In previous papers, the concept of the modulation vocoder (MODVOC) has been introduced and its general capability to perform a selective transposition on polyphonic music content has been pointed out. This renders applications possible which aim at changing the key mode of pre-recorded PCM music samples. In this paper, two enhancement techniques for selective pitch transposition by the MODVOC are proposed. The performance of the selective transposition application and the merit of these techniques are benchmarked by results obtained from a specially designed listening test methodology which is capable to govern extreme changes in terms of pitch with respect to the original audio stimuli. Results of this subjective perceptual quality assessment are presented for items that have been converted between minor and major key mode by the MODVOC and, additionally, by the first commercially available software which is also capable of handling this task.
The aim of sound morphing is to obtain a sound that falls perceptually between two (or more) sounds. Ideally, we want to morph perceptually relevant features of sounds and be able to independently manipulate them. In this work we present a method to obtain perceptually intermediate spectral envelopes guided by highlevel spectral shape descriptors and a technique that employs evolutionary computation to independently manipulate the timbral features captured by the descriptors. High-level descriptors are measures of the acoustic correlates of salient timbre dimensions derived from perceptual studies, such that the manipulation of the descriptors corresponds to potentially interesting timbral variations.
Although resynthesis may seem a simple analysis/synthesis process, it is a quite complex task, even more when it comes to recreating a singing voice. This paper presents a system whose goal is to start with an original audio stream of someone singing and recreate the same performance (melody, phonetics, dynamics) using an internal vocal sound library (choir or solo voice). By extracting dynamics and pitch information, and looking for phonetic similarities between the original audio frames and the frames of the sound library, a completely new audio stream is created. The obtained audio results, although not perfect (mainly due to the existence of audio artifacts), show that this technological approach may become an extremely powerful audio tool.
This paper presents our research efforts to synthesize complex instrumental gestures using a score-based control scheme. Our specific goal is to simulate the rasgueado technique that is popular especially in flamenco music. This technique is also used in the classical guitar repertoire. Rasgueado is especially challenging as ordinary music notation is not adequate to represent the dense stream of notes required for a convincing simulation. We will take two approaches to realize our task. First, we use the practical knowledge of how the actual performance is accomplished by the human player. A second, complementary, approach is to analyze an excerpt from real guitar playing. Our main focus here is to extract the onset times and the amplitudes of the recoded gesture. Next we combine the results from the two analysis steps using a constraintbased approach to find possible pitch and fingering sequences. Finally we translate the findings to our macro-note scheme that allows us to fill algorithmically a musical score.
In this work the Fan Chirp Transform (FChT), which provides an acute representation of harmonically related linear chirp signals, is applied to the analysis of pitch content in polyphonic music. The implementation introduced was devised to be computationally manageable and enables the generalization of the FChT for the analysis of non-linear chirps. The combination with the Constant Q Transform is explored to build a multi-resolution FChT. An existing method to compute pitch salience from the FChT is improved and adapted to handle polyphonic music. In this way a useful melodic content visualization tool is obtained. The results of a frame based melody detection evaluation indicate that the introduced technique is very promising as a front-end for music analysis.
In this paper we propose a method for automatic local time adaptation of the spectrogram of an audio signal, based on its decomposition within a Gabor multi-frame. The sparsity of the analyses within each individual frame is evaluated through the Rényi entropies measures. According to the sparsity of the decompositions, an optimal resolution and a reduced multi-frame are determined, defining an adapted spectrogram with variable resolution and hop size. The composition of such a reduced multi-frame allows an immediate definition of a dual frame: re-synthesis techniques for this adapted analysis are easily derived by the traditional phase vocoder scheme.
Audio-samplers often require to modify the pitch of recorded sounds in order to generate scales or chords. This article tackles the use of Gabor masks and their capacity to improve the perceptual realism of transposed notes obtained through the classical phasevocoder algorithm. Gabor masks can be seen as operators that allows the modification of time-dependent spectral content of sounds by modifying their time-frequency representation. The goal here is to restore a distribution of energy that is more in line with the physics of the structure that generated the original sound. The Gabor mask is elaborated using an estimation of the spectral envelope evolution in the time-frequency plane, and then applied to the modified Gabor transform. This operation turns the modified Gabor transform into another one which respects the estimated spectral envelope evolution, and therefore leads to a note that is more perceptually convincing.
A method is described that simultaneously estimates the frequency, phase and amplitude of two overlapping partials in a monaural musical signal from the amplitudes and phases in three frequency bins of the signal’s Odd Discrete Fourier Transform (ODFT). From the transform of the analysis window in its analytical form, and given the frequencies of the two partials, an analytical solution for the amplitude and phase of the two overlapping partials was obtained. Furthermore, the frequencies are estimated numerically solving a system of two equations and two unknowns, since no analytical solution could be found. Although the estimation is done independently frame by frame, particular situations (e.g. extremely close frequencies, same phase in the time window) lead to errors, which can be partly corrected with a moving average filter over several time frames. Results are presented for artificial sinusoids with time varying frequencies and amplitudes, and with different levels of noise added. The system still performs well with a Signalto-Noise ratio of down to 30 dB, with moderately modulated frequencies, and time varying amplitudes.
Sound applications based on sinusoidal modeling highly depend on the efficiency and the precision of the estimators of its analysis stage. In a previous work, theoretical bounds for the best achievable precision were shown and these bounds are reached by efficient estimators like the reassignment or the derivative methods. We show that it is possible to break these theoretical bounds with just a few additional bits of information of the original content, introducing the concept of “informed analysis”. This paper shows that existing estimators combined with some additional information can reach any expected level of precision, even in very low signal-to-noise ratio conditions, thus enabling high-quality sound effects, without the typical but unwanted musical noise.
This paper presents two ways to improve the Real-Time Iterative Spectrogram Inversion (RTISI) algorithm. The standard RTISI phase estimator with look-ahead processes the buffered frames in reverse order. We show that better results are achieved by controlling this order according to frame energy. Another improvement is to initialize the last row of the phase estimator buffer by progressing the unwrapped phase difference of the previous frames. Furthermore, we extend these improvements to dual window length phase estimation and analyze the performance in SER with respect to different analysis window lengths.