Download On the window-disjoint-orthogonality of speech sources in reverberant humanoid scenarios
Many speech source separation approaches are based on the assumption of orthogonality of speech sources in the time-frequency domain. The target speech source is demixed from the mixture by applying the ideal binary mask to the mixture. The time-frequency orthogonality of speech sources is investigated in detail only for anechoic and artificially mixed speech mixtures. This paper evaluates how the orthogonality of speech sources decreases when using a realistic reverberant humanoid recording setup and indicates strategies to enhance the separation capabilities of algorithms based on ideal binary masks under these conditions. It is shown that the SIR of the target source demixed from the mixture using the ideal binary mask decreases by approximately 3 dB for reverberation times of T60 = 0.6 s opposed to the anechoic scenario. For humanoid setups, the spatial distribution of the sources and the choice of the correct ear channel introduces differences in the SIR of further 3 dB, which leads to specific strategies to choose the best channel for demixing.
Download Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics
This paper presents an analysis-manipulation method that can generate musical instrument sounds with arbitrary pitches and durations from the sound of a given musical instrument (called seed) without distorting its timbral characteristics. Based on psychoacoustical knowledge of the auditory effects of timbres, we defined timbral features based on the spectrogram of the sound of a musical instrument as (i) the relative amplitudes of the harmonic peaks, (ii) the distribution of the inharmonic component, and (iii) temporal envelopes. First, to analyze the timbral features of a seed, it was separated into harmonic and inharmonic components using Itoyama’s integrated model. For pitch manipulation, we took into account the pitch-dependency of features (i) and (ii). We predicted the values of each feature by using a cubic polynomial that approximated the distribution of these features over pitches. To manipulate duration, we focused on preserving feature (iii) in the attack and decay duration of a seed. Therefore, only steady durations were expanded or shrunk. In addition, we propose a method for reproducing the properties of vibrato. Experimental results demonstrated the quality of the synthesized sounds produced using our method. The spectral and MFCC distances between the synthesized sounds and actual sounds of 32 instruments were reduced by 64.70% and 32.31%, respectively.
Download An amplitude- and frequency-modulation vocoder for audio signal processing
The decomposition of audio signals into perceptually meaningful modulation components is highly desirable for the development of new audio effects on the one hand and as a building block for future efficient audio compression algorithms on the other hand. In the past, there has always been a distinction between parametric coding methods and waveform coding: While waveform coding methods scale easily up to transparency (provided the necessary bit rate is available), parametric coding schemes are subjected to the limitations of the underlying source models. Otherwise, parametric methods usually offer a wealth of manipulation possibilities which can be exploited for application of audio effects, while waveform coding is strictly limited to the best as possible reproduction of the original signal. The analysis/synthesis approach presented in this paper is an attempt to show a way to bridge this gap by enabling a seamless transition between both approaches.
Download Wide-band harmonic sinusoidal modeling
In this paper we propose a method to estimate and transform harmonic components in wide-band conditions, out of a single period of the analyzed signal. This method allows estimating harmonic parameters with higher temporal resolution than typical Short Time Fourier Transform (STFT) based methods. We also discuss transformations and synthesis strategies in such context, focusing on the human voice.
Download Time mosaics - An image processing approach to audio visualization
This paper presents a new approach to the visualization of monophonic audio files that simultaneously illustrates general audio properties and the component sounds that comprise a given input file. This approach represents sound clip sequences using archetypal images which are subjected to image processing filters driven by audio characteristics such as power, pitch and signalto-noise ratio. Where the audio is comprised of a single sound it is represented by a single image that has been subjected to filtering. Heterogeneous audio files are represented as a seamless image mosaic along a time axis where each component image in the mosaic maps directly to a discovered component sound. To support this, in a given audio file, the system separates individual sounds and reveals the overlapping period between sound clips. Compared with existing visualization methods such as oscilloscopes and spectrograms, this approach yields more accessible illustrations of audio files, which are suitable for casual and nonexpert users. We propose that this method could be used as an efficient means of scanning audio database queries and navigating audio databases through browsing, since the user can visually scan the file contents and audio properties simultaneously.
Download Generalization of the derivative analysis method to non-stationary sinusoidal modeling
In the context of non-stationary sinusoidal modeling, this paper introduces the generalization of the derivative method (presented at the first DAFx edition) for the analysis stage. This new method is then compared to the reassignment method for the estimation of all the parameters of the model (phase, amplitude, frequency, amplitude modulation, and frequency modulation), and to the CramérRao bounds. It turns out that the new method is less biased, and thus outperforms the reassignment method in most cases for signalto-noise ratios greater than −10dB.
Download Spectutils, an audio signal analysis and visualization toolkit for GNU octave
Spectutils is a GNU Octave toolkit for analyzing and visualizing audio signals. Spectutils allows to display oscillograms, FFT spectrograms as well as pitch detection graphs. Spectutils can best be characterized as a user interface for GNU Octave, which integrates signal analysis and visualization functionality into dedicated function calls. Therefore, signal analysis with Spectutils requires little or no prior knowledge of Octave or MATLAB programming.
Download Center channel separation based on spatial analysis
This is a brief description of audio channel or sound source separation algorithm using spatial cues. Basically inter-channel level difference (ICLD) is used for discriminating sound sources in a spatial grid for each channel pair and analysis subband. Interchannel cross-correlation (ICC) is also used for determining sound source location area and contribution factor for the considering composite sound source. In this paper, the center and side channel separation of stereophonic music signal using spatial sound source discrimination method is introduced. This is simply implemented by using given information of center channel location and derived spatial cues. The separated center channel signal is well matched with separated side channels when reproduced simultaneously.
Download Detecting arrivals within room impulse responses using matching pursuit
This paper proposes to use Matching Pursuit, in order to investigate some statistical foundations of Room Acoustics, such as the temporal distribution of arrivals, and the estimation of mixing time. As this has never been experimentally explored, this study is a first step towards a validation of the ergodic theory of reverberation. The use of Matching Pursuit is implicit, since correlation between the impulse response and the direct sound is assumed. The compensation for the energy decay is necessary to obtain stationnary signals. Methods for determining the best the temporal boundaries of the direct sound, for choosing an appropriate stopping criteria based on the similarity between acoustical indices of the original RIR and those of the synthesized signal, and for experimentally defining the mixing time constitute the scope of this study.
Download Inferring the hand configuration from hand clapping sounds
In this paper, a technique for inferring the configuration of a clapper’s hands from a hand clapping sound is described. The method was developed based on analysis of synthetic and recorded hand clap sounds, labeled with the corresponding hand configurations. A naïve Bayes classifier was constructed to automatically classify the data using two different feature sets. The results indicate that the approach is applicable for inferring the hand configuration.