Download Room Acoustic Modelling Using a Hybrid Ray-Tracing/Feedback Delay Network Method
Combining different room acoustic modelling methods could provide a better balance between perceptual plausibility and computational efficiency than using a single and potentially more computationally expensive model. In this work, a hybrid acoustic modelling system that integrates ray tracing (RT) with an advanced feedback delay network (FDN) is designed to generate perceptually plausible RIRs. A multiple stimuli with hidden reference and anchor (MUSHRA) test and a two-alternative-forced-choice (2AFC) discrimination task have been conducted to compare the proposed method against ground truth recordings and conventional RT-based approaches. The results show that the proposed system delivers robust performance in various scenarios, achieving highly plausible reverberation synthesis.
Download Recent CCRMA research in Digital Audio Synthesis, Processing and Effects
This extended abstract summarizes DAFx-related developments at CCRMA over the past year or so.
Download On the use of zero-crossing rate for an apllication of classification of percussive sounds
We address the issue of automatically extracting rhythm descriptors from audio signals, to be eventually used in content-based musical applications such as in the context of MPEG7. Our aim is to approach the comprehension of auditory scenes in raw polyphonic audio signals without preliminary source separation. As a first step towards the automatic extraction of rhythmic structures out of signals taken from the popular music repertoire, we propose an approach for automatically extracting time indexes of occurrences of different percussive timbres in an audio signal. Within this framework, we found that a particular issue lies in the classification of percussive sounds. In this paper, we report on the method currently used to deal with this problem.
Download Vocal Tract Area Estimation by Gradient Descent
Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a waveguide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.
Download Contact Sensor Processing for Acoustic Instrument Sensor Matching Using a Modal Architecture
This paper proposes a method to filter the output of instrument contact sensors to approximate the response of a well placed microphone. A modal approach is proposed in which mode frequencies and damping ratios are fit to the frequency response of the contact sensor, and the mode gains are then determined for both the contact sensor and the microphone. The mode frequencies and damping ratios are presumed to be associated with the resonances of the instrument. Accordingly, the corresponding contact sensor and microphone mode gains will account for the instrument radiation. The ratios between the contact sensor and microphone gains are then used to create a parallel bank of second-order biquad filters to filter the contact sensor signal to estimate the microphone signal.
Download The Beating Equalizer and its Application to the Synthesis and Modification of Piano Tones
This paper presents an improved method for simulating and modifying the beating effect in piano tones. The beating effect is an audible phenomenon, which is characteristic to the piano, and, hence, it should be accounted for in realistic piano synthesis. The proposed method, which is independent of the synthesis technique, contains a cascade of second-order equalizing filters, where each filter produces the beating effect for a single partial by modulating the peak gain. Moreover, the method offers a way to control the beating frequency and the beating depth, and it can be used to modify the beating envelope in existing tones. The results show that the proposed method is able to simulate the desired beating effect.
Download A Comparison of Extended Source-Filter Models for Musical Signal Reconstruction
Recently, we have witnessed an increasing use of the sourcefilter model in music analysis, which is achieved by integrating the source filter model into a non-negative matrix factorisation (NMF) framework or statistical models. The combination of the source-filter model and NMF framework reduces the number of free parameters needed and makes the model more flexible to extend. This paper compares four extended source-filter models: the source-filter-decay (SFD) model, the NMF with timefrequency activations (NMF-ARMA) model, the multi-excitation (ME) model and the source-filter model based on β-divergence (SFbeta model). The first two models represent the time-varying spectra by adding a loss filter and a time-varying filter, respectively. The latter two are extended by using multiple excitations and including a scale factor, respectively. The models are tested using sounds of 15 instruments from the RWC Music Database. Performance is evaluated based on the relative reconstruction error. The results show that the NMF-ARMA model outperforms other models, but uses the largest set of parameters.
Download Piano-SSM: Diagonal State Space Models for Efficient Midi-to-Raw Audio Synthesis
Deep State Space Models (SSMs) have shown remarkable performance in long-sequence reasoning tasks, such as raw audio classification, and audio generation. This paper introduces PianoSSM, an end-to-end deep SSM neural network architecture designed to synthesize raw piano audio directly from MIDI input. The network requires no intermediate representations or domainspecific expert knowledge, simplifying training and improving accessibility. Quantitative evaluations on the MAESTRO dataset show that Piano-SSM achieves a Multi-Scale Spectral Loss (MSSL) of 7.02 at 16kHz, outperforming DDSP-Piano v1 with a MSSL of 7.09. At 24kHz, Piano-SSM maintains competitive performance with an MSSL of 6.75, closely matching DDSP-Piano v2’s result of 6.58. Evaluations on the MAPS dataset achieve an MSSL score of 8.23, which demonstrates the generalization capability even when training with very limited data. Further analysis highlights Piano-SSM’s ability to train on high sampling-rate audio while synthesizing audio at lower sampling rates, explicitly linking performance loss to aliasing effects. Additionally, the proposed model facilitates real-time causal inference through a custom C++17 header-only implementation. Using an Intel Core i712700 processor at 4.5GHz, with single core inference, allows synthesizing one second of audio at 44.1kHz in 0.44s with a workload of 23.1GFLOPS/s and an 10.1µs input/output delay with the largest network. While the smallest network at 16kHz only needs 0.04s with 2.3GFLOP/s and 2.6µs input/output delay. These results underscore Piano-SSM’s practical utility and efficiency in real-time audio synthesis applications.
Download Timbre-Constrained Recursive Time-Varying Analysis for Musical Note Separation
Note separation in music signal processing becomes difficult when there are overlapping partials from co-existing notes produced by either the same or different musical instruments. In order to deal with this problem, it is necessary to involve certain invariant features of musical instrument sounds into the separation processing. For example, the timbre of a note of a musical instrument may be used as one possible invariant feature. In this paper, a timbre estimate is used to represent this feature such that it becomes a constraint when note separation is performed on a mixture signal. To demonstrate the proposed method, a timedependent recursive regularization analysis is employed. Spectral envelopes of different notes are estimated and a modified parameter update strategy is applied to the recursive regularization process. The experiment results show that the flaws due to the overlapping partial problem can be effectively reduced through the proposed approach.
Download Interactive digital audio environments: gesture as a musical parameter
This paper presents some possible relationships between gesture and sound that may be built with an interactive digital audio environment. In a traditional musical situation gesture usually produces sound. The relationship between gesture and sound is unique, it is a cause to effect link. In computer music, the possibility of uncoupling gesture from sound is due to the fact that computer can carry out all the aspects of sound production from composition up to interpretation and performance. Real time computing technology and development of human gesture tracking systems may enable gesture to be introduced again into the practice of computer music, but with a completely renewed approach. There is no more need to create direct cause to effect relationships for sound production, and gesture may be seen as another musical parameter to play with in the context of interactive musical performances.