Download Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained variational autoencoder, the rotated latent feature represents the pitch-related information of the mel-spectrogram, and the unrotated latent feature represents the pitch-invariant information, i.e., the rhythmic content. The proposed method was evaluated using a predictor-based disentanglement metric on the learned features. Furthermore, we demonstrate its application to the automatic generation of music remixes.
Download Decorrelation for Immersive Audio Applications and Sound Effects Audio decorrelation is a fundamental building block for immersive audio applications. It has applications in parametric spatial audio coding, audio upmix, audio sound effects and audio rendering for virtual or augmented reality applications. In this paper, we provide insights into the practical design considerations of an audio decorrelator on the example of the decorrelator contained within the upcoming MPEG-I Immersive Audio ISO standard [1]. We describe the desirable properties of such a decorrelator, common approaches for implementation and our particular technology choices for the decorrelator used in MPEG-I for rendering sound sources with homogeneous extent.
Download Feature Based Delay Line Using Real-Time Concatenative Synthesis In this paper we introduce a novel approach utilizing real-time concatenative synthesis to produce a Feature-Based Delay Line (FBDL). Expanding upon the concept of a traditional delay, its most basic function is familiar – a dry signal is copied to an audio buffer whose read position is time shifted producing a delayed or "wet" signal that is then remixed with the dry. In our implementation, however, the traditionally unaltered wet signal is modified such that the audio delay buffer is segmented and concatenated according to specific audio features. Specifically, the input audio is analyzed and segmented as it is written to the delay buffer, where delayed segments are matched to a target feature set, such that the most similar segments are selected to constitute the wet signal of the delay. Targeting methods, either manual or automated, can be used to explore the feature space of the delay line buffer based on dry signal feature information and relevant targeting parameters, such as delay time. This paper will outline our process, detailing important requirements such as targeting and considerations for feature extraction and concatenation synthesis, as well as discussing use cases, performance evaluation, and commentary on the potential of advances to digital delay lines.
Download A frequency tracker based on a Kalman filter update of a single parameter adaptive notch filter In designing a frequency tracker, the goal is to follow the continual time variation of the frequency from a particular sinusoidal component in a noisy signal with a high accuracy and a low sample delay. Although there exists a plethora of frequency trackers in the literature, in this paper, we focus on the particular class of frequency trackers that are built upon an adaptive notch filter (ANF), i.e. a constrained bi-quadratic infinite impulse response filter, where only a single parameter needs to be estimated. As opposed to using the conventional least-mean-square (LMS) algorithm, we present an alternative approach for the estimation of this parameter, which ultimately corresponds to the frequency to be tracked. Specifically, we reformulate the ANF in terms of a state-space model, where the state contains the unknown parameter and can be subsequently updated using a Kalman filter. We also demonstrate that such an approach is equivalent to doing a normalized LMS filter update, where the regularization parameter can be expressed as the ratio of the variance of the measurement noise to the variance of the prediction error. Through an evaluation with both simulated and realistic data, it is shown that in comparison to the LMS-updated frequency tracker, the proposed Kalmanupdated alternative, results in a more accurate performance, with a faster convergence rate, while maintaining a low computational complexity and the ability to be updated on a sample-by-sample basis.
Download Nonlinear Strings based on Masses and Springs Due to advances in computational power, physical modelling for sound synthesis has gained an increased popularity over the past decades. Although much work has been done to accurately simulate existing physical systems, much less work exists on the use of physical modelling simply for the sake of creating sonically interesting sounds. This work presents a mass-spring network, inspired by existing models of the physical string. Masses have 2 translational degrees of freedom (DoF), and the springs have an additional equilibrium separation term, which together result in highly nonlinear effects. The main aim of this work is to create sonically interesting sounds while retaining some of the natural qualities of the physical string, as opposed to accurately simulating it. Although the implementation exhibits chaotic behaviour for certain choices of parameters, the presented system can create sonically interesting timbres, including nonlinear pitch glides and ‘wobbles’.
Download Probabilistic Reverberation Model Based on Echo Density and Kurtosis This article proposes a probabilistic model for synthesizing room impulse responses (RIRs) for use in convolution artificial reverberators. The proposed method is based on the concept of echo density. Echo density is a measure of the number of echoes per second in an impulse response and is a demonstrated perceptual metric of artificial reverberation quality. As echo density is related to the statistical measure of kurtosis, this article demonstrates that the statistics of an RIR can be modeled using a probabilistic mixture model. A mixture model designed specifically for modeling RIRs is proposed. The proposed method is useful for statistically replicating RIRs of a measured environment, thereby synthesizing new independent observations of an acoustic space. A perceptual pilot study is carried out to evaluate the fidelity of the replication process in monophonic and stereo artificial reverberators.
Download Fully Conditioned and Low-Latency Black-Box Modeling of Analog Compression Neural networks have been found suitable for virtual analog modeling applications. Several analog audio effects have been successfully modeled with deep learning techniques, using low-latency and conditioned architectures suitable for real-world applications. Challenges remain with effects presenting more complex responses, such as nonlinear and time-varying input-output relationships. This paper proposes a deep-learning model for the analog compression effect. The architecture we introduce is fully conditioned by the device control parameters and it works on small audio segments, allowing low-latency real-time implementations. The architecture is used to model the CL 1B analog optical compressor, showing an overall high accuracy and ability to capture the different attack and release compression profiles. The proposed architecture’ ability to model audio compression behaviors is also verified using datasets from other compressors. Limitations remain with heavy compression scenarios determined by the conditioning parameters.
Download Low-cost Numerical Approximation of HRTFs: a Non-Linear Frequency Sampling Approach Head-related transfer functions (HRTFs) describe filters that model the scattering effect of the human body on sound waves. In their discrete-time form, they are used in acoustic simulations for virtual reality (VR) or augmented reality (AR), and since HRTFs are listener-specific, the use of individualized HRTFs allows achieving more realistic perceptual results. One way to produce individualized HRTFs is by estimating the sound field around the subjects’ 3D representations (meshes) via numerical simulations, which compute discrete complex pressure values in the frequency domain in regular frequency steps. Despite the advances in the area, the computational resources required for this process are still considerably high and increase with frequency. The goal of this paper is to tackle the high computational cost associated with this task by sampling the frequency domain using hybrid linear-logarithmic frequency resolution. The results attained in simulations performed using 23 real 3D meshes suggest that the proposed strategy is able to reduce the computational cost while still providing remarkably low spectral distortion, even in simulations that require as little as 11.2% of the original total processing time.
Download A Differentiable Digital Moog Filter For Machine Learning Applications In this project, a digital ladder filter has been investigated and expanded. This structure is a simplified digital analog model of the well known analog Moog ladder filter. The goal of this paper is to derive the differentiation expressions of this filter with respect to its control parameters in order to integrate it in machine learning systems. The derivation of the backpropagation method is described in this work, it can be generalized to a Moog filter or a similar filter having any number of stages. Subsequently, the example of an adaptive Moog filter is provided. Finally, a machine learning application example is shown where the filter is integrated in a deep learning framework.
Download A Quadric Surface Model of Vacuum Tubes for Virtual Analog Applications Despite the prevalence of modern audio technology, vacuum tube amplifiers continue to play a vital role in the music industry. For this reason, over the years, many different digital techniques have been introduced for accomplishing their emulation. In this paper, we propose a novel quadric surface model for tube simulations able to overcome the Cardarilli model in terms of efficiency whilst retaining comparable accuracy when grid current is negligible. After showing the model capability to well outline tubes starting from measurement data, we perform an efficiency comparison by implementing the considered tube models as nonlinear 3-port elements in the Wave Digital domain. We do this by taking into account the typical common-cathode gain stage employed in vacuum tube guitar amplifiers. The proposed model turns out to be characterized by a speedup of 4.6× with respect to the Cardarilli model, proving thus to be promising for real-time Virtual Analog applications.