Download Audio Visualization via Delay Embedding and Subspace Learning We describe a sequence of methods for producing videos from audio signals. Our visualizations capture perceptual features like harmonicity and brightness: they produce stable images from periodic sounds and slowly-evolving images from inharmonic ones; they associate jagged shapes to brighter sounds and rounded shapes to darker ones. We interpret our methods as adaptive FIR filterbanks and show how, for larger values of the complexity parameters, we can perform accurate frequency detection without the Fourier transform. Attached to the paper is a code repository containing the Jupyter notebook used to generate the images and videos cited. We also provide code for a realtime C++ implementation of the simplest visualization method. We discuss the mathematical theory of our methods in the two appendices.
Download Real-Time System for Sound Enhancement in Noisy Environment The noise can affect the listening experience in many real-life situations involving loudspeakers as a playback device. A solution to reduce the effect of the noise is to employ headphones, but they can be annoying and not allowed on some occasions. In this context, a system for improving the audio perception and the intelligibility of sounds in a domestic noisy environment is introduced and a real-time implementation is proposed. The system comprises three main blocks: a noise estimation procedure based on an adaptive algorithm, an auditory spectral masking algorithm that estimates the music threshold capable of masking the noise source, and an FFT equalizer that is used to apply the estimated level. It has been developed on an embedded DSP board considering one microphone for the ambient noise analysis and two vibrating sound transducers for sound reproduction. Several experiments on simulated and real-world scenarios have been realized to prove the effectiveness of the proposed approach.
Download Hybrid Audio Inpainting Approach with Structured Sparse Decomposition and Sinusoidal Modeling This research presents a novel hybrid audio inpainting approach that considers the diversity of signals and enhances the reconstruction quality. Existing inpainting approaches have limitations, such as energy drop and poor reconstruction quality for non-stationary signals. Based on the fact that an audio signal can be considered as a mixture of three components: tonal, transients, and noise, the proposed approach divides the left and right reliable neighborhoods around the gap into these components using a structured sparse decomposition technique. The gap is reconstructed by extrapolating parameters estimated from the reliable neighborhoods of each component. Component-targeted methods are refined and employed to extrapolate the parameters based on their own acoustic characteristics. Experiments were conducted to evaluate the performance of the hybrid approach and compare it with other stateof-the-art inpainting approaches. The results show the hybrid approach achieves high-quality reconstruction and low computational complexity across various gap lengths and signal types, particularly for longer gaps and non-stationary signals.
Download HRTF Spatial Upsampling in the Spherical Harmonics Domain Employing a Generative Adversarial Network A Head-Related Transfer Function (HRTF) is able to capture alterations a sound wave undergoes from its source before it reaches the entrances of a listener’s left and right ear canals, and is imperative for creating immersive experiences in virtual and augmented reality (VR/AR). Nevertheless, creating personalized HRTFs demands sophisticated equipment and is hindered by time-consuming data acquisition processes. To counteract these challenges, various techniques for HRTF interpolation and up-sampling have been proposed. This paper illustrates how Generative Adversarial Networks (GANs) can be applied to HRTF data upsampling in the spherical harmonics domain. We propose using Autoencoding Generative Adversarial Networks (AE-GAN) to upsample lowdegree spherical harmonics coefficients and get a more accurate representation of the full HRTF set. The proposed method is benchmarked against two baselines: barycentric interpolation and HRTF selection. Results from log-spectral distortion (LSD) evaluation suggest that the proposed AE-GAN has significant potential for upsampling very sparse HRTFs, achieving 17% improvement over baseline methods.
Download NBU: Neural Binaural Upmixing of Stereo Content While immersive music productions have become popular in recent years, music content produced during the last decades has been predominantly mixed for stereo. This paper presents a datadriven approach to automatic binaural upmixing of stereo music. The network architecture HDemucs, previously utilized for both source separation and binauralization, is leveraged for an endto-end approach to binaural upmixing. We employ two distinct datasets, demonstrating that while custom-designed training data enhances the accuracy of spatial positioning, the use of professionally mixed music yields superior spatialization. The trained networks show a capacity to process multiple simultaneous sources individually and add valid binaural cues, effectively positioning sources with an average azimuthal error of less than 11.3 ◦ . A listening test with binaural experts shows it outperforms digital signal processing-based approaches to binauralization of stereo content in terms of spaciousness while preserving audio quality.
Download Frequency-Dependent Characteristics and Perceptual Validation of the Interaural Thresholded Level Distribution The interaural thresholded level distribution (ITLD) is a novel metric of auditory source width (ASW), derived from the psychophysical processes and structures of the inner ear. While several of the ITLD’s objective properties have been presented in previous work, its frequency-dependent characteristics and perceptual relationship with ASW have not been previously explored. This paper presents an investigation into these properties of the ITLD, which exhibits pronounced variation in band-limited behaviour as octaveband centre-frequency is increased. Additionally, a very strong correlation was found between [1 – ITLD] and normalised values of ASW, collected from a semantic differential listening test based on the Multiple Stimulus with Hidden Reference and Anchor (MUSHRA) framework. Perceptual relationships between various ITLD-derived quantities were also investigated, showing that the low-pass filter intrinsic to ITLD calculation strengthened the relationship between [1 – ITLD] and ASW. A subsequent test using transient stimuli, as well as investigations into other psychoacoustic properties of the metric such as its just-noticeabledifference, were outlined as subjects for future research, to gain a deeper understanding of the subjective properties of the ITLD.
Download Decoding Sound Source Location From EEG: Preliminary Comparisons of Spatial Rendering and Location Spatial auditory acuity is contingent on the quality of spatial cues presented during listening. Electroencephalography (EEG) shows promise for finding neural markers of such acuity present in recorded neural activity, potentially mitigating common challenges with behavioural assessment (e.g., sound source localisation tasks). This study presents findings from three preliminary experiments which investigated neural response variations to auditory stimuli under different spatial listening conditions: free-field (loudspeakerbased), individual Head-Related Transfer-Functions (HRTF), and non-individual HRTFs. Three participants, each participating in one experiment, were exposed to auditory stimuli from various spatial locations while neural activity was recorded via EEG. The resultant neural responses underwent a decoding protocol to asses how decoding accuracy varied between stimuli locations over time. Decoding accuracy was highest for free-field auditory stimuli, with significant but lower decoding accuracy between left and right hemisphere locations for individual and non-individual HRTF stimuli. A latency in significant decoding accuracy was observed between listening conditions for locations dominated by spectral cues. Furthermore, findings suggest that decoding accuracy between free-field and non-individual HRTF stimuli may reflect behavioural front-back confusion rates.
Download A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing This paper presents a deep learning approach to parametric timefrequency parameter prediction for use within stereo upmixing algorithms. The approach presented uses a Multi-Channel U-Net with Residual connections (MuCh-Res-U-Net) trained on a novel dataset of stereo and parametric time-frequency spatial audio data to predict time-frequency spatial parameters from a stereo input signal for positions on a 50-point Lebedev quadrature sampled sphere. An example upmix pipeline is then proposed which utilises the predicted time-frequency spatial parameters to both extract and remap stereo signal components to target spherical harmonic components to facilitate the generation of a full spherical representation of the upmixed sound field.
Download Audio-Visual Talker Localization in Video for Spatial Sound Reproduction Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera’s reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.
Download QUBX: Rust Library for Queue-Based Multithreaded Real-Time Parallel Audio Streams Processing and Management The concurrent management of real-time audio streams pose an increasingly complex technical challenge within the realm of the digital audio signals processing, necessitating efficient and intuitive solutions. Qubx endeavors to lead in tackling this obstacle with an architecture grounded in dynamic circular queues, tailored to optimize and synchronize the processing of parallel audio streams. It is a library written in Rust, a modern and powerful ecosystem with a still limited range of tools for digital signal processing and management. Additionally, Rust’s inherent security features and expressive type system bolster the resilience and stability of the proposed tool.