Download Automatic Classification of Chains of Guitar Effects Through Evolutionary Neural Architecture Search Recent studies on classifying electric guitar effects have achieved
high accuracy, particularly with deep learning techniques. However, these studies often rely on simplified datasets consisting
mainly of single notes rather than realistic guitar recordings.
Moreover, in the specific field of effect chain estimation, the literature tends to rely on large models, making them impractical for
real-time or resource-constrained applications. In this work, we
recorded realistic guitar performances using four different guitars
and created three datasets by applying a chain of five effects with
increasing complexity: (1) fixed order and parameters, (2) fixed order with randomly sampled parameters, and (3) random order and
parameters. We also propose a novel Neural Architecture Search
method aimed at discovering accurate yet compact convolutional
neural network models to reduce power and memory consumption.
We compared its performance to a basic random search strategy,
showing that our custom Neural Architecture Search outperformed
random search in identifying models that balance accuracy and
complexity. We found that the number of convolutional and pooling layers becomes increasingly important as dataset complexity
grows, while dense layers have less impact. Additionally, among
the effects, tremolo was identified as the most challenging to classify.
Download Inference-Time Structured Pruning for Real-Time Neural Network Audio Effects Structured pruning is a technique for reducing the computational
load and memory footprint of neural networks by removing structured subsets of parameters according to a predefined schedule
or ranking criterion.
This paper investigates the application of
structured pruning to real-time neural network audio effects, focusing on both feedforward networks and recurrent architectures.
We evaluate multiple pruning strategies at inference time, without retraining, and analyze their effects on model performance. To
quantify the trade-off between parameter count and audio fidelity,
we construct a theoretical model of the approximation error as a
function of network architecture and pruning level. The resulting bounds establish a principled relationship between pruninginduced sparsity and functional error, enabling informed deployment of neural audio effects in constrained real-time environments.
Download Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial Approaches Accurately estimating nonlinear audio effects without access to
paired input-output signals remains a challenging problem. This
work studies unsupervised probabilistic approaches for solving this
task. We introduce a method, novel for this application, based
on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using blackand gray-box models. This study compares this method with a
previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the
effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show
that the diffusion-based approach provides more stable results and
is less sensitive to data availability, while the adversarial approach
is superior at estimating more pronounced distortion effects. Our
findings contribute to the robust unsupervised blind estimation of
audio effects, demonstrating the potential of diffusion models for
system identification in music technology.
Download Empirical Results for Adjusting Truncated Backpropagation Through Time While Training Neural Audio Effects This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in
digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with
and without conditioning by user controls. Results demonstrate
that carefully tuning these parameters enhances model accuracy
and training stability, while also reducing computational demands.
Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the
revised TBPTT configuration maintains high perceptual quality.
Download Neural-Driven Multi-Band Processing for Automatic Equalization and Style Transfer We present a Neural-Driven Multi-Band Processor (NDMP), a differentiable audio processing framework that augments a static sixband Parametric Equalizer (PEQ) with per-band dynamic range
compression. We optimize this processor using neural inference
for two tasks: Automatic Equalization (AutoEQ), which estimates
tonal and dynamic corrections without a reference, and Production
Style Transfer (NDMP-ST), which adapts the processing of an input signal to match the tonal and dynamic characteristics of a reference. We train NDMP using a self-supervised strategy, where the
model learns to recover a clean signal from inputs degraded with
randomly sampled NDMP parameters and gain adjustments. This
setup eliminates the need for paired input–target data and enables
end-to-end training with audio-domain loss functions. In the inference, AutoEQ enhances previously unseen inputs in a blind setting, while NDMP-ST performs style transfer by predicting taskspecific processing parameters. We evaluate our approach on the
MUSDB18 dataset using both objective metrics (e.g., SI-SDR,
PESQ, STFT loss) and a listening test.
Our results show that
NDMP consistently outperforms traditional PEQ and a PEQ+DRC
(single-band) baseline, offering a robust neural framework for audio enhancement that combines learned spectral and dynamic control.
Download TorchFX: A Modern Approach to Audio DSP with PyTorch and GPU Acceleration The increasing complexity and real-time processing demands of
audio signals require optimized algorithms that utilize the computational power of Graphics Processing Units (GPUs).
Existing Digital Signal Processing (DSP) libraries often do not provide
the necessary efficiency and flexibility, particularly for integrating
with Artificial Intelligence (AI) models. In response, we introduce TorchFX: a GPU-accelerated Python library for DSP, engineered to facilitate sophisticated audio signal processing. Built on
the PyTorch framework, TorchFX offers an Object-Oriented interface similar to torchaudio but enhances functionality with a novel
pipe operator for intuitive filter chaining. The library provides a
comprehensive suite of Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, with a focus on multichannel
audio, thereby facilitating the integration of DSP and AI-based
approaches.
Our benchmarking results demonstrate significant
efficiency gains over traditional libraries like SciPy, particularly
in multichannel contexts. While there are current limitations in
GPU compatibility, ongoing developments promise broader support and real-time processing capabilities. TorchFX aims to become a useful tool for the community, contributing to innovation
in GPU-accelerated DSP. TorchFX is publicly available on GitHub
at https://github.com/matteospanio/torchfx.
Download Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or timevariant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To
address this gap, we formulate AFX chain recognition as the task
of jointly estimating AFX types and their order from a wet signal.
We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently
than Euclidean space due to its exponential expansion property.
Since AFX chains can be represented as trees, with AFXs as nodes
and edges encoding effect order, hyperbolic space is well-suited
for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar
sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.
Download Towards an Objective Comparison of Panning Feature Algorithms for Unsupervised Learning Estimations of panning attributes are an important feature to extract from a piece of recorded music, with downstream uses such
as classification, quality assessment, and listening enhancement.
While several algorithms exist in the literature, there is currently
no comparison between them and no studies to suggest which one
is most suitable for any particular task. This paper compares four
algorithms for extracting amplitude panning features with respect
to their suitability for unsupervised learning. It finds synchronicities between them and analyses their results on a small set of
commercial music excerpts chosen for their distinct panning features. The ability of each algorithm to differentiate between the
tracks is analysed. The results can be used in future work to either
select the most appropriate panning feature algorithm or create a
version customized for a particular task.
Download Unsupervised Text-to-Sound Mapping via Embedding Space Alignment This work focuses on developing an artistic tool that performs an
unsupervised mapping between text and sound, converting an input text string into a series of sounds from a given sound corpus.
With the use of a pre-trained sound embedding model and a separate, pre-trained text embedding model, the goal is to find a mapping between the two feature spaces. Our approach is unsupervised which allows any sound corpus to be used with the system.
The tool performs the task of text-to-sound retrieval, creating a
soundfile in which each word in the text input is mapped to a single sound in the corpus, and the resulting sounds are concatenated
to play sequentially. We experiment with three different mapping
methods, and perform quantitative and qualitative evaluations on
the outputs. Our results demonstrate the potential of unsupervised
methods for creative applications in text-to-sound mapping.
Download Generative Latent Spaces for Neural Synthesis of Audio Textures This paper investigates the synthesis of audio textures and the
structure of generative latent spaces using Variational Autoencoders (VAEs) within two paradigms of neural audio synthesis:
DSP-inspired and data-driven approaches. For each paradigm, we
propose VAE-based frameworks that allow fine-grained temporal
control. We introduce datasets across three categories of environmental sounds to support our investigations. We evaluate and compare the models’ reconstruction performance using objective metrics, and investigate their generative capabilities and latent space
structure through latent space interpolations.