Relative music loudness estimation is a MIR task that consists in
dividing audio in segments of three classes: Foreground Music,
Background Music and No Music. Given the temporal correlation
of music, in this work we approach the task using a type of network
with the ability to model temporal context: the Temporal Convolutional Network (TCN). We propose two architectures: a TCN,
and a novel architecture resulting from the combination of a TCN
with a Convolutional Neural Network (CNN) front-end. We name
this new architecture CNN-TCN. We expect the CNN front-end to
work as a feature extraction strategy to achieve a more efficient usage of the network’s parameters. We use the OpenBMAT dataset
to train and test 40 TCN and 80 CNN-TCN models with two grid
searches over a set of hyper-parameters. We compare our models with the two best algorithms submitted to the tasks of music
detection and relative music loudness estimation in MIREX 2019.
All our models outperform the MIREX algorithms even when using a lower number of parameters. The CNN-TCN emerges as the
best architecture as all its models outperform all TCN models. We
show that adding a CNN front-end to a TCN can actually reduce
the number of parameters of the network while improving performance. The CNN front-end effectively works as a feature extractor producing consistent patterns that identify different combinations of music and non-music sounds and also helps in producing
a smoother output in comparison to the TCN models.