A Cosine-Distance Based Neural Network for Music Artist Recognition Using Raw I-Vector Feature

Hamid Eghbal-Zadeh; Matthias Dorfer; Gerhard Widmer
DAFx-2016 - Brno
Recently, i-vector features have entered the field of Music Information Retrieval (MIR), exhibiting highly promising performance in important tasks such as music artist recognition or music similarity estimation. The i-vector modelling approach relies on a complex processing chain that limits by the use of engineered features such as MFCCs. The goal of the present paper is to make an important step towards a truly end-to-end modelling system inspired by the i-vector pipeline, to exploit the power of Deep Neural Networks1 (DNNs) to learn optimized feature spaces and transformations. Several authors have already tried to combine the power of DNNs with i-vector features, where DNNs were used for feature extraction, scoring or classification. In this paper, we try to use neural networks for the important step of i-vector post-processing and classification for the task of music artist recognition. Specifically, we propose a novel neural network for i-vector features with a cosine-distance loss function, optimized with stochastic gradient decent (SGD). We first show that current networks do not perform well with unprocessed i-vector features, and that post-processing methods such as Within-Class Covariance Normalization (WCCN) and Linear Discriminant Analysis (LDA) are crucially important to improve the i-vector representation. We further demonstrate that these linear projections (WCCN and LDA) can not be learned using general objective functions usually used in neural networks. We examine our network on a 50-class music artist recognition dataset using i-vectors extracted from frame-level timbre features. Our experiments suggest that using our network with fully unprocessed i-vectors, we can achieve the performance of the i-vector pipeline which uses i-vector post processing methods such as LDA and WCCN.