Improving Lyrics-to-Audio Alignment Using Frame-wise Phoneme Labels with Masked Cross Entropy Loss

Tian Cheng; Tomayasu Nakano; Masataka Goto
DAFx-2025 - Ancona
This paper addresses the task of lyrics-to-audio alignment, which involves synchronizing textual lyrics with corresponding music audio. Most publicly available datasets for this task provide annotations only at the line or word level. This poses a challenge for training lyrics-to-audio models due to the lack of frame-wise phoneme labels. However, we find that phoneme labels can be partially derived from word-level annotations: for single-phoneme words, all frames corresponding to the word can be labeled with the same phoneme; for multi-phoneme words, phoneme labels can be assigned at the first and last frames of the word. To leverage this partial information, we construct a mask for those frames and propose a masked frame-wise cross-entropy (CE) loss that considers only frames with known phoneme labels. As a baseline model, we adopt an autoencoder trained with a Connectionist Temporal Classification (CTC) loss and a reconstruction loss. We then enhance the training process by incorporating the proposed framewise masked CE loss. Experimental results show that incorporating the frame-wise masked CE loss improves alignment performance. In comparison to other state-of-the art models, our model provides a comparable Mean Absolute Error (MAE) of 0.216 seconds and a top Median Absolute Error (MedAE) of 0.041 seconds on the testing Jamendo dataset.
Download