Zero-mean convolutions for level-invariant singing voice detection

Jan Schlüter, Bernhard Lehner

Research output: Conference proceeding/Chapter in Book/Report/Chapterpeer-review

Abstract

State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.
Original languageEnglish
Title of host publicationProceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018
Pages321-326
Number of pages6
Publication statusPublished - 2018

Publication series

NameProceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018

Fingerprint

Dive into the research topics of 'Zero-mean convolutions for level-invariant singing voice detection'. Together they form a unique fingerprint.

Cite this