Zero-mean convolutions for level-invariant singing voice detection

Jan Schlüter, Bernhard Lehner

Publikation: Konferenzband/Beitrag in Buch/BerichtKapitelBegutachtung

Abstract

State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.
OriginalspracheEnglisch
TitelProceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018
Seiten321-326
Seitenumfang6
PublikationsstatusVeröffentlicht - 2018

Publikationsreihe

NameProceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018

Fingerprint

Untersuchen Sie die Forschungsthemen von „Zero-mean convolutions for level-invariant singing voice detection“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren