Zero-mean convolutions for level-invariant singing voice detection

Jan Schlüter; Bernhard Lehner

Zero-mean convolutions for level-invariant singing voice detection

Embedded AI (eAI)

Research output: Conference proceeding/Chapter in Book/Report/ › Chapter › peer-review

Abstract

State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.

Original language	English
Title of host publication	Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018
Pages	321-326
Number of pages	6
Publication status	Published - 2018

Publication series

Name	Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018

Cite this

@inbook{9bda344a11dc4dc3906bce2cacd484c4,

title = "Zero-mean convolutions for level-invariant singing voice detection",

abstract = "State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.",

author = "Jan Schl{\"u}ter and Bernhard Lehner",

year = "2018",

language = "English",

isbn = "9782954035123",

series = "Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018",

pages = "321--326",

booktitle = "Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018",

}

Zero-mean convolutions for level-invariant singing voice detection. / Schlüter, Jan; Lehner, Bernhard.
Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018. 2018. p. 321-326 (Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018).

Research output: Conference proceeding/Chapter in Book/Report/ › Chapter › peer-review

TY - CHAP

T1 - Zero-mean convolutions for level-invariant singing voice detection

AU - Schlüter, Jan

AU - Lehner, Bernhard

PY - 2018

Y1 - 2018

N2 - State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.

AB - State-of-the-art singing voice detectors are based on classifiers trained on annotated examples. As recently shown, such detectors have an important weakness: Since singing voice is correlated with sound level in training data, classifiers learn to become sensitive to input magnitude, and give different predictions for the same signal at different sound levels. Starting from a Convolutional Neural Network (CNN) trained on logarithmic-magnitude mel spectrogram excerpts, we eliminate this dependency by forcing each first-layer convolutional filter to be zero-mean – that is, to have its coefficients sum to zero. In contrast to four other methods – data augmentation, instance normalization, spectral delta features, and per-channel energy normalization (PCEN) – that we evaluated on a large-scale public dataset, zero-mean convolutions achieve perfect sound level invariance without any impact on prediction accuracy or computational requirements. We assume that zero-mean convolutions would be useful for other machine listening tasks requiring robustness to level changes.

UR - https://www.mendeley.com/catalogue/6df3a513-3d51-3670-b7d4-d93b8ea40fa4/

M3 - Chapter

SN - 9782954035123

T3 - Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018

SP - 321

EP - 326

BT - Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018

ER -

Zero-mean convolutions for level-invariant singing voice detection

Abstract

Publication series

Other files and links

Fingerprint

Cite this