Improving voice activity detection in movies

Bernhard Lehner; Gerhard Widmer; Reinhard Sonnleitner

doi:10.21437/interspeech.2015-455

Improving voice activity detection in movies

Bernhard Lehner, Gerhard Widmer, Reinhard Sonnleitner

Embedded AI (eAI)

Research output: Contribution to conference (No Proceedings) › Paper › peer-review

Abstract

Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

Original language	English
Number of pages	5
DOIs	https://doi.org/10.21437/interspeech.2015-455
Publication status	Published - 2015

Keywords

Speech detection
Voice activity detection

Access to Document

10.21437/interspeech.2015-455

Cite this

@conference{0fcbe9e90fcc47b4ac767f2656d89bbf,

title = "Improving voice activity detection in movies",

abstract = "Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.",

keywords = "Speech detection, Voice activity detection",

author = "Bernhard Lehner and Gerhard Widmer and Reinhard Sonnleitner",

year = "2015",

doi = "10.21437/interspeech.2015-455",

language = "English",

}

TY - CONF

T1 - Improving voice activity detection in movies

AU - Lehner, Bernhard

AU - Widmer, Gerhard

AU - Sonnleitner, Reinhard

PY - 2015

Y1 - 2015

N2 - Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

AB - Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

KW - Speech detection

KW - Voice activity detection

UR - https://www.mendeley.com/catalogue/88b679d9-685a-31e0-9f4c-1d0f119e53cd/

U2 - 10.21437/interspeech.2015-455

DO - 10.21437/interspeech.2015-455

M3 - Paper

ER -

Improving voice activity detection in movies

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this