Improving voice activity detection in movies

Bernhard Lehner; Gerhard Widmer; Reinhard Sonnleitner

doi:10.21437/interspeech.2015-455

Improving voice activity detection in movies

Bernhard Lehner, Gerhard Widmer, Reinhard Sonnleitner

Embedded AI (eAI)

Publikation: Konferenzbeitrag › Papier › Begutachtung

Abstract

Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

Originalsprache	Englisch
Seitenumfang	5
DOIs	https://doi.org/10.21437/interspeech.2015-455
Publikationsstatus	Veröffentlicht - 2015

Zugriff auf Dokument

10.21437/interspeech.2015-455

Andere Dateien und Links

Dieses zitieren

@conference{0fcbe9e90fcc47b4ac767f2656d89bbf,

title = "Improving voice activity detection in movies",

abstract = "Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.",

keywords = "Speech detection, Voice activity detection",

author = "Bernhard Lehner and Gerhard Widmer and Reinhard Sonnleitner",

year = "2015",

doi = "10.21437/interspeech.2015-455",

language = "English",

}

TY - CONF

T1 - Improving voice activity detection in movies

AU - Lehner, Bernhard

AU - Widmer, Gerhard

AU - Sonnleitner, Reinhard

PY - 2015

Y1 - 2015

N2 - Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

AB - Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community.

KW - Speech detection

KW - Voice activity detection

UR - https://www.mendeley.com/catalogue/88b679d9-685a-31e0-9f4c-1d0f119e53cd/

U2 - 10.21437/interspeech.2015-455

DO - 10.21437/interspeech.2015-455

M3 - Paper

ER -

Improving voice activity detection in movies

Abstract

Zugriff auf Dokument

Andere Dateien und Links

Fingerprint

Dieses zitieren