TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Trong Nhan Phan; Markus Jäger; Stefan Nadschläger; Pablo Gómez Pérez; Cong An Nguyen; Christian Huber

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Trong Nhan Phan, Markus Jäger, Stefan Nadschläger, Pablo Gómez Pérez, Cong An Nguyen, Christian Huber

Embedded AI (eAI)

Research output: Contribution to conference (No Proceedings) › Paper › peer-review

Abstract

Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.

Original language	English
Publication status	Published - 2016

Cite this

@conference{7201b94d337b477c8b9224ff1016db88,

title = "TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce",

abstract = "Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.",

author = "Phan, {Trong Nhan} and Markus J{\"a}ger and Stefan Nadschl{\"a}ger and P{\'e}rez, {Pablo G{\'o}mez} and Nguyen, {Cong An} and Christian Huber",

year = "2016",

language = "English",

}

TY - CONF

T1 - TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

AU - Phan, Trong Nhan

AU - Jäger, Markus

AU - Nadschläger, Stefan

AU - Pérez, Pablo Gómez

AU - Nguyen, Cong An

AU - Huber, Christian

PY - 2016

Y1 - 2016

N2 - Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.

AB - Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.

M3 - Paper

ER -

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Abstract

Fingerprint

Cite this