☆ 4.7 Article

MapReduce indexing strategies: Studying scalability and efficiency

INFORMATION PROCESSING & MANAGEMENT (2012)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 48, Issue 5, Pages 873-888

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2010.12.003

Keywords

MapReduce; Indexing; Efficiency; Scalability; Hadoop

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. Map Reduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four Map Reduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop Map Reduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for 10 intensive tasks like indexing, and the suitability of the per-posting list Map Reduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that Map Reduce is a suitable framework for the deployment of large-scale indexing. (C) 2010 Elsevier Ltd. All rights reserved.

MapReduce indexing strategies: Studying scalability and efficiency

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MapReduce indexing strategies: Studying scalability and efficiency

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper