4.5 Article

Design and evaluation of small-large outer joins in cloud computing environments

Journal

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
Volume 110, Issue -, Pages 2-15

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jpdc.2017.02.007

Keywords

Parallel joins; Outer joins; Small-large joins; Cloud computing; Performance evaluation

Funding

  1. German Research Foundation (DFG) within the Cluster of Excellence Center for Advancing Electronics Dresden (cfaed)
  2. Collaborative Research Center [SFB 912]
  3. Emmy Noether [KR 4381/1-1]

Ask authors/readers for more resources

Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications. (C) 2017 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available