4.4 Article

Partition-Based Online Aggregation with Shared Sampling in the Cloud

期刊

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY
卷 28, 期 6, 页码 989-1011

出版社

SCIENCE PRESS
DOI: 10.1007/s11390-013-1393-6

关键词

cloud; Map Reduce; partition; online aggregation; shared sampling

资金

  1. National Basic Research 973 Program of China [2010CB328104]
  2. National Natural Science Foundation of China [61070161, 61202449, 61320106007]
  3. National High Technology Research and Development 863 Program of China [2013AA013503]
  4. Specialized Research Fund for the Doctoral Program of Higher Education of China [20110092130002]
  5. Jiangsu Provincial Key Laboratory of Network and Information Security [BM2003201]
  6. Laboratory of Computer Network and Information Integration of Ministry of Education of China [93K-9]
  7. Shanghai Key Laboratory of Scalable Computing and Systems of China [2010DS680095]

向作者/读者索取更多资源

Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a Map Reduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of Map Reduce paradigm and the requirements of online aggregation, such as: 1) the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in Map Reduce, and 2) the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of Map Reduce. In this paper, we present OLACloud, a Map Reduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.4
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据