4.7 Article

Characterizing Co-Located Workloads in Alibaba Cloud Datacenters

Journal

IEEE TRANSACTIONS ON CLOUD COMPUTING
Volume 10, Issue 4, Pages 2381-2397

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCC.2020.3034500

Keywords

Co-located jobs; workload characterization; online services; batch jobs; Internet data center; scheduling

Funding

  1. Natural Science Foundation of China [61972118]
  2. Key Research and Development Program of Zhejiang Province [2018C01098]

Ask authors/readers for more resources

This article conducts a comprehensive analysis of Alibaba's production cluster and discovers several important characteristics: daily cyclical fluctuation in workload, memory system as the performance bottleneck, batch jobs approximated as Zipf distribution, impact on co-located batch jobs with online services, and similarity in resource usage between containers and the entire cluster.
Workload characteristics are vital for both data center operation and job scheduling in co-located data centers, where online services and batch jobs are deployed on the same production cluster. In this article, a comprehensive analysis is conducted on Alibaba's cluster-trace-v2018 of a production cluster of 4034 machines. The findings and insights are the following: (1) The workload on the production cluster poses a daily cyclical fluctuation, in terms of CPU and disk I/O utilization, and the memory system has become the performance bottleneck of a co-located cluster. (2) Batch jobs including their tasks and derived instances can be approximated as Zipf distribution. However, for all batch jobs with directed acyclic graph dependency, they suffer from co-location with online services since the online services are highly prioritized. (3) The resource usages of containers have similar cyclical fluctuation consistent with the whole cluster, while their memory usages remain approximately constant. (4) The number of batch jobs co-located with online services is dependent on the mispredictions per kilo instructions of online services. In order to guarantee the QoS of online services, when the MPKI of online services rises, the number of batch jobs to be co-located on the same machine should decrease.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available