4.6 Article

ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TDSC.2021.3083671

关键词

Computer architecture; Measurement; Cloud computing; Monitoring; Topology; Throughput; Heuristic algorithms; Microservice architecture; root cause; anomaly propagation; impact graph; cloud computing

资金

  1. National Key R&D Program of China [2017YFB1200700]
  2. National Natural Science Foundation of China [62072006]
  3. Science and Technology on Communication Networks Laboratory [6142104200103]
  4. National Key Laboratory of Science and Technology on Reliability and Environmental Engineering [6142004180403]
  5. IBM Shared University Research Project

向作者/读者索取更多资源

This article discusses the challenges and implications of diagnosing root causes of anomalies in large-scale microservice architecture in the cloud. It proposes a novel framework called ServiceRank, which detects anomalies and identifies their root causes in a fast and accurate manner. ServiceRank includes an anomaly detector, a root cause analysis module, and various mechanisms to eliminate the impacts of cloud design patterns on anomaly diagnosis.
Nowadays, increasing business applications running in the cloud are embracing the microservice architecture. This article presents the challenges and implications of diagnosing root causes of anomalies in large-scale microservice architecture using real incidents in IBM Bluemix. We propose ServiceRank, a novel framework for anomaly detection and root cause identification in the microservice architecture to tackle these challenges. ServiceRank introduces an anomaly detector followed by a root cause analysis module, which detects the suspected abnormal service without pre-defined thresholds. To generalize our approach, we design a causal relationship extraction approach to construct impact graphs for root cause investigation according to specific anomalies. To eliminate cloud design-patterns' impact on anomaly diagnosis, we propose a correlation calibration mechanism in ServiceRank and present a calibration algorithm for the circuit breaker - A typical protection pattern in the microservice architecture. Finally, we design a heuristic investigation algorithm based on the second-order random walk to identify the anomaly's root cause. Experimental results in a simulated environment and the IBM Bluemix platform show that ServiceRank outperforms selected approaches in accuracy and offers fast identification of root cause service when an anomaly occurs. Moreover, we can deploy ServiceRank rapidly and easily in various systems without any pre-defined knowledge.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据