Journal
IEEE INFOCOM 2020 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS
Volume -, Issue -, Pages 2420-2429Publisher
IEEE
DOI: 10.1109/infocom41043.2020.9155219
Keywords
-
Categories
Funding
- Beijing National Research Center for Information Science and Technology (BNRist) key projects
- National Key R&D Program of China [2019YFB1802504]
Ask authors/readers for more resources
In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available