4.5 Article

Energy-based anomaly detection for mixed data

Journal

KNOWLEDGE AND INFORMATION SYSTEMS
Volume 57, Issue 2, Pages 413-435

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s10115-018-1168-z

Keywords

Mixed data; Mixed-variate restricted Boltzmann machine; Deep belief net; Multilevel anomaly detection

Funding

  1. Telstra-Deakin Centre of Excellence in Big Data and Machine Learning

Ask authors/readers for more resources

Anomalies are those deviating significantly from the norm. Thus, anomaly detection amounts to finding data points located far away from their neighbors, i.e., those lying in low-density regions. Classic anomaly detection methods are largely designed for single data type such as continuous or discrete. However, real-world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Mixed data poses multiple challenges including (a) capturing the inter-type correlation structures and (b) measuring deviation from the norm under multiple types. These challenges are exaggerated under (c) high-dimensional regimes. In this paper, we propose a new scalable unsupervised anomaly detection method for mixed data based on Mixed-variate Restricted Boltzmann Machine (Mv. RBM). The Mv. RBM is a principled probabilistic method that estimates density of mixed data. We propose to use free energy derived from Mv. RBM as anomaly score as it is identical to data negative log-density up to an additive constant. We then extend this method to detect anomalies across multiple levels of data abstraction, an effective approach to deal with high-dimensional settings. The extension is dubbed MIXMAD, which stands for MIXed data Multilevel Anomaly Detection. In MIXMAD, we sequentially construct an ensemble of mixed-data Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. Predictions across the ensemble are finally combined via a simple rank aggregation method. The proposed methods are evaluated on a comprehensive suit of synthetic and real high-dimensional datasets. The results demonstrate that for anomaly detection, (a) a proper handling of mixed types is necessary, (b) free energy is a powerful anomaly scoring method, (c) multilevel abstraction of data is important for high-dimensional data, and (d) empirically Mv. RBM and MIXMAD are superior to popular unsupervised detection methods for both homogeneous and mixed data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available