4.5 Article

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

Journal

BIG DATA RESEARCH
Volume 16, Issue -, Pages 18-35

Publisher

ELSEVIER
DOI: 10.1016/j.bdr.2019.04.001

Keywords

Anomaly detection; Data repair; Geo-distributed big data; Spatial autocorrelation; Neural networks; Gradient-boosting

Funding

  1. Ministry of Education, Universities and Research (MIUR) through the project ComESto - Community Energy Storage: Gestione Aggregata di Sistemi d'Accumulo dell'Energia in Power Cloud [ARS01_ 01259]
  2. European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data [ICT-2013-612944]
  3. European Commission through the project TOREADOR - TrustwOrthy model-awaRE Analytics Data platform [988797]
  4. project Microsoft Azure for Research, ReCaS [PONa3_00052]
  5. project Microsoft Azure for Research, PRISMA [PON04a2_A]

Ask authors/readers for more resources

The increasing presence of geo-distributed sensor networks implies the generation of huge volumes of data from multiple geographical locations at an increasing rate. This raises important issues which become more challenging when the final goal is that of the analysis of the data for forecasting purposes or, more generally, for predictive tasks. This paper proposes a framework which supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors. In particular, we propose a distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder. We then devise a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations. Subsequently, we adopt Gradient Boosted Trees (GBTs) to predict/forecast values assumed by a target variable of interest for the repaired newly arriving (unlabeled) data, using the original feature representation or the embedding feature representation learned via the stacked auto-encoder. The workflow is implemented with distributed Apache Spark programming primitives and tested on a cluster environment. We perform experiments to assess the performance of each module, separately and in a combined manner, considering the predictive modeling of one-day-ahead energy production, for multiple renewable energy sites. Accuracy results show that the proposed framework allows reducing the error up to 13.56%. Moreover, scalability results demonstrate the efficiency of the proposed framework in terms of speedup, scaleup and execution time under a stress test. (C) 2019 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available