☆ 4.5 Article

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

BIG DATA RESEARCH (2019)

Journal

BIG DATA RESEARCH

Volume 16, Issue -, Pages 18-35

Publisher

ELSEVIER

DOI: 10.1016/j.bdr.2019.04.001

Keywords

Anomaly detection; Data repair; Geo-distributed big data; Spatial autocorrelation; Neural networks; Gradient-boosting

Funding

Ministry of Education, Universities and Research (MIUR) through the project ComESto - Community Energy Storage: Gestione Aggregata di Sistemi d'Accumulo dell'Energia in Power Cloud [ARS01_ 01259]
European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data [ICT-2013-612944]
European Commission through the project TOREADOR - TrustwOrthy model-awaRE Analytics Data platform [988797]
project Microsoft Azure for Research, ReCaS [PONa3_00052]
project Microsoft Azure for Research, PRISMA [PON04a2_A]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The increasing presence of geo-distributed sensor networks implies the generation of huge volumes of data from multiple geographical locations at an increasing rate. This raises important issues which become more challenging when the final goal is that of the analysis of the data for forecasting purposes or, more generally, for predictive tasks. This paper proposes a framework which supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors. In particular, we propose a distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder. We then devise a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations. Subsequently, we adopt Gradient Boosted Trees (GBTs) to predict/forecast values assumed by a target variable of interest for the repaired newly arriving (unlabeled) data, using the original feature representation or the embedding feature representation learned via the stacked auto-encoder. The workflow is implemented with distributed Apache Spark programming primitives and tested on a cluster environment. We perform experiments to assess the performance of each module, separately and in a combined manner, considering the predictive modeling of one-day-ahead energy production, for multiple renewable energy sites. Accuracy results show that the proposed framework allows reducing the error up to 13.56%. Moreover, scalability results demonstrate the efficiency of the proposed framework in terms of speedup, scaleup and execution time under a stress test. (C) 2019 Elsevier Inc. All rights reserved.

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

Journal

BIG DATA RESEARCH

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

Journal

BIG DATA RESEARCH

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper