4.6 Article

Prediction of Microcystis Occurrences and Analysis Using Machine Learning in High-Dimension, Low-Sample-Size and Imbalanced Water Quality Data

Journal

HARMFUL ALGAE
Volume 117, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.hal.2022.102273

Keywords

Water reservoir; Harmful algal blooms; Microcystis blooms; Machine learning; Feature engineering; Feature selection

Ask authors/readers for more resources

This study utilizes machine learning models to predict the outbreak of Microcystis and analyze its causes. Feature Engineering and Feature Selection algorithms are applied to address the challenges of high dimensionality, low sample size, and imbalance in the water quality data. The results suggest that total nitrogen, chemical oxygen demand, chlorophyll-a, dissolved oxygen saturation, and water temperature are associated with Microcystis occurrences, providing new insights not found in previous studies.
Machine learning, Deep learning, and water quality data have been used in recent years to predict the outbreak of harmful algae, especially Microcystis, and analyze outbreak causes. However, for various reasons, water quality data are often High-Dimension, Low -Sample-Size (HDLSS), meaning the sample size is lower than the number of dimensions. Moreover, imbalance problems may arise due to bias in the occurrence frequency of Microcystis. These problems make predicting the occurrence of Microcystis and analyzing its causes with machine learning difficult. In this study, a machine learning model that applies Feature Engineering (FE) and Feature Selection (FS) algorithms are used to predict outbreaks of Microcystis and analyze the outbreak factors from imbalanced HDLSS water quality data. The prediction performance was verified with binary classification to determine whether Microcystis would occur in the future by applying three machine learning models to four data patterns. The cause analysis of Microcystis occurrence was performed by visualizing the results of applying FE and FS. For the test data, the predictive performance of FE and FS methods was significantly better than that of the conventional method, with an accuracy of .108 points and an F-value of .691 points higher than the conventional method. A prediction performance increase was observed with a smaller model capacity. Data-driven analysis suggested that total nitrogen, chemical oxygen demand, chlorophyll-a, dissolved oxygen saturation, and water temperature are associated with Microcystis occurrences. The results also indicated that basic statistics of the water quality distribution (especially mean, standard deviation, and skewness) over a year, not the concentrations of water components, are related to the occurrence of Microcystis. These are new findings not found in previous studies and are expected to contribute significantly to future studies of algae. This study provides a method for analyzing water quality data with high-dimensionality and small samples, imbalance problems, or both.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available