4.7 Article

A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data

Journal

IEEE TRANSACTIONS ON BIG DATA
Volume 7, Issue 2, Pages 451-467

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TBDATA.2017.2680460

Keywords

Big data; Diffraction; Morphology; Software; Machine learning algorithms; Three-dimensional displays; Testing; Big data; diffraction image; machine learning; deep learning; metamorphic testing

Funding

  1. National Science Foundation [1262933, 1560037]
  2. Direct For Computer & Info Scie & Enginr
  3. Division of Computing and Communication Foundations [1262933] Funding Source: National Science Foundation
  4. Direct For Computer & Info Scie & Enginr
  5. Division of Computing and Communication Foundations [1560037] Funding Source: National Science Foundation

Ask authors/readers for more resources

This study introduces a big data system called CMA for classifying biological cells based on cell morphology captured in diffraction images. A framework has been developed to rigorously validate and verify the massive scale image data, software tools, and machine learning algorithms in order to ensure system quality.
Big data validation and system verification are crucial for ensuring the quality of big data applications. However, a rigorous technique for such tasks is yet to emerge. During the past decade, we have developed a big data system called CMA for investigating the classification of biological cells based on cell morphology which is captured in diffraction images. CMA includes a collection of scientific software tools, machine learning algorithms, and a large-scale cell image repository. In order to ensure the quality of big data system CMA, we developed a framework for rigorously validating the massive scale image data as well as adequately verifying both the software tools and machine learning algorithms. The validation of big data is conducted by iteratively selecting the data using a machine learning approach. An experimental approach guided by a feature selection algorithm is introduced in the framework to select an optimal feature set for improving the machine learning performance. The verification of software and algorithms is developed on the iterative metamorphic testing approach due to the non-testable property of the software and algorithms. A machine learning approach is introduced for developing test oracles iteratively to ensure the adequacy of the test coverage criteria. Performance of the machine learning algorithm is evaluated with a stratified N-fold cross validation and confusion matrix. We describe the design of the proposed big data verification and validation framework with CMA as the case study, and demonstrate its effectiveness through verifying and validating the dataset, the software and the algorithms in CMA.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available