☆ 4.7 Article

Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma

BRIEFINGS IN BIOINFORMATICS (2019)

Journal

BRIEFINGS IN BIOINFORMATICS

Volume 20, Issue 3, Pages 985-994

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/bib/bbx153

Keywords

7SK RNA; gene expression; colon adenocarcinoma; normalization methods; supervised machine learning algorithms; TCGA HTSeq-FPKM-UQ data sets

Funding

Mathematical Biosciences Institute
National Science Foundation [DMS 1440386]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Motivation: One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes). Results: Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%.

Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper