☆ 4.7 Article Proceedings Paper

GAN-based data augmentation for transcriptomics: survey and comparative assessment

BIOINFORMATICS (2023)

期刊

BIOINFORMATICS

卷 39, 期 -, 页码 i111-i120

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bioinformatics/btad239

关键词

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Computer Science, Interdisciplinary Applications Mathematical & Computational Biology Statistics & Probability

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article analyzes the application of GAN-based data augmentation strategies in cancer phenotype classification. The results show a significant improvement in binary and multiclass classification performance through data augmentation. Without augmentation, the accuracy of classifiers trained on only 50 RNA-seq samples is 94% and 70% for binary and tissue classification respectively, while adding 1000 augmented samples increases the accuracy to 98% and 94%. The strength and training cost of the generative models positively correlate with the augmentation performance and generated data quality. Multiple performance indicators are required to assess the quality of the generated data correctly.

Motivation: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.Results: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.Availability and implementationAll data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository:

GAN-based data augmentation for transcriptomics: survey and comparative assessment

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

GAN-based data augmentation for transcriptomics: survey and comparative assessment

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文