4.7 Article Proceedings Paper

GAN-based data augmentation for transcriptomics: survey and comparative assessment

向作者/读者索取更多资源

This article analyzes the application of GAN-based data augmentation strategies in cancer phenotype classification. The results show a significant improvement in binary and multiclass classification performance through data augmentation. Without augmentation, the accuracy of classifiers trained on only 50 RNA-seq samples is 94% and 70% for binary and tissue classification respectively, while adding 1000 augmented samples increases the accuracy to 98% and 94%. The strength and training cost of the generative models positively correlate with the augmentation performance and generated data quality. Multiple performance indicators are required to assess the quality of the generated data correctly.
Motivation: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.Results: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.Availability and implementationAll data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository:

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据