4.7 Article

iProbiotics: a machine learning platform for rapid identification of probiotic properties from whole-genome primary sequences

Journal

BRIEFINGS IN BIOINFORMATICS
Volume 23, Issue 1, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbab477

Keywords

probiotic; k-mer composition; prediction; feature selection; machine learning

Funding

  1. National Natural Science Foundation of China [62061034, 31922071]
  2. National Natural Science Foundation of Inner Mongolia [2021ZD08]
  3. Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region [NJYT-18-B01]
  4. Science and Technology Major Project of Inner Mongolia Autonomous Region of China [2019ZD031]

Ask authors/readers for more resources

Lactic acid bacteria consortia are commonly found in food and some possess probiotic properties. This study developed a machine learning-based platform using genomic data to identify probiotics. Results showed diverse oligonucleotide composition in probiotic genomes and a bias towards genes/pathways related to probiotic function. The study also created an online bioinformatic tool, iProbiotics, for rapid probiotic screening.
Lactic acid bacteria consortia are commonly present in food, and some of these bacteria possess probiotic properties. However, discovery and experimental validation of probiotics require extensive time and effort. Therefore, it is of great interest to develop effective screening methods for identifying probiotics. Advances in sequencing technology have generated massive genomic data, enabling us to create a machine learning-based platform for such purpose in this work. This study first selected a comprehensive probiotics genome dataset from the probiotic database (PROBIO) and literature surveys. Then, k-mer (from 2 to 8) compositional analysis was performed, revealing diverse oligonucleotide composition in strain genomes and apparently more probiotic (P-) features in probiotic genomes than non-probiotic genomes. To reduce noise and improve computational efficiency, 87 376 k-mers were refined by an incremental feature selection (IFS) method, and the model achieved the maximum accuracy level at 184 core features, with a high prediction accuracy (97.77%) and area under the curve (98.00%). Functional genomic analysis using annotations from gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Rapid Annotation using Subsystem Technology (RAST) databases, as well as analysis of genes associated with host gastrointestinal survival/settlement, carbohydrate utilization, drug resistance and virulence factors, revealed that the distribution of P-features was biased toward genes/pathways related to probiotic function. Our results suggest that the role of probiotics is not determined by a single gene, but by a combination of k-mer genomic components, providing new insights into the identification and underlying mechanisms of probiotics. This work created a novel and free online bioinformatic tool, iProbiotics, which would facilitate rapid screening for probiotics.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available