4.5 Article

Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets

Journal

CHEMICAL RESEARCH IN TOXICOLOGY
Volume 36, Issue 8, Pages 1300-1312

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acs.chemrestox.3c00042

Keywords

-

Ask authors/readers for more resources

This study investigates the distribution of positive and nonpositive entries within the ChEMBL database and its impact on the performance of classification models. The results indicate that models trained on publicly available data tend to overpredict positives, while models based on industry data sets predict negatives more often. The visualization of the prediction space further strengthens these findings by identifying regions where predictions converge. Furthermore, the utilization of these models for consensus modeling for potential adverse events prediction is highlighted.
Each year, publicly available databases are updated withnew compoundsfrom different research institutions. Positive experimental outcomesare more likely to be reported; therefore, they account for a considerablefraction of these entries. Established publicly available databasessuch as ChEMBL allow researchers to use information without constrictionsand create predictive tools for a broad spectrum of applications inthe field of toxicology. Therefore, we investigated the distributionof positive and nonpositive entries within ChEMBL for a set of off-targetsand its impact on the performance of classification models when appliedto pharmaceutical industry data sets. Results indicate that modelstrained on publicly available data tend to overpredict positives,and models based on industry data sets predict negatives more oftenthan those built using publicly available data sets. This is strengthenedeven further by the visualization of the prediction space for a setof 10,000 compounds, which makes it possible to identify regions inthe chemical space where predictions converge. Finally, we highlightthe utilization of these models for consensus modeling for potentialadverse events prediction.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available