4.7 Article

Assessment of machine learning approaches for predicting the crystallization propensity of active pharmaceutical ingredients

Journal

CRYSTENGCOMM
Volume 21, Issue 8, Pages 1215-1223

Publisher

ROYAL SOC CHEMISTRY
DOI: 10.1039/c8ce01589a

Keywords

-

Ask authors/readers for more resources

In the current report, three machine learning approaches were assessed for their ability to predict the crystallization propensities of a set of small organic compounds (<709 Da). The algorithms evaluated included: random forest regression (RFR), support vector machine regression (SVMR) and neural networks (NN). In addition to these algorithms, the influence of different molecular descriptors, the size of the training sets used, and various experimental factors on the predictive ability of the methods were also taken into consideration. For example, factors such as the solvent used, presence of impurities and/or degradants, influence of potential seeded crystallizations and implied supersaturation levels were explicitly investigated. For smaller training set sizes (e.g., similar to 50), very little difference in the accuracy of the three algorithms was observed. However, beyond training set sizes of 150, the RFR algorithm typically outperformed the others by up to 20% RMSE. Additionally, as a result of the improved performance with larger training set sizes, the RFR models built with the explicit treatment of solvent typically outperformed models only considering the active pharmaceutical ingredient (API). For example, the best performing API only model had an RMSE of 30% whereas for the API + solvent models the RMSE was found to be 20%. Beyond inclusion of the solvent, it was found that the presence of impurities and/or degradants had the greatest influence on model accuracy. When these experiments were excluded, an additional improvement of up to 10% RMSE was observed in some cases.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available