Extreme gradient boosting combined with conformal predictors for informative solubility estimation
Abstract: We used the extreme gradient boosting (XGB) algorithm to predict the experimental solubility of chemical compounds in water and organic solvents and to select significant molecular descriptors. The accuracy of prediction of our forward stepwise top-importance XGB (FSTI-XGB) on curated solubility data sets in terms of RMSE was found to be 0.59–0.76 Log(S) for two water data sets, while for organic solvent data sets it was 0.69–0.79 Log(S) for the Methanol data set, 0.65–0.79 for the Ethanol data set, and 0.62–0.70 Log(S) for the Acetone data set. That was the first step. In the second step, we used uncurated and curated AquaSolDB data sets for applicability domain (AD) tests of Drugbank, PubChem, and COCONUT databases and determined that more than 95% of studied ca. 500,000 compounds were within the AD. In the third step, we applied conformal prediction to obtain narrow prediction intervals and we successfully validated them using test sets’ true solubility values. With prediction intervals obtained in the last fourth step, we were able to estimate individual error margins and the accuracy class of the solubility prediction for molecules within the AD of three public databases. All that was possible without the knowledge of experimental database solubilities. We find these four steps novel because usually, solubility-related works only study the first step or the first two steps.
PublicationMolecules 29(1), 19
Other Funding informationPharmaceutical Manufacturing Technology Centre Core Cleaning Project (TC-2018-0026) and the Innovation Partnership Program (IP-2020-0957).
Also affiliated with
- Bernal Institute
Sustainable development goals
- (9) Industry, Innovation and Infrastructure
Department or School
- Chemical Sciences