Development and Evaluation of ADME Models Using Proprietary and Opensource Data
Absorption, Distribution, Metabolism and Elimination (ADME) properties are important factors in the drug discovery pipeline. Literature ADME data are often collected in large chemical databases like ChEMBL, which might be an asset to improve the prediction of ADME properties. Pharmaceutical companies build ADME Quantitative Structure Property Relationships (QSPR) models using proprietary data and thus the inclusion of literature data might be a valuable source for the development of predictive models. The aim of this study was to investigate whether merging literature and proprietary data could improve the predictive activity of proprietary models and enlarge their applicability domain (AD). ADME predictive models for Caco-2 (A to B) permeability and LogD7.4 were built with data extracted from Evotec and ChEMBL database. Predictive models were developed for each property and three different training sets were used based on: proprietary compounds (Evotec models), literature compounds (ChEMBL models) and a merged set of proprietary and literature compounds (Evotec+ChEMBL models). The Random Forest (RF), Partial Least Squares (PLS) and Support Vector Regression (SVR) were used to develop the models. The performance of the models was evaluated by using two types of test sets: a diverse test set (20 % compounds of available data randomly selected) and a temporal test set (data published after the models were built). The descriptors that used were the physiochemical descriptors, the structural Molecular Access System (MACCS) descriptors and the Partial equalisation of orbital electronegativity – van der Walls surface areas (Peoe-VSA) descriptors. The AD of the models was evaluated with four distance to model metrics, which were the: kNN with Euclidean distance, kNN with Manhattan distance, Leverage and Mahalanobis distance. The ability of an existing Evotec Caco-2 permeability model to assess literature compounds (extracted from ChEMBL) was evaluated. The literature test set was predicted with a higher RMSE compared to the RMSE in prediction for internal compounds. Additionally, a number of literature compounds was found to be outside the AD of the Evotec model, thus highlighting an area of improvement for proprietary Evotec models. Furthermore, the effect of the inclusion of literature data in the existing Caco-2 permeability and LogD7.4 Evotec proprietary models was evaluated. The RF algorithm was the highest performing method for the development of Caco-2 permeability models and the SVR for the LogD7.4 models. In addition, the leverage method proved to be the most appropriate for the evaluation of the models’ AD. The permeability model built merging literature and proprietary data (Evotec+ChEMBL model) predicted a literature temporal test set with an RMSE of 0.68 while the Evotec model showed an RMSE of 0.74. Even in the case of the Evotec temporal test set, the two models performed similarly and the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 86.15% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 76.50% of the compounds of the same test set appeared to be within the AD of the Evotec model. Similarly, the LogD7.4 Evotec+ChEMBL model predicted a literature temporal test set with an RMSE of 0.77 while the Evotec model showed an RMSE of 0.83. Even in the case of the Evotec temporal test set, the two models performed similarly but the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 94.86% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 88.49% of the compounds of the same test set appeared to be within the AD of the Evotec model. This study demonstrated that the inclusion of public ADME data into proprietary models improved the performance of proprietary models and enlarged at the same time their AD. The methodology presented herein will be applied by Evotec computational scientists to re-build the Caco-2 and LogD7.4 Evotec proprietary models considering literature data as discussed in this thesis.