Quantitative structure-activity relationship (QSAR) is a method that uses mathematical models to predict the biological activity or physicochemical property of a molecule based on its structure. QSAR has been widely used in drug discovery, environmental toxicology, and chemical risk assessment for decades. However, traditional QSAR models often suffer from limitations such as low accuracy, poor generalization, and high computational cost.
Machine learning (ML) is a branch of artificial intelligence that enables computers to learn from data and perform tasks that are difficult or impossible for humans. ML has revolutionized many fields such as computer vision, natural language processing, and recommender systems. ML can also be applied to QSAR modeling, as it can handle complex and high-dimensional data, learn nonlinear and hidden patterns, and optimize model performance automatically.
In this blog, we will introduce some recent advances in ML-based QSAR modeling, and discuss the challenges and opportunities for future research. We will cover the following topics:
Molecular representation learning: how to encode the structural information of molecules into numerical vectors that can be used by ML models.
Pretraining and transfer learning: how to leverage large-scale unlabeled or labeled data to improve the performance of QSAR models on specific tasks or domains.
AutoML and ensemble learning: how to automate the process of model selection, hyperparameter tuning, and model integration for QSAR modeling.
Evaluation and interpretation: how to assess the validity, reliability, and explainability of QSAR models.
Molecular Representation Learning
One of the key steps in QSAR modeling is to represent the molecules as numerical vectors, also known as molecular descriptors or features, that capture the relevant information for the prediction task. There are many types of molecular descriptors, such as physicochemical properties, topological indices, fingerprints, pharmacophores, and graph-based features. However, most of these descriptors are either predefined by human experts or extracted by simple rules, which may not be optimal or sufficient for the complex and diverse nature of molecular structures.
Molecular representation learning (MRL) is a novel approach that uses ML techniques to learn the molecular descriptors from data, without relying on human knowledge or predefined rules. MRL can be divided into three categories, based on the input format of the molecules: 1D sequential tokens, 2D topology graphs, and 3D conformers.
1D sequential tokens: Molecules can be represented as sequences of tokens, such as SMILES, InChI, or SELFIES, which encode the atomic and bond information of the molecules. ML models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers can be used to learn the molecular representations from these sequences. For example, SMILES2Vec is a RNN-based model that learns the molecular representations by predicting the next token in the SMILES sequence. MolBERT ² is a transformer-based model that learns the molecular representations by pretraining on large-scale unlabeled SMILES data using masked language modeling and next sentence prediction tasks.
2D topology graphs: Molecules can also be represented as graphs, where the nodes are atoms and the edges are bonds. ML models such as graph neural networks (GNNs), graph convolutional networks (GCNs), and graph attention networks (GATs) can be used to learn the molecular representations from these graphs. For example, MPNN ³ is a GNN-based model that learns the molecular representations by message passing and readout functions. GraphAF ⁴ is a GAT-based model that learns the molecular representations by generating the molecular graphs autoregressively with a flow-based framework.
3D conformers: Molecules can also be represented as 3D conformers, which reflect the spatial arrangement of the atoms and the intermolecular interactions. ML models such as point cloud networks (PCNs), 3D CNNs, and 3D transformers can be used to learn the molecular representations from these conformers. For example, SchNet ⁵ is a PCN-based model that learns the molecular representations by continuous-filter convolutional layers. 3D-MoFiT ⁶ is a 3D transformer-based model that learns the molecular representations by pretraining on large-scale unlabeled 3D conformers using masked atom prediction and contrastive learning tasks.
MRL can provide more expressive, flexible, and adaptive molecular descriptors for QSAR modeling, as it can capture the structural, chemical, and biological information of the molecules from different perspectives and levels of abstraction. MRL can also enable the generation of novel molecules with desired properties or activities, by decoding or optimizing the learned molecular representations.
Pretraining and Transfer Learning
Another challenge in QSAR modeling is the scarcity and heterogeneity of the labeled data, which may limit the performance and generalization of the QSAR models. Pretraining and transfer learning are effective methods to overcome this challenge, as they can leverage the large-scale unlabeled or labeled data from different sources or domains to improve the QSAR models on specific tasks or domains.
Pretraining is a method that trains a ML model on a large-scale unlabeled or labeled data set with a general or auxiliary task, such as masked language modeling, next sentence prediction, contrastive learning, or multi-task learning. The pretrained model can learn the general or common knowledge or patterns from the data, which can be useful for various downstream tasks or domains.
Transfer learning is a method that adapts a pretrained model to a specific task or domain with a small amount of labeled data, by fine-tuning the model parameters or adding task-specific layers. The transferred model can inherit the knowledge or patterns from the pretrained model, and also learn the specific or novel knowledge or patterns from the task or domain data.
Pretraining and transfer learning can be applied to QSAR modeling, by using the MRL models as the base models, and the molecular data sets as the pretraining and transfer learning data sets. For example, ChemBERTa is a transformer-based MRL model that is pretrained on a large-scale unlabeled SMILES data set with masked language modeling and next sentence prediction tasks, and then transferred to various QSAR tasks with fine-tuning. Uni-QSAR is a MRL model that combines 1D, 2D, and 3D molecular representations, and is pretrained on a large-scale labeled data set with multi-task learning, and then transferred to various QSAR tasks with fine-tuning.
Pretraining and transfer learning can significantly improve the performance and generalization of QSAR models, as they can exploit the rich and diverse information from the large-scale data, and also adapt to the specific tasks or domains with the small-scale data.
AutoML and Ensemble Learning
Another challenge in QSAR modeling is the complexity and diversity of the ML models, which may require a lot of manual efforts and expertise to select, tune, and integrate the models for different tasks or domains. AutoML and ensemble learning are powerful methods to automate and optimize this process, as they can search, evaluate, and combine the best ML models for QSAR modeling.
AutoML is a method that automates the process of ML model selection and hyperparameter tuning, by using optimization algorithms such as grid search, random search, Bayesian optimization, evolutionary algorithms, or reinforcement learning. AutoML can find the optimal or near-optimal ML model and hyperparameters for a given task or domain, without human intervention or guidance.
Ensemble learning is a method that integrates multiple ML models, either of the same or different types, by using aggregation methods such as averaging, voting, stacking, or boosting. Ensemble learning can improve the performance and robustness of QSAR models, as it can reduce the variance, bias, and overfitting of the individual models.
AutoML and ensemble learning can be applied to QSAR modeling, by using the ML-based QSAR models as the candidates, and the QSAR data sets as the optimization and aggregation data sets. For example, DeepAutoQSAR is an AutoML method that searches and ranks a collection of different ML-based QSAR models and hyperparameters by their fitness, and then ensembles the best models by averaging. Chemprop is an ensemble learning method that ensembles multiple message passing neural network models with different data splits by averaging.
AutoML and ensemble learning can greatly simplify and enhance the QSAR modeling process, as they can automate and optimize the model selection, hyperparameter tuning, and model integration for different tasks or domains.
Evaluation and Interpretation
The final step in QSAR modeling is to evaluate and interpret the QSAR models, which are essential for validating, verifying, and explaining the models. Evaluation and interpretation are also challenging for QSAR modeling, as they require rigorous and reliable methods and metrics to assess the validity, reliability, and explainability of the QSAR models.
Evaluation is a method that measures the performance and quality of the QSAR models, by using metrics such as accuracy, precision, recall, F1-score, ROC-AUC, MSE, MAE, R2, Q2, etc. Evaluation can also assess the validity and reliability of the QSAR models, by using methods such as cross-validation, external validation, applicability domain, Y-scrambling, etc.
Interpretation is a method that explains the rationale and mechanism of the QSAR models, by using methods such as feature importance, feature contribution, feature interaction, saliency map, activation map, attention map, etc. Interpretation can also provide insights and guidance for the QSAR models, by using methods such as counterfactual analysis, adversarial analysis, causal analysis, etc.
Evaluation and interpretation can be applied to QSAR modeling, by using the QSAR data sets as the evaluation and interpretation data sets, and the ML-based QSAR models as the evaluation and interpretation targets. For example, TDC allows us to evaluate and interpret the performance of different ML-based QSAR models on 22 toxicity prediction tasks, such as acute toxicity, hERG blockers, and Ames mutagenicity. By using the TDC APIs, we can easily access the data sets, train and test the models, and generate the evaluation results and plots. We can also compare our models with the state-of-the-art models, such as Uni-QSAR, which is an Auto-ML tool that combines molecular representation learning and pretraining models. TDC enables us to gain insights into the strengths and weaknesses of different QSAR models and improve them accordingly.
Conclusion
The fusion of Quantitative Structure-Activity Relationship (QSAR) and Machine Learning (ML) enhances drug discovery by addressing traditional QSAR limitations. Molecular Representation Learning (MRL) diversifies descriptors with 1D tokens, 2D graphs, and 3D conformers, improving QSAR flexibility. Pretraining and transfer learning tackle data scarcity, while AutoML and ensemble learning automate processes, enhancing model performance. Evaluation and interpretation methods ensure model validity. ML in QSAR revolutionizes drug discovery, providing accurate, generalizable, and interpretable models, expediting processes and reducing costs.