Welcome to CLM’s documentation!

A package to train and evaluate language models of chemical structures, as used in the manuscript, Qiang et al. [QWL+24].

Note that training and evaluating chemical language models requires significant computational resources. Running the default config.yaml file distributed with the repository, for example, will submit a total of 2102 jobs, with a total requested runtime of >75,000 hours and a maximum memory of 256 GB. Actual runtimes will generally be substantially below this (but, for unusually large datasets, may be longer and require modification of the default resource requests), and resource requirements will vary substantially as a function of the configuration, with key parameters including the dataset size, number of cross-validation folds, degree of non-canonical SMILES enumeration, and amount of sampling to perform from the trained models (for instance, performing three-fold cross-validation and non-canonical SMILES enumeration at a fixed augmentation factor of 10x would reduce this resource request by approximately 20-fold). Nonetheless, access to a high-performance computing cluster with GPU resources is strongly recommended to train and evaluate new models.

If you are simply looking to use the trained DeepMet model to annotate metabolomics data, consider using the DeepMet web application.

[QWL+24]

Hantao Qiang, Fei Wang, Wenyun Lu, Xi Xing, Hahn Kim, Sandrine AM Merette, Lucas B Ayres, Eponine Oler, Jenna E AbuSalim, Asael Roichman, and others. Language model-guided anticipation and discovery of unknown metabolites. bioRxiv, pages 2024–11, 2024.