Snakemake Workflow Steps

This document describes the purpose of the individual steps in the Snakemake workflow. Note that certain model evaluation and figure-generation steps have been omitted from this discussion.

Rulegraph

preprocess

Read input smiles and save a file of “canonical” smiles. This entails conversion to molecules using rdkit, removing light fragments, neutralizing charges, filtering for valid elements, removing very rare tokens in the smiles vocabulary (and eliminating the corresponding molecules), and saving the resulting smiles. Inchikeys are added to each individual smile, and any extra metadata for an input smile is saved in the output file.

create_training_sets

Split input smiles into separate files for train/test folds, while also creating .vocabulary files for the training smiles. The number of folds can be specified in the config.yaml file. The smiles can optionally be (and typically are) augmented by an enum factor (typically 10x to 100x) to increase the size of the training set, and subsequent steps of the pipeline are run for each enum factor in separate folders.

When creating training sets, the prefix train_ is used for files that might possibly be augmented, and train0_ for the non-augmented files. The former are useful for training purposes, while the latter are useful for evaluation purposes. Similarly, the prefix test_ or test0_ is used for the test files.

This step also optionally takes in a min_tc parameter to only consider smiles that are mutually similar (to a randomly selected seed smile).

add_carbon

This step generates data for the AddCarbon model by inserting a random carbon atom at a random point in training set SMILES. This is useful for model evaluation purposes later on (in the step write_structural_prior_CV).

collapse_train_test_data

This step generates single files for training and testing smiles respectively, out of all the folds we generated in the create_training_sets step. This is useful for the write_structural_prior_CV_overall step, explained later.

train_models_RNN

This step trains one or more (the number of train_seeds entries in config.yaml) RNN models on an input fold. This is the most time and resource intensive step in the entire pipeline. The specifics of the model (number of layers/epochs/batch size etc). can be configured in the model_params section of the config.yaml file.

Each training fold is internally split into train/validation split of 0.9. The trained model and loss files are written out.

sample_molecules_RNN

This step samples sample_mols (typically of the order of 10M) smiles from a trained model and fold, and saves to and output file. It is possible to sample multiple times from the same trained model using multiple sample_seed values.

tabulate_molecules

For sampled smiles from a model and a particular fold, this step adds the calculated molecular mass, formula and sampling frequency. Smiles that are found in training file for that particular fold are filtered out, based on a comparison of InchiKeys.

collect_tabulated_molecules

This is an aggregation step for all sampled smiles from all samples obtained across all |train_seeds| models in a fold, and adds aggregation metrics like sampling frequency.

write_structural_prior_CV

This step evaluates test smiles against the trained models for each fold, with the PubChem and AddCarbon model as baselines.

For each fold, for each smile in the test dataset, we generate statistics for the occurrence of the test smile in each of the 3 “models” (trained_smiles/sampled_smiles/PubChem).

For each of the 3 “models”, we keep track of smiles that fall within some tolerance of the true test-smile molecular weight. When sorted by decreasing sampling frequency, the rank at which we find the “correct” smile (in terms of a match in the smile string), gives us the “rank” of each test smile. Lower ranks indicate a better model. This output is written to ranks_file.

write_formula_prior_CV

This step is analogous to write_structural_prior_CV, but here “correctness” is based on a matching formula, not a matching smile.

process_tabulated_molecules

This step aggregates sampled smiles across all folds, calculating metrics like average sampling frequency (if metric=freq-avg).

write_structural_prior_CV_overall

This step is analogous to write_structural_prior_CV, but evaluates test smiles against the trained models across all folds.