Killer features¶
Pick and mix featurizers, predictors and similarity filters. Track your experiments in an easily-modifiable file.
We use Google’s gin-config, a “lightweight configuration framework for Python”. The featurizer, predictory and filter components of ADMET-XSpec can be found in the project root directory under ./configs:
├── configs
│ ...
│ ├── featurizers
│ ├── predictors
│ ├── sim_filters
│ ...
You can set up your own experiment by following our examples under ./configs/examples. Simply copy these .gin files and modify them, naming them whatever you like to track your experiments.
To understand more about how to configure ProcessingPipeline and produce a model, visit Guide 3.3: “Training and choosing your processing_plan”.
Add new data in whichever directory structure you like: provide it with a
friendly_name, outline a set of basic facts about it, and be on your way.
The datasets we used for our experiments are in ./data/datasets, with the directory structure serving as an organizational aid. The source of truth about a dataset - including its friendly_name for locating it - can be found in its accompanying params.yaml file. Here’s an example:
AChE/mouse/binary_classification/params.yaml:
1. friendly_name: "AChE_mouse_IC50"
2. raw_or_derived: "raw"
3. category: "AChE_IC50"
4. is_chembl: true
5. task_setting: "binary_classification"
6. filter_criteria:
7. Standard Units:
8. - "nM"
9. Standard Type:
10. - "IC50"
11. threshold: null
12. threshold_source: "AChE_human_IC50"
Twelve lines of configuration isn’t bad at all! You can find guidance on configuring params.yaml in Guide 3.1: “Sourcing and setting up data”.
Track splits, models, metrics, and logs. Every product of your
ProcessingPipelineruns is stored indata/cache.
Let this tree ./data/cache output serve as an example:
├── models
│ └── LightGBM_clf_ecfp_featurizer_4b52a
│ └── scaffold_e4737_tanimoto_5p_filter_c2805_91da5
│ ├── hyperparams.yaml
│ ├── metrics.yaml
│ ├── model_final_refit.pkl
│ ├── model_metadata.yaml
│ ├── model.pkl
│ ├── operative_config.gin
│ └── training_log
│ └── console.log
└── splits
├── registry.txt
├── scaffold_e4737_tanimoto_5p_filter_c2805_660d3
│ ├── console.log
│ ├── operative_config.gin
│ ├── test
│ │ ├── data.csv
│ │ └── params.yaml
│ └── train
│ ├── data.csv
│ └── params.yaml
There are two subdirectories that interest us: models and splits.
The first contains trained models ready for use in InferencePipeline. The following files are also outputted:
Hyperparameters (of particular interest when running optimization; see Guide 3.3: “Training & choosing your processing plan”)
Metrics on the test set
Final model refits on the entire training data
Pipeline metadata related to ADMET-XSpec
The
.ginconfig settings, referred to as an “operative_config”The training log (i.e., the output of
logging.infoor whatever level you set)
The second contains dataset splits ready to be reused and reconfigured for subsequent ProcessingPipeline runs. You can run the ProcessingPipeline without training a model and use it only for data splitting. You can then feed the split data to train a model of your choice. You can also reuse split data from previous training runs.
Here’s an outline of the contents of splits:
registry.txt, which contains a list of all splits (theirfriendly_names) withindata/cache/splits, updated on eachProcessingPipelinerun Within each run’s resulting splits (e.g.,scaffold_e4737_tanimoto_5p_filter_c2805_660d3):The splitting log (i.e., the output of
logging)The
.ginconfig settingsThe created train split (derived dataset), along with its
params.yamlThe created test split (derived dataset), along with its
params.yaml