Killer features¶

Pick and mix featurizers, predictors and similarity filters. Track your experiments in an easily-modifiable file.

We use Google’s gin-config, a “lightweight configuration framework for Python”. The featurizer, predictory and filter components of ADMET-XSpec can be found in the project root directory under ./configs:

├── configs
│   ...
│   ├── featurizers
│   ├── predictors
│   ├── sim_filters
│   ...

You can set up your own experiment by following our examples under ./configs/examples. Simply copy these .gin files and modify them, naming them whatever you like to track your experiments.

To understand more about how to configure ProcessingPipeline and produce a model, visit Guide 3.3: “Training and choosing your processing_plan”.

Add new data in whichever directory structure you like: provide it with a friendly_name, outline a set of basic facts about it, and be on your way.

The datasets we used for our experiments are in ./data/datasets, with the directory structure serving as an organizational aid. The source of truth about a dataset - including its friendly_name for locating it - can be found in its accompanying params.yaml file. Here’s an example:

AChE/mouse/binary_classification/params.yaml:

 friendly_name: "AChE_mouse_IC50"
 raw_or_derived: "raw"
 category: "AChE_IC50"
 is_chembl: true
 task_setting: "binary_classification"
 filter_criteria:
     Standard Units:
       - "nM"
     Standard Type:
      - "IC50"
threshold: null
threshold_source: "AChE_human_IC50"

Twelve lines of configuration isn’t bad at all! You can find guidance on configuring params.yaml in Guide 3.1: “Sourcing and setting up data”.

Track splits, models, metrics, and logs. Every product of your ProcessingPipeline runs is stored in data/cache.

Let this tree ./data/cache output serve as an example:

├── models
│   └── LightGBM_clf_ecfp_featurizer_4b52a
│       └── scaffold_e4737_tanimoto_5p_filter_c2805_91da5
│           ├── hyperparams.yaml
│           ├── metrics.yaml
│           ├── model_final_refit.pkl
│           ├── model_metadata.yaml
│           ├── model.pkl
│           ├── operative_config.gin
│           └── training_log
│               └── console.log
└── splits
    ├── registry.txt
    ├── scaffold_e4737_tanimoto_5p_filter_c2805_660d3
    │   ├── console.log
    │   ├── operative_config.gin
    │   ├── test
    │   │   ├── data.csv
    │   │   └── params.yaml
    │   └── train
    │       ├── data.csv
    │       └── params.yaml

There are two subdirectories that interest us: models and splits.

The first contains trained models ready for use in InferencePipeline. The following files are also outputted:

Hyperparameters (of particular interest when running optimization; see Guide 3.3: “Training & choosing your processing plan”)
Metrics on the test set
Final model refits on the entire training data
Pipeline metadata related to ADMET-XSpec
The .gin config settings, referred to as an “operative_config”
The training log (i.e., the output of logging.info or whatever level you set)

The second contains dataset splits ready to be reused and reconfigured for subsequent ProcessingPipeline runs. You can run the ProcessingPipeline without training a model and use it only for data splitting. You can then feed the split data to train a model of your choice. You can also reuse split data from previous training runs.

Here’s an outline of the contents of splits:

registry.txt, which contains a list of all splits (their friendly_names) within data/cache/splits, updated on each ProcessingPipeline run Within each run’s resulting splits (e.g., scaffold_e4737_tanimoto_5p_filter_c2805_660d3):
1. The splitting log (i.e., the output of logging)
2. The .gin config settings
3. The created train split (derived dataset), along with its params.yaml
4. The created test split (derived dataset), along with its params.yaml