Killer features
===============
1. Pick and mix featurizers, predictors and similarity filters. Track your experiments in an easily-modifiable file.

We use Google's [gin-config](https://github.com/google/gin-config), a "lightweight configuration framework for Python". The featurizer, predictory and filter components of ADMET-XSpec can be found in the project root directory under `./configs`:
```
├── configs
│   ...
│   ├── featurizers
│   ├── predictors
│   ├── sim_filters
│   ...
```

You can set up your own experiment by following our examples under `./configs/examples`. Simply copy these `.gin` files and modify them, naming them whatever you like to track your experiments.

To understand more about how to configure `ProcessingPipeline` and produce a model, visit Guide 3.3: "Training and choosing your `processing_plan`".


2. Add new data in whichever directory structure you like: provide it with a `friendly_name`, outline a set of basic facts about it, and be on your way.

The datasets we used for our experiments are in `./data/datasets`, with the directory structure serving as an organizational aid. The source of truth about a dataset - including its `friendly_name` for locating it - can be found in its accompanying `params.yaml` file. Here's an example:

`AChE/mouse/binary_classification/params.yaml`:
```yaml
1.  friendly_name: "AChE_mouse_IC50"
2.  raw_or_derived: "raw"
3.  category: "AChE_IC50"
4.  is_chembl: true
5.  task_setting: "binary_classification"
6.  filter_criteria:
7.      Standard Units:
8.        - "nM"
9.      Standard Type:
10.       - "IC50"
11. threshold: null
12. threshold_source: "AChE_human_IC50"
```

Twelve lines of configuration isn't bad at all! You can find guidance on configuring `params.yaml` in Guide 3.1: "Sourcing and setting up data".


3. Track splits, models, metrics, and logs. Every product of your `ProcessingPipeline` runs is stored in `data/cache`.

Let this `tree ./data/cache` output serve as an example:
```bash
├── models
│   └── LightGBM_clf_ecfp_featurizer_4b52a
│       └── scaffold_e4737_tanimoto_5p_filter_c2805_91da5
│           ├── hyperparams.yaml
│           ├── metrics.yaml
│           ├── model_final_refit.pkl
│           ├── model_metadata.yaml
│           ├── model.pkl
│           ├── operative_config.gin
│           └── training_log
│               └── console.log
└── splits
    ├── registry.txt
    ├── scaffold_e4737_tanimoto_5p_filter_c2805_660d3
    │   ├── console.log
    │   ├── operative_config.gin
    │   ├── test
    │   │   ├── data.csv
    │   │   └── params.yaml
    │   └── train
    │       ├── data.csv
    │       └── params.yaml
```

There are two subdirectories that interest us: `models` and `splits`.

The first contains trained models ready for use in `InferencePipeline`. The following files are also outputted:
1. Hyperparameters (of particular interest when running optimization; see Guide 3.3: "Training & choosing your processing plan")
2. Metrics on the test set
3. Final model refits on the entire training data
4. Pipeline metadata related to ADMET-XSpec
5. The `.gin` config settings, referred to as an "operative_config"
6. The training log (i.e., the output of `logging.info` or whatever level you set)

The second contains dataset splits ready to be reused and reconfigured for subsequent `ProcessingPipeline` runs. You can run the `ProcessingPipeline` without training a model and use it only for data splitting. You can then feed the split data to train a model of your choice. You can also reuse split data from previous training runs.

Here's an outline of the contents of `splits`:

- `registry.txt`, which contains a list of all splits (their `friendly_names`) within `data/cache/splits`, updated on each `ProcessingPipeline` run Within each run's resulting splits (e.g., `scaffold_e4737_tanimoto_5p_filter_c2805_660d3`):
    1. The splitting log (i.e., the output of `logging`)
    2. The `.gin` config settings
    3. The created train split (_derived dataset_), along with its `params.yaml`
    4. The created test split (_derived dataset_), along with its `params.yaml`