Choosing your "processing plan" =============================== #### Introductory notes ADMET-XSpec lets you switch between splitting and preprocessing data and training models on compositions of that data quickly, within a single config file. In addition, you can opt in or out of generating visualizations for both the original body of data and your compositions (mixed splits). To support this functionality, we use what we call "processing plans". The name reflects the fact that they control the `ProcessingPipeline`. #### Goals After reading this section, you should understand: 1. What processing plans represent and how they connect to the `run` method of ProcessingPipeline 2. Which processing plans are possible and how to configure them 3. The create/load distinction with train/test splits At a high level, a processing plan controls these 9 steps in the execution of the `ProcessingPipeline`: ``` # Step 1: Load datasets # Step 2: Visualize raw datasets # Step 3: Create(*) train/test splits # Step 4: Save train/test splits # Step 5: Visualize train/test splits # Step 6: Load optimized hyperparameters # Step 7: Optimize hyperparameters # Step 8: Train and evaluate the model # Step 9: Refit final model on full dataset ``` These steps are comments taken from the `run` method itself. They are placed between control flow blocks to disable certain parts from running with an `if-else`. If you want deeper insight by looking into the methods called by `ProcessingPipeline`, you can check out the `run` method. However, you will do fine by sticking to one of the processing plans we have provided in `configs/processing_plans`. These are used in our set of `configs/examples`. Let's look at `configs/processing_plans/train_optimize.gin`, the one you are likely to use most often. With this processing plan, the `ProcessingPipeline` will: 1. Load your raw ChEMBL datasets and preprocess them. 2. **Not** visualize the preprocessed datasets, since you set that step to 'False'. 3. Create(\*) your training and test splits, and save them to cache. 4. **Not** visualize the train/test splits, since you set that step to 'False'. 5. **Not** load hyperparameters found to be optimal in a different run, since you set that to 'False'. More on this in Guide 3.4: "Training and optimization". 6. Find optimal hyperparameters for training and train a model on them, as well as refit the model on the entire train+test dataset and generate metrics based on that. ```bash ProcessingPipeline.do_load_datasets = True ProcessingPipeline.do_visualize_datasets = False ProcessingPipeline.do_load_train_test = True ProcessingPipeline.do_dump_train_test = True ProcessingPipeline.do_visualize_train_test = False ProcessingPipeline.do_load_optimized_hyperparams = False ProcessingPipeline.do_optimize_hyperparams = True ProcessingPipeline.do_train_model = True ProcessingPipeline.do_refit_final_model = True ``` The file `configs/processing_plans/_possible_plans.gin` serves as a reminder of what plans you can create whenever you find yourself outside of our docs. The (\*) symbol highlights the ambiguity that may arise from referring to the train/test split stage as "creation" on the one hand and "loading" on the other. Consider the following two processing plans, which we will call "Just visualize raw" and "Train from select splits": "Just visualize raw" ```bash ProcessingPipeline.do_load_datasets = True ProcessingPipeline.do_visualize_datasets = True ProcessingPipeline.do_load_train_test = False ProcessingPipeline.do_dump_train_test = False ProcessingPipeline.do_visualize_train_test = False ProcessingPipeline.do_load_optimized_hyperparams = False ProcessingPipeline.do_optimize_hyperparams = False ProcessingPipeline.do_train_model = False ProcessingPipeline.do_refit_final_model = False ``` "Train on select splits" ```bash ProcessingPipeline.do_load_datasets = False ProcessingPipeline.do_visualize_datasets = False ProcessingPipeline.do_load_train_test = True ProcessingPipeline.do_dump_train_test = False ProcessingPipeline.do_visualize_train_test = False ProcessingPipeline.do_load_optimized_hyperparams = False ProcessingPipeline.do_optimize_hyperparams = True ProcessingPipeline.do_train_model = True ProcessingPipeline.do_refit_final_model = True ``` You can see how in the first situation we wish to simply process the original data in some way without training a model, and in the second situation we do not want to interact with the original data at all, since we already have generated splits and aim to train a model on those splits. These splits have been saved to disk (in `data/cache`) and are therefore "loaded". In this way, ambiguity arises when we run `ProcessingPipeline` on original data, "creating splits" in the process and immediately proceeding to train a model on those splits. This is, in fact, the only way to create splits: the original data must be loaded, and then splits must be dumped. That is when they are "created". This is not reflected in the step's name: `do_load_train_test`. To reconcile this, we can think of the `ProcessingPipeline` as immediately "loading" the splits it had created.