src.processing_pipeline

class src.processing_pipeline.ProcessingPipeline(do_load_datasets: bool, do_visualize_datasets: bool, do_load_train_test: bool, do_dump_train_test: bool, do_visualize_train_test: bool, do_load_optimized_hyperparams: bool, do_optimize_hyperparams: bool, do_train_model: bool, do_refit_final_model: bool, data_interface: DataInterface, predictor: PredictorBase | None = None, featurizer: FeaturizerBase | None = None, reducer: ReducerBase | None = None, splitter: DataSplitterBase | None = None, sim_filter: SimilarityFilterBase | None = None, datasets: List[str] | None = None, manual_train_splits: List[str] | None = None, manual_test_splits: List[str] | None = None, test_origin_dataset: str | None = None, task_setting: str = 'regression', smiles_col: str = 'smiles', source_col: str = 'source', target_col: str = 'y', logfile: str | None = None, override_cache: bool = False)

Orchestrate dataset loading, splitting, filtering, visualization, and model training.

Manages complete workflow from raw datasets through train/test splitting, optional similarity filtering, dimensionality reduction visualization, hyperparameter optimization, model training, and evaluation.

Parameters:
  • do_load_datasets (bool) – Whether to load and prepare datasets

  • do_visualize_datasets (bool) – Whether to generate visualizations of raw datasets

  • do_load_train_test (bool) – Whether to create or load train/test splits

  • do_dump_train_test (bool) – Whether to save train/test splits to disk

  • do_visualize_train_test (bool) – Whether to visualize train/test splits

  • do_load_optimized_hyperparams (bool) – Whether to load pre-optimized hyperparameters

  • do_optimize_hyperparams (bool) – Whether to run hyperparameter optimization

  • do_train_model (bool) – Whether to train and evaluate model

  • do_refit_final_model (bool) – Whether to refit on combined train+test data

  • data_interface (DataInterface) – Interface for dataset loading and persistence

  • predictor (PredictorBase or None) – Machine learning model for prediction

  • featurizer (FeaturizerBase or None) – Molecular featurizer (if predictor doesn’t use internal featurization)

  • reducer (ReducerBase or None) – Dimensionality reducer for visualization

  • splitter (DataSplitterBase or None) – Strategy for train/test splitting

  • sim_filter (SimilarityFilterBase or None) – Similarity filter for augmentation data

  • datasets (List[str] or None) – List of dataset friendly names to load

  • manual_train_splits (List[str] or None) – Pre-split training set names (alternative to splitter)

  • manual_test_splits (List[str] or None) – Pre-split test set names (alternative to splitter)

  • test_origin_dataset (str or None) – Dataset name defining test origin for filtering

  • task_setting (str) – Task type (‘regression’ or ‘binary_classification’)

  • smiles_col (str) – Column name for SMILES strings

  • source_col (str) – Column name for dataset source labels

  • target_col (str) – Column name for target values

  • logfile (str or None) – Path to log file

  • override_cache (bool) – Whether to regenerate cached datasets

Variables:
  • split_key – Cache key for current train/test split configuration

  • predictor_key – Cache key for predictor configuration

  • optimized_hyperparameters – Loaded or optimized hyperparameters

run() None

Execute complete pipeline workflow.

Runs configured steps in sequence: 1. Load datasets 2. Visualize raw datasets (if enabled) 3. Create train/test splits (if enabled) 4. Save splits (if enabled) 5. Visualize splits (if enabled) 6. Load optimized hyperparameters (if enabled) 7. Optimize hyperparameters (if enabled) 8. Train and evaluate model (if enabled) 9. Refit final model on full data (if enabled)

Return type:

None