src.data.data_interface

class src.data.data_interface.DataInterface(dataset_dir: str, cache_dir: str, visualizations_dir: str, data_config_filename: str, prepared_filename: str, registry_filename: str = 'registry.txt')

Manage dataset loading, normalization, and train/test split persistence.

Variables:
  • possible_smiles_cols – Column names to search for SMILES strings

  • possible_label_cols – Column names to search for target labels

  • logfile – Path to logging output file

  • override_cache – Whether to regenerate cached prepared datasets

  • task_setting – Task type (regression, binary_classification, multi_class_classification)

check_train_test_split_exists(cache_key: str) bool

Check if train/test split files exist for given cache key.

Parameters:

cache_key (str) – Unique identifier for the split

Returns:

True if both train and test CSV files exist

Return type:

bool

dump_gin_config(path: Path) None

Write current Gin configuration to specified directory.

Parameters:

path (Path) – Directory where config should be written

Return type:

None

dump_gin_config_to_model_dir(model_cache_key: str, data_cache_key: str) None

Write Gin configuration to model directory.

Parameters:
  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Return type:

None

dump_logs(path: Path) None

Copy log file contents to specified directory.

Parameters:

path (Path) – Directory where logs should be written

Return type:

None

dump_training_logs(model_cache_key: str, data_cache_key: str) None

Copy training logs to model directory.

Parameters:
  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Return type:

None

get_by_friendly_name(friendly_name: str, is_in_splits=False) DataFrame

Load dataset by friendly name from either prepared datasets or splits.

Parameters:
  • friendly_name (str) – Human-readable dataset identifier

  • is_in_splits (bool) – If True, load from splits directory; else from datasets

Returns:

Loaded dataset

Return type:

pd.DataFrame

static get_clean_smiles_df(df: DataFrame, smiles_col: str) DataFrame

Canonicalize SMILES strings and remove rows with invalid SMILES.

Parameters:
  • df (pd.DataFrame) – DataFrame containing SMILES column

  • smiles_col (str) – Name of column containing SMILES strings

Returns:

DataFrame with canonicalized SMILES, invalid rows removed

Return type:

pd.DataFrame

classmethod get_label_col_in_raw(raw_df: DataFrame) str

Identify label column in raw dataset by checking known column names.

Parameters:

raw_df (pd.DataFrame) – Raw dataset DataFrame

Returns:

Name of label column

Return type:

str

Raises:

ValueError – If no label column found in DataFrame

get_normalized_df(df_to_prepare: DataFrame) DataFrame

Normalize dataset to standard format with ‘smiles’ and ‘y’ columns.

Performs column renaming, SMILES canonicalization, and removal of rows with NaN values in either column.

Parameters:

df_to_prepare (pd.DataFrame) – Raw dataset DataFrame

Returns:

Normalized DataFrame with columns ‘smiles’ and ‘y’

Return type:

pd.DataFrame

Raises:

RuntimeError – If normalized DataFrame lacks ‘smiles’ or ‘y’ columns

classmethod get_smiles_col_in_raw(raw_df: DataFrame) str

Identify SMILES column in raw dataset by checking known column names.

Parameters:

raw_df (pd.DataFrame) – Raw dataset DataFrame

Returns:

Name of SMILES column

Return type:

str

Raises:

ValueError – If no SMILES column found in DataFrame

get_split_friendly_names(cache_key: str) Tuple[str, str]

Retrieve friendly names for train and test splits from cache key.

Parameters:

cache_key (str) – Unique identifier for the split

Returns:

Tuple of (train_friendly_name, test_friendly_name)

Return type:

Tuple[str, str]

Raises:
  • FileNotFoundError – If params file not found for train or test split

  • RuntimeError – If friendly_name not found in params file

get_train_test_friendly_names(cache_key: str) str

Retrieve friendly name from train split params.

Parameters:

cache_key (str) – Unique identifier for the split

Returns:

Friendly name from train split parameters

Return type:

str

Raises:
  • FileNotFoundError – If params file not found for split

  • RuntimeError – If friendly_name not found in params file

load_hyperparams(model_cache_key: str, data_cache_key: str) Dict

Load model hyperparameters from YAML file.

Parameters:
  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Returns:

Dictionary of hyperparameter names and values

Return type:

Dict

Raises:

FileNotFoundError – If params file does not exist

pickle_model(model: PredictorBase, model_cache_key: str, data_cache_key: str, save_as_refit: bool = False) None

Serialize trained model to disk using pickle.

Parameters:
  • model (PredictorBase) – Trained predictor instance

  • model_cache_key (str) – Model identifier (e.g., algorithm type)

  • data_cache_key (str) – Data split identifier

  • save_as_refit (bool) – If True, save as refit model; else as standard model

Return type:

None

save_hyperparams(params: Dict, model_cache_key: str, data_cache_key: str) None

Save model hyperparameters to YAML file.

Parameters:
  • params (Dict) – Dictionary of hyperparameter names and values

  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Return type:

None

save_metrics(metrics: Dict, model_cache_key: str, data_cache_key: str) None

Save model evaluation metrics to YAML file.

Parameters:
  • metrics (Dict) – Dictionary of metric names and values

  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Return type:

None

save_model_metadata(metadata: Dict, model_cache_key: str, data_cache_key: str) None

Save model metadata to YAML file.

Parameters:
  • metadata (Dict) – Dictionary of metadata key-value pairs

  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Return type:

None

save_train_test_split(train_df: DataFrame, test_df: DataFrame, cache_key: str, split_friendly_name: str, classification_or_regression: str) None

Persist train/test split to disk with metadata.

Saves both DataFrames as CSV files, creates params YAML files with metadata, and dumps Gin config and console logs to the split directory.

Parameters:
  • train_df (pd.DataFrame) – Training set DataFrame

  • test_df (pd.DataFrame) – Test set DataFrame

  • cache_key (str) – Unique identifier for this split

  • split_friendly_name (str) – Human-readable name for this split

  • classification_or_regression (str) – Task type string

Return type:

None

save_visualization(friendly_name: str, visualization: Image) None

Save visualization image to visualizations directory.

Parameters:
  • friendly_name (str) – Identifier used in output filename

  • visualization (Image) – PIL Image object to save

Return type:

None

set_logfile(logfile: str) None

Set the path for log file output.

Parameters:

logfile (str) – Path to log file

Return type:

None

set_override_cache(override_cache: bool) None

Control whether to regenerate cached prepared datasets.

Parameters:

override_cache (bool) – If True, ignore existing cached datasets

Return type:

None

set_task_setting(task_setting: str) None

Set the machine learning task type.

Parameters:

task_setting (str) – One of ‘regression’, ‘binary_classification’, ‘multi_class_classification’

Raises:

AssertionError – If task_setting is not one of the allowed values

Return type:

None

unpickle_model(model_cache_key: str, data_cache_key: str) PredictorBase

Deserialize trained model from disk.

Parameters:
  • model_cache_key (str) – Model identifier

  • data_cache_key (str) – Data split identifier

Returns:

Loaded predictor instance

Return type:

PredictorBase

Raises:

FileNotFoundError – If model file does not exist at expected path

update_datasets_registry() None

Scan dataset directory and write friendly names to registry file.

Searches for YAML files containing friendly_name field and writes them to registry file.

Return type:

None

update_registries() None

Update all registry files.

Currently updates splits registry only.

Return type:

None

update_splits_registry() None

Scan splits directory and write friendly names with timestamps to registry.

Searches for YAML params files, extracts friendly names and timestamps from console logs, sorts by timestamp, and writes to registry file.

Return type:

None