src.data.data_interface¶

class src.data.data_interface.DataInterface(dataset_dir: str, cache_dir: str, visualizations_dir: str, data_config_filename: str, prepared_filename: str, registry_filename: str = 'registry.txt')¶

Manage dataset loading, normalization, and train/test split persistence.

Variables:

possible_smiles_cols – Column names to search for SMILES strings
possible_label_cols – Column names to search for target labels
logfile – Path to logging output file
override_cache – Whether to regenerate cached prepared datasets
task_setting – Task type (regression, binary_classification, multi_class_classification)

check_train_test_split_exists(cache_key: str) → bool¶

Check if train/test split files exist for given cache key.

Parameters:: cache_key (str) – Unique identifier for the split
Returns:: True if both train and test CSV files exist
Return type:: bool

dump_gin_config(path: Path) → None¶

Write current Gin configuration to specified directory.

Parameters:: path (Path) – Directory where config should be written
Return type:: None

dump_gin_config_to_model_dir(model_cache_key: str, data_cache_key: str) → None¶

Write Gin configuration to model directory.

Parameters:

model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Return type:

None

dump_logs(path: Path) → None¶

Copy log file contents to specified directory.

Parameters:: path (Path) – Directory where logs should be written
Return type:: None

dump_training_logs(model_cache_key: str, data_cache_key: str) → None¶

Copy training logs to model directory.

Parameters:

model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Return type:

None

get_by_friendly_name(friendly_name: str, is_in_splits=False) → DataFrame¶

Load dataset by friendly name from either prepared datasets or splits.

Parameters:

friendly_name (str) – Human-readable dataset identifier
is_in_splits (bool) – If True, load from splits directory; else from datasets

Returns:

Loaded dataset

Return type:

pd.DataFrame

static get_clean_smiles_df(df: DataFrame, smiles_col: str) → DataFrame¶

Canonicalize SMILES strings and remove rows with invalid SMILES.

Parameters:

df (pd.DataFrame) – DataFrame containing SMILES column
smiles_col (str) – Name of column containing SMILES strings

Returns:

DataFrame with canonicalized SMILES, invalid rows removed

Return type:

pd.DataFrame

classmethod get_label_col_in_raw(raw_df: DataFrame) → str¶

Identify label column in raw dataset by checking known column names.

Parameters:: raw_df (pd.DataFrame) – Raw dataset DataFrame
Returns:: Name of label column
Return type:: str
Raises:: ValueError – If no label column found in DataFrame

get_normalized_df(df_to_prepare: DataFrame) → DataFrame¶

Normalize dataset to standard format with ‘smiles’ and ‘y’ columns.

Performs column renaming, SMILES canonicalization, and removal of rows with NaN values in either column.

Parameters:: df_to_prepare (pd.DataFrame) – Raw dataset DataFrame
Returns:: Normalized DataFrame with columns ‘smiles’ and ‘y’
Return type:: pd.DataFrame
Raises:: RuntimeError – If normalized DataFrame lacks ‘smiles’ or ‘y’ columns

classmethod get_smiles_col_in_raw(raw_df: DataFrame) → str¶

Identify SMILES column in raw dataset by checking known column names.

Parameters:: raw_df (pd.DataFrame) – Raw dataset DataFrame
Returns:: Name of SMILES column
Return type:: str
Raises:: ValueError – If no SMILES column found in DataFrame

get_split_friendly_names(cache_key: str) → Tuple[str, str]¶

Retrieve friendly names for train and test splits from cache key.

Parameters:

cache_key (str) – Unique identifier for the split

Returns:

Tuple of (train_friendly_name, test_friendly_name)

Return type:

Tuple[str, str]

Raises:

FileNotFoundError – If params file not found for train or test split
RuntimeError – If friendly_name not found in params file

get_train_test_friendly_names(cache_key: str) → str¶

Retrieve friendly name from train split params.

Parameters:

cache_key (str) – Unique identifier for the split

Returns:

Friendly name from train split parameters

Return type:

str

Raises:

FileNotFoundError – If params file not found for split
RuntimeError – If friendly_name not found in params file

load_hyperparams(model_cache_key: str, data_cache_key: str) → Dict¶

Load model hyperparameters from YAML file.

Parameters:

model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Returns:

Dictionary of hyperparameter names and values

Return type:

Dict

Raises:

FileNotFoundError – If params file does not exist

pickle_model(model: PredictorBase, model_cache_key: str, data_cache_key: str, save_as_refit: bool = False) → None¶

Serialize trained model to disk using pickle.

Parameters:

model (PredictorBase) – Trained predictor instance
model_cache_key (str) – Model identifier (e.g., algorithm type)
data_cache_key (str) – Data split identifier
save_as_refit (bool) – If True, save as refit model; else as standard model

Return type:

None

save_hyperparams(params: Dict, model_cache_key: str, data_cache_key: str) → None¶

Save model hyperparameters to YAML file.

Parameters:

params (Dict) – Dictionary of hyperparameter names and values
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Return type:

None

save_metrics(metrics: Dict, model_cache_key: str, data_cache_key: str) → None¶

Save model evaluation metrics to YAML file.

Parameters:

metrics (Dict) – Dictionary of metric names and values
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Return type:

None

save_model_metadata(metadata: Dict, model_cache_key: str, data_cache_key: str) → None¶

Save model metadata to YAML file.

Parameters:

metadata (Dict) – Dictionary of metadata key-value pairs
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Return type:

None

save_train_test_split(train_df: DataFrame, test_df: DataFrame, cache_key: str, split_friendly_name: str, classification_or_regression: str) → None¶

Persist train/test split to disk with metadata.

Saves both DataFrames as CSV files, creates params YAML files with metadata, and dumps Gin config and console logs to the split directory.

Parameters:

train_df (pd.DataFrame) – Training set DataFrame
test_df (pd.DataFrame) – Test set DataFrame
cache_key (str) – Unique identifier for this split
split_friendly_name (str) – Human-readable name for this split
classification_or_regression (str) – Task type string

Return type:

None

save_visualization(friendly_name: str, visualization: Image) → None¶

Save visualization image to visualizations directory.

Parameters:

friendly_name (str) – Identifier used in output filename
visualization (Image) – PIL Image object to save

Return type:

None

set_logfile(logfile: str) → None¶

Set the path for log file output.

Parameters:: logfile (str) – Path to log file
Return type:: None

set_override_cache(override_cache: bool) → None¶

Control whether to regenerate cached prepared datasets.

Parameters:: override_cache (bool) – If True, ignore existing cached datasets
Return type:: None

set_task_setting(task_setting: str) → None¶

Set the machine learning task type.

Parameters:: task_setting (str) – One of ‘regression’, ‘binary_classification’, ‘multi_class_classification’
Raises:: AssertionError – If task_setting is not one of the allowed values
Return type:: None

unpickle_model(model_cache_key: str, data_cache_key: str) → PredictorBase¶

Deserialize trained model from disk.

Parameters:

model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier

Returns:

Loaded predictor instance

Return type:

PredictorBase

Raises:

FileNotFoundError – If model file does not exist at expected path

update_datasets_registry() → None¶

Scan dataset directory and write friendly names to registry file.

Searches for YAML files containing friendly_name field and writes them to registry file.

Return type:: None

update_registries() → None¶

Update all registry files.

Currently updates splits registry only.

Return type:: None

update_splits_registry() → None¶

Scan splits directory and write friendly names with timestamps to registry.

Searches for YAML params files, extracts friendly names and timestamps from console logs, sorts by timestamp, and writes to registry file.

Return type:: None