src.data.data_interface¶
- class src.data.data_interface.DataInterface(dataset_dir: str, cache_dir: str, visualizations_dir: str, data_config_filename: str, prepared_filename: str, registry_filename: str = 'registry.txt')¶
Manage dataset loading, normalization, and train/test split persistence.
- Variables:
possible_smiles_cols – Column names to search for SMILES strings
possible_label_cols – Column names to search for target labels
logfile – Path to logging output file
override_cache – Whether to regenerate cached prepared datasets
task_setting – Task type (regression, binary_classification, multi_class_classification)
- check_train_test_split_exists(cache_key: str) bool¶
Check if train/test split files exist for given cache key.
- Parameters:
cache_key (str) – Unique identifier for the split
- Returns:
True if both train and test CSV files exist
- Return type:
bool
- dump_gin_config(path: Path) None¶
Write current Gin configuration to specified directory.
- Parameters:
path (Path) – Directory where config should be written
- Return type:
None
- dump_gin_config_to_model_dir(model_cache_key: str, data_cache_key: str) None¶
Write Gin configuration to model directory.
- Parameters:
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Return type:
None
- dump_logs(path: Path) None¶
Copy log file contents to specified directory.
- Parameters:
path (Path) – Directory where logs should be written
- Return type:
None
- dump_training_logs(model_cache_key: str, data_cache_key: str) None¶
Copy training logs to model directory.
- Parameters:
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Return type:
None
- get_by_friendly_name(friendly_name: str, is_in_splits=False) DataFrame¶
Load dataset by friendly name from either prepared datasets or splits.
- Parameters:
friendly_name (str) – Human-readable dataset identifier
is_in_splits (bool) – If True, load from splits directory; else from datasets
- Returns:
Loaded dataset
- Return type:
pd.DataFrame
- static get_clean_smiles_df(df: DataFrame, smiles_col: str) DataFrame¶
Canonicalize SMILES strings and remove rows with invalid SMILES.
- Parameters:
df (pd.DataFrame) – DataFrame containing SMILES column
smiles_col (str) – Name of column containing SMILES strings
- Returns:
DataFrame with canonicalized SMILES, invalid rows removed
- Return type:
pd.DataFrame
- classmethod get_label_col_in_raw(raw_df: DataFrame) str¶
Identify label column in raw dataset by checking known column names.
- Parameters:
raw_df (pd.DataFrame) – Raw dataset DataFrame
- Returns:
Name of label column
- Return type:
str
- Raises:
ValueError – If no label column found in DataFrame
- get_normalized_df(df_to_prepare: DataFrame) DataFrame¶
Normalize dataset to standard format with ‘smiles’ and ‘y’ columns.
Performs column renaming, SMILES canonicalization, and removal of rows with NaN values in either column.
- Parameters:
df_to_prepare (pd.DataFrame) – Raw dataset DataFrame
- Returns:
Normalized DataFrame with columns ‘smiles’ and ‘y’
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If normalized DataFrame lacks ‘smiles’ or ‘y’ columns
- classmethod get_smiles_col_in_raw(raw_df: DataFrame) str¶
Identify SMILES column in raw dataset by checking known column names.
- Parameters:
raw_df (pd.DataFrame) – Raw dataset DataFrame
- Returns:
Name of SMILES column
- Return type:
str
- Raises:
ValueError – If no SMILES column found in DataFrame
- get_split_friendly_names(cache_key: str) Tuple[str, str]¶
Retrieve friendly names for train and test splits from cache key.
- Parameters:
cache_key (str) – Unique identifier for the split
- Returns:
Tuple of (train_friendly_name, test_friendly_name)
- Return type:
Tuple[str, str]
- Raises:
FileNotFoundError – If params file not found for train or test split
RuntimeError – If friendly_name not found in params file
- get_train_test_friendly_names(cache_key: str) str¶
Retrieve friendly name from train split params.
- Parameters:
cache_key (str) – Unique identifier for the split
- Returns:
Friendly name from train split parameters
- Return type:
str
- Raises:
FileNotFoundError – If params file not found for split
RuntimeError – If friendly_name not found in params file
- load_hyperparams(model_cache_key: str, data_cache_key: str) Dict¶
Load model hyperparameters from YAML file.
- Parameters:
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Returns:
Dictionary of hyperparameter names and values
- Return type:
Dict
- Raises:
FileNotFoundError – If params file does not exist
- pickle_model(model: PredictorBase, model_cache_key: str, data_cache_key: str, save_as_refit: bool = False) None¶
Serialize trained model to disk using pickle.
- Parameters:
model (PredictorBase) – Trained predictor instance
model_cache_key (str) – Model identifier (e.g., algorithm type)
data_cache_key (str) – Data split identifier
save_as_refit (bool) – If True, save as refit model; else as standard model
- Return type:
None
- save_hyperparams(params: Dict, model_cache_key: str, data_cache_key: str) None¶
Save model hyperparameters to YAML file.
- Parameters:
params (Dict) – Dictionary of hyperparameter names and values
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Return type:
None
- save_metrics(metrics: Dict, model_cache_key: str, data_cache_key: str) None¶
Save model evaluation metrics to YAML file.
- Parameters:
metrics (Dict) – Dictionary of metric names and values
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Return type:
None
- save_model_metadata(metadata: Dict, model_cache_key: str, data_cache_key: str) None¶
Save model metadata to YAML file.
- Parameters:
metadata (Dict) – Dictionary of metadata key-value pairs
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Return type:
None
- save_train_test_split(train_df: DataFrame, test_df: DataFrame, cache_key: str, split_friendly_name: str, classification_or_regression: str) None¶
Persist train/test split to disk with metadata.
Saves both DataFrames as CSV files, creates params YAML files with metadata, and dumps Gin config and console logs to the split directory.
- Parameters:
train_df (pd.DataFrame) – Training set DataFrame
test_df (pd.DataFrame) – Test set DataFrame
cache_key (str) – Unique identifier for this split
split_friendly_name (str) – Human-readable name for this split
classification_or_regression (str) – Task type string
- Return type:
None
- save_visualization(friendly_name: str, visualization: Image) None¶
Save visualization image to visualizations directory.
- Parameters:
friendly_name (str) – Identifier used in output filename
visualization (Image) – PIL Image object to save
- Return type:
None
- set_logfile(logfile: str) None¶
Set the path for log file output.
- Parameters:
logfile (str) – Path to log file
- Return type:
None
- set_override_cache(override_cache: bool) None¶
Control whether to regenerate cached prepared datasets.
- Parameters:
override_cache (bool) – If True, ignore existing cached datasets
- Return type:
None
- set_task_setting(task_setting: str) None¶
Set the machine learning task type.
- Parameters:
task_setting (str) – One of ‘regression’, ‘binary_classification’, ‘multi_class_classification’
- Raises:
AssertionError – If task_setting is not one of the allowed values
- Return type:
None
- unpickle_model(model_cache_key: str, data_cache_key: str) PredictorBase¶
Deserialize trained model from disk.
- Parameters:
model_cache_key (str) – Model identifier
data_cache_key (str) – Data split identifier
- Returns:
Loaded predictor instance
- Return type:
PredictorBase
- Raises:
FileNotFoundError – If model file does not exist at expected path
- update_datasets_registry() None¶
Scan dataset directory and write friendly names to registry file.
Searches for YAML files containing friendly_name field and writes them to registry file.
- Return type:
None
- update_registries() None¶
Update all registry files.
Currently updates splits registry only.
- Return type:
None
- update_splits_registry() None¶
Scan splits directory and write friendly names with timestamps to registry.
Searches for YAML params files, extracts friendly names and timestamps from console logs, sorts by timestamp, and writes to registry file.
- Return type:
None