[FEAT] Refactor STFT preprocessing and training pipeline into importable modules #48

Open
opened 2025-04-22 04:19:59 +00:00 by nuluh · 0 comments
nuluh commented 2025-04-22 04:19:59 +00:00 (Migrated from github.com)

Problem Statement

The current notebook containing STFT processing, train-test-split, and labeling functionality has grown complex and difficult to read. The code for data splitting and labeling is embedded within notebook cells, making it hard to maintain and reuse across different experiments. This approach limits code reusability and makes the notebook less readable for thesis documentation purposes.

Proposed Solution

Refactor the data splitting and labeling code from the notebook into properly structured Python modules that can be imported. This will:

  1. Create dedicated Python modules in the src/ directory for:

    • Dataset splitting functionality (train/test/validation splits)
    • Labeling generation and management
    • Model training pipeline components
  2. Clean up the notebook to focus on experiment flow, visualization, and results rather than implementation details.

  3. Implement proper documentation, typing, and error handling in the new modules.

Alternatives Considered

  • Keeping code in notebook but moving to separate notebook cells: This would improve readability but wouldn't address reusability.
  • Using Jupyter notebook extensions to collapse code: Helps with readability but doesn't improve maintainability.
  • Creating Python scripts without proper module structure: Would help with reusability but might create import issues.

Component

Python Source Code

Priority

High (significantly improves workflow)

Implementation Ideas

# Proposed module structure:
# src/
# └── ml/
#     ├── __init__.py
#     ├── data_splitting.py
#     ├── labeling.py
#     └── training.py
def create_train_test_split(
    data: pd.DataFrame, 
    test_size: float = 0.2, 
    random_state: int = 42,
    stratify: np.ndarray = None
) -> tuple:
    """
    Create a stratified train-test split from STFT data.
    
    Parameters:
    -----------
    data : pd.DataFrame
        The input DataFrame containing STFT data
    test_size : float
        Proportion of data to use for testing (default: 0.2)
    random_state : int
        Random seed for reproducibility (default: 42)
    stratify : np.ndarray, optional
        Labels to use for stratified sampling
        
    Returns:
    --------
    tuple
        (X_train, X_test, y_train, y_test) - Split datasets
    """
    # Extract features and labels
    X = data.drop('label_column', axis=1) if 'label_column' in data.columns else data
    y = data['label_column'] if 'label_column' in data.columns else stratify
    
    # Create split
    X_train, X_test, y_train, y_test = sklearn_split(
        X, y, test_size=test_size, random_state=random_state, stratify=stratify
    )
    
    return X_train, X_test, y_train, y_test

Expected Benefits

  1. Improved notebook readability with focus on experimental flow and results
  2. Enhanced code reusability across different notebooks and experiments
  3. Better maintainability with single-responsibility modules
  4. Easier testing of individual components
  5. Clearer documentation of the machine learning pipeline for thesis readers
  6. Simplified iteration on model improvements

Additional Context

This refactoring aligns with software engineering best practices and will make the thesis code more professional and maintainable. The modules should include appropriate error handling, type hints, and docstrings to ensure they're robust and well-documented.
Once implemented, the notebook workflow would change from:

# Current: Multiple cells of complex code
# Cell 1
X = stft_data.drop('label', axis=1)
y = stft_data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ... more complex code ...

To

# Refactored: Clean, readable notebook
from src.ml.data_splitting import create_train_test_split
from src.ml.labeling import generate_labels

# Generate labels
labels = generate_labels(stft_data)

# Create split with one clear function call
X_train, X_test, y_train, y_test = create_train_test_split(
    stft_data, 
    test_size=0.2,
    stratify=labels
)
### Problem Statement The current notebook containing STFT processing, train-test-split, and labeling functionality has grown complex and difficult to read. The code for data splitting and labeling is embedded within notebook cells, making it hard to maintain and reuse across different experiments. This approach limits code reusability and makes the notebook less readable for thesis documentation purposes. ### Proposed Solution Refactor the data splitting and labeling code from the notebook into properly structured Python modules that can be imported. This will: 1. Create dedicated Python modules in the src/ directory for: - Dataset splitting functionality (train/test/validation splits) - Labeling generation and management - Model training pipeline components 2. Clean up the notebook to focus on experiment flow, visualization, and results rather than implementation details. 3. Implement proper documentation, typing, and error handling in the new modules. ### Alternatives Considered - Keeping code in notebook but moving to separate notebook cells: This would improve readability but wouldn't address reusability. - Using Jupyter notebook extensions to collapse code: Helps with readability but doesn't improve maintainability. - Creating Python scripts without proper module structure: Would help with reusability but might create import issues. ### Component Python Source Code ### Priority High (significantly improves workflow) ### Implementation Ideas ```sh # Proposed module structure: # src/ # └── ml/ # ├── __init__.py # ├── data_splitting.py # ├── labeling.py # └── training.py ``` ```py def create_train_test_split( data: pd.DataFrame, test_size: float = 0.2, random_state: int = 42, stratify: np.ndarray = None ) -> tuple: """ Create a stratified train-test split from STFT data. Parameters: ----------- data : pd.DataFrame The input DataFrame containing STFT data test_size : float Proportion of data to use for testing (default: 0.2) random_state : int Random seed for reproducibility (default: 42) stratify : np.ndarray, optional Labels to use for stratified sampling Returns: -------- tuple (X_train, X_test, y_train, y_test) - Split datasets """ # Extract features and labels X = data.drop('label_column', axis=1) if 'label_column' in data.columns else data y = data['label_column'] if 'label_column' in data.columns else stratify # Create split X_train, X_test, y_train, y_test = sklearn_split( X, y, test_size=test_size, random_state=random_state, stratify=stratify ) return X_train, X_test, y_train, y_test ``` ### Expected Benefits 1. Improved notebook readability with focus on experimental flow and results 2. Enhanced code reusability across different notebooks and experiments 3. Better maintainability with single-responsibility modules 4. Easier testing of individual components 5. Clearer documentation of the machine learning pipeline for thesis readers 6. Simplified iteration on model improvements ### Additional Context This refactoring aligns with software engineering best practices and will make the thesis code more professional and maintainable. The modules should include appropriate error handling, type hints, and docstrings to ensure they're robust and well-documented. Once implemented, the notebook workflow would change from: ```py # Current: Multiple cells of complex code # Cell 1 X = stft_data.drop('label', axis=1) y = stft_data['label'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ... more complex code ... ``` To ```py # Refactored: Clean, readable notebook from src.ml.data_splitting import create_train_test_split from src.ml.labeling import generate_labels # Generate labels labels = generate_labels(stft_data) # Create split with one clear function call X_train, X_test, y_train, y_test = create_train_test_split( stft_data, test_size=0.2, stratify=labels ) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nuluh/thesis#48