[FEAT] Refactor STFT preprocessing and training pipeline into importable modules #48

New Issue

2025-04-22T04:19:59Z

nuluh commented

2025-04-22 04:19:59 +00:00

(Migrated from github.com)

Problem Statement

The current notebook containing STFT processing, train-test-split, and labeling functionality has grown complex and difficult to read. The code for data splitting and labeling is embedded within notebook cells, making it hard to maintain and reuse across different experiments. This approach limits code reusability and makes the notebook less readable for thesis documentation purposes.

Proposed Solution

Refactor the data splitting and labeling code from the notebook into properly structured Python modules that can be imported. This will:

Create dedicated Python modules in the src/ directory for:
- Dataset splitting functionality (train/test/validation splits)
- Labeling generation and management
- Model training pipeline components
Clean up the notebook to focus on experiment flow, visualization, and results rather than implementation details.
Implement proper documentation, typing, and error handling in the new modules.

Alternatives Considered

Keeping code in notebook but moving to separate notebook cells: This would improve readability but wouldn't address reusability.
Using Jupyter notebook extensions to collapse code: Helps with readability but doesn't improve maintainability.
Creating Python scripts without proper module structure: Would help with reusability but might create import issues.

Component

Python Source Code

Priority

High (significantly improves workflow)

Implementation Ideas

# Proposed module structure:
# src/
# └── ml/
#     ├── __init__.py
#     ├── data_splitting.py
#     ├── labeling.py
#     └── training.py

def create_train_test_split(
    data: pd.DataFrame, 
    test_size: float = 0.2, 
    random_state: int = 42,
    stratify: np.ndarray = None
) -> tuple:
    """
    Create a stratified train-test split from STFT data.
    
    Parameters:
    -----------
    data : pd.DataFrame
        The input DataFrame containing STFT data
    test_size : float
        Proportion of data to use for testing (default: 0.2)
    random_state : int
        Random seed for reproducibility (default: 42)
    stratify : np.ndarray, optional
        Labels to use for stratified sampling
        
    Returns:
    --------
    tuple
        (X_train, X_test, y_train, y_test) - Split datasets
    """
    # Extract features and labels
    X = data.drop('label_column', axis=1) if 'label_column' in data.columns else data
    y = data['label_column'] if 'label_column' in data.columns else stratify
    
    # Create split
    X_train, X_test, y_train, y_test = sklearn_split(
        X, y, test_size=test_size, random_state=random_state, stratify=stratify
    )
    
    return X_train, X_test, y_train, y_test

Expected Benefits

Improved notebook readability with focus on experimental flow and results
Enhanced code reusability across different notebooks and experiments
Better maintainability with single-responsibility modules
Easier testing of individual components
Clearer documentation of the machine learning pipeline for thesis readers
Simplified iteration on model improvements

Additional Context

This refactoring aligns with software engineering best practices and will make the thesis code more professional and maintainable. The modules should include appropriate error handling, type hints, and docstrings to ensure they're robust and well-documented.
Once implemented, the notebook workflow would change from:

# Current: Multiple cells of complex code
# Cell 1
X = stft_data.drop('label', axis=1)
y = stft_data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ... more complex code ...

To

# Refactored: Clean, readable notebook
from src.ml.data_splitting import create_train_test_split
from src.ml.labeling import generate_labels

# Generate labels
labels = generate_labels(stft_data)

# Create split with one clear function call
X_train, X_test, y_train, y_test = create_train_test_split(
    stft_data, 
    test_size=0.2,
    stratify=labels
)

### Problem Statement The current notebook containing STFT processing, train-test-split, and labeling functionality has grown complex and difficult to read. The code for data splitting and labeling is embedded within notebook cells, making it hard to maintain and reuse across different experiments. This approach limits code reusability and makes the notebook less readable for thesis documentation purposes. ### Proposed Solution Refactor the data splitting and labeling code from the notebook into properly structured Python modules that can be imported. This will: 1. Create dedicated Python modules in the src/ directory for: - Dataset splitting functionality (train/test/validation splits) - Labeling generation and management - Model training pipeline components 2. Clean up the notebook to focus on experiment flow, visualization, and results rather than implementation details. 3. Implement proper documentation, typing, and error handling in the new modules. ### Alternatives Considered - Keeping code in notebook but moving to separate notebook cells: This would improve readability but wouldn't address reusability. - Using Jupyter notebook extensions to collapse code: Helps with readability but doesn't improve maintainability. - Creating Python scripts without proper module structure: Would help with reusability but might create import issues. ### Component Python Source Code ### Priority High (significantly improves workflow) ### Implementation Ideas ```sh # Proposed module structure: # src/ # └── ml/ # ├── __init__.py # ├── data_splitting.py # ├── labeling.py # └── training.py ``` ```py def create_train_test_split( data: pd.DataFrame, test_size: float = 0.2, random_state: int = 42, stratify: np.ndarray = None ) -> tuple: """ Create a stratified train-test split from STFT data. Parameters: ----------- data : pd.DataFrame The input DataFrame containing STFT data test_size : float Proportion of data to use for testing (default: 0.2) random_state : int Random seed for reproducibility (default: 42) stratify : np.ndarray, optional Labels to use for stratified sampling Returns: -------- tuple (X_train, X_test, y_train, y_test) - Split datasets """ # Extract features and labels X = data.drop('label_column', axis=1) if 'label_column' in data.columns else data y = data['label_column'] if 'label_column' in data.columns else stratify # Create split X_train, X_test, y_train, y_test = sklearn_split( X, y, test_size=test_size, random_state=random_state, stratify=stratify ) return X_train, X_test, y_train, y_test ``` ### Expected Benefits 1. Improved notebook readability with focus on experimental flow and results 2. Enhanced code reusability across different notebooks and experiments 3. Better maintainability with single-responsibility modules 4. Easier testing of individual components 5. Clearer documentation of the machine learning pipeline for thesis readers 6. Simplified iteration on model improvements ### Additional Context This refactoring aligns with software engineering best practices and will make the thesis code more professional and maintainable. The modules should include appropriate error handling, type hints, and docstrings to ensure they're robust and well-documented. Once implemented, the notebook workflow would change from: ```py # Current: Multiple cells of complex code # Cell 1 X = stft_data.drop('label', axis=1) y = stft_data['label'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ... more complex code ... ``` To ```py # Refactored: Clean, readable notebook from src.ml.data_splitting import create_train_test_split from src.ml.labeling import generate_labels # Generate labels labels = generate_labels(stft_data) # Create split with one clear function call X_train, X_test, y_train, y_test = create_train_test_split( stft_data, test_size=0.2, stratify=labels ) ```

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#48