[EXP] Cross-dataset validation #74

New Issue

2025-05-17T14:29:19Z

meowndor commented

2025-05-17 14:29:19 +00:00

(Migrated from github.com)

Hypothesis

Training on one dataset (A) and validating on another (B) - and vice versa - will provide more robust evaluation of model generalization performance than standard train-test splits within each dataset.

Background & Motivation

My thesis proposal defense revealed an important validation gap - my professor requested evaluation of how well models trained on one dataset perform on another. This cross-dataset validation approach will test real-world generalization capabilities and reveal if the models are learning dataset-specific patterns rather than generalizable features.

This approach addresses potential data leakage concerns and provides stronger evidence for the robustness of the proposed methods across different data collection environments/scenarios.

Dataset

Dataset A: 15,390 samples × 514 features (513 STFT features + 1 label column)
Dataset B: 15,390 samples × 514 features (513 STFT features + 1 label column)
Both datasets are already preprocessed and ready for model training
Current implementation uses sklearn train_test_split within each dataset
New implementation will use entire datasets as train/validation splits

Methodology

Implement two cross-validation scenarios:
- Scenario 1: Train on dataset A, validate on dataset B
- Scenario 2: Train on dataset B, validate on dataset A
For each scenario:
- Train all previously implemented models on the training dataset
- Evaluate performance metrics on the validation dataset
- Compare results with the standard within-dataset validation approach
Create visualization comparing performance across all validation approaches:
- Standard CV on dataset A
- Standard CV on dataset B
- Train A → Test B
- Train B → Test A
Analyze discrepancies in performance between validation methods

Parameters & Hyperparameters

Use identical hyperparameters as in previous experiments for fair comparison
For each model type (e.g., Random Forest, SVM, Neural Network):
- Learning rate: [same as previous]
- Architecture: [same as previous]
- Regularization parameters: [same as previous]
- Training epochs/iterations: [same as previous]
Key modification is only the training/validation data split approach

Evaluation Metrics

Accuracy
F1-score (macro and per-class)
Confusion matrix
ROC-AUC (for applicable models)
Cross-entropy loss
Performance gap between standard validation and cross-dataset validation

Notebook Location

notebooks/cross_dataset_validation.ipynb

Dependencies

Depends on issue #XX (Data preprocessing for datasets A and B)
[DOC] Initial Discussion for Methodology (#63)

References

No response

Additional Notes

This experiment is critical for the thesis defense as it addresses a specific request from the committee. It will demonstrate robustness of my approach across datasets collected in different environments.

The implementation can leverage the existing model training pipeline with minimal modifications to the data loading and evaluation procedures. The main code changes will be to the dataset splitting logic rather than model architecture or training.

Expected outcome: Some performance drop in cross-dataset validation is anticipated, but a drop greater than 15-20% would indicate overfitting to dataset-specific patterns and may require revisiting feature engineering.

### Hypothesis Training on one dataset (A) and validating on another (B) - and vice versa - will provide more robust evaluation of model generalization performance than standard train-test splits within each dataset. ### Background & Motivation My thesis proposal defense revealed an important validation gap - my professor requested evaluation of how well models trained on one dataset perform on another. This cross-dataset validation approach will test real-world generalization capabilities and reveal if the models are learning dataset-specific patterns rather than generalizable features. This approach addresses potential data leakage concerns and provides stronger evidence for the robustness of the proposed methods across different data collection environments/scenarios. ### Dataset - Dataset A: 15,390 samples × 514 features (513 STFT features + 1 label column) - Dataset B: 15,390 samples × 514 features (513 STFT features + 1 label column) - Both datasets are already preprocessed and ready for model training - Current implementation uses sklearn train_test_split within each dataset - New implementation will use entire datasets as train/validation splits ### Methodology 1. Implement two cross-validation scenarios: - Scenario 1: Train on dataset A, validate on dataset B - Scenario 2: Train on dataset B, validate on dataset A 2. For each scenario: - Train all previously implemented models on the training dataset - Evaluate performance metrics on the validation dataset - Compare results with the standard within-dataset validation approach 3. Create visualization comparing performance across all validation approaches: - Standard CV on dataset A - Standard CV on dataset B - Train A → Test B - Train B → Test A 4. Analyze discrepancies in performance between validation methods ### Parameters & Hyperparameters - Use identical hyperparameters as in previous experiments for fair comparison - For each model type (e.g., Random Forest, SVM, Neural Network): - Learning rate: [same as previous] - Architecture: [same as previous] - Regularization parameters: [same as previous] - Training epochs/iterations: [same as previous] - Key modification is only the training/validation data split approach ### Evaluation Metrics - Accuracy - F1-score (macro and per-class) - Confusion matrix - ROC-AUC (for applicable models) - Cross-entropy loss - Performance gap between standard validation and cross-dataset validation ### Notebook Location notebooks/cross_dataset_validation.ipynb ### Dependencies - Depends on issue #XX (Data preprocessing for datasets A and B) - #63 ### References _No response_ ### Additional Notes This experiment is critical for the thesis defense as it addresses a specific request from the committee. It will demonstrate robustness of my approach across datasets collected in different environments. The implementation can leverage the existing model training pipeline with minimal modifications to the data loading and evaluation procedures. The main code changes will be to the dataset splitting logic rather than model architecture or training. Expected outcome: Some performance drop in cross-dataset validation is anticipated, but a drop greater than 15-20% would indicate overfitting to dataset-specific patterns and may require revisiting feature engineering.

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#74