[EXP] Cross-dataset validation #74

Open
opened 2025-05-17 14:29:19 +00:00 by meowndor · 0 comments
meowndor commented 2025-05-17 14:29:19 +00:00 (Migrated from github.com)

Hypothesis

Training on one dataset (A) and validating on another (B) - and vice versa - will provide more robust evaluation of model generalization performance than standard train-test splits within each dataset.

Background & Motivation

My thesis proposal defense revealed an important validation gap - my professor requested evaluation of how well models trained on one dataset perform on another. This cross-dataset validation approach will test real-world generalization capabilities and reveal if the models are learning dataset-specific patterns rather than generalizable features.

This approach addresses potential data leakage concerns and provides stronger evidence for the robustness of the proposed methods across different data collection environments/scenarios.

Dataset

  • Dataset A: 15,390 samples × 514 features (513 STFT features + 1 label column)
  • Dataset B: 15,390 samples × 514 features (513 STFT features + 1 label column)
  • Both datasets are already preprocessed and ready for model training
  • Current implementation uses sklearn train_test_split within each dataset
  • New implementation will use entire datasets as train/validation splits

Methodology

  1. Implement two cross-validation scenarios:

    • Scenario 1: Train on dataset A, validate on dataset B
    • Scenario 2: Train on dataset B, validate on dataset A
  2. For each scenario:

    • Train all previously implemented models on the training dataset
    • Evaluate performance metrics on the validation dataset
    • Compare results with the standard within-dataset validation approach
  3. Create visualization comparing performance across all validation approaches:

    • Standard CV on dataset A
    • Standard CV on dataset B
    • Train A → Test B
    • Train B → Test A
  4. Analyze discrepancies in performance between validation methods

Parameters & Hyperparameters

  • Use identical hyperparameters as in previous experiments for fair comparison

  • For each model type (e.g., Random Forest, SVM, Neural Network):

    • Learning rate: [same as previous]
    • Architecture: [same as previous]
    • Regularization parameters: [same as previous]
    • Training epochs/iterations: [same as previous]
  • Key modification is only the training/validation data split approach

Evaluation Metrics

  • Accuracy
  • F1-score (macro and per-class)
  • Confusion matrix
  • ROC-AUC (for applicable models)
  • Cross-entropy loss
  • Performance gap between standard validation and cross-dataset validation

Notebook Location

notebooks/cross_dataset_validation.ipynb

Dependencies

References

No response

Additional Notes

This experiment is critical for the thesis defense as it addresses a specific request from the committee. It will demonstrate robustness of my approach across datasets collected in different environments.

The implementation can leverage the existing model training pipeline with minimal modifications to the data loading and evaluation procedures. The main code changes will be to the dataset splitting logic rather than model architecture or training.

Expected outcome: Some performance drop in cross-dataset validation is anticipated, but a drop greater than 15-20% would indicate overfitting to dataset-specific patterns and may require revisiting feature engineering.

### Hypothesis Training on one dataset (A) and validating on another (B) - and vice versa - will provide more robust evaluation of model generalization performance than standard train-test splits within each dataset. ### Background & Motivation My thesis proposal defense revealed an important validation gap - my professor requested evaluation of how well models trained on one dataset perform on another. This cross-dataset validation approach will test real-world generalization capabilities and reveal if the models are learning dataset-specific patterns rather than generalizable features. This approach addresses potential data leakage concerns and provides stronger evidence for the robustness of the proposed methods across different data collection environments/scenarios. ### Dataset - Dataset A: 15,390 samples × 514 features (513 STFT features + 1 label column) - Dataset B: 15,390 samples × 514 features (513 STFT features + 1 label column) - Both datasets are already preprocessed and ready for model training - Current implementation uses sklearn train_test_split within each dataset - New implementation will use entire datasets as train/validation splits ### Methodology 1. Implement two cross-validation scenarios: - Scenario 1: Train on dataset A, validate on dataset B - Scenario 2: Train on dataset B, validate on dataset A 2. For each scenario: - Train all previously implemented models on the training dataset - Evaluate performance metrics on the validation dataset - Compare results with the standard within-dataset validation approach 3. Create visualization comparing performance across all validation approaches: - Standard CV on dataset A - Standard CV on dataset B - Train A → Test B - Train B → Test A 4. Analyze discrepancies in performance between validation methods ### Parameters & Hyperparameters - Use identical hyperparameters as in previous experiments for fair comparison - For each model type (e.g., Random Forest, SVM, Neural Network): - Learning rate: [same as previous] - Architecture: [same as previous] - Regularization parameters: [same as previous] - Training epochs/iterations: [same as previous] - Key modification is only the training/validation data split approach ### Evaluation Metrics - Accuracy - F1-score (macro and per-class) - Confusion matrix - ROC-AUC (for applicable models) - Cross-entropy loss - Performance gap between standard validation and cross-dataset validation ### Notebook Location notebooks/cross_dataset_validation.ipynb ### Dependencies - Depends on issue #XX (Data preprocessing for datasets A and B) - #63 ### References _No response_ ### Additional Notes This experiment is critical for the thesis defense as it addresses a specific request from the committee. It will demonstrate robustness of my approach across datasets collected in different environments. The implementation can leverage the existing model training pipeline with minimal modifications to the data loading and evaluation procedures. The main code changes will be to the dataset splitting logic rather than model architecture or training. Expected outcome: Some performance drop in cross-dataset validation is anticipated, but a drop greater than 15-20% would indicate overfitting to dataset-specific patterns and may require revisiting feature engineering.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nuluh/thesis#74