[EXP] Evaluate dataset separation strategies: combined vs. separate datasets for train/test splits #47

Closed
opened 2025-04-21 01:22:43 +00:00 by nuluh · 1 comment
nuluh commented 2025-04-21 01:22:43 +00:00 (Migrated from github.com)

Hypothesis

Using separate datasets for training and testing will provide more reliable and generalizable model performance estimates than combining both datasets and using random splits, particularly considering the temporal and subject-specific characteristics of accelerometer data.

Background & Motivation

Recent research in ML applications highlights the risk of inflated performance metrics due to data leakage when improper dataset splitting is used. For accelerometer time-series data, similar or repeated measurements may exist across the datasets, potentially causing cross-contamination between training and test sets if combined indiscriminately. This investigation aims to determine the most appropriate dataset strategy for my thesis to ensure valid and reproducible results.

Dataset

  • Two accelerometer datasets with time-domain and frequency-domain (STFT) features
  • Dataset A
  • Dataset B
  • Potential temporal relationships within each dataset
  • Potential similarities/differences between the two datasets

Methodology

  1. Approach A - Combined datasets:

    • Merge both datasets
    • Perform stratified random split (e.g., 80/20)
    • Apply k-fold cross-validation (k=5)
    • Analyze performance metrics
  2. Approach B - Separate datasets:

    • Train on Dataset 1, test on Dataset 2
    • Train on Dataset 2, test on Dataset 1
    • Train on combined datasets, validate on hold-out portions of each
    • Analyze performance metrics and generalization capability
  3. Analysis of potential data leakage:

    • Temporal auto-correlation analysis within datasets
    • Feature similarity analysis between datasets
    • Subject/measurement overlap assessment
  4. Statistical comparison of performance stability:

    • Compare variance in metrics across different split strategies
    • Assess confidence intervals for each approach
    • Evaluate generalization gap (difference between training and test performance)

Parameters & Hyperparameters

  • Model architecture: [specify model type]
  • Fixed hyperparameters across all experiments:
    • Learning rate: [value]
    • Batch size: [value]
    • Training epochs: [value]
    • Regularization settings: [values]
  • Same pre-processing pipeline for all experiments
  • Same random seeds for reproducibility

Evaluation Metrics

  • Primary metrics:

    • Classification accuracy
    • F1-score
    • AUC-ROC
  • Secondary analysis:

    • Standard deviation of metrics across folds/runs
    • Confidence intervals
    • Performance stability analysis
    • Generalization gap measurements
  • Statistical tests:

    • Paired t-tests between approaches
    • McNemar's test for comparing classifier disagreement

Notebook Location

notebooks/stft.ipynb

Dependencies

No response

References

  1. "Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT Images"

    • Key finding: Improper dataset splitting with similar/repeated measurements leads to data leakage and inflated metrics
  2. "Impact of Train/Test Sample Regimen on Performance Estimate Stability of Machine Learning in Cardiovascular Imaging"

    • Key finding: Single random splits cause significant variability in performance metrics, especially in small datasets
  3. "Cross-Validation Is All You Need: A Statistical Approach to Label Noise Estimation"

    • Key finding: Repeated cross-validation helps detect and mitigate noisy labels in datasets
  4. Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation"

    • Relevant for understanding how model selection can introduce bias in performance estimation

Additional Notes

This experiment is critical for establishing the methodological foundation of my thesis. The results will directly impact:

  1. The reliability of all subsequent ML experiments
  2. The validity of conclusions drawn from model performance
  3. The reproducibility of my thesis results

Based on initial literature review, I expect the separate dataset approach to provide more conservative but reliable estimates, while the combined approach might show higher but potentially inflated performance metrics.

### Hypothesis Using separate datasets for training and testing will provide more reliable and generalizable model performance estimates than combining both datasets and using random splits, particularly considering the temporal and subject-specific characteristics of accelerometer data. ### Background & Motivation Recent research in ML applications highlights the risk of inflated performance metrics due to data leakage when improper dataset splitting is used. For accelerometer time-series data, similar or repeated measurements may exist across the datasets, potentially causing cross-contamination between training and test sets if combined indiscriminately. This investigation aims to determine the most appropriate dataset strategy for my thesis to ensure valid and reproducible results. ### Dataset - Two accelerometer datasets with time-domain and frequency-domain (STFT) features - Dataset A - Dataset B - Potential temporal relationships within each dataset - Potential similarities/differences between the two datasets ### Methodology 1. Approach A - Combined datasets: - Merge both datasets - Perform stratified random split (e.g., 80/20) - Apply k-fold cross-validation (k=5) - Analyze performance metrics 2. Approach B - Separate datasets: - Train on Dataset 1, test on Dataset 2 - Train on Dataset 2, test on Dataset 1 - Train on combined datasets, validate on hold-out portions of each - Analyze performance metrics and generalization capability 3. Analysis of potential data leakage: - Temporal auto-correlation analysis within datasets - Feature similarity analysis between datasets - Subject/measurement overlap assessment 4. Statistical comparison of performance stability: - Compare variance in metrics across different split strategies - Assess confidence intervals for each approach - Evaluate generalization gap (difference between training and test performance) ### Parameters & Hyperparameters - Model architecture: [specify model type] - Fixed hyperparameters across all experiments: - Learning rate: [value] - Batch size: [value] - Training epochs: [value] - Regularization settings: [values] - Same pre-processing pipeline for all experiments - Same random seeds for reproducibility ### Evaluation Metrics - Primary metrics: - Classification accuracy - F1-score - AUC-ROC - Secondary analysis: - Standard deviation of metrics across folds/runs - Confidence intervals - Performance stability analysis - Generalization gap measurements - Statistical tests: - Paired t-tests between approaches - McNemar's test for comparing classifier disagreement ### Notebook Location notebooks/stft.ipynb ### Dependencies _No response_ ### References 1. "Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT Images" - Key finding: Improper dataset splitting with similar/repeated measurements leads to data leakage and inflated metrics 2. "Impact of Train/Test Sample Regimen on Performance Estimate Stability of Machine Learning in Cardiovascular Imaging" - Key finding: Single random splits cause significant variability in performance metrics, especially in small datasets 3. "Cross-Validation Is All You Need: A Statistical Approach to Label Noise Estimation" - Key finding: Repeated cross-validation helps detect and mitigate noisy labels in datasets 4. Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation" - Relevant for understanding how model selection can introduce bias in performance estimation ### Additional Notes This experiment is critical for establishing the methodological foundation of my thesis. The results will directly impact: 1. The reliability of all subsequent ML experiments 2. The validity of conclusions drawn from model performance 3. The reproducibility of my thesis results Based on initial literature review, I expect the separate dataset approach to provide more conservative but reliable estimates, while the combined approach might show higher but potentially inflated performance metrics.
nuluh commented 2025-07-27 21:56:00 +00:00 (Migrated from github.com)

out of scope

out of scope
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nuluh/thesis#47