[EXP] Evaluate dataset separation strategies: combined vs. separate datasets for train/test splits #47

New Issue

2025-04-21T01:22:43Z

nuluh commented

2025-04-21 01:22:43 +00:00

(Migrated from github.com)

Hypothesis

Using separate datasets for training and testing will provide more reliable and generalizable model performance estimates than combining both datasets and using random splits, particularly considering the temporal and subject-specific characteristics of accelerometer data.

Background & Motivation

Recent research in ML applications highlights the risk of inflated performance metrics due to data leakage when improper dataset splitting is used. For accelerometer time-series data, similar or repeated measurements may exist across the datasets, potentially causing cross-contamination between training and test sets if combined indiscriminately. This investigation aims to determine the most appropriate dataset strategy for my thesis to ensure valid and reproducible results.

Dataset

Two accelerometer datasets with time-domain and frequency-domain (STFT) features
Dataset A
Dataset B
Potential temporal relationships within each dataset
Potential similarities/differences between the two datasets

Methodology

Approach A - Combined datasets:
- Merge both datasets
- Perform stratified random split (e.g., 80/20)
- Apply k-fold cross-validation (k=5)
- Analyze performance metrics
Approach B - Separate datasets:
- Train on Dataset 1, test on Dataset 2
- Train on Dataset 2, test on Dataset 1
- Train on combined datasets, validate on hold-out portions of each
- Analyze performance metrics and generalization capability
Analysis of potential data leakage:
- Temporal auto-correlation analysis within datasets
- Feature similarity analysis between datasets
- Subject/measurement overlap assessment
Statistical comparison of performance stability:
- Compare variance in metrics across different split strategies
- Assess confidence intervals for each approach
- Evaluate generalization gap (difference between training and test performance)

Parameters & Hyperparameters

Model architecture: [specify model type]
Fixed hyperparameters across all experiments:
- Learning rate: [value]
- Batch size: [value]
- Training epochs: [value]
- Regularization settings: [values]
Same pre-processing pipeline for all experiments
Same random seeds for reproducibility

Evaluation Metrics

Primary metrics:
- Classification accuracy
- F1-score
- AUC-ROC
Secondary analysis:
- Standard deviation of metrics across folds/runs
- Confidence intervals
- Performance stability analysis
- Generalization gap measurements
Statistical tests:
- Paired t-tests between approaches
- McNemar's test for comparing classifier disagreement

Notebook Location

notebooks/stft.ipynb

Dependencies

No response

References

"Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT Images"
- Key finding: Improper dataset splitting with similar/repeated measurements leads to data leakage and inflated metrics
"Impact of Train/Test Sample Regimen on Performance Estimate Stability of Machine Learning in Cardiovascular Imaging"
- Key finding: Single random splits cause significant variability in performance metrics, especially in small datasets
"Cross-Validation Is All You Need: A Statistical Approach to Label Noise Estimation"
- Key finding: Repeated cross-validation helps detect and mitigate noisy labels in datasets
Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation"
- Relevant for understanding how model selection can introduce bias in performance estimation

Additional Notes

This experiment is critical for establishing the methodological foundation of my thesis. The results will directly impact:

The reliability of all subsequent ML experiments
The validity of conclusions drawn from model performance
The reproducibility of my thesis results

Based on initial literature review, I expect the separate dataset approach to provide more conservative but reliable estimates, while the combined approach might show higher but potentially inflated performance metrics.

### Hypothesis Using separate datasets for training and testing will provide more reliable and generalizable model performance estimates than combining both datasets and using random splits, particularly considering the temporal and subject-specific characteristics of accelerometer data. ### Background & Motivation Recent research in ML applications highlights the risk of inflated performance metrics due to data leakage when improper dataset splitting is used. For accelerometer time-series data, similar or repeated measurements may exist across the datasets, potentially causing cross-contamination between training and test sets if combined indiscriminately. This investigation aims to determine the most appropriate dataset strategy for my thesis to ensure valid and reproducible results. ### Dataset - Two accelerometer datasets with time-domain and frequency-domain (STFT) features - Dataset A - Dataset B - Potential temporal relationships within each dataset - Potential similarities/differences between the two datasets ### Methodology 1. Approach A - Combined datasets: - Merge both datasets - Perform stratified random split (e.g., 80/20) - Apply k-fold cross-validation (k=5) - Analyze performance metrics 2. Approach B - Separate datasets: - Train on Dataset 1, test on Dataset 2 - Train on Dataset 2, test on Dataset 1 - Train on combined datasets, validate on hold-out portions of each - Analyze performance metrics and generalization capability 3. Analysis of potential data leakage: - Temporal auto-correlation analysis within datasets - Feature similarity analysis between datasets - Subject/measurement overlap assessment 4. Statistical comparison of performance stability: - Compare variance in metrics across different split strategies - Assess confidence intervals for each approach - Evaluate generalization gap (difference between training and test performance) ### Parameters & Hyperparameters - Model architecture: [specify model type] - Fixed hyperparameters across all experiments: - Learning rate: [value] - Batch size: [value] - Training epochs: [value] - Regularization settings: [values] - Same pre-processing pipeline for all experiments - Same random seeds for reproducibility ### Evaluation Metrics - Primary metrics: - Classification accuracy - F1-score - AUC-ROC - Secondary analysis: - Standard deviation of metrics across folds/runs - Confidence intervals - Performance stability analysis - Generalization gap measurements - Statistical tests: - Paired t-tests between approaches - McNemar's test for comparing classifier disagreement ### Notebook Location notebooks/stft.ipynb ### Dependencies _No response_ ### References 1. "Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT Images" - Key finding: Improper dataset splitting with similar/repeated measurements leads to data leakage and inflated metrics 2. "Impact of Train/Test Sample Regimen on Performance Estimate Stability of Machine Learning in Cardiovascular Imaging" - Key finding: Single random splits cause significant variability in performance metrics, especially in small datasets 3. "Cross-Validation Is All You Need: A Statistical Approach to Label Noise Estimation" - Key finding: Repeated cross-validation helps detect and mitigate noisy labels in datasets 4. Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation" - Relevant for understanding how model selection can introduce bias in performance estimation ### Additional Notes This experiment is critical for establishing the methodological foundation of my thesis. The results will directly impact: 1. The reliability of all subsequent ML experiments 2. The validity of conclusions drawn from model performance 3. The reproducibility of my thesis results Based on initial literature review, I expect the separate dataset approach to provide more conservative but reliable estimates, while the combined approach might show higher but potentially inflated performance metrics.

nuluh commented

2025-07-27 21:56:00 +00:00

(Migrated from github.com)

out of scope

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#47