[EXP] Evaluate dataset separation strategies: combined vs. separate datasets for train/test splits #47
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hypothesis
Using separate datasets for training and testing will provide more reliable and generalizable model performance estimates than combining both datasets and using random splits, particularly considering the temporal and subject-specific characteristics of accelerometer data.
Background & Motivation
Recent research in ML applications highlights the risk of inflated performance metrics due to data leakage when improper dataset splitting is used. For accelerometer time-series data, similar or repeated measurements may exist across the datasets, potentially causing cross-contamination between training and test sets if combined indiscriminately. This investigation aims to determine the most appropriate dataset strategy for my thesis to ensure valid and reproducible results.
Dataset
Methodology
Approach A - Combined datasets:
Approach B - Separate datasets:
Analysis of potential data leakage:
Statistical comparison of performance stability:
Parameters & Hyperparameters
Evaluation Metrics
Primary metrics:
Secondary analysis:
Statistical tests:
Notebook Location
notebooks/stft.ipynb
Dependencies
No response
References
"Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT Images"
"Impact of Train/Test Sample Regimen on Performance Estimate Stability of Machine Learning in Cardiovascular Imaging"
"Cross-Validation Is All You Need: A Statistical Approach to Label Noise Estimation"
Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation"
Additional Notes
This experiment is critical for establishing the methodological foundation of my thesis. The results will directly impact:
Based on initial literature review, I expect the separate dataset approach to provide more conservative but reliable estimates, while the combined approach might show higher but potentially inflated performance metrics.
out of scope