[FEAT] Implement memory-efficient data processing options for STFT pipeline #45
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem Statement
The current data preprocessing pipeline stores all processed data from raw CSV to ready-to-export-CSV as objects in memory, consuming approximately 3-4 GB of RAM. This high memory usage becomes problematic when processing large datasets or working on machines with limited resources.
Proposed Solution
Enhance the preprocessing class to offer multiple memory management strategies through a configurable option:
The implementation should allow users to select the appropriate strategy based on their hardware constraints and performance needs.
Alternatives Considered
Initial workaround considered was simply saving processed data to file between steps and reloading, but this creates inefficiency as data must be read from disk before STFT processing. This approach also doesn't address the fundamental memory usage pattern.
Another alternative would be to completely rewrite the pipeline to use generators and streaming, but this would require significant code restructuring.
Component
Python Source Code
Priority
Medium (nice to have)
Implementation Ideas
Add a
memory_strategyparameter to the preprocessing class constructor with options:For disk-based strategy:
For chunked strategy:
Add memory usage estimation method to help users select appropriate strategy
Expected Benefits
Additional Context
This enhancement complements the ongoing work on STFT processing. While fixing the export shape bug, we should consider implementing this memory optimization to make the entire pipeline more robust.
Initial benchmarking suggests that even with the file-based approach, the overall processing time might only increase by 15-20%, which is an acceptable tradeoff for the memory benefits.
Sample code for implementation: