[FEAT] Implement memory-efficient data processing options for STFT pipeline #45

Open
opened 2025-04-20 07:00:27 +00:00 by nuluh · 0 comments
nuluh commented 2025-04-20 07:00:27 +00:00 (Migrated from github.com)

Problem Statement

The current data preprocessing pipeline stores all processed data from raw CSV to ready-to-export-CSV as objects in memory, consuming approximately 3-4 GB of RAM. This high memory usage becomes problematic when processing large datasets or working on machines with limited resources.

Proposed Solution

Enhance the preprocessing class to offer multiple memory management strategies through a configurable option:

  1. In-memory processing (current approach) - Keeps all data in memory for fastest processing
  2. File-based incremental processing - Saves intermediate results to disk between processing steps
  3. Stream-based processing - Process data in chunks without keeping full dataset in memory

The implementation should allow users to select the appropriate strategy based on their hardware constraints and performance needs.

Alternatives Considered

Initial workaround considered was simply saving processed data to file between steps and reloading, but this creates inefficiency as data must be read from disk before STFT processing. This approach also doesn't address the fundamental memory usage pattern.

Another alternative would be to completely rewrite the pipeline to use generators and streaming, but this would require significant code restructuring.

Component

Python Source Code

Priority

Medium (nice to have)

Implementation Ideas

  1. Add a memory_strategy parameter to the preprocessing class constructor with options:

    • "in_memory" (default): current implementation
    • "disk_based": save intermediate results to disk
    • "chunked": process in fixed-size chunks
  2. For disk-based strategy:

    • Implement temporary file management with context managers
    • Use memory-mapped files (np.memmap) for large arrays
    • Add cleanup methods to remove temporary files
  3. For chunked strategy:

    • Implement generator-based processing functions
    • Define appropriate chunk sizes based on typical memory constraints
    • Add progress tracking for chunk-based processing
  4. Add memory usage estimation method to help users select appropriate strategy

Expected Benefits

  1. Enable processing of larger datasets than currently possible with available RAM
  2. Make the preprocessing pipeline more flexible for different computing environments
  3. Provide options that balance between speed (in-memory) and memory efficiency (chunked)
  4. Allow thesis experiments to run on computers with limited RAM

Additional Context

This enhancement complements the ongoing work on STFT processing. While fixing the export shape bug, we should consider implementing this memory optimization to make the entire pipeline more robust.

Initial benchmarking suggests that even with the file-based approach, the overall processing time might only increase by 15-20%, which is an acceptable tradeoff for the memory benefits.

Sample code for implementation:

class DataProcessor:
    def __init__(self, memory_strategy="in_memory", chunk_size=1000):
        """
        Initialize data processor with memory strategy
        
        Parameters:
        -----------
        memory_strategy : str
            "in_memory": Keep all data in memory (fast but memory intensive)
            "disk_based": Save intermediate results to disk
            "chunked": Process data in chunks
        chunk_size : int
            Size of chunks when using chunked processing
        """
        self.memory_strategy = memory_strategy
        self.chunk_size = chunk_size
        
    def process_file(self, input_file, output_file):
        if self.memory_strategy == "in_memory":
            self._process_in_memory(input_file, output_file)
        elif self.memory_strategy == "disk_based":
            self._process_disk_based(input_file, output_file)
        elif self.memory_strategy == "chunked":
            self._process_chunked(input_file, output_file)
        else:
            raise ValueError(f"Unknown memory strategy: {self.memory_strategy}")
### Problem Statement The current data preprocessing pipeline stores all processed data from raw CSV to ready-to-export-CSV as objects in memory, consuming approximately 3-4 GB of RAM. This high memory usage becomes problematic when processing large datasets or working on machines with limited resources. ### Proposed Solution Enhance the preprocessing class to offer multiple memory management strategies through a configurable option: 1. In-memory processing (current approach) - Keeps all data in memory for fastest processing 2. File-based incremental processing - Saves intermediate results to disk between processing steps 3. Stream-based processing - Process data in chunks without keeping full dataset in memory The implementation should allow users to select the appropriate strategy based on their hardware constraints and performance needs. ### Alternatives Considered Initial workaround considered was simply saving processed data to file between steps and reloading, but this creates inefficiency as data must be read from disk before STFT processing. This approach also doesn't address the fundamental memory usage pattern. Another alternative would be to completely rewrite the pipeline to use generators and streaming, but this would require significant code restructuring. ### Component Python Source Code ### Priority Medium (nice to have) ### Implementation Ideas 1. Add a `memory_strategy` parameter to the preprocessing class constructor with options: - "in_memory" (default): current implementation - "disk_based": save intermediate results to disk - "chunked": process in fixed-size chunks 2. For disk-based strategy: - Implement temporary file management with context managers - Use memory-mapped files (np.memmap) for large arrays - Add cleanup methods to remove temporary files 3. For chunked strategy: - Implement generator-based processing functions - Define appropriate chunk sizes based on typical memory constraints - Add progress tracking for chunk-based processing 4. Add memory usage estimation method to help users select appropriate strategy ### Expected Benefits 1. Enable processing of larger datasets than currently possible with available RAM 2. Make the preprocessing pipeline more flexible for different computing environments 3. Provide options that balance between speed (in-memory) and memory efficiency (chunked) 4. Allow thesis experiments to run on computers with limited RAM ### Additional Context This enhancement complements the ongoing work on STFT processing. While fixing the export shape bug, we should consider implementing this memory optimization to make the entire pipeline more robust. Initial benchmarking suggests that even with the file-based approach, the overall processing time might only increase by 15-20%, which is an acceptable tradeoff for the memory benefits. Sample code for implementation: ```python class DataProcessor: def __init__(self, memory_strategy="in_memory", chunk_size=1000): """ Initialize data processor with memory strategy Parameters: ----------- memory_strategy : str "in_memory": Keep all data in memory (fast but memory intensive) "disk_based": Save intermediate results to disk "chunked": Process data in chunks chunk_size : int Size of chunks when using chunked processing """ self.memory_strategy = memory_strategy self.chunk_size = chunk_size def process_file(self, input_file, output_file): if self.memory_strategy == "in_memory": self._process_in_memory(input_file, output_file) elif self.memory_strategy == "disk_based": self._process_disk_based(input_file, output_file) elif self.memory_strategy == "chunked": self._process_chunked(input_file, output_file) else: raise ValueError(f"Unknown memory strategy: {self.memory_strategy}")
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nuluh/thesis#45