[FEAT] Implement memory-efficient data processing options for STFT pipeline #45

New Issue

2025-04-20T07:00:27Z

nuluh commented

2025-04-20 07:00:27 +00:00

(Migrated from github.com)

Problem Statement

The current data preprocessing pipeline stores all processed data from raw CSV to ready-to-export-CSV as objects in memory, consuming approximately 3-4 GB of RAM. This high memory usage becomes problematic when processing large datasets or working on machines with limited resources.

Proposed Solution

Enhance the preprocessing class to offer multiple memory management strategies through a configurable option:

In-memory processing (current approach) - Keeps all data in memory for fastest processing
File-based incremental processing - Saves intermediate results to disk between processing steps
Stream-based processing - Process data in chunks without keeping full dataset in memory

The implementation should allow users to select the appropriate strategy based on their hardware constraints and performance needs.

Alternatives Considered

Initial workaround considered was simply saving processed data to file between steps and reloading, but this creates inefficiency as data must be read from disk before STFT processing. This approach also doesn't address the fundamental memory usage pattern.

Another alternative would be to completely rewrite the pipeline to use generators and streaming, but this would require significant code restructuring.

Component

Python Source Code

Priority

Medium (nice to have)

Implementation Ideas

Add a memory_strategy parameter to the preprocessing class constructor with options:
- "in_memory" (default): current implementation
- "disk_based": save intermediate results to disk
- "chunked": process in fixed-size chunks
For disk-based strategy:
- Implement temporary file management with context managers
- Use memory-mapped files (np.memmap) for large arrays
- Add cleanup methods to remove temporary files
For chunked strategy:
- Implement generator-based processing functions
- Define appropriate chunk sizes based on typical memory constraints
- Add progress tracking for chunk-based processing
Add memory usage estimation method to help users select appropriate strategy

Expected Benefits

Enable processing of larger datasets than currently possible with available RAM
Make the preprocessing pipeline more flexible for different computing environments
Provide options that balance between speed (in-memory) and memory efficiency (chunked)
Allow thesis experiments to run on computers with limited RAM

Additional Context

This enhancement complements the ongoing work on STFT processing. While fixing the export shape bug, we should consider implementing this memory optimization to make the entire pipeline more robust.

Initial benchmarking suggests that even with the file-based approach, the overall processing time might only increase by 15-20%, which is an acceptable tradeoff for the memory benefits.

Sample code for implementation:

class DataProcessor:
    def __init__(self, memory_strategy="in_memory", chunk_size=1000):
        """
        Initialize data processor with memory strategy
        
        Parameters:
        -----------
        memory_strategy : str
            "in_memory": Keep all data in memory (fast but memory intensive)
            "disk_based": Save intermediate results to disk
            "chunked": Process data in chunks
        chunk_size : int
            Size of chunks when using chunked processing
        """
        self.memory_strategy = memory_strategy
        self.chunk_size = chunk_size
        
    def process_file(self, input_file, output_file):
        if self.memory_strategy == "in_memory":
            self._process_in_memory(input_file, output_file)
        elif self.memory_strategy == "disk_based":
            self._process_disk_based(input_file, output_file)
        elif self.memory_strategy == "chunked":
            self._process_chunked(input_file, output_file)
        else:
            raise ValueError(f"Unknown memory strategy: {self.memory_strategy}")

### Problem Statement The current data preprocessing pipeline stores all processed data from raw CSV to ready-to-export-CSV as objects in memory, consuming approximately 3-4 GB of RAM. This high memory usage becomes problematic when processing large datasets or working on machines with limited resources. ### Proposed Solution Enhance the preprocessing class to offer multiple memory management strategies through a configurable option: 1. In-memory processing (current approach) - Keeps all data in memory for fastest processing 2. File-based incremental processing - Saves intermediate results to disk between processing steps 3. Stream-based processing - Process data in chunks without keeping full dataset in memory The implementation should allow users to select the appropriate strategy based on their hardware constraints and performance needs. ### Alternatives Considered Initial workaround considered was simply saving processed data to file between steps and reloading, but this creates inefficiency as data must be read from disk before STFT processing. This approach also doesn't address the fundamental memory usage pattern. Another alternative would be to completely rewrite the pipeline to use generators and streaming, but this would require significant code restructuring. ### Component Python Source Code ### Priority Medium (nice to have) ### Implementation Ideas 1. Add a `memory_strategy` parameter to the preprocessing class constructor with options: - "in_memory" (default): current implementation - "disk_based": save intermediate results to disk - "chunked": process in fixed-size chunks 2. For disk-based strategy: - Implement temporary file management with context managers - Use memory-mapped files (np.memmap) for large arrays - Add cleanup methods to remove temporary files 3. For chunked strategy: - Implement generator-based processing functions - Define appropriate chunk sizes based on typical memory constraints - Add progress tracking for chunk-based processing 4. Add memory usage estimation method to help users select appropriate strategy ### Expected Benefits 1. Enable processing of larger datasets than currently possible with available RAM 2. Make the preprocessing pipeline more flexible for different computing environments 3. Provide options that balance between speed (in-memory) and memory efficiency (chunked) 4. Allow thesis experiments to run on computers with limited RAM ### Additional Context This enhancement complements the ongoing work on STFT processing. While fixing the export shape bug, we should consider implementing this memory optimization to make the entire pipeline more robust. Initial benchmarking suggests that even with the file-based approach, the overall processing time might only increase by 15-20%, which is an acceptable tradeoff for the memory benefits. Sample code for implementation: ```python class DataProcessor: def __init__(self, memory_strategy="in_memory", chunk_size=1000): """ Initialize data processor with memory strategy Parameters: ----------- memory_strategy : str "in_memory": Keep all data in memory (fast but memory intensive) "disk_based": Save intermediate results to disk "chunked": Process data in chunks chunk_size : int Size of chunks when using chunked processing """ self.memory_strategy = memory_strategy self.chunk_size = chunk_size def process_file(self, input_file, output_file): if self.memory_strategy == "in_memory": self._process_in_memory(input_file, output_file) elif self.memory_strategy == "disk_based": self._process_disk_based(input_file, output_file) elif self.memory_strategy == "chunked": self._process_chunked(input_file, output_file) else: raise ValueError(f"Unknown memory strategy: {self.memory_strategy}")

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#45