[FEAT] Implement file status tracking to prevent processing incomplete CSVs #46

New Issue

2025-04-20T07:15:37Z

nuluh commented

2025-04-20 07:15:37 +00:00

(Migrated from github.com)

Problem Statement

During STFT processing, errors can occur when processing incomplete or corrupted CSV files, particularly when a previous process was interrupted. There's currently no mechanism to track file completion status, which can lead to shape mismatches when attempting operations like pandas.DataFrame.to_csv(mode='a') on partially processed files.

Proposed Solution

Implement a file status tracking system that:

Marks files that are being processed with a temporary extension (e.g., .csv.temp)
Renames files to their final extension (.csv) only after successful completion
Adds validation checks before processing to detect and handle incomplete files
Provides recovery options for incomplete files

Alternatives Considered

While exception handling could catch errors during processing, it's a reactive approach that doesn't prevent the initial processing attempt on incomplete files. Locking mechanisms could also be used but add complexity that may not be necessary for a single-user thesis project.

Component

Python Source Code

Priority

Medium (nice to have)

Implementation Ideas

Create a file status manager class that handles:
- Renaming files with temporary extensions during processing
- Validating file status before processing
- Safe completion of file operations
Implementation approach:
- Use atomic file operations for renaming
- Add metadata about processing status in file headers or separate metadata files
- Implement context managers for safe file handling
Example pseudocode:

class FileStatusManager:
    def __init__(self, base_path):
        self.base_path = base_path
        
    def start_processing(self, filename):
        """Mark file as being processed by adding .incomplete extension"""
        incomplete_path = f"{filename}.incomplete"
        if os.path.exists(filename):
            os.rename(filename, incomplete_path)
        return incomplete_path
        
    def complete_processing(self, incomplete_path):
        """Mark file as successfully processed by removing .incomplete extension"""
        final_path = incomplete_path.replace('.incomplete', '')
        os.rename(incomplete_path, final_path)
        return final_path
        
    def is_complete(self, filename):
        """Check if file is complete"""
        return os.path.exists(filename) and not os.path.exists(f"{filename}.incomplete")
        
    def find_incomplete_files(self):
        """Find all incomplete files in base_path"""
        return glob.glob(f"{self.base_path}/*.incomplete")

Implementation Ideas

def process_data(input_file, output_file):
    file_manager = FileStatusManager('.')
    
    # Check if output already exists and is complete
    if file_manager.is_complete(output_file):
        print(f"Output file {output_file} already exists and is complete.")
        return
        
    # Start processing and mark as incomplete
    temp_output = file_manager.start_processing(output_file)
    
    try:
        # Do processing
        df = pd.read_csv(input_file)
        # ... processing steps ...
        df.to_csv(temp_output, index=False)
        
        # Mark as complete
        file_manager.complete_processing(temp_output)
    except Exception as e:
        print(f"Error processing {input_file}: {e}")
        # File remains with .incomplete extension

Expected Benefits

Prevents processing errors by clearly identifying incomplete files
Enables easy recovery from interrupted processing runs
Provides visual indication (via file extensions) of processing status
Makes the pipeline more robust against interruptions and crashes
Simplifies debugging by preserving the state of interrupted operations

Additional Context

This feature would be particularly useful during long batch processing operations where interruptions are more likely. It would complement the memory optimization feature (issue #45) by adding another layer of robustness to the processing pipeline.
The implementation should be lightweight and not add significant overhead to the processing time. The focus should be on preventing data corruption and providing clear status indicators.
This feature request outlines a practical approach to preventing errors when processing incomplete files. It's a simple safeguard mechanism that can save time and frustration by making the data processing pipeline more resilient to interruptions.

### Problem Statement During STFT processing, errors can occur when processing incomplete or corrupted CSV files, particularly when a previous process was interrupted. There's currently no mechanism to track file completion status, which can lead to shape mismatches when attempting operations like pandas.DataFrame.to_csv(mode='a') on partially processed files. ### Proposed Solution Implement a file status tracking system that: 1. Marks files that are being processed with a temporary extension (e.g., `.csv.temp`) 2. Renames files to their final extension (`.csv`) only after successful completion 3. Adds validation checks before processing to detect and handle incomplete files 4. Provides recovery options for incomplete files ### Alternatives Considered While exception handling could catch errors during processing, it's a reactive approach that doesn't prevent the initial processing attempt on incomplete files. Locking mechanisms could also be used but add complexity that may not be necessary for a single-user thesis project. ### Component Python Source Code ### Priority Medium (nice to have) ### Implementation Ideas 1. Create a file status manager class that handles: - Renaming files with temporary extensions during processing - Validating file status before processing - Safe completion of file operations 2. Implementation approach: - Use atomic file operations for renaming - Add metadata about processing status in file headers or separate metadata files - Implement context managers for safe file handling 3. Example pseudocode: ```python class FileStatusManager: def __init__(self, base_path): self.base_path = base_path def start_processing(self, filename): """Mark file as being processed by adding .incomplete extension""" incomplete_path = f"{filename}.incomplete" if os.path.exists(filename): os.rename(filename, incomplete_path) return incomplete_path def complete_processing(self, incomplete_path): """Mark file as successfully processed by removing .incomplete extension""" final_path = incomplete_path.replace('.incomplete', '') os.rename(incomplete_path, final_path) return final_path def is_complete(self, filename): """Check if file is complete""" return os.path.exists(filename) and not os.path.exists(f"{filename}.incomplete") def find_incomplete_files(self): """Find all incomplete files in base_path""" return glob.glob(f"{self.base_path}/*.incomplete") ``` 4. Implementation Ideas ```py def process_data(input_file, output_file): file_manager = FileStatusManager('.') # Check if output already exists and is complete if file_manager.is_complete(output_file): print(f"Output file {output_file} already exists and is complete.") return # Start processing and mark as incomplete temp_output = file_manager.start_processing(output_file) try: # Do processing df = pd.read_csv(input_file) # ... processing steps ... df.to_csv(temp_output, index=False) # Mark as complete file_manager.complete_processing(temp_output) except Exception as e: print(f"Error processing {input_file}: {e}") # File remains with .incomplete extension ``` ### Expected Benefits 1. Prevents processing errors by clearly identifying incomplete files 2. Enables easy recovery from interrupted processing runs 3. Provides visual indication (via file extensions) of processing status 4. Makes the pipeline more robust against interruptions and crashes 5. Simplifies debugging by preserving the state of interrupted operations ### Additional Context This feature would be particularly useful during long batch processing operations where interruptions are more likely. It would complement the memory optimization feature (issue #45) by adding another layer of robustness to the processing pipeline. The implementation should be lightweight and not add significant overhead to the processing time. The focus should be on preventing data corruption and providing clear status indicators. This feature request outlines a practical approach to preventing errors when processing incomplete files. It's a simple safeguard mechanism that can save time and frustration by making the data processing pipeline more resilient to interruptions.

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#46