Add 'labels' Column to Time-domain Feature Extraction DataFrame #10

New Issue

2024-08-20T06:11:35Z

nuluh commented

2024-08-20 06:11:35 +00:00

(Migrated from github.com)

Description

We need to include a 'labels' column in our feature extraction DataFrame to facilitate downstream tasks such as training machine learning models. Currently, the DataFrame generated by the build_features function only contains extracted features, and lacks any form of labeling for these features.

Expected Behavior

The DataFrame should include a 'labels' column where each row corresponds to the label of the dataset from which the features were extracted.

Current Behavior

The current implementation generates a DataFrame without a 'labels' column. This absence prevents us from using the DataFrame directly in supervised learning scenarios. Here's the DataFrame features head looks like:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Mean                50 non-null     float64
 1   Max                 50 non-null     float64
 2   Peak (Pm)           50 non-null     float64
 3   Peak-to-Peak (Pk)   50 non-null     float64
 4   RMS                 50 non-null     float64
 5   Variance            50 non-null     float64
 6   Standard Deviation  50 non-null     float64
 7   Power               50 non-null     float64
 8   Crest Factor        50 non-null     float64
 9   Form Factor         50 non-null     float64
 10  Pulse Indicator     50 non-null     float64
 11  Margin              50 non-null     float64
 12  Kurtosis            50 non-null     float64
 13  Skewness            50 non-null     float64
dtypes: float64(14)
memory usage: 5.6 KB

Possible Solution

Modify the build_features function to append a 'labels' column to the DataFrame. This column could be derived from the directory names or a specific pattern in the filenames, depending on how our data is structured.

Steps to Reproduce

Run the build_features script with the current setup.
Observe that the resulting DataFrame saved as combined_features.csv does not include a 'labels' column.

Context (Environment)

The feature extraction is crucial for our model training, and having labeled data is necessary for any supervised learning approach. The absence of labels impacts our ability to directly train models using the extracted features.

Possible Implementation

Here's a potential snippet for how we might modify the build_features function:

def build_features(input_dir, output_dir):
    all_features = []  # List to store all feature dicts
    for nth_damage in os.listdir(input_dir):
        nth_damage_path = os.path.join(input_dir, nth_damage)
        if os.path.isdir(nth_damage_path):
            for nth_test in os.listdir(nth_damage_path):
                nth_test_path = os.path.join(nth_damage_path, nth_test)
                if nth_test_path.endswith('.csv'):  # Ensure it's a CSV file
                    features = ExtractTimeFeatures(nth_test_path)
                    features['label'] = nth_damage  # assuming label is the directory name
                    all_features.append(features)
    df = pd.DataFrame(all_features)
    output_file_path = os.path.join(output_dir, 'combined_features.csv')
    df.to_csv(output_file_path, index=False)

## Description We need to include a 'labels' column in our feature extraction DataFrame to facilitate downstream tasks such as training machine learning models. Currently, the DataFrame generated by the `build_features` function only contains extracted features, and lacks any form of labeling for these features. ## Expected Behavior The DataFrame should include a 'labels' column where each row corresponds to the label of the dataset from which the features were extracted. ## Current Behavior The current implementation generates a DataFrame without a 'labels' column. This absence prevents us from using the DataFrame directly in supervised learning scenarios. Here's the DataFrame features head looks like: ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Mean 50 non-null float64 1 Max 50 non-null float64 2 Peak (Pm) 50 non-null float64 3 Peak-to-Peak (Pk) 50 non-null float64 4 RMS 50 non-null float64 5 Variance 50 non-null float64 6 Standard Deviation 50 non-null float64 7 Power 50 non-null float64 8 Crest Factor 50 non-null float64 9 Form Factor 50 non-null float64 10 Pulse Indicator 50 non-null float64 11 Margin 50 non-null float64 12 Kurtosis 50 non-null float64 13 Skewness 50 non-null float64 dtypes: float64(14) memory usage: 5.6 KB ``` ## Possible Solution Modify the `build_features` function to append a 'labels' column to the DataFrame. This column could be derived from the directory names or a specific pattern in the filenames, depending on how our data is structured. ## Steps to Reproduce 1. Run the `build_features` script with the current setup. 2. Observe that the resulting DataFrame saved as `combined_features.csv` does not include a 'labels' column. ## Context (Environment) The feature extraction is crucial for our model training, and having labeled data is necessary for any supervised learning approach. The absence of labels impacts our ability to directly train models using the extracted features. ## Possible Implementation Here's a potential snippet for how we might modify the `build_features` function: ```python def build_features(input_dir, output_dir): all_features = [] # List to store all feature dicts for nth_damage in os.listdir(input_dir): nth_damage_path = os.path.join(input_dir, nth_damage) if os.path.isdir(nth_damage_path): for nth_test in os.listdir(nth_damage_path): nth_test_path = os.path.join(nth_damage_path, nth_test) if nth_test_path.endswith('.csv'): # Ensure it's a CSV file features = ExtractTimeFeatures(nth_test_path) features['label'] = nth_damage # assuming label is the directory name all_features.append(features) df = pd.DataFrame(all_features) output_file_path = os.path.join(output_dir, 'combined_features.csv') df.to_csv(output_file_path, index=False)

Sign in to join this conversation.

Branches Tags

main

dev

feature/chapter-2-literature-review

feature/chapter-4-results

feature/chapter-3-methodology-steps

exp/74-exp-cross-dataset-validation

exp/74-exp-cross-dataset-validation-b2bf1b0

feat/103-feat-inference-function

feature/101-feat-time-elapsed-for-training-and-inference

feature/99-exp-alternative-undamage-case-data

feat/90-feat-preserve-trained-model

latex/75-enhance-background-research

wuicace-2025

revert-92-latex/91-bug-expose-maketitle

latex/91-bug-expose-maketitle

latex/documentclass

latex/frontmatter

latex/bib

latex/methodology

latex/literature-review

latex/theoritical-foundation

latex/background

latex/68-feat-refactor-chapter-two

68-feat-refactor-chapter-two

latex/initial-template

59-feat-add-acknowledgement-page

57-feat-add-dynamic-page-style-for-chapter-page

latex/fix-table-of-contents-styling

56-bug-endorsementpage-error

latex/54-doc-summary-table-of-past-realted-research

feature/48-feat-refactor-stft-preprocessing-and-training-pipeline-into-importable-modules

40-feat-add-export-to-csv-method-for-dataprocessor-in-convertpy

43-bug-stft-csv-export-has-incorrect-shape-and-column-format

feature/38-feat-redesign-convertpy

feature/37-feat-add-data-processing-script-for-dataset-b-outside-training-data

stft

feature/19-qugs-data

feature/15-normalize-dataset-by-preprocess-relatives-value-between-two-acceloremeter-sensors

feature/automate-csv-file

revert-8-feature/csv-padding-naming

feature/5-create-fft-script

feature/10-add-labels-column-to-time-domain-feature-extraction-dataframe

feature/csv-padding-naming

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: nuluh/thesis#10