diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..44fcab8 --- /dev/null +++ b/.gitattributes @@ -0,0 +1 @@ +*.ipynb filter=nbstripout \ No newline at end of file diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 0000000..020dfcd --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,115 @@ +name: Bug Report +description: Report a bug or unexpected behavior +title: "[BUG] " +labels: ["bug"] +assignees: + - ${{github.actor}} +body: + - type: markdown + attributes: + value: | + Thanks for taking the time to fill out this bug report! + + - type: textarea + id: description + attributes: + label: Bug Description + description: A clear and concise description of what the bug is + placeholder: When I run the script, it crashes when processing large datasets... + validations: + required: true + + - type: textarea + id: reproduction + attributes: + label: Steps to Reproduce + description: Steps to reproduce the behavior + placeholder: | + 1. Go to notebook '...' + 2. Run cell #... + 3. See error + validations: + required: true + + - type: textarea + id: expected + attributes: + label: Expected Behavior + description: What did you expect to happen? + placeholder: The analysis should complete successfully and generate the visualization + validations: + required: true + + - type: textarea + id: actual + attributes: + label: Actual Behavior + description: What actually happened? + placeholder: The script crashes with a memory error after processing 1000 samples + validations: + required: true + + - type: textarea + id: logs + attributes: + label: Error Logs + description: Paste any relevant logs or error messages + render: shell + placeholder: | + Traceback (most recent call last): + File "script.py", line 42, in + main() + File "script.py", line 28, in main + process_data(data) + MemoryError: ... + validations: + required: false + + - type: dropdown + id: component + attributes: + label: Component + description: Which part of the thesis project is affected? + options: + - LaTeX Document + - Python Source Code + - Jupyter Notebook + - Data Processing + - ML Model + - Visualization + - Build/Environment + validations: + required: true + + - type: input + id: version + attributes: + label: Version/Commit + description: Which version or commit hash are you using? + placeholder: v0.2.3 or 8d5b9a7 + validations: + required: true + + - type: textarea + id: environment + attributes: + label: Environment + description: Information about your environment + placeholder: | + - OS: [e.g. Ubuntu 22.04] + - Python: [e.g. 3.9.5] + - Relevant packages and versions: + - numpy: 1.22.3 + - scikit-learn: 1.0.2 + - tensorflow: 2.9.1 + validations: + required: false + + - type: textarea + id: additional + attributes: + label: Additional Context + description: Any other context or screenshots about the problem + placeholder: Add any other context about the problem here... + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..0de487b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,12 @@ +blank_issues_enabled: false +contact_links: + - name: Documentation + url: ../docs/README.md + about: Check the documentation before creating an issue + +# Template configurations +templates: + - name: bug_report.yml + - name: feature_request.yml + - name: experiment.yml + - name: documentation.yml diff --git a/.github/ISSUE_TEMPLATE/documentation.yml b/.github/ISSUE_TEMPLATE/documentation.yml new file mode 100644 index 0000000..4b4d1e0 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/documentation.yml @@ -0,0 +1,116 @@ +name: Documentation +description: Improvements or additions to documentation +title: "[DOC] " +labels: ["documentation"] +assignees: + - ${{github.actor}} +body: + - type: markdown + attributes: + value: | + Use this template for documentation-related tasks for your thesis project. + + - type: dropdown + id: doc_type + attributes: + label: Documentation Type + description: What type of documentation is this issue about? + options: + - Thesis Chapter/Section + - Code Documentation + - Experiment Documentation + - README/Project Documentation + - Literature Review + - Methodology Description + - Results Analysis + - API Reference + validations: + required: true + + - type: textarea + id: description + attributes: + label: Description + description: Describe what needs to be documented + placeholder: Need to document the data preprocessing pipeline including all transformation steps and rationale + validations: + required: true + + - type: textarea + id: current_state + attributes: + label: Current State + description: What's the current state of the documentation (if any)? + placeholder: Currently there are some comments in the code but no comprehensive documentation of the preprocessing steps + validations: + required: false + + - type: textarea + id: proposed_changes + attributes: + label: Proposed Changes + description: What specific documentation changes do you want to make? + placeholder: | + 1. Create a dedicated markdown file describing each preprocessing step + 2. Add docstrings to all preprocessing functions + 3. Create a diagram showing the data flow + 4. Document parameter choices and their justification + validations: + required: true + + - type: input + id: location + attributes: + label: Documentation Location + description: Where will this documentation be stored? + placeholder: docs/data_preprocessing.md or src/preprocessing/README.md + validations: + required: true + + - type: dropdown + id: priority + attributes: + label: Priority + description: How important is this documentation? + options: + - Critical (required for thesis) + - High (important for understanding) + - Medium (helpful but not urgent) + - Low (nice to have) + validations: + required: true + + - type: dropdown + id: audience + attributes: + label: Target Audience + description: Who is the primary audience for this documentation? + options: + - Thesis Committee/Reviewers + - Future Self + - Other Researchers + - Technical Readers + - Non-technical Readers + - Multiple Audiences + validations: + required: true + + - type: textarea + id: references + attributes: + label: References + description: Any papers, documentation or other materials related to this documentation task + placeholder: | + - Smith et al. (2022). "Best practices in machine learning documentation" + - Code in src/preprocessing/normalize.py + validations: + required: false + + - type: textarea + id: notes + attributes: + label: Additional Notes + description: Any other relevant information + placeholder: This documentation will be referenced in Chapter 3 of the thesis + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/experiment.yml b/.github/ISSUE_TEMPLATE/experiment.yml new file mode 100644 index 0000000..55d530d --- /dev/null +++ b/.github/ISSUE_TEMPLATE/experiment.yml @@ -0,0 +1,124 @@ +# .github/ISSUE_TEMPLATE/experiment.yml +name: Experiment +description: Document a new ML experiment +title: "[EXP] " +labels: ["experiment"] +assignees: + - ${{github.actor}} +body: + - type: markdown + attributes: + value: | + Use this template to document a new experiment for your thesis. + + - type: textarea + id: hypothesis + attributes: + label: Hypothesis + description: What is the hypothesis you're testing with this experiment? + placeholder: Using a deeper network with residual connections will improve accuracy on the imbalanced dataset without increasing overfitting + validations: + required: true + + - type: textarea + id: background + attributes: + label: Background & Motivation + description: Background context and why this experiment is important + placeholder: Previous experiments showed promising results but suffered from overfitting. Recent literature suggests that... + validations: + required: true + + - type: textarea + id: dataset + attributes: + label: Dataset + description: What data will you use for this experiment? + placeholder: | + - Dataset: MNIST with augmentation + - Preprocessing: Standardization + random rotation + - Train/Test Split: 80/20 + - Validation strategy: 5-fold cross-validation + validations: + required: true + + - type: textarea + id: methodology + attributes: + label: Methodology + description: How will you conduct the experiment? + placeholder: | + 1. Implement ResNet architecture with varying depths (18, 34, 50) + 2. Train with early stopping (patience=10) + 3. Compare against baseline CNN from experiment #23 + 4. Analyze learning curves and performance metrics + validations: + required: true + + - type: textarea + id: parameters + attributes: + label: Parameters & Hyperparameters + description: List the key parameters for this experiment + placeholder: | + - Learning rate: 0.001 with Adam optimizer + - Batch size: 64 + - Epochs: Max 100 with early stopping + - Dropout rate: 0.3 + - L2 regularization: 1e-4 + validations: + required: true + + - type: textarea + id: metrics + attributes: + label: Evaluation Metrics + description: How will you evaluate the results? + placeholder: | + - Accuracy + - F1-score (macro-averaged) + - ROC-AUC + - Training vs. validation loss curves + - Inference time + validations: + required: true + + - type: input + id: notebook + attributes: + label: Notebook Location + description: Where will the experiment notebook be stored? + placeholder: notebooks/experiment_resnet_comparison.ipynb + validations: + required: false + + - type: textarea + id: dependencies + attributes: + label: Dependencies + description: What other issues or tasks does this experiment depend on? + placeholder: | + - Depends on issue #42 (Data preprocessing pipeline) + - Requires completion of issue #51 (Baseline model) + validations: + required: false + + - type: textarea + id: references + attributes: + label: References + description: Any papers, documentation or other materials relevant to this experiment + placeholder: | + - He et al. (2016). "Deep Residual Learning for Image Recognition" + - My previous experiment #23 (baseline CNN) + validations: + required: false + + - type: textarea + id: notes + attributes: + label: Additional Notes + description: Any other relevant information + placeholder: This experiment may require significant GPU resources. Expected runtime is ~3 hours on Tesla V100. + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 0000000..7147326 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,99 @@ +# .github/ISSUE_TEMPLATE/feature_request.yml +name: Feature Request +description: Suggest a new feature or enhancement +title: "[FEAT] " +labels: ["enhancement"] +assignees: + - ${{github.actor}} +body: + - type: markdown + attributes: + value: | + Thanks for taking the time to propose a new feature! + + - type: textarea + id: problem + attributes: + label: Problem Statement + description: What problem are you trying to solve with this feature? + placeholder: I'm frustrated when trying to analyze different model results because I need to manually compare them... + validations: + required: true + + - type: textarea + id: solution + attributes: + label: Proposed Solution + description: Describe the solution you'd like to implement + placeholder: Create a visualization utility that automatically compares results across multiple models and experiments + validations: + required: true + + - type: textarea + id: alternatives + attributes: + label: Alternatives Considered + description: Describe alternatives you've considered + placeholder: I considered using an external tool, but integrating directly would provide better workflow + validations: + required: false + + - type: dropdown + id: component + attributes: + label: Component + description: Which part of the thesis project would this feature affect? + options: + - LaTeX Document + - Python Source Code + - Jupyter Notebook + - Data Processing + - ML Model + - Visualization + - Build/Environment + - Multiple Components + validations: + required: true + + - type: dropdown + id: priority + attributes: + label: Priority + description: How important is this feature for your thesis progression? + options: + - Critical (blocks progress) + - High (significantly improves workflow) + - Medium (nice to have) + - Low (minor improvement) + validations: + required: true + + - type: textarea + id: implementation + attributes: + label: Implementation Ideas + description: Any initial thoughts on how to implement this feature? + placeholder: | + - Could use matplotlib's subplot feature + - Would need to standardize the model output format + - Should include statistical significance tests + validations: + required: false + + - type: textarea + id: benefits + attributes: + label: Expected Benefits + description: How will this feature benefit your thesis work? + placeholder: This will save time in analysis and provide more consistent comparisons across experiments + validations: + required: true + + - type: textarea + id: additional + attributes: + label: Additional Context + description: Any other context, screenshots, or reference material + placeholder: Here's a paper that uses a similar approach... + validations: + required: false diff --git a/.gitmessage b/.gitmessage new file mode 100644 index 0000000..4afc0c8 --- /dev/null +++ b/.gitmessage @@ -0,0 +1,30 @@ +# .gitmessage + +# (): +# |<---- Using a Maximum Of 50 Characters ---->| +# +# Explain the problem that this commit is solving. Focus on why you +# are making this change as opposed to how. Use clear, concise language. +# |<---- Try To Limit Each Line to a Maximum Of 72 Characters ---->| +# +# -- COMMIT END -- +# Types: +# feat (new feature) +# fix (bug fix) +# refactor (refactoring code) +# style (formatting, no code change) +# doc (changes to documentation) +# test (adding or refactoring tests) +# perf (performance improvements) +# chore (routine tasks, dependencies) +# exp (experimental work/exploration) +# +# Scope: +# latex (changes to thesis LaTeX) +# src (changes to Python source code) +# nb (changes to notebooks) +# ml (ML model specific changes) +# data (data processing/preparation) +# viz (visualization related) +# all (changes spanning entire repository) +# -------------------- \ No newline at end of file diff --git a/LICENSE b/LICENSE index e69de29..488f0a3 100644 --- a/LICENSE +++ b/LICENSE @@ -0,0 +1,7 @@ +Copyright 2024 Rifqi D. Panuluh + +All Rights Reserved. + +This repository is for viewing purposes only. No part of this repository, including but not limited to the code, files, and documentation, may be copied, reproduced, modified, or distributed in any form or by any means without the prior written permission of the copyright holder. + +Unauthorized use, distribution, or modification of this repository may result in legal action. diff --git a/README.md b/README.md index e69de29..29b847c 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,18 @@ +## Summary + +This repository contains the work related to my thesis, which focuses on damage localization prediction. The research explores the application of machine learning techniques to structural health monitoring. + +**Note:** This repository does not contain the secondary data used in the analysis. The code is designed to work with data from the [QUGS (Qatar University Grandstand Simulator)](https://www.structuralvibration.com/benchmark/qugs/) dataset, which is not included here. + +The repository is private and access is restricted only to those who have been given explicit permission by the owner. Access is provided solely for the purpose of brief review or seeking technical guidance. + +## Restrictions + +- **No Derivative Works or Cloning:** Any form of copying, cloning, or creating derivative works based on this repository is strictly prohibited. +- **Limited Access:** Use beyond brief review or collaboration is not allowed without prior permission from the owner. + +--- + +All contents of this repository, including the thesis idea, code, and associated data, are copyrighted © 2024 by Rifqi Panuluh. Unauthorized use or duplication is prohibited. + +[LICENSE](https://github.com/nuluh/thesis?tab=License-1-ov-file#readme) diff --git a/code/notebooks/03_feature_extraction.ipynb b/code/notebooks/03_feature_extraction.ipynb index 77c576d..e4f6154 100644 --- a/code/notebooks/03_feature_extraction.ipynb +++ b/code/notebooks/03_feature_extraction.ipynb @@ -157,6 +157,19 @@ "source": [ "# Define a function to extract numbers from a filename that later used as labels features\n", "def extract_numbers(filename):\n", + " '''\n", + " Extract numbers from a filename\n", + "\n", + " Parameters\n", + " ----------\n", + " filename : str\n", + " The filename to extract numbers from\n", + "\n", + " Returns\n", + " -------\n", + " list\n", + " A list of extracted numbers: [damage_number, test_number, sensor_number]\n", + " '''\n", " # Find all occurrences of one or more digits in the filename\n", " numbers = re.findall(r'\\d+', filename)\n", " # Convert the list of number strings to integers\n", @@ -168,6 +181,7 @@ " all_features = []\n", " for nth_damage in os.listdir(input_dir):\n", " nth_damage_path = os.path.join(input_dir, nth_damage)\n", + " print(f'Extracting features from damage folder {nth_damage_path}')\n", " if os.path.isdir(nth_damage_path):\n", " for nth_test in os.listdir(nth_damage_path):\n", " nth_test_path = os.path.join(nth_damage_path, nth_test)\n", @@ -348,6 +362,430 @@ "sns.pairplot(subset_df, hue='label', diag_kind='kde')\n", "plt.show()" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## QUGS Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To test the `FeatureExtractor` class from the `time_domain_features.py` script with real data from QUGS that has been converted purposed for the thesis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Importing Modules\n", + "\n", + "Use relative imports or modify the path to include the directory where the module is stored. In this example, we’ll simulate the relative import setup." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Real DataFrame\n", + "\n", + "Create one DataFrame from one of the raw data file. Simulate importing the `FeatureExtractor` from its relative path in the notebooks directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Convert to DataFrame (simulating processed data input)\n", + "single_data_dir = \"D:/thesis/data/converted/raw/DAMAGE_2/D2_TEST05_01.csv\"\n", + "df = pd.read_csv(single_data_dir)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Absolute the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[df.columns[-1]] = df[df.columns[-1]].abs()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize Data Points" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Plotting the data points\n", + "plt.figure(figsize=(8, 6))\n", + "plt.plot(df['Time'], df[df.columns[-1]], marker='o', color='blue', label='Data Points')\n", + "plt.title('Scatter Plot of Data Points')\n", + "plt.xlabel('Time')\n", + "plt.ylabel('Amp')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Downsampled Plot with Alpha Blending\n", + "\n", + "Reduce the number of data points by sampling a subset of the data and use transparency to help visualize the density of overlapping points." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Downsample the data by taking every nth point\n", + "n = 1 # Adjust this value as needed\n", + "downsampled_df = df.iloc[::n, :]\n", + "\n", + "# Plotting the downsampled data points with alpha blending\n", + "plt.figure(figsize=(8, 6))\n", + "plt.plot(downsampled_df['Time'], downsampled_df[downsampled_df.columns[-1]], alpha=0.5, color='blue', label='Data Points')\n", + "plt.title('Scatter Plot of Downsampled Data Points')\n", + "plt.xlabel('Time')\n", + "plt.ylabel('Amp')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Line Plot with Rolling Avg" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Calculate the rolling average\n", + "window_size = 50 # Adjust this value as needed\n", + "rolling_avg = df[df.columns[-1]].rolling(window=window_size).mean()\n", + "\n", + "# Plotting the original data points and the rolling average\n", + "plt.figure(figsize=(8, 6))\n", + "plt.plot(df['Time'], df[df.columns[-1]], alpha=0.3, color='blue', label='Original Data')\n", + "plt.plot(df['Time'], rolling_avg, color='red', label='Rolling Average')\n", + "plt.title('Line Plot with Rolling Average')\n", + "plt.xlabel('Time')\n", + "plt.ylabel('Amp')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Print Time-domain Features (Single CSV Real Data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import sys\n", + "import os\n", + "# Assuming the src directory is one level up from the notebooks directory\n", + "sys.path.append('../src/features')\n", + "from time_domain_features import FeatureExtractor\n", + "\n", + "\n", + "# Extract features\n", + "extracted = FeatureExtractor(df[df.columns[-1]])\n", + "\n", + "# Format with pandas DataFramw\n", + "features = pd.DataFrame(extracted.features, index=[0])\n", + "features\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Print Time-domain Features (Multiple CSV Real Data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import sys\n", + "import os\n", + "import re\n", + "# Assuming the src directory is one level up from the notebooks directory\n", + "sys.path.append('../src/features')\n", + "from time_domain_features import ExtractTimeFeatures # use wrapper function instead of class for easy use\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### The function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define a function to extract numbers from a filename that later used as labels features\n", + "def extract_numbers(filename):\n", + " '''\n", + " Extract numbers from a filename\n", + "\n", + " Parameters\n", + " ----------\n", + " filename : str\n", + " The filename to extract numbers from\n", + "\n", + " Returns\n", + " -------\n", + " list\n", + " A list of extracted numbers: [damage_number, test_number, sensor_number]\n", + " '''\n", + " # Find all occurrences of one or more digits in the filename\n", + " numbers = re.findall(r'\\d+', filename)\n", + " # Convert the list of number strings to integers\n", + " numbers = [int(num) for num in numbers]\n", + " # Convert to a tuple and return\n", + " return numbers\n", + "\n", + "def build_features(input_dir:str, sensor:int=None, verbose:bool=False, absolute:bool=False):\n", + " all_features = []\n", + " for nth_damage in os.listdir(input_dir):\n", + " nth_damage_path = os.path.join(input_dir, nth_damage)\n", + " if verbose:\n", + " print(f'Extracting features from damage folder {nth_damage_path}')\n", + " if os.path.isdir(nth_damage_path):\n", + " for nth_test in os.listdir(nth_damage_path):\n", + " nth_test_path = os.path.join(nth_damage_path, nth_test)\n", + " # if verbose:\n", + " # print(f'Extracting features from {nth_test_path}')\n", + " if sensor is not None:\n", + " # Check if the file has the specified sensor suffix\n", + " if not nth_test.endswith(f'_{sensor:02}.csv'):\n", + " continue\n", + " # if verbose:\n", + " # print(f'Extracting features from {nth_test_path}')\n", + " features = ExtractTimeFeatures(nth_test_path, absolute=absolute) # return the one csv file feature in dictionary {}\n", + " if verbose:\n", + " print(features)\n", + " features['label'] = extract_numbers(nth_test)[0] # add labels to the dictionary\n", + " features['filename'] = nth_test # add filename to the dictionary\n", + " all_features.append(features)\n", + "\n", + " # Create a DataFrame from the list of dictionaries\n", + " df = pd.DataFrame(all_features)\n", + " return df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Execute the automation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_dir = \"D:/thesis/data/converted/raw\"\n", + "# Extract features\n", + "df1 = build_features(data_dir, sensor=1, verbose=True, absolute=True)\n", + "df2 = build_features(data_dir, sensor=2, verbose=True, absolute=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df1.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Assuming your DataFrame is named 'df'\n", + "\n", + "# Subsetting the DataFrame to include only the first 3 columns and the label\n", + "subset_df = df1[['Mean', 'Max', 'Peak (Pm)', 'label']]\n", + "\n", + "# Plotting the pairplot\n", + "g = sns.pairplot(subset_df, hue='label', diag_kind='kde')\n", + "\n", + "# Adjusting the axis limits\n", + "# for ax in g.axes.flatten():\n", + "# ax.set_xlim(-10, 10) # Adjust these limits based on your data\n", + "# ax.set_ylim(-10, 10) # Adjust these limits based on your data\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df2.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Assuming your DataFrame is named 'df'\n", + "\n", + "# Subsetting the DataFrame to include only the first 3 columns and the label\n", + "subset_df = df2[['Mean', 'Max', 'Standard Deviation', 'Kurtosis', 'label']]\n", + "\n", + "# Plotting the pairplot\n", + "g = sns.pairplot(subset_df, hue='label', diag_kind='kde')\n", + "\n", + "# Adjusting the axis limits\n", + "# for ax in g.axes.flatten():\n", + "# ax.set_xlim(-10, 10) # Adjust these limits based on your data\n", + "# ax.set_ylim(-10, 10) # Adjust these limits based on your data\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Perform division" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Separate the label column\n", + "label_column = df1.iloc[:, -2]\n", + "\n", + "# Perform the relative value by operate division on all the features\n", + "df_relative = df2.iloc[:, :-2] / df1.iloc[:, :-2]\n", + "\n", + "# Add the label column back to the resulting DataFrame\n", + "df_relative['label'] = label_column\n", + "\n", + "# Append a string to all column names\n", + "suffix = '_rel'\n", + "df_relative.columns = [col + suffix if col != 'label' else col for col in df_relative.columns]\n", + "\n", + "# Display the first 5 rows of the resulting DataFrame\n", + "df_relative" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Subsetting DataFrame to see the pair plots due to many features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Assuming your DataFrame is named 'df'\n", + "\n", + "# Subsetting the DataFrame to include only the first 3 columns and the label\n", + "subset_df = df_relative[['Mean_rel', 'Max_rel', 'Peak (Pm)_rel', 'label']]\n", + "\n", + "# Plotting the pairplot\n", + "g = sns.pairplot(subset_df, hue='label', diag_kind='kde')\n", + "\n", + "# Adjusting the axis limits\n", + "# for ax in g.axes.flatten():\n", + "# ax.set_xlim(-10, 10) # Adjust these limits based on your data\n", + "# ax.set_ylim(-10, 10) # Adjust these limits based on your data\n", + "\n", + "plt.show()" + ] } ], "metadata": { diff --git a/code/notebooks/stft.ipynb b/code/notebooks/stft.ipynb new file mode 100644 index 0000000..b1c16b3 --- /dev/null +++ b/code/notebooks/stft.ipynb @@ -0,0 +1,857 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sensor1 = pd.read_csv('D:/thesis/data/converted/raw/DAMAGE_1/DAMAGE_1_TEST1_01.csv',sep=',')\n", + "sensor2 = pd.read_csv('D:/thesis/data/converted/raw/DAMAGE_1/DAMAGE_1_TEST1_02.csv',sep=',')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sensor1.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df1 = pd.DataFrame()\n", + "df1['s1'] = sensor1[sensor1.columns[-1]]\n", + "df1['s2'] = sensor2[sensor2.columns[-1]]\n", + "df1\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def merge_two_sensors(damage_path, damage):\n", + " df = pd.DataFrame()\n", + " for file in os.listdir(damage_path):\n", + " pattern = re.compile(r'DAMAGE_\\d+_TEST\\d+_\\d{2}\\.csv')\n", + " try:\n", + " assert pattern.match(file), f\"File {file} does not match the required format, skipping...\"\n", + " # assert \"TEST01\" in file, f\"File {file} does not contain 'TEST01', skipping...\" #TODO: should be trained using the whole test file\n", + " print(f\"Processing file: {file}\")\n", + " # Append the full path of the file to sensor1 or sensor2 based on the filename\n", + " if file.endswith('_01.csv'):\n", + " df['sensor 1'] = pd.read_csv(os.path.join('D:/thesis/data/converted/raw', damage, file), sep=',', usecols=[1])\n", + " elif file.endswith('_02.csv'):\n", + " df['sensor 2'] = pd.read_csv(os.path.join('D:/thesis/data/converted/raw', damage, file), sep=',', usecols=[1])\n", + " except AssertionError as e:\n", + " print(e)\n", + " continue # Skip to the next iteration\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import re\n", + "\n", + "df = []\n", + "for damage in os.listdir('D:/thesis/data/converted/raw'):\n", + " damage_path = os.path.join('D:/thesis/data/converted/raw', damage)\n", + " df.append(merge_two_sensors(damage_path, damage))\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(df)\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Combined Plot for sensor 1 and sensor 2 from data1 file in which motor is operated at 800 rpm\n", + "\n", + "plt.plot(df1['s2'], label='sensor 2')\n", + "plt.plot(df1['s1'], label='sensor 1', alpha=0.5)\n", + "plt.xlabel(\"Number of samples\")\n", + "plt.ylabel(\"Amplitude\")\n", + "plt.title(\"Raw vibration signal\")\n", + "plt.ylim(-7.5, 5)\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "signal_sensor1_test1 = []\n", + "signal_sensor2_test1 = []\n", + "\n", + "for data in df:\n", + " signal_sensor1_test1.append(data['sensor 1'].values)\n", + " signal_sensor2_test1.append(data['sensor 2'].values)\n", + "\n", + "print(len(signal_sensor1_test1))\n", + "print(len(signal_sensor2_test1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Applying Short-Time Fourier Transform (STFT)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "os.getcwd()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import numpy as np\n", + "from scipy.signal import stft, hann\n", + "from multiprocessing import Pool\n", + "\n", + "\n", + "\n", + "# Function to compute and append STFT data\n", + "def process_stft(args):\n", + " # Define STFT parameters\n", + " window_size = 1024\n", + " hop_size = 512\n", + " window = hann(window_size)\n", + "\n", + " Fs = 1024 # Sampling frequency in Hz\n", + " \n", + " damage_num, test_num, sensor_suffix = args\n", + " sensor_name = active_sensors[sensor_suffix]\n", + " sensor_num = sensor_suffix[-1] # '1' or '2'\n", + " \n", + " # Construct the file path\n", + " file_name = f'DAMAGE_{damage_num}_TEST{test_num}_{sensor_suffix}.csv'\n", + " file_path = os.path.join(damage_base_path, f'DAMAGE_{damage_num}', file_name)\n", + " \n", + " # Check if the file exists\n", + " if not os.path.isfile(file_path):\n", + " print(f\"File {file_path} does not exist. Skipping...\")\n", + " return\n", + " \n", + " # Read the CSV\n", + " try:\n", + " df = pd.read_csv(file_path)\n", + " except Exception as e:\n", + " print(f\"Error reading {file_path}: {e}. Skipping...\")\n", + " return\n", + " \n", + " # Ensure the CSV has exactly two columns\n", + " if df.shape[1] != 2:\n", + " print(f\"Unexpected number of columns in {file_path}. Skipping...\")\n", + " return\n", + " \n", + " # Extract sensor data\n", + " sensor_column = df.columns[1]\n", + " sensor_data = df[sensor_column].values\n", + " \n", + " # Compute STFT\n", + " frequencies, times, Zxx = stft(sensor_data, fs=Fs, window=window, nperseg=window_size, noverlap=window_size - hop_size)\n", + " magnitude = np.abs(Zxx)\n", + " flattened_stft = magnitude.flatten()\n", + " \n", + " # Define the output CSV file path\n", + " stft_file_name = f'stft_data{sensor_num}_{damage_num}.csv'\n", + " sensor_output_dir = os.path.join(damage_base_path, sensor_name.lower())\n", + " os.makedirs(sensor_output_dir, exist_ok=True)\n", + " stft_file_path = os.path.join(sensor_output_dir, stft_file_name)\n", + " print(stft_file_path)\n", + " # Append the flattened STFT to the CSV\n", + " try:\n", + " flattened_stft_df = pd.DataFrame([flattened_stft])\n", + " if not os.path.isfile(stft_file_path):\n", + " # Create a new CSV\n", + " flattened_stft_df.to_csv(stft_file_path, index=False, header=False)\n", + " else:\n", + " # Append to existing CSV\n", + " flattened_stft_df.to_csv(stft_file_path, mode='a', index=False, header=False)\n", + " print(f\"Appended STFT data to {stft_file_path}\")\n", + " except Exception as e:\n", + " print(f\"Error writing to {stft_file_path}: {e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define the base path where DAMAGE_X folders are located\n", + "damage_base_path = 'D:/thesis/data/converted/raw/'\n", + "\n", + "# Define active sensors\n", + "active_sensors = {\n", + " '01': 'sensor1', # Beginning map sensor\n", + " '02': 'sensor2' # End map sensor\n", + "}\n", + "\n", + "# Define damage cases and test runs\n", + "damage_cases = range(1, 7) # Adjust based on actual number of damage cases\n", + "test_runs = range(1, 6) # TEST01 to TEST05\n", + "args_list = []\n", + "\n", + "# Prepare the list of arguments for parallel processing\n", + "for damage_num in damage_cases:\n", + " for test_num in test_runs:\n", + " for sensor_suffix in active_sensors.keys():\n", + " args_list.append((damage_num, test_num, sensor_suffix))\n", + "\n", + "print(len(args_list))\n", + "args_list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Process STFTs sequentially instead of in parallel\n", + "if __name__ == \"__main__\":\n", + " print(f\"Starting sequential STFT processing...\")\n", + " for i, arg in enumerate(args_list, 1):\n", + " process_stft(arg)\n", + " print(f\"Processed {i}/{len(args_list)} files\")\n", + " print(\"STFT processing completed.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from scipy.signal import stft, hann\n", + "\n", + "# Applying STFT\n", + "vibration_data = signal_sensor1_test1[1]\n", + "window_size = 1024\n", + "hop_size = 512\n", + "window = hann(window_size) # Creating a Hanning window\n", + "Fs = 1024\n", + "\n", + "frequencies, times, Zxx = stft(vibration_data, \n", + " fs=Fs, \n", + " window=window, \n", + " nperseg=window_size, \n", + " noverlap=window_size - hop_size)\n", + "# Plotting the STFT Data\n", + "plt.pcolormesh(times, frequencies, np.abs(Zxx), shading='gouraud')\n", + "plt.title(f'STFT Magnitude for case {1} signal sensor 2')\n", + "plt.ylabel(f'Frequency [Hz]')\n", + "plt.xlabel(f'Time [sec]')\n", + "plt.show()\n", + "\n", + "# get current y ticks in list\n", + "print(len(frequencies))\n", + "print(len(times))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading STFT Data from CSV Files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.listdir('D:/thesis/data/working')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "ready_data1 = []\n", + "for file in os.listdir('D:/thesis/data/working/sensor1'):\n", + " ready_data1.append(pd.read_csv(os.path.join('D:/thesis/data/working/sensor1', file)))\n", + "# ready_data1[1]\n", + "# colormesh give title x is frequency and y is time and rotate/transpose the data\n", + "# Plotting the STFT Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ready_data1[1]\n", + "plt.pcolormesh(ready_data1[1])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(6):\n", + " plt.pcolormesh(ready_data1[i])\n", + " plt.title(f'STFT Magnitude for case {i} sensor 1')\n", + " plt.xlabel(f'Frequency [Hz]')\n", + " plt.ylabel(f'Time [sec]')\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ready_data2 = []\n", + "for file in os.listdir('D:/thesis/data/working/sensor2'):\n", + " ready_data2.append(pd.read_csv(os.path.join('D:/thesis/data/working/sensor2', file)))\n", + "ready_data2[5]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len(ready_data1))\n", + "print(len(ready_data2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x1 = 0\n", + "\n", + "for i in range(len(ready_data1)):\n", + " print(ready_data1[i].shape)\n", + " x1 = x1 + ready_data1[i].shape[0]\n", + "\n", + "print(x1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x2 = 0\n", + "\n", + "for i in range(len(ready_data2)):\n", + " print(ready_data2[i].shape)\n", + " x2 = x2 + ready_data2[i].shape[0]\n", + "\n", + "print(x2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x1 = ready_data1[0]\n", + "# print(x1)\n", + "print(type(x1))\n", + "for i in range(len(ready_data1) - 1):\n", + " #print(i)\n", + " x1 = np.concatenate((x1, ready_data1[i + 1]), axis=0)\n", + "# print(x1)\n", + "pd.DataFrame(x1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x2 = ready_data2[0]\n", + "\n", + "for i in range(len(ready_data2) - 1):\n", + " #print(i)\n", + " x2 = np.concatenate((x2, ready_data2[i + 1]), axis=0)\n", + "pd.DataFrame(x2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(x1.shape)\n", + "print(x2.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_1 = [1,1,1,1]\n", + "y_2 = [0,1,1,1]\n", + "y_3 = [1,0,1,1]\n", + "y_4 = [1,1,0,0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_1 = 0\n", + "y_2 = 1\n", + "y_3 = 2\n", + "y_4 = 3\n", + "y_5 = 4\n", + "y_6 = 5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_data = [y_1, y_2, y_3, y_4, y_5, y_6]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(y_data)):\n", + " print(ready_data1[i].shape[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(y_data)):\n", + " y_data[i] = [y_data[i]]*ready_data1[i].shape[0]\n", + " y_data[i] = np.array(y_data[i])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = y_data[0]\n", + "\n", + "for i in range(len(y_data) - 1):\n", + " #print(i)\n", + " y = np.concatenate((y, y_data[i+1]), axis=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(y.shape)\n", + "print(np.unique(y))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "x_train1, x_test1, y_train, y_test = train_test_split(x1, y, test_size=0.2, random_state=2)\n", + "x_train2, x_test2, y_train, y_test = train_test_split(x2, y, test_size=0.2, random_state=2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import accuracy_score\n", + "from sklearn.ensemble import RandomForestClassifier, BaggingClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", + "from sklearn.svm import SVC\n", + "from xgboost import XGBClassifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check the shapes of x_train and y_train\n", + "print(\"Shape of x1_train:\", x_train1.shape)\n", + "print(\"Shape of x2_train:\", x_train2.shape)\n", + "print(\"Shape of y_train:\", y_train.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "accuracies1 = []\n", + "accuracies2 = []\n", + "\n", + "\n", + "# 1. Random Forest\n", + "rf_model = RandomForestClassifier()\n", + "rf_model.fit(x_train1, y_train)\n", + "rf_pred1 = rf_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, rf_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"Random Forest Accuracy for sensor 1:\", acc1)\n", + "rf_model.fit(x_train2, y_train)\n", + "rf_pred2 = rf_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, rf_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"Random Forest Accuracy for sensor 2:\", acc2)\n", + "# print(rf_pred)\n", + "# print(y_test)\n", + "\n", + "# 2. Bagged Trees\n", + "bagged_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10)\n", + "bagged_model.fit(x_train1, y_train)\n", + "bagged_pred1 = bagged_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, bagged_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"Bagged Trees Accuracy for sensor 1:\", acc1)\n", + "bagged_model.fit(x_train2, y_train)\n", + "bagged_pred2 = bagged_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, bagged_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"Bagged Trees Accuracy for sensor 2:\", acc2)\n", + "\n", + "# 3. Decision Tree\n", + "dt_model = DecisionTreeClassifier()\n", + "dt_model.fit(x_train1, y_train)\n", + "dt_pred1 = dt_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, dt_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"Decision Tree Accuracy for sensor 1:\", acc1)\n", + "dt_model.fit(x_train2, y_train)\n", + "dt_pred2 = dt_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, dt_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"Decision Tree Accuracy for sensor 2:\", acc2)\n", + "\n", + "# 4. KNeighbors\n", + "knn_model = KNeighborsClassifier()\n", + "knn_model.fit(x_train1, y_train)\n", + "knn_pred1 = knn_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, knn_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"KNeighbors Accuracy for sensor 1:\", acc1)\n", + "knn_model.fit(x_train2, y_train)\n", + "knn_pred2 = knn_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, knn_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"KNeighbors Accuracy for sensor 2:\", acc2)\n", + "\n", + "# 5. Linear Discriminant Analysis\n", + "lda_model = LinearDiscriminantAnalysis()\n", + "lda_model.fit(x_train1, y_train)\n", + "lda_pred1 = lda_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, lda_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"Linear Discriminant Analysis Accuracy for sensor 1:\", acc1)\n", + "lda_model.fit(x_train2, y_train)\n", + "lda_pred2 = lda_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, lda_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"Linear Discriminant Analysis Accuracy for sensor 2:\", acc2)\n", + "\n", + "# 6. Support Vector Machine\n", + "svm_model = SVC()\n", + "svm_model.fit(x_train1, y_train)\n", + "svm_pred1 = svm_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, svm_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"Support Vector Machine Accuracy for sensor 1:\", acc1)\n", + "svm_model.fit(x_train2, y_train)\n", + "svm_pred2 = svm_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, svm_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"Support Vector Machine Accuracy for sensor 2:\", acc2)\n", + "\n", + "# 7. XGBoost\n", + "xgboost_model = XGBClassifier()\n", + "xgboost_model.fit(x_train1, y_train)\n", + "xgboost_pred1 = xgboost_model.predict(x_test1)\n", + "acc1 = accuracy_score(y_test, xgboost_pred1) * 100\n", + "accuracies1.append(acc1)\n", + "# format with color coded if acc1 > 90\n", + "acc1 = f\"\\033[92m{acc1:.2f}\\033[00m\" if acc1 > 90 else f\"{acc1:.2f}\"\n", + "print(\"XGBoost Accuracy:\", acc1)\n", + "xgboost_model.fit(x_train2, y_train)\n", + "xgboost_pred2 = xgboost_model.predict(x_test2)\n", + "acc2 = accuracy_score(y_test, xgboost_pred2) * 100\n", + "accuracies2.append(acc2)\n", + "# format with color coded if acc2 > 90\n", + "acc2 = f\"\\033[92m{acc2:.2f}\\033[00m\" if acc2 > 90 else f\"{acc2:.2f}\"\n", + "print(\"XGBoost Accuracy:\", acc2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(accuracies1)\n", + "print(accuracies2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "models = [rf_model, bagged_model, dt_model, knn_model, lda_model, svm_model, xgboost_model]\n", + "model_names = [\"Random Forest\", \"Bagged Trees\", \"Decision Tree\", \"KNN\", \"LDA\", \"SVM\", \"XGBoost\"]\n", + "\n", + "bar_width = 0.35 # Width of each bar\n", + "index = np.arange(len(model_names)) # Index for the bars\n", + "\n", + "# Plotting the bar graph\n", + "plt.figure(figsize=(14, 8))\n", + "\n", + "# Bar plot for Sensor 1\n", + "plt.bar(index, accuracies1, width=bar_width, color='blue', label='Sensor 1')\n", + "\n", + "# Bar plot for Sensor 2\n", + "plt.bar(index + bar_width, accuracies2, width=bar_width, color='orange', label='Sensor 2')\n", + "\n", + "# Add values on top of each bar\n", + "for i, acc1, acc2 in zip(index, accuracies1, accuracies2):\n", + " plt.text(i, acc1 + .1, f'{acc1:.2f}%', ha='center', va='bottom', color='black')\n", + " plt.text(i + bar_width, acc2 + 1, f'{acc2:.2f}%', ha='center', va='bottom', color='black')\n", + "\n", + "# Customize the plot\n", + "plt.xlabel('Model Name →')\n", + "plt.ylabel('Accuracy →')\n", + "plt.title('Accuracy of classifiers for Sensors 1 and 2 with 513 features')\n", + "plt.xticks(index + bar_width / 2, model_names) # Set x-tick positions\n", + "plt.legend()\n", + "plt.ylim(0, 100)\n", + "\n", + "# Show the plot\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def spectograph(data_dir: str):\n", + " # print(os.listdir(data_dir))\n", + " for damage in os.listdir(data_dir):\n", + " # print(damage)\n", + " d = os.path.join(data_dir, damage)\n", + " # print(d)\n", + " for file in os.listdir(d):\n", + " # print(file)\n", + " f = os.path.join(d, file)\n", + " print(f)\n", + " # sensor1 = pd.read_csv(f, skiprows=1, sep=';')\n", + " # sensor2 = pd.read_csv(f, skiprows=1, sep=';')\n", + "\n", + " # df1 = pd.DataFrame()\n", + "\n", + " # df1['s1'] = sensor1[sensor1.columns[-1]]\n", + " # df1['s2'] = sensor2[sensor2.columns[-1]]\n", + "ed\n", + " # # Combined Plot for sensor 1 and sensor 2 from data1 file in which motor is operated at 800 rpm\n", + "\n", + " # plt.plot(df1['s2'], label='sensor 2')\n", + " # plt.plot(df1['s1'], label='sensor 1')\n", + " # plt.xlabel(\"Number of samples\")\n", + " # plt.ylabel(\"Amplitude\")\n", + " # plt.title(\"Raw vibration signal\")\n", + " # plt.legend()\n", + " # plt.show()\n", + "\n", + " # from scipy import signal\n", + " # from scipy.signal.windows import hann\n", + "\n", + " # vibration_data = df1['s1']\n", + "\n", + " # # Applying STFT\n", + " # window_size = 1024\n", + " # hop_size = 512\n", + " # window = hann(window_size) # Creating a Hanning window\n", + " # frequencies, times, Zxx = signal.stft(vibration_data, window=window, nperseg=window_size, noverlap=window_size - hop_size)\n", + "\n", + " # # Plotting the STFT Data\n", + " # plt.pcolormesh(times, frequencies, np.abs(Zxx), shading='gouraud')\n", + " # plt.title(f'STFT Magnitude for case 1 signal sensor 1 ')\n", + " # plt.ylabel('Frequency [Hz]')\n", + " # plt.xlabel('Time [sec]')\n", + " # plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "spectograph('D:/thesis/data/converted/raw')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/code/src/features/frequency_domain_features.py b/code/src/features/frequency_domain_features.py new file mode 100644 index 0000000..e918b60 --- /dev/null +++ b/code/src/features/frequency_domain_features.py @@ -0,0 +1,192 @@ +import numpy as np +import pandas as pd +from scipy.fft import fft, fftfreq + +def get_mean_freq(signal, frame_size, hop_length): + mean = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + current_mean = np.sum(y)/frame_size + mean.append(current_mean) + return np.array(mean) + +def get_variance_freq(signal, frame_size, hop_length): + var = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + current_var = (np.sum((y - (np.sum(y)/frame_size))**2))/(frame_size-1) + var.append(current_var) + return np.array(var) + +def get_third_freq(signal, frame_size, hop_length): + third = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + current_third = (np.sum((y - (np.sum(y)/frame_size))**3))/(frame_size * (np.sqrt((np.sum((y - (np.sum(y)/frame_size))**2))/(frame_size-1)))**3) + third.append(current_third) + return np.array(third) + +def get_forth_freq(signal, frame_size, hop_length): + forth = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + current_forth = (np.sum((y - (np.sum(y)/frame_size))**4))/(frame_size * ((np.sum((y - (np.sum(y)/frame_size))**2))/(frame_size-1))**2) + forth.append(current_forth) + return np.array(forth) + +def get_grand_freq(signal, frame_size, hop_length): + grand = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_grand = np.sum(f * y)/np.sum(y) + grand.append(current_grand) + return np.array(grand) + +def get_std_freq(signal, frame_size, hop_length): + std = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_std = np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size) + std.append(current_std) + return np.array(std) + +def get_Cfactor_freq(signal, frame_size, hop_length): + cfactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_cfactor = np.sqrt(np.sum(f**2 * y)/np.sum(y)) + cfactor.append(current_cfactor) + return np.array(cfactor) + +def get_Dfactor_freq(signal, frame_size, hop_length): + dfactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_dfactor = np.sqrt(np.sum(f**4 * y)/np.sum(f**2 * y)) + dfactor.append(current_dfactor) + return np.array(dfactor) + +def get_Efactor_freq(signal, frame_size, hop_length): + efactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_efactor = np.sqrt(np.sum(f**2 * y)/np.sqrt(np.sum(y) * np.sum(f**4 * y))) + efactor.append(current_efactor) + return np.array(efactor) + +def get_Gfactor_freq(signal, frame_size, hop_length): + gfactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_gfactor = (np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size))/(np.sum(f * y)/np.sum(y)) + gfactor.append(current_gfactor) + return np.array(gfactor) + +def get_third1_freq(signal, frame_size, hop_length): + third1 = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_third1 = np.sum((f - (np.sum(f * y)/np.sum(y)))**3 * y)/(frame_size * (np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size))**3) + third1.append(current_third1) + return np.array(third1) + +def get_forth1_freq(signal, frame_size, hop_length): + forth1 = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_forth1 = np.sum((f - (np.sum(f * y)/np.sum(y)))**4 * y)/(frame_size * (np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size))**4) + forth1.append(current_forth1) + return np.array(forth1) + +def get_Hfactor_freq(signal, frame_size, hop_length): + hfactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_hfactor = np.sum(np.sqrt(abs(f - (np.sum(f * y)/np.sum(y)))) * y)/(frame_size * np.sqrt(np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size))) + hfactor.append(current_hfactor) + return np.array(hfactor) + +def get_Jfactor_freq(signal, frame_size, hop_length): + jfactor = [] + for i in range(0, len(signal), hop_length): + L = len(signal[i:i+frame_size]) + y = abs(np.fft.fft(signal[i:i+frame_size]/L))[:int(L/2)] + f = np.fft.fftfreq (L,.1/25600)[:int(L/2)] + current_jfactor = np.sum(np.sqrt(abs(f - (np.sum(f * y)/np.sum(y)))) * y)/(frame_size * np.sqrt(np.sqrt(np.sum((f-(np.sum(f * y)/np.sum(y)))**2 * y)/frame_size))) + jfactor.append(current_jfactor) + return np.array(jfactor) + +class FrequencyFeatureExtractor: + def __init__(self, data): + # Assuming data is a numpy array + self.x = data + # Perform FFT and compute magnitude of frequency components + self.frequency_spectrum = np.abs(fft(self.x)) + self.n = len(self.frequency_spectrum) + self.mean_freq = np.mean(self.frequency_spectrum) + self.variance_freq = np.var(self.frequency_spectrum) + self.std_freq = np.std(self.frequency_spectrum) + + # Calculate the required frequency features + self.features = self.calculate_features() + + def calculate_features(self): + S_mu = self.mean_freq + S_MAX = np.max(self.frequency_spectrum) + S_SBP = np.sum(self.frequency_spectrum) + S_Peak = np.max(self.frequency_spectrum) + S_V = np.sum((self.frequency_spectrum - S_mu) ** 2) / (self.n - 1) + S_Sigma = np.sqrt(S_V) + S_Skewness = np.sum((self.frequency_spectrum - S_mu) ** 3) / (self.n * S_Sigma ** 3) + S_Kurtosis = np.sum((self.frequency_spectrum - S_mu) ** 4) / (self.n * S_Sigma ** 4) + S_RSPPB = S_Peak / S_mu + + return { + 'Mean of band Power Spectrum (S_mu)': S_mu, + 'Max of band power spectrum (S_MAX)': S_MAX, + 'Sum of total band power (S_SBP)': S_SBP, + 'Peak of band power (S_Peak)': S_Peak, + 'Variance of band power (S_V)': S_V, + 'Standard Deviation of band power (S_Sigma)': S_Sigma, + 'Skewness of band power (S_Skewness)': S_Skewness, + 'Kurtosis of band power (S_Kurtosis)': S_Kurtosis, + 'Relative Spectral Peak per Band Power (S_RSPPB)': S_RSPPB + } + + def __repr__(self): + result = "Frequency Domain Feature Extraction Results:\n" + for feature, value in self.features.items(): + result += f"{feature}: {value:.4f}\n" + return result + +def ExtractFrequencyFeatures(object): + data = pd.read_csv(object, skiprows=1) # Skip the header row separator char info + extractor = FrequencyFeatureExtractor(data.iloc[:, 1].values) # Assuming the data is in the second column + features = extractor.features + return features + +# Usage Example +# extractor = FrequencyFeatureExtractor('path_to_your_data.csv') +# print(extractor) diff --git a/code/src/features/time_domain_features.py b/code/src/features/time_domain_features.py index 1ef4ace..f97e0d1 100644 --- a/code/src/features/time_domain_features.py +++ b/code/src/features/time_domain_features.py @@ -36,9 +36,12 @@ class FeatureExtractor: result += f"{feature}: {value:.4f}\n" return result -def ExtractTimeFeatures(object): +def ExtractTimeFeatures(object, absolute): data = pd.read_csv(object, skiprows=1) # Skip the header row separator char info - extractor = FeatureExtractor(data.iloc[:, 1].values) # Assuming the data is in the second column + if absolute: + extractor = FeatureExtractor(np.abs(data.iloc[:, 1].values)) # Assuming the data is in the second column + else: + extractor = FeatureExtractor(data.iloc[:, 1].values) features = extractor.features return features # Save features to a file diff --git a/code/src/process_stft.py b/code/src/process_stft.py new file mode 100644 index 0000000..1de44b4 --- /dev/null +++ b/code/src/process_stft.py @@ -0,0 +1,115 @@ +import os +import pandas as pd +import numpy as np +from scipy.signal import stft, hann +import glob +import multiprocessing # Added import for multiprocessing + +# Define the base directory where DAMAGE_X folders are located +damage_base_path = 'D:/thesis/data/converted/raw' + +# Define output directories for each sensor +output_dirs = { + 'sensor1': os.path.join(damage_base_path, 'sensor1'), + 'sensor2': os.path.join(damage_base_path, 'sensor2') +} + +# Create output directories if they don't exist +for dir_path in output_dirs.values(): + os.makedirs(dir_path, exist_ok=True) + +# Define STFT parameters +window_size = 1024 +hop_size = 512 +window = hann(window_size) +Fs = 1024 + +# Number of damage cases (adjust as needed) +num_damage_cases = 6 # Change to 30 if you have 30 damage cases + +# Number of test runs per damage case +num_test_runs = 5 + +# Function to perform STFT and return magnitude +def compute_stft(vibration_data): + frequencies, times, Zxx = stft( + vibration_data, + fs=Fs, + window=window, + nperseg=window_size, + noverlap=window_size - hop_size + ) + stft_magnitude = np.abs(Zxx) + return stft_magnitude.T # Transpose to have frequencies as columns + +def process_damage_case(damage_num): + damage_folder = os.path.join(damage_base_path, f'DAMAGE_{damage_num}') + + # Check if the damage folder exists + if not os.path.isdir(damage_folder): + print(f"Folder {damage_folder} does not exist. Skipping...") + return + + # Process Sensor 1 and Sensor 2 separately + for sensor_num in [1, 2]: + aggregated_stft = [] # List to hold STFTs from all test runs + + # Iterate over all test runs + for test_num in range(1, num_test_runs + 1): + # Construct the filename based on sensor number + # Sensor 1 corresponds to '_01', Sensor 2 corresponds to '_02' + sensor_suffix = f'_0{sensor_num}' + file_name = f'DAMAGE_{damage_num}_TEST{test_num}{sensor_suffix}.csv' + file_path = os.path.join(damage_folder, file_name) + + # Check if the file exists + if not os.path.isfile(file_path): + print(f"File {file_path} does not exist. Skipping...") + continue + + # Read the CSV file + try: + df = pd.read_csv(file_path) + except Exception as e: + print(f"Error reading {file_path}: {e}. Skipping...") + continue + + # Ensure the CSV has exactly two columns: 'Timestamp (s)' and 'Sensor X' + if df.shape[1] != 2: + print(f"Unexpected number of columns in {file_path}. Expected 2, got {df.shape[1]}. Skipping...") + continue + + # Extract vibration data (assuming the second column is sensor data) + vibration_data = df.iloc[:, 1].values + + # Perform STFT + stft_magnitude = compute_stft(vibration_data) + + # Convert STFT result to DataFrame + df_stft = pd.DataFrame( + stft_magnitude, + columns=[f"Freq_{freq:.2f}" for freq in np.linspace(0, Fs/2, stft_magnitude.shape[1])] + ) + + # Append to the aggregated list + aggregated_stft.append(df_stft) + + # Concatenate all STFT DataFrames vertically + if aggregated_stft: + df_aggregated = pd.concat(aggregated_stft, ignore_index=True) + + # Define output filename + output_file = os.path.join( + output_dirs[f'sensor{sensor_num}'], + f'stft_data{sensor_num}_{damage_num}.csv' + ) + + # Save the aggregated STFT to CSV + df_aggregated.to_csv(output_file, index=False) + print(f"Saved aggregated STFT for Sensor {sensor_num}, Damage {damage_num} to {output_file}") + else: + print(f"No STFT data aggregated for Sensor {sensor_num}, Damage {damage_num}.") + +if __name__ == "__main__": # Added main guard for multiprocessing + with multiprocessing.Pool() as pool: + pool.map(process_damage_case, range(1, num_damage_cases + 1)) diff --git a/code/src/verify_stft.py b/code/src/verify_stft.py new file mode 100644 index 0000000..1831c3e --- /dev/null +++ b/code/src/verify_stft.py @@ -0,0 +1,133 @@ +import os +import pandas as pd +import numpy as np +from scipy.signal import stft, hann +import glob + +# Define the base directory where DAMAGE_X folders are located +damage_base_path = 'D:/thesis/data/converted/raw/' + +# Define sensor directories +sensor_dirs = { + 'sensor1': os.path.join(damage_base_path, 'sensor1'), + 'sensor2': os.path.join(damage_base_path, 'sensor2') +} + +# Define STFT parameters +window_size = 1024 +hop_size = 512 +window = hann(window_size) +Fs = 1024 + +def verify_stft(damage_num, test_num, sensor_num): + """ + Verifies the STFT of an individual test run against the aggregated STFT data. + + Parameters: + - damage_num (int): Damage case number. + - test_num (int): Test run number. + - sensor_num (int): Sensor number (1 or 2). + """ + # Mapping sensor number to suffix + sensor_suffix = f'_0{sensor_num}' + + # Construct the file name for the individual test run + individual_file_name = f'DAMAGE_{damage_num}_TEST{test_num}{sensor_suffix}.csv' + individual_file_path = os.path.join(damage_base_path, f'DAMAGE_{damage_num}', individual_file_name) + + # Check if the individual file exists + if not os.path.isfile(individual_file_path): + print(f"File {individual_file_path} does not exist. Skipping verification for this test run.") + return + + # Read the individual test run CSV + try: + df_individual = pd.read_csv(individual_file_path) + except Exception as e: + print(f"Error reading {individual_file_path}: {e}. Skipping verification for this test run.") + return + + # Ensure the CSV has exactly two columns: 'Timestamp (s)' and 'Sensor X' + if df_individual.shape[1] != 2: + print(f"Unexpected number of columns in {individual_file_path}. Expected 2, got {df_individual.shape[1]}. Skipping.") + return + + # Extract vibration data + vibration_data = df_individual.iloc[:, 1].values + + # Perform STFT + frequencies, times, Zxx = stft( + vibration_data, + fs=Fs, + window=window, + nperseg=window_size, + noverlap=window_size - hop_size + ) + + # Compute magnitude and transpose + stft_magnitude = np.abs(Zxx).T # Shape: (513, 513) + + # Select random row indices to verify (e.g., 3 random rows) + np.random.seed(42) # For reproducibility + sample_row_indices = np.random.choice(stft_magnitude.shape[0], size=3, replace=False) + + # Read the aggregated STFT CSV + aggregated_file_name = f'stft_data{sensor_num}_{damage_num}.csv' + aggregated_file_path = os.path.join(sensor_dirs[f'sensor{sensor_num}'], aggregated_file_name) + + if not os.path.isfile(aggregated_file_path): + print(f"Aggregated file {aggregated_file_path} does not exist. Skipping verification for this test run.") + return + + try: + df_aggregated = pd.read_csv(aggregated_file_path) + except Exception as e: + print(f"Error reading {aggregated_file_path}: {e}. Skipping verification for this test run.") + return + + # Calculate the starting row index in the aggregated CSV + # Each test run contributes 513 rows + start_row = (test_num - 1) * 513 + end_row = start_row + 513 # Exclusive + + # Ensure the aggregated CSV has enough rows + if df_aggregated.shape[0] < end_row: + print(f"Aggregated file {aggregated_file_path} does not have enough rows for Test {test_num}. Skipping.") + return + + # Extract the corresponding STFT block from the aggregated CSV + df_aggregated_block = df_aggregated.iloc[start_row:end_row].values # Shape: (513, 513) + + # Compare selected rows + all_match = True + for row_idx in sample_row_indices: + individual_row = stft_magnitude[row_idx] + aggregated_row = df_aggregated_block[row_idx] + + # Check if the rows are almost equal within a tolerance + if np.allclose(individual_row, aggregated_row, atol=1e-6): + verification_status = "MATCH" + else: + verification_status = "MISMATCH" + all_match = False + + # Print the comparison details + print(f"Comparing Damage {damage_num}, Test {test_num}, Sensor {sensor_num}, Row {row_idx}: {verification_status}") + print(f"Individual STFT Row {row_idx}: {individual_row[:5]} ... {individual_row[-5:]}") + print(f"Aggregated STFT Row {row_idx + start_row}: {aggregated_row[:5]} ... {aggregated_row[-5:]}\n") + + # If all sampled rows match, print a verification success message + if all_match: + print(f"STFT of DAMAGE_{damage_num}_TEST{test_num}_{sensor_num}.csv is verified. On `stft_data{sensor_num}_{damage_num}.csv` start at rows {start_row} to {end_row} with 513 rows.\n") + else: + print(f"STFT of DAMAGE_{damage_num}_TEST{test_num}_{sensor_num}.csv has discrepancies in `stft_data{sensor_num}_{damage_num}.csv` start at rows {start_row} to {end_row} with 513 rows.\n") + +# Define the number of damage cases and test runs +num_damage_cases = 6 # Adjust to 30 as per your dataset +num_test_runs = 5 + +# Iterate through all damage cases, test runs, and sensors +for damage_num in range(1, num_damage_cases + 1): + for test_num in range(1, num_test_runs + 1): + for sensor_num in [1, 2]: + verify_stft(damage_num, test_num, sensor_num) diff --git a/data/QUGS/convert.py b/data/QUGS/convert.py new file mode 100644 index 0000000..071a537 --- /dev/null +++ b/data/QUGS/convert.py @@ -0,0 +1,68 @@ +import pandas as pd +import os +import sys +from colorama import Fore, Style, init + +def create_damage_files(base_path, output_base, prefix): + # Initialize colorama + init(autoreset=True) + + # Generate column labels based on expected duplication in input files + columns = ['Real'] + [f'Real.{i}' for i in range(1, 30)] # Explicitly setting column names + + sensor_end_map = {1: 'Real.25', 2: 'Real.26', 3: 'Real.27', 4: 'Real.28', 5: 'Real.29'} + + # Define the damage scenarios and the corresponding original file indices + damage_scenarios = { + 1: range(1, 6), # Damage 1 files from zzzAD1.csv to zzzAD5.csv + 2: range(6, 11), # Damage 2 files from zzzAD6.csv to zzzAD10.csv + 3: range(11, 16), # Damage 3 files from zzzAD11.csv to zzzAD15.csvs + 4: range(16, 21), # Damage 4 files from zzzAD16.csv to zzzAD20.csv + 5: range(21, 26), # Damage 5 files from zzzAD21.csv to zzzAD25.csv + 6: range(26, 31) # Damage 6 files from zzzAD26.csv to zzzAD30.csv + } + damage_pad = len(str(len(damage_scenarios))) + test_pad = len(str(30)) + + for damage, files in damage_scenarios.items(): + for i, file_index in enumerate(files, start=1): + # Load original data file + file_path = os.path.join(base_path, f'zzz{prefix}D{file_index}.TXT') + df = pd.read_csv(file_path, sep='\t', skiprows=10) # Read with explicit column names + + top_sensor = columns[i-1] + print(top_sensor, type(top_sensor)) + output_file_1 = os.path.join(output_base, f'DAMAGE_{damage}', f'DAMAGE{damage}_TEST{i}_01.csv') + print(f"Creating {output_file_1} from taking zzz{prefix}D{file_index}.TXT") + print("Taking datetime column on index 0...") + print(f"Taking `{top_sensor}`...") + df[['Time', top_sensor]].to_csv(output_file_1, index=False) + print(Fore.GREEN + "Done") + + bottom_sensor = sensor_end_map[i] + output_file_2 = os.path.join(output_base, f'DAMAGE_{damage}', f'DAMAGE{damage}_TEST{i}_02.csv') + print(f"Creating {output_file_2} from taking zzz{prefix}D{file_index}.TXT") + print("Taking datetime column on index 0...") + print(f"Taking `{bottom_sensor}`...") + df[['Time', bottom_sensor]].to_csv(output_file_2, index=False) + print(Fore.GREEN + "Done") + print("---") + +def main(): + if len(sys.argv) < 2: + print("Usage: python convert.py ") + sys.exit(1) + + base_path = sys.argv[1] + output_base = sys.argv[2] + prefix = sys.argv[3] # Define output directory + + # Create output folders if they don't exist + for i in range(1, 5): + os.makedirs(os.path.join(output_base, f'DAMAGE_{i}'), exist_ok=True) + + create_damage_files(base_path, output_base, prefix) + print(Fore.YELLOW + Style.BRIGHT + "All files have been created successfully.") + +if __name__ == "__main__": + main() diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md new file mode 100644 index 0000000..3daa472 --- /dev/null +++ b/docs/CONTRIBUTING.md @@ -0,0 +1,66 @@ +This document outlines the process for developing and contributing to my own thesis project. By following these guidelines, this will ensure consistent quality and maintain a clear development history. + +## Development Workflow + +### 1. Issue Creation +Before working on any new feature, experiment, or bug fix: +- Create a GitHub issue using the appropriate template +- Assign it to myself +- Add relevant labels +- Link it to the project board if applicable + +### 2. Branching Strategy +Use the following branch naming convention: +- `feature/-short-description` +- `bugfix/-short-description` +- `experiment/-short-description` +- `doc/-short-description` + +Always branch from `main` for new features/experiments. + +### 3. Development Process +- Make regular, atomic commits following the commit message template +- Include the issue number in commit messages (e.g., "#42") +- Push changes at the end of each work session + +### 4. Code Quality +- Follow PEP 8 guidelines for Python code +- Document functions with docstrings +- Maintain test coverage for custom functions +- Keep notebooks clean and well-documented + +### 5. Pull Requests +Even working alone, use PRs for significant changes: +- Create a PR from your feature branch to `main` +- Reference the issue(s) it resolves +- Include a summary of changes +- Self-review the PR before merging + +### 6. Versioning +Follow semantic versioning: +- Major version: Significant thesis milestones or structural changes +- Minor version: New experiments, features, or chapters +- Patch version: Bug fixes and minor improvements + +### 7. Documentation +Update documentation with each significant change: +- Keep README current +- Update function documentation +- Maintain clear experiment descriptions in notebooks +- Record significant decisions and findings + +## LaTeX Guidelines +- Use consistent citation style +- Break long sections into multiple files +- Use meaningful label names for cross-references +- Consider using version-control friendly LaTeX practices (one sentence per line) + +## Experiment Tracking +For each experiment: +- Create an issue documenting the experiment design +- Reference related papers and previous experiments +- Document parameters and results in the notebook +- Summarize findings in the issue before closing + +## Commit Categories +Use the categories defined in the commit template to clearly classify changes.