lang-main/notebooks/archive/Analyse_4-1.ipynb
2024-08-07 20:06:06 +02:00

5502 lines
187 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Analyse 2-2**\n",
"\n",
"## Strategie & Fokus\n",
"\n",
"- Versuche Clustering bzw. Zusammenfassung von Begriffen (z.B. Prüfung, Prüfen, Überprüfung)\n",
"- Orientierung an Häufigkeitsverteilung: häufigere Begriffe zuerst analysieren"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Merkmal 1: Clustering von Vorgangsbeschreibungen\n",
"\n",
"## Recherche\n",
"[Textmining HS Hannover](https://textmining.wp.hs-hannover.de/Preprocessing.html)\n",
"\n",
"### Allgemeine Zergliederung der Einzelbeschreibungen\n",
"\n",
"- Text in Sätze\n",
"- Sätze in Wörter\n",
"- Wörter in Grundform:\n",
" - Lemma: Die Form des Wortes, wie sie in einem Wörterbuch steht. Z.B.: Haus, laufen, begründen\n",
" - Stamm: Das Wort ohne Flexionsendungen (Prefixe und Suffixe). Z.B.: Haus, lauf, begründ\n",
" - Wurzel: Kern des Wortes, von dem das Wort ggf. durch Derivation abgeleitet wurde. Z.B.: Haus, lauf, Grund\n",
"- Wortartbestimmung\n",
" - klassische Part-of-Speech-Erkennung (herkömmliche Wortart)\n",
" - Named Entity Recognition (NER) (Eigennamen)\n",
" - Bsp. spaCy: Person, Ort, Organisation, Verschiedenes\n",
"\n",
"#### Semantik\n",
"\n",
"- Wörter innerhalb eines Satzes größere Zusammenhänge als außerhalb\n",
"\n",
"### Pakete\n",
"\n",
"- Englisch: \n",
" - [NLTK](https://www.nltk.org/)\n",
"- Deutsch:\n",
" - [HanTa - The Hanover Tagger](https://github.com/wartaal/HanTa/tree/master)\n",
" - [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)\n",
" - [Python Wrapper](https://treetaggerwrapper.readthedocs.io/en/latest/)\n",
" - [spaCy](https://spacy.io/)\n",
" - [Beispiel 1](https://www.trinnovative.de/blog/2020-09-08-natural-language-processing-mit-spacy.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"21.02.:\n",
"- Überarbeitung RegEx-Filterung\n",
"- Verbesserung Duplikatefindung über Ähnlichkeit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyse"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\foersterflorian\\mambaforge\\envs\\ihm2\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import DataFrame, Series\n",
"import spacy\n",
"from spacy.lang.de import German as GermanSpacyModel\n",
"import sentence_transformers\n",
"from sentence_transformers import SentenceTransformer\n",
"from collections import Counter\n",
"from itertools import combinations\n",
"from dateutil.parser import parse\n",
"import re\n",
"\n",
"\n",
"import logging\n",
"import sys\n",
"import pickle\n",
"\n",
"from ihm_analyze.helpers import (\n",
" save_pickle,\n",
" load_pickle,\n",
" build_embedding_map,\n",
" build_cosSim_matrix,\n",
" filt_thresh_cosSim_matrix,\n",
" list_cosSim_dupl_candidates,\n",
" choose_cosSim_dupl_candidates,\n",
")\n",
"\n",
"LOGGING_LEVEL = 'INFO'\n",
"logging.basicConfig(level=LOGGING_LEVEL, stream=sys.stdout)\n",
"logger = logging.getLogger('base')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"LOAD_CALC_FILES = False\n",
"\n",
"DESC_BLACKLIST = set(['-'])\n",
"\"\"\"\n",
"GENERAL_BLACKLIST = set([\n",
" 'herr', 'hr.', 'förster', 'graf', 'stöppel', \n",
" 'stab', 'kw', 'h.', 'koch', 'heininger', '.',\n",
" 'schwab', 'm.', 'wenninger', '-', '--',\n",
"])\n",
"\"\"\"\n",
"\n",
"GENERAL_BLACKLIST = set([\n",
" 'herr', 'hr.' 'kw', 'h.', '.',\n",
" 'm.', '-', '--', 'dr.', 'dr',\n",
"])\n",
"\n",
"#GENERAL_BLACKLIST = set()\n",
"#POS_of_interest = set(['NOUN', 'PROPN', 'ADJ', 'VERB', 'AUX'])\n",
"#POS_of_interest = set(['NOUN', 'ADJ', 'VERB', 'AUX'])\n",
"#POS_of_interest = set(['NOUN', 'PROPN'])\n",
"POS_of_interest = set(['NOUN', 'PROPN', 'VERB', 'AUX'])\n",
"#TAG_of_interest = set(['ADJD'])\n",
"TAG_of_interest = set()\n",
"\n",
"#POS_INDIRECT = set(['AUX', 'VERB'])\n",
"POS_INDIRECT = set(['AUX'])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# load language model\n",
"# transformer model without vector embeddings\n",
"# can not be used to calculate similarities\n",
"# using sentence transformers instead\n",
"nlp = spacy.load('de_dep_news_trf')\n",
"#nlp = spacy.load('de_core_news_lg')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
"INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n"
]
}
],
"source": [
"model_stfr = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 129020 entries, 0 to 129019\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 VorgangsID 129020 non-null int64 \n",
" 1 ObjektID 129020 non-null int64 \n",
" 2 HObjektText 129003 non-null object \n",
" 3 ObjektArtID 129020 non-null int64 \n",
" 4 ObjektArtText 128372 non-null object \n",
" 5 VorgangsTypID 129020 non-null int64 \n",
" 6 VorgangsTypName 129020 non-null object \n",
" 7 VorgangsDatum 129020 non-null datetime64[ns]\n",
" 8 VorgangsStatusId 129020 non-null int64 \n",
" 9 VorgangsPrioritaet 129020 non-null int64 \n",
" 10 VorgangsBeschreibung 124087 non-null object \n",
" 11 VorgangsOrt 507 non-null object \n",
" 12 VorgangsArtText 129020 non-null object \n",
" 13 ErledigungsDatum 129020 non-null datetime64[ns]\n",
" 14 ErledigungsArtText 128474 non-null object \n",
" 15 ErledigungsBeschreibung 118135 non-null object \n",
" 16 MPMelderArbeitsplatz 6359 non-null object \n",
" 17 MPAbteilungBezeichnung 6359 non-null object \n",
" 18 Arbeitsbeginn 123538 non-null datetime64[ns]\n",
" 19 ErstellungsDatum 129020 non-null datetime64[ns]\n",
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
"memory usage: 19.7+ MB\n"
]
}
],
"source": [
"# load dataset\n",
"DATA_SET_ID = 'Export4'\n",
"FILE_PATH = f'01_2_Rohdaten_neu/{DATA_SET_ID}.csv'\n",
"date_cols = ['VorgangsDatum', 'ErledigungsDatum', 'Arbeitsbeginn', 'ErstellungsDatum']\n",
"raw = pd.read_csv(filepath_or_buffer=FILE_PATH, sep=';', encoding='cp1252', parse_dates=date_cols, dayfirst=True)\n",
"raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>114</td>\n",
" <td>427 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-06</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Kettbaum kaputt</td>\n",
" <td>2019-03-06</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>124</td>\n",
" <td>621 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-11</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>asgasdg</td>\n",
" <td>2019-03-11</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>53</td>\n",
" <td>244</td>\n",
" <td>285 C, Webmaschine, SG 220 EMS</td>\n",
" <td>5</td>\n",
" <td>Greifer-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Kupplung schleift</td>\n",
" <td>NaN</td>\n",
" <td>Kupplung defekt</td>\n",
" <td>2019-03-20</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>58</td>\n",
" <td>257</td>\n",
" <td>107, Webmaschine, OM 220 EOS</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-21</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Gegengewicht wieder anbringen</td>\n",
" <td>NaN</td>\n",
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
" <td>2019-03-21</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schraube ausgebohrt\\nGegengewicht wieder angeb...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-03-21</td>\n",
" <td>2019-03-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>81</td>\n",
" <td>138</td>\n",
" <td>00138, Schärmaschine 9,</td>\n",
" <td>16</td>\n",
" <td>Schärmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
" <td>NaN</td>\n",
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"0 11 114 427 C , Webmaschine, DL 280 EMS Breite 280 \n",
"1 17 124 621 C , Webmaschine, DL 280 EMS Breite 280 \n",
"2 53 244 285 C, Webmaschine, SG 220 EMS \n",
"3 58 257 107, Webmaschine, OM 220 EOS \n",
"4 81 138 00138, Schärmaschine 9, \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"0 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"1 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"2 5 Greifer-Webmaschine 3 Reparaturauftrag (Portal) \n",
"3 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"4 16 Schärmaschine 3 Reparaturauftrag (Portal) \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"0 2019-03-06 4 0 \n",
"1 2019-03-11 5 0 \n",
"2 2019-03-19 5 0 \n",
"3 2019-03-21 5 0 \n",
"4 2019-03-25 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Kupplung schleift NaN \n",
"3 Gegengewicht wieder anbringen NaN \n",
"4 da ist etwas gebrochen. (Herr Heininger) NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"0 Kettbaum kaputt 2019-03-06 \n",
"1 asgasdg 2019-03-11 \n",
"2 Kupplung defekt 2019-03-20 \n",
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
"\n",
" ErledigungsArtText ErledigungsBeschreibung \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Reparatur UTT NaN \n",
"3 Reparatur UTT Schraube ausgebohrt\\nGegengewicht wieder angeb... \n",
"4 Reparatur UTT Bolzen gebrochen. Bolzen neu angefertig und di... \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"0 Weberei Weberei NaT 2019-03-06 \n",
"1 Elektrowerkstatt Elektrowerkstatt NaT 2019-03-11 \n",
"2 Weberei Weberei NaT 2019-03-19 \n",
"3 Weberei Weberei 2019-03-21 2019-03-21 \n",
"4 Vorwerk Vorwerk 2019-03-25 2019-03-25 "
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Features: 20\n"
]
}
],
"source": [
"print(f\"Anzahl Features: {len(raw.columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Neue Features gegenüber letzter Analyse:**\n",
"- ``ObjektArtID``\n",
"- ``ObjektArtText``\n",
"- ``VorgangsTypName``"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Duplikate"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"source": [
"duplicates_filt = raw.duplicated()"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate: 84\n"
]
}
],
"source": [
"print(f\"Anzahl Duplikate: {duplicates_filt.sum()}\")"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [],
"source": [
"filt_data = raw[duplicates_filt]\n",
"uni_obj_id_dupl = filt_data['ObjektID'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl einzigartiger Objekt-IDs unter Duplikaten: 47\n"
]
}
],
"source": [
"print(f\"Anzahl einzigartiger Objekt-IDs unter Duplikaten: {len(uni_obj_id_dupl)}\")"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 128936 entries, 0 to 128935\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 VorgangsID 128936 non-null int64 \n",
" 1 ObjektID 128936 non-null int64 \n",
" 2 HObjektText 128920 non-null object \n",
" 3 ObjektArtID 128936 non-null int64 \n",
" 4 ObjektArtText 128289 non-null object \n",
" 5 VorgangsTypID 128936 non-null int64 \n",
" 6 VorgangsTypName 128936 non-null object \n",
" 7 VorgangsDatum 128936 non-null datetime64[ns]\n",
" 8 VorgangsStatusId 128936 non-null int64 \n",
" 9 VorgangsPrioritaet 128936 non-null int64 \n",
" 10 VorgangsBeschreibung 124008 non-null object \n",
" 11 VorgangsOrt 507 non-null object \n",
" 12 VorgangsArtText 128936 non-null object \n",
" 13 ErledigungsDatum 128936 non-null datetime64[ns]\n",
" 14 ErledigungsArtText 128402 non-null object \n",
" 15 ErledigungsBeschreibung 118086 non-null object \n",
" 16 MPMelderArbeitsplatz 6337 non-null object \n",
" 17 MPAbteilungBezeichnung 6337 non-null object \n",
" 18 Arbeitsbeginn 123480 non-null datetime64[ns]\n",
" 19 ErstellungsDatum 128936 non-null datetime64[ns]\n",
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
"memory usage: 19.7+ MB\n"
]
}
],
"source": [
"wo_duplicates = raw.drop_duplicates(ignore_index=True)\n",
"wo_duplicates.info()"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_00_DF_wo_dupl.parquet'\n",
"wo_duplicates.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ``VorgangsBeschreibung``"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **NA vals und Duplikate**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"String-Bereinigung"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"SPECIAL_CHARS = set(['&', '$', '%', '§', '/', '(', ')', '_', \n",
" '+', '', '--', '<', '>', '´',\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def clean_string_slim(string: str) -> str:\n",
" # remove special chars\n",
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces at the beginning and the end\n",
" string = string.strip()\n",
" \n",
" return string\n",
"\n",
"def clean_string(string: str) -> str:\n",
" #num_reps = 5\n",
" \n",
" # remove special chars\n",
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove dates\n",
" pattern = r'[\\d]{1,4}[.:][\\d]{1,4}[.:][\\d]{1,4}'\n",
" string = re.sub(pattern, '', string)\n",
" # remove times\n",
" pattern = r'[\\d]{1,2}[:][\\d]{1,2}[:][\\d]{0,2}'\n",
" string = re.sub(pattern, '', string)\n",
" # remove all chars despite punctuation and alphanumeric ones\n",
" pattern = r'[^ \\w.,;:\\-äöüÄÖÜ]+'\n",
" string = re.sub(pattern, '', string)\n",
" # remove - where it is used as em dash\n",
" pattern = r'[\\W]+-[\\W]+'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces in front of punctuation\n",
" pattern = r'[ ]+([;,.:])'\n",
" string = re.sub(pattern, r'\\1', string)\n",
" # remove multiple whitespaces\n",
" pattern = r'[ ]+'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces at the beginning and the end\n",
" string = string.strip()\n",
" \n",
" #while num_reps != 0:\n",
" #string = string.replace('\\n', ' ')\n",
" #string = string.replace('\\t', ' ')\n",
" #string = string.replace(' ', ' ')\n",
" #string = string.replace(' ', ' ')\n",
" #string = string.replace(' - ', ' ')\n",
" \"\"\"\n",
" for char in SPECIAL_CHARS:\n",
" string = string.replace(char, '')\n",
" \n",
" #num_reps -= 1\n",
" \n",
" # remove spaces at the beginning and the end\n",
" string = string.strip()\n",
" \"\"\"\n",
" \n",
" return string"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"base = wo_duplicates.copy()\n",
"base = base.dropna(axis=0, subset='VorgangsBeschreibung')\n",
"# preprocessing\n",
"#base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string)\n",
"base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string_slim)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>140837</td>\n",
" <td>728</td>\n",
" <td>10107, Rechteckfilter H1,</td>\n",
" <td>9</td>\n",
" <td>Behälter</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2022-03-30</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Filter Links Klopfer Defekt</td>\n",
" <td>NaN</td>\n",
" <td>Klopfer defekt</td>\n",
" <td>2022-03-30</td>\n",
" <td>Ausgetauscht</td>\n",
" <td>.</td>\n",
" <td>Produktion</td>\n",
" <td>Produktion</td>\n",
" <td>2022-03-30</td>\n",
" <td>2022-03-30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>136284</td>\n",
" <td>1280</td>\n",
" <td>03024, Flachform Hubtisch, H2E12</td>\n",
" <td>30</td>\n",
" <td>Hydraulik</td>\n",
" <td>2</td>\n",
" <td>Störungsmeldung</td>\n",
" <td>2022-03-25</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Anfahrschutz für Hydraulikkupplung abgefahren.</td>\n",
" <td>NaN</td>\n",
" <td>Defekt</td>\n",
" <td>2022-03-30</td>\n",
" <td>Repariert</td>\n",
" <td>Geschweißt</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-03-30</td>\n",
" <td>2022-03-25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>116920</td>\n",
" <td>1518</td>\n",
" <td>00576, Leitstrahlmischer 1,</td>\n",
" <td>41</td>\n",
" <td>Mischer</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-14</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>.</td>\n",
" <td>NaN</td>\n",
" <td>halbjährlich Wartung (W)</td>\n",
" <td>2022-04-21</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>.</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-21</td>\n",
" <td>2021-11-22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21260</td>\n",
" <td>2097</td>\n",
" <td>00827, Überladebrücke Rampe 1,</td>\n",
" <td>58</td>\n",
" <td>Verladung</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-05-06</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Prüfung durch externen DL</td>\n",
" <td>NaN</td>\n",
" <td>jährliche Prüfung externer Dienstleister (P)</td>\n",
" <td>2022-04-25</td>\n",
" <td>Geprüft ohne Mängel</td>\n",
" <td>Geprüft ohne Mängel.</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-04</td>\n",
" <td>2021-04-14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>116374</td>\n",
" <td>1703</td>\n",
" <td>00715, Vogelsang, 2</td>\n",
" <td>3</td>\n",
" <td>Pumpen</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-14</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>NaN</td>\n",
" <td>halbjährlich Wartung (W)</td>\n",
" <td>2022-05-12</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>Wartung wie geplant durchgeführt</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-20</td>\n",
" <td>2021-11-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14774</th>\n",
" <td>165211</td>\n",
" <td>723</td>\n",
" <td>10102, Nasswäscher AGT 2,</td>\n",
" <td>9</td>\n",
" <td>Behälter</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-05-01</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>NaN</td>\n",
" <td>Biozid Dosierung Montag (W)</td>\n",
" <td>2023-05-03</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>.</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-05-03</td>\n",
" <td>2022-10-10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14775</th>\n",
" <td>54805</td>\n",
" <td>2365</td>\n",
" <td>03544, Dampfkessel BHKW 1,</td>\n",
" <td>11</td>\n",
" <td>Dampferzeuger</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-05-03</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>NaN</td>\n",
" <td>dreitägige Überprüfung Mittwoch (W)</td>\n",
" <td>2023-05-03</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>Nach Vorgabe</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-05-03</td>\n",
" <td>2021-06-04</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14776</th>\n",
" <td>166438</td>\n",
" <td>3214</td>\n",
" <td>03760, Seepexpumpe , BN 5-12L</td>\n",
" <td>3</td>\n",
" <td>Pumpen</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-04-24</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>NaN</td>\n",
" <td>halbjährlich Wartung (W)</td>\n",
" <td>2023-05-03</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>Wartung wie geplant durchgeführt</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-05-02</td>\n",
" <td>2022-10-24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14777</th>\n",
" <td>166443</td>\n",
" <td>1277</td>\n",
" <td>00593, Hydraulik für Deckelhubeinrichtung,</td>\n",
" <td>30</td>\n",
" <td>Hydraulik</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-04-24</td>\n",
" <td>12</td>\n",
" <td>0</td>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>NaN</td>\n",
" <td>halbjährlich Wartung (W)</td>\n",
" <td>2023-05-03</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>Wartung wie geplant durchgeführt</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-05-03</td>\n",
" <td>2022-10-24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14778</th>\n",
" <td>195126</td>\n",
" <td>1266</td>\n",
" <td>02000, BHKW 1,</td>\n",
" <td>28</td>\n",
" <td>Heizung</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-05-01</td>\n",
" <td>14</td>\n",
" <td>0</td>\n",
" <td>regelmäßige Wartung nach Herstellervorgaben al...</td>\n",
" <td>NaN</td>\n",
" <td>2.000h Wartung (W)</td>\n",
" <td>2023-04-26</td>\n",
" <td>Planmäßige Wartung</td>\n",
" <td>Wartung wie geplant durchgeführt</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-04-24</td>\n",
" <td>2023-03-20</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>14481 rows × 20 columns</p>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"0 140837 728 10107, Rechteckfilter H1, \n",
"1 136284 1280 03024, Flachform Hubtisch, H2E12 \n",
"2 116920 1518 00576, Leitstrahlmischer 1, \n",
"3 21260 2097 00827, Überladebrücke Rampe 1, \n",
"4 116374 1703 00715, Vogelsang, 2 \n",
"... ... ... ... \n",
"14774 165211 723 10102, Nasswäscher AGT 2, \n",
"14775 54805 2365 03544, Dampfkessel BHKW 1, \n",
"14776 166438 3214 03760, Seepexpumpe , BN 5-12L \n",
"14777 166443 1277 00593, Hydraulik für Deckelhubeinrichtung, \n",
"14778 195126 1266 02000, BHKW 1, \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"0 9 Behälter 3 Reparaturauftrag (Portal) \n",
"1 30 Hydraulik 2 Störungsmeldung \n",
"2 41 Mischer 1 Wartung \n",
"3 58 Verladung 1 Wartung \n",
"4 3 Pumpen 1 Wartung \n",
"... ... ... ... ... \n",
"14774 9 Behälter 1 Wartung \n",
"14775 11 Dampferzeuger 1 Wartung \n",
"14776 3 Pumpen 1 Wartung \n",
"14777 30 Hydraulik 1 Wartung \n",
"14778 28 Heizung 1 Wartung \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"0 2022-03-30 2 0 \n",
"1 2022-03-25 2 0 \n",
"2 2022-04-14 2 0 \n",
"3 2022-05-06 2 0 \n",
"4 2022-04-14 2 0 \n",
"... ... ... ... \n",
"14774 2023-05-01 2 0 \n",
"14775 2023-05-03 2 0 \n",
"14776 2023-04-24 2 0 \n",
"14777 2023-04-24 12 0 \n",
"14778 2023-05-01 14 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"0 Filter Links Klopfer Defekt NaN \n",
"1 Anfahrschutz für Hydraulikkupplung abgefahren. NaN \n",
"2 . NaN \n",
"3 Prüfung durch externen DL NaN \n",
"4 Wartung nach Arbeitsplan NaN \n",
"... ... ... \n",
"14774 Manuelle Dosierung des Biozids NaN \n",
"14775 NaN \n",
"14776 Wartung nach Arbeitsplan NaN \n",
"14777 Wartung nach Arbeitsplan NaN \n",
"14778 regelmäßige Wartung nach Herstellervorgaben al... NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"0 Klopfer defekt 2022-03-30 \n",
"1 Defekt 2022-03-30 \n",
"2 halbjährlich Wartung (W) 2022-04-21 \n",
"3 jährliche Prüfung externer Dienstleister (P) 2022-04-25 \n",
"4 halbjährlich Wartung (W) 2022-05-12 \n",
"... ... ... \n",
"14774 Biozid Dosierung Montag (W) 2023-05-03 \n",
"14775 dreitägige Überprüfung Mittwoch (W) 2023-05-03 \n",
"14776 halbjährlich Wartung (W) 2023-05-03 \n",
"14777 halbjährlich Wartung (W) 2023-05-03 \n",
"14778 2.000h Wartung (W) 2023-04-26 \n",
"\n",
" ErledigungsArtText ErledigungsBeschreibung \\\n",
"0 Ausgetauscht . \n",
"1 Repariert Geschweißt \n",
"2 Planmäßige Wartung . \n",
"3 Geprüft ohne Mängel Geprüft ohne Mängel. \n",
"4 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
"... ... ... \n",
"14774 Planmäßige Wartung . \n",
"14775 Planmäßige Wartung Nach Vorgabe \n",
"14776 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
"14777 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
"14778 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn \\\n",
"0 Produktion Produktion 2022-03-30 \n",
"1 NaN NaN 2022-03-30 \n",
"2 NaN NaN 2022-04-21 \n",
"3 NaN NaN 2022-04-04 \n",
"4 NaN NaN 2022-04-20 \n",
"... ... ... ... \n",
"14774 NaN NaN 2023-05-03 \n",
"14775 NaN NaN 2023-05-03 \n",
"14776 NaN NaN 2023-05-02 \n",
"14777 NaN NaN 2023-05-03 \n",
"14778 NaN NaN 2023-04-24 \n",
"\n",
" ErstellungsDatum \n",
"0 2022-03-30 \n",
"1 2022-03-25 \n",
"2 2021-11-22 \n",
"3 2021-04-14 \n",
"4 2021-11-17 \n",
"... ... \n",
"14774 2022-10-10 \n",
"14775 2021-06-04 \n",
"14776 2022-10-24 \n",
"14777 2022-10-24 \n",
"14778 2023-03-20 \n",
"\n",
"[14481 rows x 20 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Einträge: 14481\n"
]
}
],
"source": [
"descriptions = base['VorgangsBeschreibung']\n",
"print(f\"Einträge: {len(descriptions)}\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate Vorgangsbeschreibungen: 12297\n",
"Anzahl einzigartiger Vorgangsbeschreibungen: 2184\n",
"Anteil einzigartiger Vorgangsbeschreibungen: 15.08 %\n"
]
}
],
"source": [
"num_dupl_descr = descriptions.duplicated().sum()\n",
"uni_descr = descriptions.unique()\n",
"num_uni_descr = len(uni_descr)\n",
"\n",
"print(f\"Anzahl Duplikate Vorgangsbeschreibungen: {num_dupl_descr}\")\n",
"print(f\"Anzahl einzigartiger Vorgangsbeschreibungen: {num_uni_descr}\")\n",
"print(f\"Anteil einzigartiger Vorgangsbeschreibungen: {num_uni_descr / len(descriptions) * 100:.2f} %\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'\n",
"\n",
"if not LOAD_CALC_FILES:\n",
" cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']\n",
" descr_df = pd.DataFrame(columns=cols)\n",
" max_val = 0\n",
" text = None\n",
" index = 0\n",
"\n",
"\n",
" for idx, description in enumerate(uni_descr):\n",
" len_descr = len(description)\n",
" filt = base['VorgangsBeschreibung'] == description\n",
" temp = base[filt]\n",
" assoc_obj_ids = temp['ObjektID'].unique()\n",
" assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')\n",
" num_assoc_obj_ids = len(assoc_obj_ids)\n",
" num_dupl = filt.sum()\n",
" \n",
" conc_df = pd.DataFrame(data=[[\n",
" description,\n",
" len_descr,\n",
" num_dupl,\n",
" assoc_obj_ids,\n",
" num_assoc_obj_ids\n",
" ]], columns=cols)\n",
" \n",
" descr_df = pd.concat([descr_df, conc_df], ignore_index=True)\n",
" \n",
" if num_dupl > max_val:\n",
" max_val = num_dupl\n",
" index = idx\n",
" text = description\n",
" \n",
" temp1 = descr_df.sort_values(by='num_occur', ascending=False)\n",
" \n",
" # saving\n",
" temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)\n",
"else:\n",
" # loading\n",
" temp1 = pd.read_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>527</td>\n",
" <td>2809</td>\n",
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
" <td>1724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>11</td>\n",
" <td>2034</td>\n",
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
" <td>1187</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>24</td>\n",
" <td>1062</td>\n",
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
" <td>218</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>30</td>\n",
" <td>526</td>\n",
" <td>[0, 722, 723, 724, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>29</td>\n",
" <td>511</td>\n",
" <td>[722, 723, 724, 725, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>844</th>\n",
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
" <td>68</td>\n",
" <td>1</td>\n",
" <td>[0]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>843</th>\n",
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" <td>[1245]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>842</th>\n",
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
" <td>47</td>\n",
" <td>1</td>\n",
" <td>[1326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td>Ausgeführt</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>[2365]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2183</th>\n",
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
" <td>39</td>\n",
" <td>1</td>\n",
" <td>[0]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2184 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 2809 \n",
"16 VDE Prüfung 11 2034 \n",
"4 Wartung nach Arbeitsplan 24 1062 \n",
"7 Manuelle Dosierung des Biozids 30 526 \n",
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
"... ... ... ... \n",
"844 Filterabreinigung AGT 1 : das erste Ventil von... 68 1 \n",
"843 Abnahmeprüfung durch Sachkundigen 33 1 \n",
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 47 1 \n",
"841 Ausgeführt 10 1 \n",
"2183 Antrieb neu Dichten. Liegt auf Werkbank 39 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1187 \n",
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 218 \n",
"7 [0, 722, 723, 724, 726] 5 \n",
"12 [722, 723, 724, 725, 726] 5 \n",
"... ... ... \n",
"844 [0] 1 \n",
"843 [1245] 1 \n",
"842 [1326] 1 \n",
"841 [2365] 1 \n",
"2183 [0] 1 \n",
"\n",
"[2184 rows x 5 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2184"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(temp1)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Bestimmen des Prüftermins für elektrische Arbeitsmittel(Teil der Gefährdungsbeurteilung gemäß Betribssicherheitsverordnung §3) Ist immer ein Jahr gültig! Erklärung: -Warum stehen vor jeder Auswahl die Zahlen 1-7? Antwort: Es gibt die Gefahren Klasse 1-7 daher wurde auch bei jeder Auswahlmöglichkeit die Gefahrenklasse mit integriert. Gefährdungsklasse 1 2 3 4 5 6 7 Zustand Spitzenniv. sehr gut gut normal beeinträchtigt schlecht sehr schlecht Einwirkung/Gefährdung keine sehr niedrig niedrig normal erhöht hoch sehr hoch'"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iloc[0,0]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'VDE Prüfung'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iloc[1,0]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# saving\n",
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'\n",
"#temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Cosine Similarity**"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"# eliminate descriptions with less than 6 symbols\n",
"subset_data = temp1.loc[temp1['len'] > 5, 'descr'].copy()\n",
"#subset_data = subset_data.iloc[0:100]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2171"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(subset_data)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# saving\n",
"SAVE_PATH_SUBSET_DATA = f'./02_1_Preprocess1/{DATA_SET_ID}_02_1_subset_data.pkl'\n",
"if not LOAD_CALC_FILES:\n",
" subset_data.to_pickle(SAVE_PATH_SUBSET_DATA)\n",
"else:\n",
" subset_data = pd.read_pickle(SAVE_PATH_SUBSET_DATA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Wie geht man mit unbekannten Wörtern um?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# build mapping of embeddings for given model\n",
"def build_embedding_map(\n",
" data: Series,\n",
" model: GermanSpacyModel | SentenceTransformer,\n",
") -> dict[int, tuple['Embedding',str]]:\n",
" # dictionary with embeddings\n",
" embeddings: dict[int, tuple['Embedding',str]] = dict()\n",
" is_spacy = False\n",
" is_STRF = False\n",
" \n",
" if isinstance(model, spacy.lang.de.German):\n",
" is_spacy = True\n",
" elif isinstance(model, SentenceTransformer):\n",
" is_STRF = True\n",
" \n",
" if not any((is_spacy, is_STRF)):\n",
" raise NotImplementedError(\"Model type unknown\")\n",
" \n",
" for (idx, text) in subset_data.items():\n",
" \n",
" if is_spacy:\n",
" embd = model(text)\n",
" embeddings[idx] = (embd, text)\n",
" # check for empty vectors\n",
" if not doc.vector_norm:\n",
" print('--- Unknown Words ---')\n",
" print(f'{embd.text=} has no vector')\n",
" elif is_STRF:\n",
" embd = model.encode(text, show_progress_bar=False, normalize_embeddings=False)\n",
" embeddings[idx] = (embd, text)\n",
" \n",
" return embeddings, (is_spacy, is_STRF)\n",
"\n",
"# build similarity matrix out of embeddings\n",
"def build_cosSim_matrix(\n",
" data: Series,\n",
" model: GermanSpacyModel | SentenceTransformer,\n",
") -> DataFrame:\n",
" # build empty matrix\n",
" df_index = data.index\n",
" cosineSim_idx_matrix = pd.DataFrame(data=0., columns=df_index, \n",
" index=df_index, dtype=np.float32)\n",
" \n",
" # obtain embeddings based on used model\n",
" embds, (is_spacy, is_STRF) = build_embedding_map(\n",
" data=data,\n",
" model=model\n",
" )\n",
" \n",
" # apply index based mapping for efficient handling of large texts\n",
" combs = combinations(df_index, 2)\n",
" \n",
" for (idx1, idx2) in combs:\n",
" #print(f\"{idx1=}, {idx2=}\")\n",
" embd1 = embds[idx1][0]\n",
" embd2 = embds[idx2][0]\n",
" \n",
" # calculate similarity based on model type\n",
" if is_spacy:\n",
" cosSim = embd1.similarity(embd2)\n",
" elif is_STRF:\n",
" cosSim = sentence_transformers.util.cos_sim(embd1, embd2)\n",
" cosSim = cosSim.item()\n",
" \n",
" cosineSim_idx_matrix.at[idx1, idx2] = cosSim\n",
" \n",
" return cosineSim_idx_matrix, embds"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"SKIP = False\n",
"SAVE_PATH_COSSIM_MATRIX_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_matrix_whole_textbased.parquet'\n",
"SAVE_PATH_COSSIM_EMBDS_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_embds_whole_textbased.pkl'\n",
"\n",
"if not SKIP:\n",
" cosineSim_idx_matrix, embds = build_cosSim_matrix(\n",
" data=subset_data,\n",
" model=model_stfr,\n",
" )\n",
" # saving\n",
" cosineSim_idx_matrix.to_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)\n",
" save_pickle(obj=embds, path=SAVE_PATH_COSSIM_EMBDS_WHOLE)\n",
"else:\n",
" cosineSim_idx_matrix = pd.read_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)\n",
" embds = load_pickle(SAVE_PATH_COSSIM_EMBDS_WHOLE)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2171, 2171)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cosineSim_idx_matrix.to_numpy().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# obtain index pairs with cosine similarity \n",
"# greater than or equal to given threshold value\n",
"\n",
"def filt_thresh_cosSim_matrix(\n",
" threshold: float,\n",
" cosineSim_idx_matrix: DataFrame,\n",
"):\n",
" cosineSim_filt = cosineSim_idx_matrix.where(cosineSim_idx_matrix >= threshold).stack()\n",
" \n",
" return cosineSim_filt\n",
"\n",
"def list_cosSim_dupl_candidates(\n",
" cosineSim_filt: Series,\n",
" embeddings: dict[int, tuple['Embedding',str]],\n",
"):\n",
" # compare found duplicates\n",
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
" df_candidates = pd.DataFrame(columns=columns)\n",
" \n",
" index_pairs = list()\n",
"\n",
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
" # get text content from embedding as second tuple entry\n",
" content = [[\n",
" idx1,\n",
" embeddings[idx1][1],\n",
" idx2,\n",
" embeddings[idx2][1],\n",
" score,\n",
" ]]\n",
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
" \n",
" df_candidates = pd.concat([df_candidates, df_conc])\n",
" index_pairs.append((idx1, idx2))\n",
" \n",
" return df_candidates, index_pairs\n",
"\n",
"def choose_cosSim_dupl_candidates(\n",
" cosineSim_filt: Series,\n",
" embeddings: dict[int, tuple['Embedding',str]],\n",
") -> tuple[DataFrame, list[tuple['Index', 'Index']]]:\n",
" # compare found duplicates\n",
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
" df_candidates = pd.DataFrame(columns=columns)\n",
" \n",
" index_pairs = list()\n",
"\n",
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
" # get texts for comparison\n",
" text1 = embeddings[idx1][1]\n",
" text2 = embeddings[idx2][1]\n",
" # get decision\n",
" print('---------- New Decision ----------')\n",
" print('text1:\\n', text1, '\\n', flush=True)\n",
" print('text2:\\n', text2, '\\n', flush=True)\n",
" decision = input('Please enter >>y<< if this is a duplicate, else hit enter:')\n",
" \n",
" if not decision == 'y':\n",
" continue\n",
" \n",
" # get text content from embedding as second tuple entry\n",
" content = [[\n",
" idx1,\n",
" text1,\n",
" idx2,\n",
" text2,\n",
" score,\n",
" ]]\n",
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
" \n",
" df_candidates = pd.concat([df_candidates, df_conc])\n",
" index_pairs.append((idx1, idx2))\n",
" \n",
" return df_candidates, index_pairs"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14 18 0.851394\n",
"16 181 0.818661\n",
" 195 0.840125\n",
" 87 0.812861\n",
" 1306 0.818661\n",
" ... \n",
"876 911 0.812442\n",
"929 910 0.847216\n",
" 870 0.964813\n",
"910 870 0.830993\n",
"837 868 0.951816\n",
"Length: 1445, dtype: float32"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SIMILARITY_THRESHOLD = 0.8\n",
"SAVE_PATH_COSSIM_CANDFILT_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_3_cosineSim_idx_cand_filter_textbased.pkl'\n",
"\n",
"SKIP = False\n",
"if not SKIP:\n",
" cosineSim_filt = filt_thresh_cosSim_matrix(\n",
" threshold=SIMILARITY_THRESHOLD,\n",
" cosineSim_idx_matrix=cosineSim_idx_matrix,\n",
" )\n",
" # saving\n",
" cosineSim_filt.to_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)\n",
"else:\n",
" cosineSim_filt = pd.read_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)\n",
"cosineSim_filt"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"a:\\Arbeitsaufgaben\\Instandhaltung\\ihm_analyze\\helpers.py:131: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df_candidates = pd.concat([df_candidates, df_conc])\n"
]
}
],
"source": [
"SKIP = False\n",
"SAVE_PATH_DUPL_CANDIDATES = (f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_candidates_'\n",
" f'cosSim_thresh_{SIMILARITY_THRESHOLD}.xlsx')\n",
"SAVE_PATH_IDX_CAND_PAIRS = f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_idx_pairs_whole_Exp4.pkl'\n",
"\n",
"if not SKIP:\n",
" cosSim_dupl_candidates, dupl_idx_pairs = list_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
" )\n",
" # save results\n",
" cosSim_dupl_candidates.to_excel(SAVE_PATH_DUPL_CANDIDATES)\n",
" save_pickle(obj=dupl_idx_pairs, path=SAVE_PATH_IDX_CAND_PAIRS)\n",
" #cosSim_dupl_candidates\n",
"else:\n",
" cosSim_dupl_candidates = pd.read_excel(SAVE_PATH_DUPL_CANDIDATES, index_col=0)\n",
" dupl_idx_pairs = load_pickle(SAVE_PATH_IDX_CAND_PAIRS)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>idx1</th>\n",
" <th>text1</th>\n",
" <th>idx2</th>\n",
" <th>text2</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>14</td>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>18</td>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>0.851394</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16</td>\n",
" <td>VDE Prüfung</td>\n",
" <td>181</td>\n",
" <td>· VDE Prüfung</td>\n",
" <td>0.818661</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16</td>\n",
" <td>VDE Prüfung</td>\n",
" <td>195</td>\n",
" <td>VDE Prüfung nach VDE 0701/0702</td>\n",
" <td>0.840125</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16</td>\n",
" <td>VDE Prüfung</td>\n",
" <td>87</td>\n",
" <td>Prüfung nach VDE 701/702</td>\n",
" <td>0.812861</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16</td>\n",
" <td>VDE Prüfung</td>\n",
" <td>1306</td>\n",
" <td>·VDE Prüfung</td>\n",
" <td>0.818661</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>876</td>\n",
" <td>defekte Filter-Stützkörpe von AGT2 und AGT 3</td>\n",
" <td>911</td>\n",
" <td>AGT1 Filter Trichter 2 Klopfer defekt</td>\n",
" <td>0.812442</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>929</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>910</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>0.847216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>929</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>870</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>0.964813</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>910</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>870</td>\n",
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
" <td>0.830993</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>837</td>\n",
" <td>Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe...</td>\n",
" <td>868</td>\n",
" <td>Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe...</td>\n",
" <td>0.951816</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1445 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" idx1 text1 idx2 \\\n",
"0 14 Bestimmen des Prüftermins für elektrische Arbe... 18 \n",
"0 16 VDE Prüfung 181 \n",
"0 16 VDE Prüfung 195 \n",
"0 16 VDE Prüfung 87 \n",
"0 16 VDE Prüfung 1306 \n",
".. ... ... ... \n",
"0 876 defekte Filter-Stützkörpe von AGT2 und AGT 3 911 \n",
"0 929 Unter \"Sonstiges\" können Sie alle anderen Mäng... 910 \n",
"0 929 Unter \"Sonstiges\" können Sie alle anderen Mäng... 870 \n",
"0 910 Unter \"Sonstiges\" können Sie alle anderen Mäng... 870 \n",
"0 837 Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe... 868 \n",
"\n",
" text2 score \n",
"0 Bestimmen des Prüftermins für elektrische Arbe... 0.851394 \n",
"0 · VDE Prüfung 0.818661 \n",
"0 VDE Prüfung nach VDE 0701/0702 0.840125 \n",
"0 Prüfung nach VDE 701/702 0.812861 \n",
"0 ·VDE Prüfung 0.818661 \n",
".. ... ... \n",
"0 AGT1 Filter Trichter 2 Klopfer defekt 0.812442 \n",
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.847216 \n",
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.964813 \n",
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.830993 \n",
"0 Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe... 0.951816 \n",
"\n",
"[1445 rows x 5 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cosSim_dupl_candidates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Nächste Schritte:**\n",
"- Grenz-Threshold finden, bei dem Duplikate gerade noch richtig erkannt werden"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"if False:\n",
" thresholds = (0.75, 0.8, 0.85, 0.9, 0.93, 0.95, 0.96, 0.97, 0.98)\n",
"\n",
" for thresh in thresholds:\n",
" \n",
" cosineSim_filt = filt_thresh_cosSim_matrix(\n",
" threshold=thresh,\n",
" cosineSim_idx_matrix=cosineSim_idx_matrix.copy(),\n",
" )\n",
" \n",
" cosSim_dupl_candidates = list_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
" )\n",
" \n",
" # saving path\n",
" saving_path = (f'./Filterung_Duplikate/dupl_candidates_'\n",
" f'cosSim_thresh_{thresh}_STFR.xlsx')\n",
" \n",
" cosSim_dupl_candidates.to_excel(saving_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Ergebnisse:**\n",
"- kein allgemeiner Threshold ableitbar, nur grober Richtwert\n",
"- Paare mit geringerem Score stellenweise ähnlicher als die mit höherem Score\n",
"- finale Entscheidung für Duplikat händisch, da Kontextwissen trotzdem notwendig\n",
"- Arbeit mit ``temp1`` und merging von Einträgen\n",
"\n",
"- für gesamten Datensatz händisch nicht zielführend (über 9300 Einträge, die verglichen werden müssten)\n",
"- für ersten Wurf: Merging basierend auf Threshold von ``0.8``"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"*Manual Decision*"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# manually decide if candidates are indeed duplicates\n",
"\n",
"SKIP = True\n",
"if not SKIP:\n",
" cosSim_dupl_candidates, dupl_idx_pairs = choose_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"#save_pickle(obj=dupl_idx_pairs, path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"#dupl_idx_pairs = load_pickle(path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')\n",
"#dupl_idx_pairs = load_pickle(path='./02_1_Preprocess1/dupl_idx_pairs_whole_Exp4.pkl')\n",
"\n",
"#dupl_idx_pairs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"*Eliminate Candidates*"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"temp2 = temp1.copy()\n",
"dupl_idx_pairs = load_pickle(path=SAVE_PATH_IDX_CAND_PAIRS)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1445"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(dupl_idx_pairs)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# merge duplicates\n",
"\n",
"# to-do:\n",
"# merge: 'num_occur', 'assoc_obj_ids', \n",
"# recalc: 'num_assoc_obj_ids'\n",
"\n",
"for (i1, i2) in dupl_idx_pairs:\n",
" \n",
" # if an entry does not exist anymore, skip this pair\n",
" if i1 not in temp2.index or i2 not in temp2.index:\n",
" continue\n",
" \n",
" # merge num occur\n",
" num_occur1 = temp2.at[i1, 'num_occur']\n",
" num_occur2 = temp2.at[i2, 'num_occur']\n",
" new_num_occur = num_occur1 + num_occur2\n",
"\n",
" # merge assoc obj ids\n",
" assoc_ids1 = temp2.at[i1, 'assoc_obj_ids']\n",
" assoc_ids2 = temp2.at[i2, 'assoc_obj_ids']\n",
" new_assoc_ids = np.append(assoc_ids1, assoc_ids2)\n",
" new_assoc_ids = np.unique(new_assoc_ids.flatten())\n",
"\n",
" # recalc num assoc obj ids\n",
" new_num_assoc_obj_ids = len(new_assoc_ids)\n",
"\n",
" # write porperties to first entry\n",
" temp2.at[i1, 'num_occur'] = new_num_occur\n",
" temp2.at[i1, 'assoc_obj_ids'] = new_assoc_ids\n",
" temp2.at[i1, 'num_assoc_obj_ids'] = new_num_assoc_obj_ids\n",
" \n",
" # drop second entry\n",
" temp2 = temp2.drop(index=i2)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>527</td>\n",
" <td>2809</td>\n",
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
" <td>1724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>11</td>\n",
" <td>2034</td>\n",
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
" <td>1187</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>24</td>\n",
" <td>1062</td>\n",
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
" <td>218</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>30</td>\n",
" <td>526</td>\n",
" <td>[0, 722, 723, 724, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>29</td>\n",
" <td>511</td>\n",
" <td>[722, 723, 724, 725, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 2809 \n",
"16 VDE Prüfung 11 2034 \n",
"4 Wartung nach Arbeitsplan 24 1062 \n",
"7 Manuelle Dosierung des Biozids 30 526 \n",
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1187 \n",
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 218 \n",
"7 [0, 722, 723, 724, 726] 5 \n",
"12 [722, 723, 724, 725, 726] 5 "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.head()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>527</td>\n",
" <td>3081</td>\n",
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
" <td>1724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>11</td>\n",
" <td>2201</td>\n",
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
" <td>1203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>24</td>\n",
" <td>1091</td>\n",
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
" <td>219</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>30</td>\n",
" <td>526</td>\n",
" <td>[0, 722, 723, 724, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>29</td>\n",
" <td>511</td>\n",
" <td>[722, 723, 724, 725, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
"16 VDE Prüfung 11 2201 \n",
"4 Wartung nach Arbeitsplan 24 1091 \n",
"7 Manuelle Dosierung des Biozids 30 526 \n",
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
"7 [0, 722, 723, 724, 726] 5 \n",
"12 [722, 723, 724, 725, 726] 5 "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp2.head()"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 2184 entries, 14 to 2183\n",
"Data columns (total 5 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 descr 2184 non-null object\n",
" 1 len 2184 non-null object\n",
" 2 num_occur 2184 non-null object\n",
" 3 assoc_obj_ids 2184 non-null object\n",
" 4 num_assoc_obj_ids 2184 non-null object\n",
"dtypes: object(5)\n",
"memory usage: 166.9+ KB\n"
]
}
],
"source": [
"temp1.info()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 1735 entries, 14 to 2183\n",
"Data columns (total 5 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 descr 1735 non-null object\n",
" 1 len 1735 non-null object\n",
" 2 num_occur 1735 non-null object\n",
" 3 assoc_obj_ids 1735 non-null object\n",
" 4 num_assoc_obj_ids 1735 non-null object\n",
"dtypes: object(5)\n",
"memory usage: 81.3+ KB\n"
]
}
],
"source": [
"temp2.info()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# transform assoc_obj_ids to list to be able to save DF\n",
"temp2['assoc_obj_ids'] = temp2['assoc_obj_ids'].map(lambda x: x.tolist())"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>527</td>\n",
" <td>3081</td>\n",
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
" <td>1724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>11</td>\n",
" <td>2201</td>\n",
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
" <td>1203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>24</td>\n",
" <td>1091</td>\n",
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
" <td>219</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>30</td>\n",
" <td>526</td>\n",
" <td>[0, 722, 723, 724, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>29</td>\n",
" <td>511</td>\n",
" <td>[722, 723, 724, 725, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>844</th>\n",
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
" <td>68</td>\n",
" <td>1</td>\n",
" <td>[0]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>843</th>\n",
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" <td>[1245]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>842</th>\n",
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
" <td>47</td>\n",
" <td>1</td>\n",
" <td>[1326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td>Ausgeführt</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>[2365]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2183</th>\n",
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
" <td>39</td>\n",
" <td>1</td>\n",
" <td>[0]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1735 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
"16 VDE Prüfung 11 2201 \n",
"4 Wartung nach Arbeitsplan 24 1091 \n",
"7 Manuelle Dosierung des Biozids 30 526 \n",
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
"... ... ... ... \n",
"844 Filterabreinigung AGT 1 : das erste Ventil von... 68 1 \n",
"843 Abnahmeprüfung durch Sachkundigen 33 1 \n",
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 47 1 \n",
"841 Ausgeführt 10 1 \n",
"2183 Antrieb neu Dichten. Liegt auf Werkbank 39 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
"7 [0, 722, 723, 724, 726] 5 \n",
"12 [722, 723, 724, 725, 726] 5 \n",
"... ... ... \n",
"844 [0] 1 \n",
"843 [1245] 1 \n",
"842 [1326] 1 \n",
"841 [2365] 1 \n",
"2183 [0] 1 \n",
"\n",
"[1735 rows x 5 columns]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp2"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'\n",
"temp2.to_pickle(SAVE_PATH_REMOVED_DUPL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Handling von Rechtschreibfehlern (Hunspell über PyEnchant)\n",
"- Handling von Vector-Embeddings über Transformer-Modelle:\n",
" - höhere Fehlertoleranz (Rechtschreibung, redundante oder unbedeutende Worte)\n",
" - nicht angewiesen, dass jedes Wort im Vocabulary vorkommt (vgl. spaCy-Modell)\n",
" - bei ersten Versuchen höhere Genauigkeit bei der Erkennung tatsächlicher Duplikate\n",
"- Nutzung Vector-Embeddings für Duplikatfindung"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ---> Model Training: Data Set"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# data for model training\n",
"data = temp1.iloc[50:300,0].to_list()\n",
"data = [e for e in data if e != '']\n",
"\n",
"with open('spacy_train/training_data_2.txt','w', encoding='utf-8') as f:\n",
" f.writelines(\"\\n\".join(data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### spaCy"
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Durchführung: Sollwert: 20 0,1g'"
]
},
"execution_count": 245,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string = temp1.iloc[-2,0]\n",
"#string = temp1.iloc[0,0]\n",
"string"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [],
"source": [
"string = 'Ich spiele jeden Tag mit den Kindern im Garten. Das ist schön.'\n",
"string = 'Die Maschine XYZ ist aufgrund einer Störung im Druckluftsystem defekt.'\n",
"#string = 'The machine XYZ is broken because of a failure in the air pressure system.'\n",
"#string = 'Wir benötigen das Werkzeug von Herr Stöppel, um das derzeit abzuarbeiten.Dies wird durch Herrn Strebe getan.'"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [],
"source": [
"doc = nlp(string)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"# simulate occurence counter\n",
"OCC_COUNTER = 10"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"SPELL_CHECK_NON_CHARS = set([' ', '.', ',', ';', ':', '-'])\n",
"CLEANING = True\n",
"#CLEANING = False\n",
"\n",
"def pre_clean_word(string: str) -> str:\n",
" \n",
" pattern = r'[^A-Za-zäöüÄÖÜ]+'\n",
" string = re.sub(pattern, '', string)\n",
" \"\"\"\n",
" for char in SPELL_CHECK_NON_CHARS:\n",
" string = string.replace(char, '')\n",
" \"\"\"\n",
" \n",
" return string\n",
"\n",
"# https://stackoverflow.com/questions/25341945/check-if-string-has-date-any-format \n",
"def is_str_date(string, fuzzy=False):\n",
" \n",
" try:\n",
" parse(string, fuzzy=fuzzy)\n",
" return True\n",
" except ValueError:\n",
" return False\n",
"\n",
"\n",
"def obtain_sub_tree(token):\n",
" # check if token is a POS of interest\n",
" descendants = list(token.subtree)\n",
" descendants.remove(token)\n",
" logger.debug(f'Token >>{token}<< has subtree >>{descendants}<<')\n",
" return descendants\n",
"\n",
"\n",
"def add_children_descendants(\n",
" parent,\n",
" weight,\n",
" connections,\n",
" unique_tokens,\n",
" children_sents,\n",
" map_2_word: dict[str, str] | None = None,\n",
"):\n",
" # add child as key\n",
" if CLEANING:\n",
" parent_lemma = pre_clean_word(string=parent.lemma_)\n",
" \n",
" # map words\n",
" if map_2_word is not None:\n",
" if parent_lemma.lower() in map_2_word:\n",
" parent_lemma = map_2_word[parent_lemma.lower()]\n",
" #logger.info(f\"[SUCCESS] Mapped PARENT to {parent_lemma}\")\n",
" \n",
" if parent_lemma != '':\n",
" if (parent_lemma, parent.pos_) in connections:\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" # do not add auxiliary words\n",
" if parent.pos_ != 'AUX':\n",
" unique_tokens.add(parent_lemma)\n",
" connections[(parent_lemma, parent.pos_)] = list()\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" if (parent.lemma_, parent.pos_) in connections:\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" # do not add auxiliary words\n",
" if parent.pos_ != 'AUX':\n",
" unique_tokens.add(parent.lemma_)\n",
" connections[(parent.lemma_, parent.pos_)] = list()\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
"\n",
"\n",
"def obtain_descendant_info(\n",
" doc,\n",
" weight,\n",
" POS_of_interest,\n",
" TAG_of_interest,\n",
" connections,\n",
" unique_tokens,\n",
" map_2_word: dict[str, str] | None = None,\n",
"):\n",
" \n",
" # iterate over sentences\n",
" for sent in doc.sents:\n",
" \n",
" # iterate over tokens in one sentence\n",
" for token in sent:\n",
" \n",
" if not (token.pos_ in POS_of_interest or token.tag_ in TAG_of_interest):\n",
" continue\n",
" elif token.lemma_.lower() in GENERAL_BLACKLIST:\n",
" logger.debug(f'Eliminated parent >>{token}<< because of blacklist')\n",
" continue\n",
" \n",
" descendants = obtain_sub_tree(token=token)\n",
" \n",
" # iterate over all children if there are any\n",
" if descendants is not None:\n",
" # list with all children in the current sentence\n",
" children_sents = list()\n",
" \n",
" for child in descendants:\n",
" logger.debug(f'Token is >>{token}<< with child >>{child}<< and POS {child.pos_}')\n",
" \n",
" # elimnate cases of cross-references with verbs\n",
" if ((token.pos_ == 'AUX' or token.pos_ == 'VERB') and\n",
" (child.pos_ == 'AUX' or child.pos_ == 'VERB')):\n",
" continue\n",
" elif not (child.pos_ in POS_of_interest or child.tag_ in TAG_of_interest):\n",
" continue\n",
" elif child.lemma_.lower() in GENERAL_BLACKLIST:\n",
" logger.debug(f'Eliminated child >>{child}<< because of blacklist')\n",
" continue\n",
" \n",
" \n",
" if CLEANING:\n",
" child = pre_clean_word(string=child.lemma_)\n",
" if child == '':\n",
" continue\n",
" #child = pre_clean_word(string=child)\n",
" \n",
" if (child not in DESC_BLACKLIST and\n",
" not is_str_date(string=child)):\n",
" #not is_str_date(string=child.text)):\n",
" #children_sents.append((child.lemma_, weight))\n",
" \n",
" # map words\n",
" if map_2_word is not None:\n",
" if child.lower() in map_2_word:\n",
" child = map_2_word[child.lower()]\n",
" #logger.info(f\"[SUCCESS] Mapped CHILD to {child}\")\n",
" \n",
" children_sents.append((child, weight))\n",
" \n",
" #if child.lemma_ not in unique_tokens:\n",
" if (child not in unique_tokens and\n",
" not is_str_date(string=child)):\n",
" #unique_tokens.add(child.lemma_)\n",
" unique_tokens.add(child)\n",
" \n",
" else:\n",
" if (child.lemma_ not in DESC_BLACKLIST and\n",
" not is_str_date(string=child.text)):\n",
" children_sents.append((child.lemma_, weight))\n",
" \n",
" if child.lemma_ not in unique_tokens:\n",
" unique_tokens.add(child.lemma_)\n",
" \n",
" # add list of children for current parent if not empty\n",
" if children_sents:\n",
" \n",
" add_children_descendants(\n",
" parent=token,\n",
" weight=weight,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" children_sents=children_sents,\n",
" map_2_word=map_2_word,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"def obtain_adj_matrix(unique_tokens, connections):\n",
"\n",
" adj_mat = pd.DataFrame(\n",
" data=0, \n",
" columns=list(unique_tokens), \n",
" index=list(unique_tokens),\n",
" dtype=np.uint32,\n",
" )\n",
" \n",
" for (pred, POS), descendants_list in connections.items():\n",
" #print(f'{pred=}, {descendants=}')\n",
" \n",
" for descendants in descendants_list:\n",
" #print(f'{descendants}')\n",
" \n",
" if POS not in POS_INDIRECT:\n",
" for (desc, weight) in descendants:\n",
" adj_mat.at[pred, desc] += weight\n",
" \n",
" else:\n",
" if len(descendants) > 1:\n",
" # if auxiliary word, make connection between all associated words\n",
" combs = combinations(descendants, r=2)\n",
" \n",
" for comb in combs:\n",
" # comb is tuple ((word_1, weight), (word_2, weight))\n",
" weight = comb[0][1]\n",
" word_1 = comb[0][0]\n",
" word_2 = comb[1][0]\n",
" \n",
" \"\"\"\n",
" if ((word_1 == 'Eigenverantwortlichkeit' or word_1 == 'neu') and\n",
" (word_2 == 'Eigenverantwortlichkeit' or word_2 == 'neu')):\n",
" print(f'Hello from {pred=} with {descendants=}')\n",
" \"\"\"\n",
" \n",
" adj_mat.at[word_1, word_2] += weight\n",
" \n",
" return adj_mat\n",
"\n",
"\n",
"def make_undir_adj_matrix(adj_mat):\n",
" \n",
" adj_mat_undir = adj_mat.copy()\n",
" arr = adj_mat_undir.to_numpy()\n",
" arr_upper = np.triu(arr)\n",
" arr_lower = np.tril(arr)\n",
" arr_lower = np.rot90(np.fliplr(arr_lower))\n",
" arr_new = arr_lower + arr_upper\n",
" \n",
" adj_mat_undir.loc[:] = arr_new\n",
" \n",
" return adj_mat_undir"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gesamter Datensatz"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"SKIP = False\n",
"\n",
"SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'\n",
"if not SKIP:\n",
" temp2 = pd.read_pickle(SAVE_PATH_REMOVED_DUPL)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>527</td>\n",
" <td>3081</td>\n",
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
" <td>1724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>11</td>\n",
" <td>2201</td>\n",
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
" <td>1203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>24</td>\n",
" <td>1091</td>\n",
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
" <td>219</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>30</td>\n",
" <td>526</td>\n",
" <td>[0, 722, 723, 724, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>29</td>\n",
" <td>511</td>\n",
" <td>[722, 723, 724, 725, 726]</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
"16 VDE Prüfung 11 2201 \n",
"4 Wartung nach Arbeitsplan 24 1091 \n",
"7 Manuelle Dosierung des Biozids 30 526 \n",
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
"7 [0, 722, 723, 724, 726] 5 \n",
"12 [722, 723, 724, 725, 726] 5 "
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp2.head()"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"# analysiere erste 10 Einträge\n",
"#descr = temp1[['descr', 'num_occur']]\n",
"descr = temp2[['descr', 'num_occur']]\n",
"#descr = descr.iloc[:7,:]"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1735"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(descr)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>num_occur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
" <td>3081</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>VDE Prüfung</td>\n",
" <td>2201</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Wartung nach Arbeitsplan</td>\n",
" <td>1091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Manuelle Dosierung des Biozids</td>\n",
" <td>526</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
" <td>511</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>844</th>\n",
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>843</th>\n",
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>842</th>\n",
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td>Ausgeführt</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2183</th>\n",
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1735 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" descr num_occur\n",
"14 Bestimmen des Prüftermins für elektrische Arbe... 3081\n",
"16 VDE Prüfung 2201\n",
"4 Wartung nach Arbeitsplan 1091\n",
"7 Manuelle Dosierung des Biozids 526\n",
"12 Mikrobiologie(Abklatsch-Test) 511\n",
"... ... ...\n",
"844 Filterabreinigung AGT 1 : das erste Ventil von... 1\n",
"843 Abnahmeprüfung durch Sachkundigen 1\n",
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 1\n",
"841 Ausgeführt 1\n",
"2183 Antrieb neu Dichten. Liegt auf Werkbank 1\n",
"\n",
"[1735 rows x 2 columns]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descr"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"#LOAD_CALC_FILES = True\n",
"#LOAD_CALC_FILES = False\n",
"#IS_TEST = True\n",
"IS_TEST = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Entdeckte Gruppen**\n",
"- Prüfung:\n",
" - Prüfen\n",
" - Sichtprüfung\n",
" - Überprüfung / überprüfen\n",
" - Kontrolle / kontrollieren\n",
" - sicherstellen / Sicherstellung\n",
" - Wartung / warten\n",
" - Reinigung / reinigen\n",
" - Prüfbericht\n",
"- Handlung:\n",
" - Schmierung\n",
" - schmieren\n",
" - reinigen\n",
" - Reinigung\n",
" - schneiden / nachschneiden\n",
"- zyklisch:\n",
" - täglich\n",
" - wöchentlich\n",
" - monatlich\n",
" - jährlich\n",
"- Datum:\n",
" - Uhr\n",
" - Montag, Dienstag, Mittwoch, Donnerstag, Freitag, Samstag, Sonntag\n",
"- Kleinteile:\n",
" - Schraube\n",
" - Adapter\n",
" - Halterung\n",
" - Scheibe\n",
" - Gewinde\n",
" - Ventil\n",
" - Schalter\n",
" - Befestigungsschraube\n",
"- Komponenten:\n",
" - Kupplung\n",
" - Motor\n",
" - Getriebe\n",
" - Ventilator\n",
" - Zahnriemen\n",
" - Tranformator\n",
" - Filterelement\n",
" - Dosierpumpe\n",
" - Luftschlauch\n",
" - Dichtung\n",
" - Filter\n",
" - Scharnier\n",
" - Spannrolle\n",
" - Druckluftbehälter\n",
" - Kette\n",
" - Anschlüsse\n",
" - Schläuche\n",
" - Beleuchtung\n",
"- Elektrik:\n",
" - Zuleitung\n",
" - Kabel\n",
" - Steckdose\n",
" - Elektriker\n",
" - Elektronik\n",
" - elektrisch\n",
" - Sicherheitsbeleuchtung\n",
"- Anlagen:\n",
" - Mischanlage\n",
" - Maschine\n",
" - Wasserenthärtungsanlage\n",
" - Lüftungsanlage\n",
" - Klimaanlage\n",
"- Vereinbarung:\n",
" - Wartungsvertrag\n",
" - Neuvertrag\n",
" - Vertrag\n",
" - terminieren / terminiert\n",
" - Absprache\n",
" - melden\n",
" - telefonisch\n",
" - mitteilen\n",
"- Störbild:\n",
" - defekt\n",
" - kaputt\n",
" - Geräusch\n",
" - undicht\n",
" - leckt\n",
" - Dichtigkeit\n",
"- Abteilung:\n",
" - Buchhaltung\n",
" - Betriebstechnik\n",
" - Entwicklung\n",
"- Ort:\n",
" - Kesselhaus\n",
" - Durchfahrt\n",
" - Dach\n",
" - Haupteingang\n",
" - Werkbank\n",
" - Schlosserei"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"word_2_map = {\n",
" 'Prüfung': ['prüfen', 'sichtprüfung', 'überprüfung', 'überprüfen',\n",
" 'kontrolle', 'kontrollieren', 'sicherstellen', 'sicherstellung',\n",
" 'reinigung', 'reinigen', 'prüfbericht', 'sichtkontrolle',\n",
" 'rundgang', 'technikrundgang'],\n",
" 'Wartung': ['wartung', 'warten', 'wartungstätigkeit', 'wartungsarbeit',\n",
" 'wartungsplan'],\n",
" 'Handlung': ['schmierung', 'schmieren', 'reinigen', 'reinigung',\n",
" 'schneiden', 'nachschneiden'],\n",
" 'zyklisch': ['täglich', 'tägliche', 'täglicher', 'wöchentlich', 'wöchentliche', 'monatlich', 'jährlich',\n",
" 'halbjährlich', 'monatliche', 'wartungsintervall'],\n",
" 'Datum': ['uhr', 'montag', 'dienstag', 'mittwoch', 'donnerstag',\n",
" 'freitag', 'samstag', 'sonntag'],\n",
" 'Kleinteile': ['schraube', 'adapter', 'halterung', 'scheibe', 'gewinde',\n",
" 'ventil', 'schalter', 'befestigungsschraube'],\n",
" 'Komponenten': ['kupplung', 'motor', 'getriebe', 'ventilator',\n",
" 'zahnriemen', 'transformator', 'filterelement',\n",
" 'dosierpumpe', 'luftschlauch', 'dichtung', 'filter',\n",
" 'scharnier', 'spannrolle', 'druckluftbehälter', 'kette',\n",
" 'anschlüsse', 'anschluss', 'schläuche', 'schlauch', 'beleuchtung'],\n",
" 'Elektrik': ['zuleitung', 'kabel', 'steckdose', 'elektriker',\n",
" 'elektronik', 'elektrisch', 'sicherheitsbeleuchtung'],\n",
" 'Anlagen': ['anlage', 'mischanlage', 'maschine', 'klimaanlage', 'filteranlage',\n",
" 'wasserenthärtungsanlage', 'lüftungsanlage', 'wasseraufbereitungsanlage'],\n",
" 'Vereinbarung': ['wartungsvertrag', 'neuvertrag', 'vertrag', 'terminieren'\n",
" 'terminiert', 'absprache', 'melden', 'telefonisch', 'mitteilen'],\n",
" 'Störbild': ['defekt', 'kaputt', 'geräusch', 'undicht', 'leckt', 'dichtigkeit'],\n",
" 'Abteilung': ['buchhaltung', 'betriebstechnik', 'entwicklung'],\n",
" 'Ort': ['kesselhaus', 'durchfahrt', 'dach', \n",
" 'haupteingang', 'werkbank', 'schlosserei'],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Frage: Existiert Möglichkeit zur Klassifizierung von Begriffen?\n",
" - z.B. automatische Kennung, ob Komponente oder nicht"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"map_2_word = dict()\n",
"\n",
"for key, word_list in word_2_map.items():\n",
" \n",
" for word in word_list:\n",
" map_2_word[word] = key"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"IS_TEST = False\n",
"LOAD_CALC_FILES = False"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1735"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(descr)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 1, Percent completed: 0.06\n",
"INFO:base:Number of entries processed: 501, Percent completed: 28.88\n",
"INFO:base:Number of entries processed: 1001, Percent completed: 57.69\n",
"INFO:base:Number of entries processed: 1501, Percent completed: 86.51\n"
]
}
],
"source": [
"# adjacency matrix\n",
"connections = dict()\n",
"unique_tokens = set()\n",
"UPDATE_STATUS = 500\n",
"length_data = len(descr)\n",
"\n",
"if not LOAD_CALC_FILES or IS_TEST:\n",
" for count, description in enumerate(descr.iterrows()):\n",
" \n",
" text = description[1]['descr']\n",
" weight = description[1]['num_occur']\n",
" \n",
" doc = nlp(text)\n",
" \n",
" obtain_descendant_info(\n",
" doc=doc,\n",
" weight=weight,\n",
" POS_of_interest=POS_of_interest,\n",
" TAG_of_interest=TAG_of_interest,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" map_2_word=None,\n",
" )\n",
" \n",
" if count % UPDATE_STATUS == 0:\n",
" logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"adj_mat = obtain_adj_matrix(\n",
" unique_tokens=unique_tokens, \n",
" connections=connections\n",
")\n",
"adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"SAVE_PATH_UNI_TOKENS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_unique_tokens.pkl'\n",
"SAVE_PATH_CONNECTIONS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_connections.pkl'\n",
"SAVE_PATH_ADJ_DF = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df.parquet'\n",
"SAVE_PATH_ADJ_DF_UNDIR = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df_undir.parquet'\n",
"if not IS_TEST:\n",
" if LOAD_CALC_FILES:\n",
" connections = load_pickle(SAVE_PATH_UNI_TOKENS)\n",
" unique_tokens = load_pickle(SAVE_PATH_CONNECTIONS)\n",
" adj_mat = pd.read_parquet(SAVE_PATH_ADJ_DF)\n",
" adj_mat_undir = pd.read_parquet(SAVE_PATH_ADJ_DF_UNDIR)\n",
" else:\n",
" adj_mat.to_parquet(SAVE_PATH_ADJ_DF)\n",
" adj_mat_undir.to_parquet(SAVE_PATH_ADJ_DF_UNDIR)\n",
" save_pickle(obj=connections, path=SAVE_PATH_CONNECTIONS)\n",
" save_pickle(obj=unique_tokens, path=SAVE_PATH_UNI_TOKENS)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Dampf</th>\n",
" <th>Riss</th>\n",
" <th>Förderleistung</th>\n",
" <th>festlegen</th>\n",
" <th>weie</th>\n",
" <th>reperatur</th>\n",
" <th>Edelstahlblech</th>\n",
" <th>Kidde</th>\n",
" <th>Anlagenstillstand</th>\n",
" <th>Füllstandssonde</th>\n",
" <th>...</th>\n",
" <th>Kaltwasserhähne</th>\n",
" <th>Andreas</th>\n",
" <th>Haltebügel</th>\n",
" <th>Sicherheitsschalter</th>\n",
" <th>Tränkle</th>\n",
" <th>Fall</th>\n",
" <th>Zusatzstoff</th>\n",
" <th>Gelenk</th>\n",
" <th>trocknen</th>\n",
" <th>Kilo</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AB</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ABIC</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AGT</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AGipsTechnikRZB</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AKU</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überfähren</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überholen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überprüfen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>übertragen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überwachen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2337 rows × 2337 columns</p>\n",
"</div>"
],
"text/plain": [
" Dampf Riss Förderleistung festlegen weie reperatur \\\n",
"AB 0 0 0 0 0 0 \n",
"ABIC 0 0 0 0 0 0 \n",
"AGT 0 0 0 0 0 0 \n",
"AGipsTechnikRZB 0 0 0 0 0 0 \n",
"AKU 0 0 0 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"überfähren 0 0 0 0 0 0 \n",
"überholen 0 0 0 0 0 0 \n",
"überprüfen 0 0 0 0 0 0 \n",
"übertragen 0 0 0 0 0 0 \n",
"überwachen 0 0 0 0 0 0 \n",
"\n",
" Edelstahlblech Kidde Anlagenstillstand Füllstandssonde \\\n",
"AB 0 0 0 0 \n",
"ABIC 0 0 0 0 \n",
"AGT 0 0 0 0 \n",
"AGipsTechnikRZB 0 0 0 0 \n",
"AKU 0 0 0 0 \n",
"... ... ... ... ... \n",
"überfähren 0 0 0 0 \n",
"überholen 0 0 0 0 \n",
"überprüfen 0 0 0 0 \n",
"übertragen 0 0 0 0 \n",
"überwachen 0 0 0 0 \n",
"\n",
" ... Kaltwasserhähne Andreas Haltebügel \\\n",
"AB ... 0 0 0 \n",
"ABIC ... 0 0 0 \n",
"AGT ... 0 0 0 \n",
"AGipsTechnikRZB ... 0 0 0 \n",
"AKU ... 0 0 0 \n",
"... ... ... ... ... \n",
"überfähren ... 0 0 0 \n",
"überholen ... 0 0 0 \n",
"überprüfen ... 0 0 0 \n",
"übertragen ... 0 0 0 \n",
"überwachen ... 0 0 0 \n",
"\n",
" Sicherheitsschalter Tränkle Fall Zusatzstoff Gelenk \\\n",
"AB 0 0 0 1 0 \n",
"ABIC 0 0 0 0 0 \n",
"AGT 0 0 0 0 0 \n",
"AGipsTechnikRZB 0 0 0 0 0 \n",
"AKU 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"überfähren 0 0 0 0 0 \n",
"überholen 0 0 0 0 0 \n",
"überprüfen 0 0 0 0 0 \n",
"übertragen 0 0 0 0 0 \n",
"überwachen 0 0 0 0 0 \n",
"\n",
" trocknen Kilo \n",
"AB 0 0 \n",
"ABIC 0 0 \n",
"AGT 0 0 \n",
"AGipsTechnikRZB 0 0 \n",
"AKU 0 0 \n",
"... ... ... \n",
"überfähren 0 0 \n",
"überholen 0 0 \n",
"überprüfen 0 0 \n",
"übertragen 0 0 \n",
"überwachen 0 0 \n",
"\n",
"[2337 rows x 2337 columns]"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adj_mat_undir.sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'AGipsTechnikRZB'"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ret = adj_mat_undir.sort_index().index[3]\n",
"ret"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"is_str_date(ret)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adj_mat_undir.loc[ret,:].sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Threshold"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 120\n",
"arr = adj_mat_undir.to_numpy()\n",
"arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"190"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"70"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = np.sum(arr, axis=0)\n",
"np.count_nonzero(temp)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"thresh_adj_mat = adj_mat_undir.copy()\n",
"thresh_adj_mat.loc[:] = arr"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Dampf</th>\n",
" <th>Riss</th>\n",
" <th>Förderleistung</th>\n",
" <th>festlegen</th>\n",
" <th>weie</th>\n",
" <th>reperatur</th>\n",
" <th>Edelstahlblech</th>\n",
" <th>Kidde</th>\n",
" <th>Anlagenstillstand</th>\n",
" <th>Füllstandssonde</th>\n",
" <th>...</th>\n",
" <th>Kaltwasserhähne</th>\n",
" <th>Andreas</th>\n",
" <th>Haltebügel</th>\n",
" <th>Sicherheitsschalter</th>\n",
" <th>Tränkle</th>\n",
" <th>Fall</th>\n",
" <th>Zusatzstoff</th>\n",
" <th>Gelenk</th>\n",
" <th>trocknen</th>\n",
" <th>Kilo</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Dampf</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Riss</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Förderleistung</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>festlegen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>weie</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fall</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zusatzstoff</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gelenk</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>trocknen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kilo</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2337 rows × 2337 columns</p>\n",
"</div>"
],
"text/plain": [
" Dampf Riss Förderleistung festlegen weie reperatur \\\n",
"Dampf 0 0 0 0 0 0 \n",
"Riss 0 0 0 0 0 0 \n",
"Förderleistung 0 0 0 0 0 0 \n",
"festlegen 0 0 0 0 0 0 \n",
"weie 0 0 0 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"Fall 0 0 0 0 0 0 \n",
"Zusatzstoff 0 0 0 0 0 0 \n",
"Gelenk 0 0 0 0 0 0 \n",
"trocknen 0 0 0 0 0 0 \n",
"Kilo 0 0 0 0 0 0 \n",
"\n",
" Edelstahlblech Kidde Anlagenstillstand Füllstandssonde \\\n",
"Dampf 0 0 0 0 \n",
"Riss 0 0 0 0 \n",
"Förderleistung 0 0 0 0 \n",
"festlegen 0 0 0 0 \n",
"weie 0 0 0 0 \n",
"... ... ... ... ... \n",
"Fall 0 0 0 0 \n",
"Zusatzstoff 0 0 0 0 \n",
"Gelenk 0 0 0 0 \n",
"trocknen 0 0 0 0 \n",
"Kilo 0 0 0 0 \n",
"\n",
" ... Kaltwasserhähne Andreas Haltebügel \\\n",
"Dampf ... 0 0 0 \n",
"Riss ... 0 0 0 \n",
"Förderleistung ... 0 0 0 \n",
"festlegen ... 0 0 0 \n",
"weie ... 0 0 0 \n",
"... ... ... ... ... \n",
"Fall ... 0 0 0 \n",
"Zusatzstoff ... 0 0 0 \n",
"Gelenk ... 0 0 0 \n",
"trocknen ... 0 0 0 \n",
"Kilo ... 0 0 0 \n",
"\n",
" Sicherheitsschalter Tränkle Fall Zusatzstoff Gelenk \\\n",
"Dampf 0 0 0 0 0 \n",
"Riss 0 0 0 0 0 \n",
"Förderleistung 0 0 0 0 0 \n",
"festlegen 0 0 0 0 0 \n",
"weie 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"Fall 0 0 0 0 0 \n",
"Zusatzstoff 0 0 0 0 0 \n",
"Gelenk 0 0 0 0 0 \n",
"trocknen 0 0 0 0 0 \n",
"Kilo 0 0 0 0 0 \n",
"\n",
" trocknen Kilo \n",
"Dampf 0 0 \n",
"Riss 0 0 \n",
"Förderleistung 0 0 \n",
"festlegen 0 0 \n",
"weie 0 0 \n",
"... ... ... \n",
"Fall 0 0 \n",
"Zusatzstoff 0 0 \n",
"Gelenk 0 0 \n",
"trocknen 0 0 \n",
"Kilo 0 0 \n",
"\n",
"[2337 rows x 2337 columns]"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"thresh_adj_mat"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"ADJ_MAT_PATH_CSV = f'./02_2_Preprocess2/{DATA_SET_ID}_01_1_adj_mat_thresh_mapping_{WEIGHT_THRESHOLD}.csv'\n",
"thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Transfer in NetworkX Graph for Exporting to Standardized Formats*"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"G = nx.from_pandas_adjacency(thresh_adj_mat)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"SAVE_PATH_GRAPHML = f'./02_2_Preprocess2/{DATA_SET_ID}_adj_mat_thresh_{WEIGHT_THRESHOLD}.graphml'\n",
"nx.write_graphml(G, SAVE_PATH_GRAPHML)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test Cosine Similarity\n",
"- erstelle Matrix mit Ähnlichkeits-Score (obere Dreiecksmatrix)\n",
"- jedes Wortpaar\n",
"- filtere Tabelle nach Threshold\n",
"- nutze Gewichts-Adjezenzmatrix mit Threshold als Maske\n",
" - nur Analyse von hochgewichtigen Gruppen\n",
"- analysiere Zusammenhänge in Form von Graph (ähnlich bisherigem Vorgehen)\n",
"- bilde Gruppen und benenne diese (z.B. Prüfung+Überprüfung+Kontrolle --> Überprüfung)\n",
"- baue daraus Wörterbuch und matche Begriffe bei der Erstellung"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def build_cosine_similarity_matrix(\n",
" adj_mat\n",
"):\n",
" # obtain words to compare\n",
" words = adj_mat.index.to_list()\n",
" \n",
" # cos matrix\n",
" cos_mat = pd.DataFrame(\n",
" data=0., \n",
" columns=words, \n",
" index=words,\n",
" dtype=np.float32,\n",
" )\n",
" \n",
" for (word1, word2) in combinations(words, 2):\n",
" # obtain model vocabulary\n",
" w1 = nlp.vocab[str(word1)]\n",
" w2 = nlp.vocab[str(word2)]\n",
" # calculate cosine similarity\n",
" cos_sim = w1.similarity(w2)\n",
" # set value\n",
" cos_mat.at[word1, word2] = cos_sim\n",
" \n",
" return cos_mat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\foersterflorian\\AppData\\Local\\Temp\\ipykernel_17216\\213623562.py:20: UserWarning: [W008] Evaluating Lexeme.similarity based on empty vectors.\n",
" cos_sim = w1.similarity(w2)\n"
]
}
],
"source": [
"cos_mat = build_cosine_similarity_matrix(adj_mat=adj_mat_undir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Klübertemp</th>\n",
" <th>Schusssuche</th>\n",
" <th>Laser</th>\n",
" <th>Schaftteile</th>\n",
" <th>Dichtsätz</th>\n",
" <th>Tastatur</th>\n",
" <th>Vorspuleinheit</th>\n",
" <th>beginnen</th>\n",
" <th>auslesen</th>\n",
" <th>Kettspannung</th>\n",
" <th>...</th>\n",
" <th>Tänzerwalze</th>\n",
" <th>Abfallkante</th>\n",
" <th>rappeln</th>\n",
" <th>Rottenegger</th>\n",
" <th>Contrawalze</th>\n",
" <th>Eisenträger</th>\n",
" <th>Hängegurte</th>\n",
" <th>Treffen</th>\n",
" <th>Greiferarmen</th>\n",
" <th>Nadelleist</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Klübertemp</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schusssuche</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Laser</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.324276</td>\n",
" <td>0.0</td>\n",
" <td>0.059743</td>\n",
" <td>0.133676</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-0.063913</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.167521</td>\n",
" <td>0.0</td>\n",
" <td>-0.029860</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schaftteile</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dichtsätz</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Eisenträger</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.170954</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hängegurte</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Treffen</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Greiferarmen</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nadelleist</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6951 rows × 6951 columns</p>\n",
"</div>"
],
"text/plain": [
" Klübertemp Schusssuche Laser Schaftteile Dichtsätz \\\n",
"Klübertemp 0.0 0.0 0.0 0.0 0.0 \n",
"Schusssuche 0.0 0.0 0.0 0.0 0.0 \n",
"Laser 0.0 0.0 0.0 0.0 0.0 \n",
"Schaftteile 0.0 0.0 0.0 0.0 0.0 \n",
"Dichtsätz 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.0 0.0 0.0 0.0 0.0 \n",
"Hängegurte 0.0 0.0 0.0 0.0 0.0 \n",
"Treffen 0.0 0.0 0.0 0.0 0.0 \n",
"Greiferarmen 0.0 0.0 0.0 0.0 0.0 \n",
"Nadelleist 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" Tastatur Vorspuleinheit beginnen auslesen Kettspannung ... \\\n",
"Klübertemp 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Schusssuche 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Laser 0.324276 0.0 0.059743 0.133676 0.0 ... \n",
"Schaftteile 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Dichtsätz 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"... ... ... ... ... ... ... \n",
"Eisenträger 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Hängegurte 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Treffen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Greiferarmen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Nadelleist 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"\n",
" Tänzerwalze Abfallkante rappeln Rottenegger Contrawalze \\\n",
"Klübertemp 0.0 0.0 0.000000 0.0 0.0 \n",
"Schusssuche 0.0 0.0 0.000000 0.0 0.0 \n",
"Laser 0.0 0.0 -0.063913 0.0 0.0 \n",
"Schaftteile 0.0 0.0 0.000000 0.0 0.0 \n",
"Dichtsätz 0.0 0.0 0.000000 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.0 0.0 0.000000 0.0 0.0 \n",
"Hängegurte 0.0 0.0 0.000000 0.0 0.0 \n",
"Treffen 0.0 0.0 0.000000 0.0 0.0 \n",
"Greiferarmen 0.0 0.0 0.000000 0.0 0.0 \n",
"Nadelleist 0.0 0.0 0.000000 0.0 0.0 \n",
"\n",
" Eisenträger Hängegurte Treffen Greiferarmen Nadelleist \n",
"Klübertemp 0.000000 0.0 0.000000 0.0 0.0 \n",
"Schusssuche 0.000000 0.0 0.000000 0.0 0.0 \n",
"Laser 0.167521 0.0 -0.029860 0.0 0.0 \n",
"Schaftteile 0.000000 0.0 0.000000 0.0 0.0 \n",
"Dichtsätz 0.000000 0.0 0.000000 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.000000 0.0 0.170954 0.0 0.0 \n",
"Hängegurte 0.000000 0.0 0.000000 0.0 0.0 \n",
"Treffen 0.000000 0.0 0.000000 0.0 0.0 \n",
"Greiferarmen 0.000000 0.0 0.000000 0.0 0.0 \n",
"Nadelleist 0.000000 0.0 0.000000 0.0 0.0 \n",
"\n",
"[6951 rows x 6951 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cos_mat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 10\n",
"arr = adj_mat_undir.to_numpy()\n",
"COS_THRESHOLD = 0.4\n",
"cos_arr = cos_mat.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cos_arr_filt = np.where((cos_arr > COS_THRESHOLD) & (arr >= WEIGHT_THRESHOLD), cos_arr, 0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cos_arr_filt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"217"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"np.count_nonzero(cos_arr_filt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"thresh_cos_mat = cos_mat.copy()\n",
"thresh_cos_mat[:] = cos_arr_filt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Verstärkung</th>\n",
" <th>Zuluftfilter</th>\n",
" <th>klemmt</th>\n",
" <th>Komminikation</th>\n",
" <th>Doppelholztische</th>\n",
" <th>Deckenbeleuchtung</th>\n",
" <th>Abfalltransport</th>\n",
" <th>fahrbar</th>\n",
" <th>Folieneinlauf</th>\n",
" <th>entsorgen</th>\n",
" <th>...</th>\n",
" <th>neuwertig</th>\n",
" <th>Bleit</th>\n",
" <th>Rauchentwicklung</th>\n",
" <th>Kompressorsteuerung</th>\n",
" <th>anziehen</th>\n",
" <th>Mitarbeiterin</th>\n",
" <th>Nägel</th>\n",
" <th>WZ</th>\n",
" <th>ExSchutzAnlage</th>\n",
" <th>Gemisch</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Verstärkung</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zuluftfilter</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>klemmt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Komminikation</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Doppelholztische</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mitarbeiterin</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nägel</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WZ</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ExSchutzAnlage</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gemisch</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6951 rows × 6951 columns</p>\n",
"</div>"
],
"text/plain": [
" Verstärkung Zuluftfilter klemmt Komminikation \\\n",
"Verstärkung 0.0 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 0.0 \n",
"\n",
" Doppelholztische Deckenbeleuchtung Abfalltransport \\\n",
"Verstärkung 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 \n",
"... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 \n",
"\n",
" fahrbar Folieneinlauf entsorgen ... neuwertig Bleit \\\n",
"Verstärkung 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 ... 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 ... 0.0 0.0 \n",
"... ... ... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 ... 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 ... 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 ... 0.0 0.0 \n",
"\n",
" Rauchentwicklung Kompressorsteuerung anziehen \\\n",
"Verstärkung 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 \n",
"... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 \n",
"\n",
" Mitarbeiterin Nägel WZ ExSchutzAnlage Gemisch \n",
"Verstärkung 0.0 0.0 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[6951 rows x 6951 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"thresh_cos_mat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"COS_MAT_PATH_CSV = f'./Graphanalyse_Gruppen/cos_mat_Wthresh_{WEIGHT_THRESHOLD}_Cthresh{int(COS_THRESHOLD*100)}.csv'\n",
"thresh_cos_mat.to_csv(path_or_buf=COS_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}