5502 lines
187 KiB
Plaintext
5502 lines
187 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# **Analyse 2-2**\n",
|
||
"\n",
|
||
"## Strategie & Fokus\n",
|
||
"\n",
|
||
"- Versuche Clustering bzw. Zusammenfassung von Begriffen (z.B. Prüfung, Prüfen, Überprüfung)\n",
|
||
"- Orientierung an Häufigkeitsverteilung: häufigere Begriffe zuerst analysieren"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# Merkmal 1: Clustering von Vorgangsbeschreibungen\n",
|
||
"\n",
|
||
"## Recherche\n",
|
||
"[Textmining HS Hannover](https://textmining.wp.hs-hannover.de/Preprocessing.html)\n",
|
||
"\n",
|
||
"### Allgemeine Zergliederung der Einzelbeschreibungen\n",
|
||
"\n",
|
||
"- Text in Sätze\n",
|
||
"- Sätze in Wörter\n",
|
||
"- Wörter in Grundform:\n",
|
||
" - Lemma: Die Form des Wortes, wie sie in einem Wörterbuch steht. Z.B.: Haus, laufen, begründen\n",
|
||
" - Stamm: Das Wort ohne Flexionsendungen (Prefixe und Suffixe). Z.B.: Haus, lauf, begründ\n",
|
||
" - Wurzel: Kern des Wortes, von dem das Wort ggf. durch Derivation abgeleitet wurde. Z.B.: Haus, lauf, Grund\n",
|
||
"- Wortartbestimmung\n",
|
||
" - klassische Part-of-Speech-Erkennung (herkömmliche Wortart)\n",
|
||
" - Named Entity Recognition (NER) (Eigennamen)\n",
|
||
" - Bsp. spaCy: Person, Ort, Organisation, Verschiedenes\n",
|
||
"\n",
|
||
"#### Semantik\n",
|
||
"\n",
|
||
"- Wörter innerhalb eines Satzes größere Zusammenhänge als außerhalb\n",
|
||
"\n",
|
||
"### Pakete\n",
|
||
"\n",
|
||
"- Englisch: \n",
|
||
" - [NLTK](https://www.nltk.org/)\n",
|
||
"- Deutsch:\n",
|
||
" - [HanTa - The Hanover Tagger](https://github.com/wartaal/HanTa/tree/master)\n",
|
||
" - [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)\n",
|
||
" - [Python Wrapper](https://treetaggerwrapper.readthedocs.io/en/latest/)\n",
|
||
" - [spaCy](https://spacy.io/)\n",
|
||
" - [Beispiel 1](https://www.trinnovative.de/blog/2020-09-08-natural-language-processing-mit-spacy.html)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"21.02.:\n",
|
||
"- Überarbeitung RegEx-Filterung\n",
|
||
"- Verbesserung Duplikatefindung über Ähnlichkeit"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Analyse"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"c:\\Users\\foersterflorian\\mambaforge\\envs\\ihm2\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
||
" from .autonotebook import tqdm as notebook_tqdm\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"from pandas import DataFrame, Series\n",
|
||
"import spacy\n",
|
||
"from spacy.lang.de import German as GermanSpacyModel\n",
|
||
"import sentence_transformers\n",
|
||
"from sentence_transformers import SentenceTransformer\n",
|
||
"from collections import Counter\n",
|
||
"from itertools import combinations\n",
|
||
"from dateutil.parser import parse\n",
|
||
"import re\n",
|
||
"\n",
|
||
"\n",
|
||
"import logging\n",
|
||
"import sys\n",
|
||
"import pickle\n",
|
||
"\n",
|
||
"from ihm_analyze.helpers import (\n",
|
||
" save_pickle,\n",
|
||
" load_pickle,\n",
|
||
" build_embedding_map,\n",
|
||
" build_cosSim_matrix,\n",
|
||
" filt_thresh_cosSim_matrix,\n",
|
||
" list_cosSim_dupl_candidates,\n",
|
||
" choose_cosSim_dupl_candidates,\n",
|
||
")\n",
|
||
"\n",
|
||
"LOGGING_LEVEL = 'INFO'\n",
|
||
"logging.basicConfig(level=LOGGING_LEVEL, stream=sys.stdout)\n",
|
||
"logger = logging.getLogger('base')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"%load_ext autoreload\n",
|
||
"%autoreload 2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"LOAD_CALC_FILES = False\n",
|
||
"\n",
|
||
"DESC_BLACKLIST = set(['-'])\n",
|
||
"\"\"\"\n",
|
||
"GENERAL_BLACKLIST = set([\n",
|
||
" 'herr', 'hr.', 'förster', 'graf', 'stöppel', \n",
|
||
" 'stab', 'kw', 'h.', 'koch', 'heininger', '.',\n",
|
||
" 'schwab', 'm.', 'wenninger', '-', '--',\n",
|
||
"])\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"GENERAL_BLACKLIST = set([\n",
|
||
" 'herr', 'hr.' 'kw', 'h.', '.',\n",
|
||
" 'm.', '-', '--', 'dr.', 'dr',\n",
|
||
"])\n",
|
||
"\n",
|
||
"#GENERAL_BLACKLIST = set()\n",
|
||
"#POS_of_interest = set(['NOUN', 'PROPN', 'ADJ', 'VERB', 'AUX'])\n",
|
||
"#POS_of_interest = set(['NOUN', 'ADJ', 'VERB', 'AUX'])\n",
|
||
"#POS_of_interest = set(['NOUN', 'PROPN'])\n",
|
||
"POS_of_interest = set(['NOUN', 'PROPN', 'VERB', 'AUX'])\n",
|
||
"#TAG_of_interest = set(['ADJD'])\n",
|
||
"TAG_of_interest = set()\n",
|
||
"\n",
|
||
"#POS_INDIRECT = set(['AUX', 'VERB'])\n",
|
||
"POS_INDIRECT = set(['AUX'])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# load language model\n",
|
||
"# transformer model without vector embeddings\n",
|
||
"# can not be used to calculate similarities\n",
|
||
"# using sentence transformers instead\n",
|
||
"nlp = spacy.load('de_dep_news_trf')\n",
|
||
"#nlp = spacy.load('de_core_news_lg')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
|
||
"INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"model_stfr = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 98,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 129020 entries, 0 to 129019\n",
|
||
"Data columns (total 20 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 VorgangsID 129020 non-null int64 \n",
|
||
" 1 ObjektID 129020 non-null int64 \n",
|
||
" 2 HObjektText 129003 non-null object \n",
|
||
" 3 ObjektArtID 129020 non-null int64 \n",
|
||
" 4 ObjektArtText 128372 non-null object \n",
|
||
" 5 VorgangsTypID 129020 non-null int64 \n",
|
||
" 6 VorgangsTypName 129020 non-null object \n",
|
||
" 7 VorgangsDatum 129020 non-null datetime64[ns]\n",
|
||
" 8 VorgangsStatusId 129020 non-null int64 \n",
|
||
" 9 VorgangsPrioritaet 129020 non-null int64 \n",
|
||
" 10 VorgangsBeschreibung 124087 non-null object \n",
|
||
" 11 VorgangsOrt 507 non-null object \n",
|
||
" 12 VorgangsArtText 129020 non-null object \n",
|
||
" 13 ErledigungsDatum 129020 non-null datetime64[ns]\n",
|
||
" 14 ErledigungsArtText 128474 non-null object \n",
|
||
" 15 ErledigungsBeschreibung 118135 non-null object \n",
|
||
" 16 MPMelderArbeitsplatz 6359 non-null object \n",
|
||
" 17 MPAbteilungBezeichnung 6359 non-null object \n",
|
||
" 18 Arbeitsbeginn 123538 non-null datetime64[ns]\n",
|
||
" 19 ErstellungsDatum 129020 non-null datetime64[ns]\n",
|
||
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
|
||
"memory usage: 19.7+ MB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# load dataset\n",
|
||
"DATA_SET_ID = 'Export4'\n",
|
||
"FILE_PATH = f'01_2_Rohdaten_neu/{DATA_SET_ID}.csv'\n",
|
||
"date_cols = ['VorgangsDatum', 'ErledigungsDatum', 'Arbeitsbeginn', 'ErstellungsDatum']\n",
|
||
"raw = pd.read_csv(filepath_or_buffer=FILE_PATH, sep=';', encoding='cp1252', parse_dates=date_cols, dayfirst=True)\n",
|
||
"raw.info()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 99,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>VorgangsID</th>\n",
|
||
" <th>ObjektID</th>\n",
|
||
" <th>HObjektText</th>\n",
|
||
" <th>ObjektArtID</th>\n",
|
||
" <th>ObjektArtText</th>\n",
|
||
" <th>VorgangsTypID</th>\n",
|
||
" <th>VorgangsTypName</th>\n",
|
||
" <th>VorgangsDatum</th>\n",
|
||
" <th>VorgangsStatusId</th>\n",
|
||
" <th>VorgangsPrioritaet</th>\n",
|
||
" <th>VorgangsBeschreibung</th>\n",
|
||
" <th>VorgangsOrt</th>\n",
|
||
" <th>VorgangsArtText</th>\n",
|
||
" <th>ErledigungsDatum</th>\n",
|
||
" <th>ErledigungsArtText</th>\n",
|
||
" <th>ErledigungsBeschreibung</th>\n",
|
||
" <th>MPMelderArbeitsplatz</th>\n",
|
||
" <th>MPAbteilungBezeichnung</th>\n",
|
||
" <th>Arbeitsbeginn</th>\n",
|
||
" <th>ErstellungsDatum</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>11</td>\n",
|
||
" <td>114</td>\n",
|
||
" <td>427 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Luft-Webmaschine</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2019-03-06</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Kettbaum kaputt</td>\n",
|
||
" <td>2019-03-06</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>NaT</td>\n",
|
||
" <td>2019-03-06</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>17</td>\n",
|
||
" <td>124</td>\n",
|
||
" <td>621 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Luft-Webmaschine</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2019-03-11</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>asgasdg</td>\n",
|
||
" <td>2019-03-11</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Elektrowerkstatt</td>\n",
|
||
" <td>Elektrowerkstatt</td>\n",
|
||
" <td>NaT</td>\n",
|
||
" <td>2019-03-11</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>53</td>\n",
|
||
" <td>244</td>\n",
|
||
" <td>285 C, Webmaschine, SG 220 EMS</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Greifer-Webmaschine</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2019-03-19</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Kupplung schleift</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Kupplung defekt</td>\n",
|
||
" <td>2019-03-20</td>\n",
|
||
" <td>Reparatur UTT</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>NaT</td>\n",
|
||
" <td>2019-03-19</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>58</td>\n",
|
||
" <td>257</td>\n",
|
||
" <td>107, Webmaschine, OM 220 EOS</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Luft-Webmaschine</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2019-03-21</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Gegengewicht wieder anbringen</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
|
||
" <td>2019-03-21</td>\n",
|
||
" <td>Reparatur UTT</td>\n",
|
||
" <td>Schraube ausgebohrt\\nGegengewicht wieder angeb...</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>Weberei</td>\n",
|
||
" <td>2019-03-21</td>\n",
|
||
" <td>2019-03-21</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>81</td>\n",
|
||
" <td>138</td>\n",
|
||
" <td>00138, Schärmaschine 9,</td>\n",
|
||
" <td>16</td>\n",
|
||
" <td>Schärmaschine</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2019-03-25</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
|
||
" <td>2019-03-25</td>\n",
|
||
" <td>Reparatur UTT</td>\n",
|
||
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
|
||
" <td>Vorwerk</td>\n",
|
||
" <td>Vorwerk</td>\n",
|
||
" <td>2019-03-25</td>\n",
|
||
" <td>2019-03-25</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" VorgangsID ObjektID HObjektText \\\n",
|
||
"0 11 114 427 C , Webmaschine, DL 280 EMS Breite 280 \n",
|
||
"1 17 124 621 C , Webmaschine, DL 280 EMS Breite 280 \n",
|
||
"2 53 244 285 C, Webmaschine, SG 220 EMS \n",
|
||
"3 58 257 107, Webmaschine, OM 220 EOS \n",
|
||
"4 81 138 00138, Schärmaschine 9, \n",
|
||
"\n",
|
||
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
|
||
"0 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
|
||
"1 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
|
||
"2 5 Greifer-Webmaschine 3 Reparaturauftrag (Portal) \n",
|
||
"3 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
|
||
"4 16 Schärmaschine 3 Reparaturauftrag (Portal) \n",
|
||
"\n",
|
||
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
|
||
"0 2019-03-06 4 0 \n",
|
||
"1 2019-03-11 5 0 \n",
|
||
"2 2019-03-19 5 0 \n",
|
||
"3 2019-03-21 5 0 \n",
|
||
"4 2019-03-25 5 0 \n",
|
||
"\n",
|
||
" VorgangsBeschreibung VorgangsOrt \\\n",
|
||
"0 NaN NaN \n",
|
||
"1 NaN NaN \n",
|
||
"2 Kupplung schleift NaN \n",
|
||
"3 Gegengewicht wieder anbringen NaN \n",
|
||
"4 da ist etwas gebrochen. (Herr Heininger) NaN \n",
|
||
"\n",
|
||
" VorgangsArtText ErledigungsDatum \\\n",
|
||
"0 Kettbaum kaputt 2019-03-06 \n",
|
||
"1 asgasdg 2019-03-11 \n",
|
||
"2 Kupplung defekt 2019-03-20 \n",
|
||
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
|
||
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
|
||
"\n",
|
||
" ErledigungsArtText ErledigungsBeschreibung \\\n",
|
||
"0 NaN NaN \n",
|
||
"1 NaN NaN \n",
|
||
"2 Reparatur UTT NaN \n",
|
||
"3 Reparatur UTT Schraube ausgebohrt\\nGegengewicht wieder angeb... \n",
|
||
"4 Reparatur UTT Bolzen gebrochen. Bolzen neu angefertig und di... \n",
|
||
"\n",
|
||
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
|
||
"0 Weberei Weberei NaT 2019-03-06 \n",
|
||
"1 Elektrowerkstatt Elektrowerkstatt NaT 2019-03-11 \n",
|
||
"2 Weberei Weberei NaT 2019-03-19 \n",
|
||
"3 Weberei Weberei 2019-03-21 2019-03-21 \n",
|
||
"4 Vorwerk Vorwerk 2019-03-25 2019-03-25 "
|
||
]
|
||
},
|
||
"execution_count": 99,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"raw.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 100,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Anzahl Features: 20\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(f\"Anzahl Features: {len(raw.columns)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Neue Features gegenüber letzter Analyse:**\n",
|
||
"- ``ObjektArtID``\n",
|
||
"- ``ObjektArtText``\n",
|
||
"- ``VorgangsTypName``"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Duplikate"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 101,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"duplicates_filt = raw.duplicated()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 102,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Anzahl Duplikate: 84\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(f\"Anzahl Duplikate: {duplicates_filt.sum()}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 103,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"filt_data = raw[duplicates_filt]\n",
|
||
"uni_obj_id_dupl = filt_data['ObjektID'].unique()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 104,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Anzahl einzigartiger Objekt-IDs unter Duplikaten: 47\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(f\"Anzahl einzigartiger Objekt-IDs unter Duplikaten: {len(uni_obj_id_dupl)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 105,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 128936 entries, 0 to 128935\n",
|
||
"Data columns (total 20 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 VorgangsID 128936 non-null int64 \n",
|
||
" 1 ObjektID 128936 non-null int64 \n",
|
||
" 2 HObjektText 128920 non-null object \n",
|
||
" 3 ObjektArtID 128936 non-null int64 \n",
|
||
" 4 ObjektArtText 128289 non-null object \n",
|
||
" 5 VorgangsTypID 128936 non-null int64 \n",
|
||
" 6 VorgangsTypName 128936 non-null object \n",
|
||
" 7 VorgangsDatum 128936 non-null datetime64[ns]\n",
|
||
" 8 VorgangsStatusId 128936 non-null int64 \n",
|
||
" 9 VorgangsPrioritaet 128936 non-null int64 \n",
|
||
" 10 VorgangsBeschreibung 124008 non-null object \n",
|
||
" 11 VorgangsOrt 507 non-null object \n",
|
||
" 12 VorgangsArtText 128936 non-null object \n",
|
||
" 13 ErledigungsDatum 128936 non-null datetime64[ns]\n",
|
||
" 14 ErledigungsArtText 128402 non-null object \n",
|
||
" 15 ErledigungsBeschreibung 118086 non-null object \n",
|
||
" 16 MPMelderArbeitsplatz 6337 non-null object \n",
|
||
" 17 MPAbteilungBezeichnung 6337 non-null object \n",
|
||
" 18 Arbeitsbeginn 123480 non-null datetime64[ns]\n",
|
||
" 19 ErstellungsDatum 128936 non-null datetime64[ns]\n",
|
||
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
|
||
"memory usage: 19.7+ MB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"wo_duplicates = raw.drop_duplicates(ignore_index=True)\n",
|
||
"wo_duplicates.info()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 97,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_00_DF_wo_dupl.parquet'\n",
|
||
"wo_duplicates.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### ``VorgangsBeschreibung``"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### **NA vals und Duplikate**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"String-Bereinigung"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SPECIAL_CHARS = set(['&', '$', '%', '§', '/', '(', ')', '_', \n",
|
||
" '+', '–', '--', '<', '>', '´',\n",
|
||
"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def clean_string_slim(string: str) -> str:\n",
|
||
" # remove special chars\n",
|
||
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
|
||
" string = re.sub(pattern, ' ', string)\n",
|
||
" # remove whitespaces at the beginning and the end\n",
|
||
" string = string.strip()\n",
|
||
" \n",
|
||
" return string\n",
|
||
"\n",
|
||
"def clean_string(string: str) -> str:\n",
|
||
" #num_reps = 5\n",
|
||
" \n",
|
||
" # remove special chars\n",
|
||
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
|
||
" string = re.sub(pattern, ' ', string)\n",
|
||
" # remove dates\n",
|
||
" pattern = r'[\\d]{1,4}[.:][\\d]{1,4}[.:][\\d]{1,4}'\n",
|
||
" string = re.sub(pattern, '', string)\n",
|
||
" # remove times\n",
|
||
" pattern = r'[\\d]{1,2}[:][\\d]{1,2}[:][\\d]{0,2}'\n",
|
||
" string = re.sub(pattern, '', string)\n",
|
||
" # remove all chars despite punctuation and alphanumeric ones\n",
|
||
" pattern = r'[^ \\w.,;:\\-äöüÄÖÜ]+'\n",
|
||
" string = re.sub(pattern, '', string)\n",
|
||
" # remove - where it is used as em dash\n",
|
||
" pattern = r'[\\W]+-[\\W]+'\n",
|
||
" string = re.sub(pattern, ' ', string)\n",
|
||
" # remove whitespaces in front of punctuation\n",
|
||
" pattern = r'[ ]+([;,.:])'\n",
|
||
" string = re.sub(pattern, r'\\1', string)\n",
|
||
" # remove multiple whitespaces\n",
|
||
" pattern = r'[ ]+'\n",
|
||
" string = re.sub(pattern, ' ', string)\n",
|
||
" # remove whitespaces at the beginning and the end\n",
|
||
" string = string.strip()\n",
|
||
" \n",
|
||
" #while num_reps != 0:\n",
|
||
" #string = string.replace('\\n', ' ')\n",
|
||
" #string = string.replace('\\t', ' ')\n",
|
||
" #string = string.replace(' ', ' ')\n",
|
||
" #string = string.replace(' ', ' ')\n",
|
||
" #string = string.replace(' - ', ' ')\n",
|
||
" \"\"\"\n",
|
||
" for char in SPECIAL_CHARS:\n",
|
||
" string = string.replace(char, '')\n",
|
||
" \n",
|
||
" #num_reps -= 1\n",
|
||
" \n",
|
||
" # remove spaces at the beginning and the end\n",
|
||
" string = string.strip()\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" return string"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"base = wo_duplicates.copy()\n",
|
||
"base = base.dropna(axis=0, subset='VorgangsBeschreibung')\n",
|
||
"# preprocessing\n",
|
||
"#base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string)\n",
|
||
"base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string_slim)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>VorgangsID</th>\n",
|
||
" <th>ObjektID</th>\n",
|
||
" <th>HObjektText</th>\n",
|
||
" <th>ObjektArtID</th>\n",
|
||
" <th>ObjektArtText</th>\n",
|
||
" <th>VorgangsTypID</th>\n",
|
||
" <th>VorgangsTypName</th>\n",
|
||
" <th>VorgangsDatum</th>\n",
|
||
" <th>VorgangsStatusId</th>\n",
|
||
" <th>VorgangsPrioritaet</th>\n",
|
||
" <th>VorgangsBeschreibung</th>\n",
|
||
" <th>VorgangsOrt</th>\n",
|
||
" <th>VorgangsArtText</th>\n",
|
||
" <th>ErledigungsDatum</th>\n",
|
||
" <th>ErledigungsArtText</th>\n",
|
||
" <th>ErledigungsBeschreibung</th>\n",
|
||
" <th>MPMelderArbeitsplatz</th>\n",
|
||
" <th>MPAbteilungBezeichnung</th>\n",
|
||
" <th>Arbeitsbeginn</th>\n",
|
||
" <th>ErstellungsDatum</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>140837</td>\n",
|
||
" <td>728</td>\n",
|
||
" <td>10107, Rechteckfilter H1,</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>Behälter</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Reparaturauftrag (Portal)</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Filter Links Klopfer Defekt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Klopfer defekt</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" <td>Ausgetauscht</td>\n",
|
||
" <td>.</td>\n",
|
||
" <td>Produktion</td>\n",
|
||
" <td>Produktion</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>136284</td>\n",
|
||
" <td>1280</td>\n",
|
||
" <td>03024, Flachform Hubtisch, H2E12</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>Hydraulik</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Störungsmeldung</td>\n",
|
||
" <td>2022-03-25</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Anfahrschutz für Hydraulikkupplung abgefahren.</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Defekt</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" <td>Repariert</td>\n",
|
||
" <td>Geschweißt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2022-03-30</td>\n",
|
||
" <td>2022-03-25</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>116920</td>\n",
|
||
" <td>1518</td>\n",
|
||
" <td>00576, Leitstrahlmischer 1,</td>\n",
|
||
" <td>41</td>\n",
|
||
" <td>Mischer</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2022-04-14</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>.</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>halbjährlich Wartung (W)</td>\n",
|
||
" <td>2022-04-21</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>.</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2022-04-21</td>\n",
|
||
" <td>2021-11-22</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>21260</td>\n",
|
||
" <td>2097</td>\n",
|
||
" <td>00827, Überladebrücke Rampe 1,</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>Verladung</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2022-05-06</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Prüfung durch externen DL</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>jährliche Prüfung externer Dienstleister (P)</td>\n",
|
||
" <td>2022-04-25</td>\n",
|
||
" <td>Geprüft ohne Mängel</td>\n",
|
||
" <td>Geprüft ohne Mängel.</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2022-04-04</td>\n",
|
||
" <td>2021-04-14</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>116374</td>\n",
|
||
" <td>1703</td>\n",
|
||
" <td>00715, Vogelsang, 2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Pumpen</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2022-04-14</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>halbjährlich Wartung (W)</td>\n",
|
||
" <td>2022-05-12</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>Wartung wie geplant durchgeführt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2022-04-20</td>\n",
|
||
" <td>2021-11-17</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>14774</th>\n",
|
||
" <td>165211</td>\n",
|
||
" <td>723</td>\n",
|
||
" <td>10102, Nasswäscher AGT 2,</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>Behälter</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2023-05-01</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Biozid Dosierung Montag (W)</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>.</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>2022-10-10</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>14775</th>\n",
|
||
" <td>54805</td>\n",
|
||
" <td>2365</td>\n",
|
||
" <td>03544, Dampfkessel BHKW 1,</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>Dampferzeuger</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td></td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>dreitägige Überprüfung Mittwoch (W)</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>Nach Vorgabe</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>2021-06-04</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>14776</th>\n",
|
||
" <td>166438</td>\n",
|
||
" <td>3214</td>\n",
|
||
" <td>03760, Seepexpumpe , BN 5-12L</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Pumpen</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2023-04-24</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>halbjährlich Wartung (W)</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>Wartung wie geplant durchgeführt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2023-05-02</td>\n",
|
||
" <td>2022-10-24</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>14777</th>\n",
|
||
" <td>166443</td>\n",
|
||
" <td>1277</td>\n",
|
||
" <td>00593, Hydraulik für Deckelhubeinrichtung,</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>Hydraulik</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2023-04-24</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>halbjährlich Wartung (W)</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>Wartung wie geplant durchgeführt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2023-05-03</td>\n",
|
||
" <td>2022-10-24</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>14778</th>\n",
|
||
" <td>195126</td>\n",
|
||
" <td>1266</td>\n",
|
||
" <td>02000, BHKW 1,</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>Heizung</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Wartung</td>\n",
|
||
" <td>2023-05-01</td>\n",
|
||
" <td>14</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>regelmäßige Wartung nach Herstellervorgaben al...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2.000h Wartung (W)</td>\n",
|
||
" <td>2023-04-26</td>\n",
|
||
" <td>Planmäßige Wartung</td>\n",
|
||
" <td>Wartung wie geplant durchgeführt</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2023-04-24</td>\n",
|
||
" <td>2023-03-20</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>14481 rows × 20 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" VorgangsID ObjektID HObjektText \\\n",
|
||
"0 140837 728 10107, Rechteckfilter H1, \n",
|
||
"1 136284 1280 03024, Flachform Hubtisch, H2E12 \n",
|
||
"2 116920 1518 00576, Leitstrahlmischer 1, \n",
|
||
"3 21260 2097 00827, Überladebrücke Rampe 1, \n",
|
||
"4 116374 1703 00715, Vogelsang, 2 \n",
|
||
"... ... ... ... \n",
|
||
"14774 165211 723 10102, Nasswäscher AGT 2, \n",
|
||
"14775 54805 2365 03544, Dampfkessel BHKW 1, \n",
|
||
"14776 166438 3214 03760, Seepexpumpe , BN 5-12L \n",
|
||
"14777 166443 1277 00593, Hydraulik für Deckelhubeinrichtung, \n",
|
||
"14778 195126 1266 02000, BHKW 1, \n",
|
||
"\n",
|
||
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
|
||
"0 9 Behälter 3 Reparaturauftrag (Portal) \n",
|
||
"1 30 Hydraulik 2 Störungsmeldung \n",
|
||
"2 41 Mischer 1 Wartung \n",
|
||
"3 58 Verladung 1 Wartung \n",
|
||
"4 3 Pumpen 1 Wartung \n",
|
||
"... ... ... ... ... \n",
|
||
"14774 9 Behälter 1 Wartung \n",
|
||
"14775 11 Dampferzeuger 1 Wartung \n",
|
||
"14776 3 Pumpen 1 Wartung \n",
|
||
"14777 30 Hydraulik 1 Wartung \n",
|
||
"14778 28 Heizung 1 Wartung \n",
|
||
"\n",
|
||
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
|
||
"0 2022-03-30 2 0 \n",
|
||
"1 2022-03-25 2 0 \n",
|
||
"2 2022-04-14 2 0 \n",
|
||
"3 2022-05-06 2 0 \n",
|
||
"4 2022-04-14 2 0 \n",
|
||
"... ... ... ... \n",
|
||
"14774 2023-05-01 2 0 \n",
|
||
"14775 2023-05-03 2 0 \n",
|
||
"14776 2023-04-24 2 0 \n",
|
||
"14777 2023-04-24 12 0 \n",
|
||
"14778 2023-05-01 14 0 \n",
|
||
"\n",
|
||
" VorgangsBeschreibung VorgangsOrt \\\n",
|
||
"0 Filter Links Klopfer Defekt NaN \n",
|
||
"1 Anfahrschutz für Hydraulikkupplung abgefahren. NaN \n",
|
||
"2 . NaN \n",
|
||
"3 Prüfung durch externen DL NaN \n",
|
||
"4 Wartung nach Arbeitsplan NaN \n",
|
||
"... ... ... \n",
|
||
"14774 Manuelle Dosierung des Biozids NaN \n",
|
||
"14775 NaN \n",
|
||
"14776 Wartung nach Arbeitsplan NaN \n",
|
||
"14777 Wartung nach Arbeitsplan NaN \n",
|
||
"14778 regelmäßige Wartung nach Herstellervorgaben al... NaN \n",
|
||
"\n",
|
||
" VorgangsArtText ErledigungsDatum \\\n",
|
||
"0 Klopfer defekt 2022-03-30 \n",
|
||
"1 Defekt 2022-03-30 \n",
|
||
"2 halbjährlich Wartung (W) 2022-04-21 \n",
|
||
"3 jährliche Prüfung externer Dienstleister (P) 2022-04-25 \n",
|
||
"4 halbjährlich Wartung (W) 2022-05-12 \n",
|
||
"... ... ... \n",
|
||
"14774 Biozid Dosierung Montag (W) 2023-05-03 \n",
|
||
"14775 dreitägige Überprüfung Mittwoch (W) 2023-05-03 \n",
|
||
"14776 halbjährlich Wartung (W) 2023-05-03 \n",
|
||
"14777 halbjährlich Wartung (W) 2023-05-03 \n",
|
||
"14778 2.000h Wartung (W) 2023-04-26 \n",
|
||
"\n",
|
||
" ErledigungsArtText ErledigungsBeschreibung \\\n",
|
||
"0 Ausgetauscht . \n",
|
||
"1 Repariert Geschweißt \n",
|
||
"2 Planmäßige Wartung . \n",
|
||
"3 Geprüft ohne Mängel Geprüft ohne Mängel. \n",
|
||
"4 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
|
||
"... ... ... \n",
|
||
"14774 Planmäßige Wartung . \n",
|
||
"14775 Planmäßige Wartung Nach Vorgabe \n",
|
||
"14776 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
|
||
"14777 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
|
||
"14778 Planmäßige Wartung Wartung wie geplant durchgeführt \n",
|
||
"\n",
|
||
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn \\\n",
|
||
"0 Produktion Produktion 2022-03-30 \n",
|
||
"1 NaN NaN 2022-03-30 \n",
|
||
"2 NaN NaN 2022-04-21 \n",
|
||
"3 NaN NaN 2022-04-04 \n",
|
||
"4 NaN NaN 2022-04-20 \n",
|
||
"... ... ... ... \n",
|
||
"14774 NaN NaN 2023-05-03 \n",
|
||
"14775 NaN NaN 2023-05-03 \n",
|
||
"14776 NaN NaN 2023-05-02 \n",
|
||
"14777 NaN NaN 2023-05-03 \n",
|
||
"14778 NaN NaN 2023-04-24 \n",
|
||
"\n",
|
||
" ErstellungsDatum \n",
|
||
"0 2022-03-30 \n",
|
||
"1 2022-03-25 \n",
|
||
"2 2021-11-22 \n",
|
||
"3 2021-04-14 \n",
|
||
"4 2021-11-17 \n",
|
||
"... ... \n",
|
||
"14774 2022-10-10 \n",
|
||
"14775 2021-06-04 \n",
|
||
"14776 2022-10-24 \n",
|
||
"14777 2022-10-24 \n",
|
||
"14778 2023-03-20 \n",
|
||
"\n",
|
||
"[14481 rows x 20 columns]"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"base"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Einträge: 14481\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"descriptions = base['VorgangsBeschreibung']\n",
|
||
"print(f\"Einträge: {len(descriptions)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Anzahl Duplikate Vorgangsbeschreibungen: 12297\n",
|
||
"Anzahl einzigartiger Vorgangsbeschreibungen: 2184\n",
|
||
"Anteil einzigartiger Vorgangsbeschreibungen: 15.08 %\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"num_dupl_descr = descriptions.duplicated().sum()\n",
|
||
"uni_descr = descriptions.unique()\n",
|
||
"num_uni_descr = len(uni_descr)\n",
|
||
"\n",
|
||
"print(f\"Anzahl Duplikate Vorgangsbeschreibungen: {num_dupl_descr}\")\n",
|
||
"print(f\"Anzahl einzigartiger Vorgangsbeschreibungen: {num_uni_descr}\")\n",
|
||
"print(f\"Anteil einzigartiger Vorgangsbeschreibungen: {num_uni_descr / len(descriptions) * 100:.2f} %\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'\n",
|
||
"\n",
|
||
"if not LOAD_CALC_FILES:\n",
|
||
" cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']\n",
|
||
" descr_df = pd.DataFrame(columns=cols)\n",
|
||
" max_val = 0\n",
|
||
" text = None\n",
|
||
" index = 0\n",
|
||
"\n",
|
||
"\n",
|
||
" for idx, description in enumerate(uni_descr):\n",
|
||
" len_descr = len(description)\n",
|
||
" filt = base['VorgangsBeschreibung'] == description\n",
|
||
" temp = base[filt]\n",
|
||
" assoc_obj_ids = temp['ObjektID'].unique()\n",
|
||
" assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')\n",
|
||
" num_assoc_obj_ids = len(assoc_obj_ids)\n",
|
||
" num_dupl = filt.sum()\n",
|
||
" \n",
|
||
" conc_df = pd.DataFrame(data=[[\n",
|
||
" description,\n",
|
||
" len_descr,\n",
|
||
" num_dupl,\n",
|
||
" assoc_obj_ids,\n",
|
||
" num_assoc_obj_ids\n",
|
||
" ]], columns=cols)\n",
|
||
" \n",
|
||
" descr_df = pd.concat([descr_df, conc_df], ignore_index=True)\n",
|
||
" \n",
|
||
" if num_dupl > max_val:\n",
|
||
" max_val = num_dupl\n",
|
||
" index = idx\n",
|
||
" text = description\n",
|
||
" \n",
|
||
" temp1 = descr_df.sort_values(by='num_occur', ascending=False)\n",
|
||
" \n",
|
||
" # saving\n",
|
||
" temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)\n",
|
||
"else:\n",
|
||
" # loading\n",
|
||
" temp1 = pd.read_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>len</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" <th>assoc_obj_ids</th>\n",
|
||
" <th>num_assoc_obj_ids</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>527</td>\n",
|
||
" <td>2809</td>\n",
|
||
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
|
||
" <td>1724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>2034</td>\n",
|
||
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
|
||
" <td>1187</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>1062</td>\n",
|
||
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
|
||
" <td>218</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>526</td>\n",
|
||
" <td>[0, 722, 723, 724, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>511</td>\n",
|
||
" <td>[722, 723, 724, 725, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>844</th>\n",
|
||
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[0]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>843</th>\n",
|
||
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
|
||
" <td>33</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[1245]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>842</th>\n",
|
||
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[1326]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>841</th>\n",
|
||
" <td>Ausgeführt</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[2365]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2183</th>\n",
|
||
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
|
||
" <td>39</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[0]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>2184 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr len num_occur \\\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 2809 \n",
|
||
"16 VDE Prüfung 11 2034 \n",
|
||
"4 Wartung nach Arbeitsplan 24 1062 \n",
|
||
"7 Manuelle Dosierung des Biozids 30 526 \n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
|
||
"... ... ... ... \n",
|
||
"844 Filterabreinigung AGT 1 : das erste Ventil von... 68 1 \n",
|
||
"843 Abnahmeprüfung durch Sachkundigen 33 1 \n",
|
||
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 47 1 \n",
|
||
"841 Ausgeführt 10 1 \n",
|
||
"2183 Antrieb neu Dichten. Liegt auf Werkbank 39 1 \n",
|
||
"\n",
|
||
" assoc_obj_ids num_assoc_obj_ids \n",
|
||
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
|
||
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1187 \n",
|
||
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 218 \n",
|
||
"7 [0, 722, 723, 724, 726] 5 \n",
|
||
"12 [722, 723, 724, 725, 726] 5 \n",
|
||
"... ... ... \n",
|
||
"844 [0] 1 \n",
|
||
"843 [1245] 1 \n",
|
||
"842 [1326] 1 \n",
|
||
"841 [2365] 1 \n",
|
||
"2183 [0] 1 \n",
|
||
"\n",
|
||
"[2184 rows x 5 columns]"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp1"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"2184"
|
||
]
|
||
},
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(temp1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Bestimmen des Prüftermins für elektrische Arbeitsmittel(Teil der Gefährdungsbeurteilung gemäß Betribssicherheitsverordnung §3) Ist immer ein Jahr gültig! Erklärung: -Warum stehen vor jeder Auswahl die Zahlen 1-7? Antwort: Es gibt die Gefahren Klasse 1-7 daher wurde auch bei jeder Auswahlmöglichkeit die Gefahrenklasse mit integriert. Gefährdungsklasse 1 2 3 4 5 6 7 Zustand Spitzenniv. sehr gut gut normal beeinträchtigt schlecht sehr schlecht Einwirkung/Gefährdung keine sehr niedrig niedrig normal erhöht hoch sehr hoch'"
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp1.iloc[0,0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'VDE Prüfung'"
|
||
]
|
||
},
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp1.iloc[1,0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# saving\n",
|
||
"SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'\n",
|
||
"#temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Cosine Similarity**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# eliminate descriptions with less than 6 symbols\n",
|
||
"subset_data = temp1.loc[temp1['len'] > 5, 'descr'].copy()\n",
|
||
"#subset_data = subset_data.iloc[0:100]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"2171"
|
||
]
|
||
},
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(subset_data)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# saving\n",
|
||
"SAVE_PATH_SUBSET_DATA = f'./02_1_Preprocess1/{DATA_SET_ID}_02_1_subset_data.pkl'\n",
|
||
"if not LOAD_CALC_FILES:\n",
|
||
" subset_data.to_pickle(SAVE_PATH_SUBSET_DATA)\n",
|
||
"else:\n",
|
||
" subset_data = pd.read_pickle(SAVE_PATH_SUBSET_DATA)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- Wie geht man mit unbekannten Wörtern um?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# build mapping of embeddings for given model\n",
|
||
"def build_embedding_map(\n",
|
||
" data: Series,\n",
|
||
" model: GermanSpacyModel | SentenceTransformer,\n",
|
||
") -> dict[int, tuple['Embedding',str]]:\n",
|
||
" # dictionary with embeddings\n",
|
||
" embeddings: dict[int, tuple['Embedding',str]] = dict()\n",
|
||
" is_spacy = False\n",
|
||
" is_STRF = False\n",
|
||
" \n",
|
||
" if isinstance(model, spacy.lang.de.German):\n",
|
||
" is_spacy = True\n",
|
||
" elif isinstance(model, SentenceTransformer):\n",
|
||
" is_STRF = True\n",
|
||
" \n",
|
||
" if not any((is_spacy, is_STRF)):\n",
|
||
" raise NotImplementedError(\"Model type unknown\")\n",
|
||
" \n",
|
||
" for (idx, text) in subset_data.items():\n",
|
||
" \n",
|
||
" if is_spacy:\n",
|
||
" embd = model(text)\n",
|
||
" embeddings[idx] = (embd, text)\n",
|
||
" # check for empty vectors\n",
|
||
" if not doc.vector_norm:\n",
|
||
" print('--- Unknown Words ---')\n",
|
||
" print(f'{embd.text=} has no vector')\n",
|
||
" elif is_STRF:\n",
|
||
" embd = model.encode(text, show_progress_bar=False, normalize_embeddings=False)\n",
|
||
" embeddings[idx] = (embd, text)\n",
|
||
" \n",
|
||
" return embeddings, (is_spacy, is_STRF)\n",
|
||
"\n",
|
||
"# build similarity matrix out of embeddings\n",
|
||
"def build_cosSim_matrix(\n",
|
||
" data: Series,\n",
|
||
" model: GermanSpacyModel | SentenceTransformer,\n",
|
||
") -> DataFrame:\n",
|
||
" # build empty matrix\n",
|
||
" df_index = data.index\n",
|
||
" cosineSim_idx_matrix = pd.DataFrame(data=0., columns=df_index, \n",
|
||
" index=df_index, dtype=np.float32)\n",
|
||
" \n",
|
||
" # obtain embeddings based on used model\n",
|
||
" embds, (is_spacy, is_STRF) = build_embedding_map(\n",
|
||
" data=data,\n",
|
||
" model=model\n",
|
||
" )\n",
|
||
" \n",
|
||
" # apply index based mapping for efficient handling of large texts\n",
|
||
" combs = combinations(df_index, 2)\n",
|
||
" \n",
|
||
" for (idx1, idx2) in combs:\n",
|
||
" #print(f\"{idx1=}, {idx2=}\")\n",
|
||
" embd1 = embds[idx1][0]\n",
|
||
" embd2 = embds[idx2][0]\n",
|
||
" \n",
|
||
" # calculate similarity based on model type\n",
|
||
" if is_spacy:\n",
|
||
" cosSim = embd1.similarity(embd2)\n",
|
||
" elif is_STRF:\n",
|
||
" cosSim = sentence_transformers.util.cos_sim(embd1, embd2)\n",
|
||
" cosSim = cosSim.item()\n",
|
||
" \n",
|
||
" cosineSim_idx_matrix.at[idx1, idx2] = cosSim\n",
|
||
" \n",
|
||
" return cosineSim_idx_matrix, embds"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SKIP = False\n",
|
||
"SAVE_PATH_COSSIM_MATRIX_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_matrix_whole_textbased.parquet'\n",
|
||
"SAVE_PATH_COSSIM_EMBDS_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_embds_whole_textbased.pkl'\n",
|
||
"\n",
|
||
"if not SKIP:\n",
|
||
" cosineSim_idx_matrix, embds = build_cosSim_matrix(\n",
|
||
" data=subset_data,\n",
|
||
" model=model_stfr,\n",
|
||
" )\n",
|
||
" # saving\n",
|
||
" cosineSim_idx_matrix.to_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)\n",
|
||
" save_pickle(obj=embds, path=SAVE_PATH_COSSIM_EMBDS_WHOLE)\n",
|
||
"else:\n",
|
||
" cosineSim_idx_matrix = pd.read_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)\n",
|
||
" embds = load_pickle(SAVE_PATH_COSSIM_EMBDS_WHOLE)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(2171, 2171)"
|
||
]
|
||
},
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"cosineSim_idx_matrix.to_numpy().shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# obtain index pairs with cosine similarity \n",
|
||
"# greater than or equal to given threshold value\n",
|
||
"\n",
|
||
"def filt_thresh_cosSim_matrix(\n",
|
||
" threshold: float,\n",
|
||
" cosineSim_idx_matrix: DataFrame,\n",
|
||
"):\n",
|
||
" cosineSim_filt = cosineSim_idx_matrix.where(cosineSim_idx_matrix >= threshold).stack()\n",
|
||
" \n",
|
||
" return cosineSim_filt\n",
|
||
"\n",
|
||
"def list_cosSim_dupl_candidates(\n",
|
||
" cosineSim_filt: Series,\n",
|
||
" embeddings: dict[int, tuple['Embedding',str]],\n",
|
||
"):\n",
|
||
" # compare found duplicates\n",
|
||
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
|
||
" df_candidates = pd.DataFrame(columns=columns)\n",
|
||
" \n",
|
||
" index_pairs = list()\n",
|
||
"\n",
|
||
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
|
||
" # get text content from embedding as second tuple entry\n",
|
||
" content = [[\n",
|
||
" idx1,\n",
|
||
" embeddings[idx1][1],\n",
|
||
" idx2,\n",
|
||
" embeddings[idx2][1],\n",
|
||
" score,\n",
|
||
" ]]\n",
|
||
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
|
||
" \n",
|
||
" df_candidates = pd.concat([df_candidates, df_conc])\n",
|
||
" index_pairs.append((idx1, idx2))\n",
|
||
" \n",
|
||
" return df_candidates, index_pairs\n",
|
||
"\n",
|
||
"def choose_cosSim_dupl_candidates(\n",
|
||
" cosineSim_filt: Series,\n",
|
||
" embeddings: dict[int, tuple['Embedding',str]],\n",
|
||
") -> tuple[DataFrame, list[tuple['Index', 'Index']]]:\n",
|
||
" # compare found duplicates\n",
|
||
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
|
||
" df_candidates = pd.DataFrame(columns=columns)\n",
|
||
" \n",
|
||
" index_pairs = list()\n",
|
||
"\n",
|
||
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
|
||
" # get texts for comparison\n",
|
||
" text1 = embeddings[idx1][1]\n",
|
||
" text2 = embeddings[idx2][1]\n",
|
||
" # get decision\n",
|
||
" print('---------- New Decision ----------')\n",
|
||
" print('text1:\\n', text1, '\\n', flush=True)\n",
|
||
" print('text2:\\n', text2, '\\n', flush=True)\n",
|
||
" decision = input('Please enter >>y<< if this is a duplicate, else hit enter:')\n",
|
||
" \n",
|
||
" if not decision == 'y':\n",
|
||
" continue\n",
|
||
" \n",
|
||
" # get text content from embedding as second tuple entry\n",
|
||
" content = [[\n",
|
||
" idx1,\n",
|
||
" text1,\n",
|
||
" idx2,\n",
|
||
" text2,\n",
|
||
" score,\n",
|
||
" ]]\n",
|
||
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
|
||
" \n",
|
||
" df_candidates = pd.concat([df_candidates, df_conc])\n",
|
||
" index_pairs.append((idx1, idx2))\n",
|
||
" \n",
|
||
" return df_candidates, index_pairs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"14 18 0.851394\n",
|
||
"16 181 0.818661\n",
|
||
" 195 0.840125\n",
|
||
" 87 0.812861\n",
|
||
" 1306 0.818661\n",
|
||
" ... \n",
|
||
"876 911 0.812442\n",
|
||
"929 910 0.847216\n",
|
||
" 870 0.964813\n",
|
||
"910 870 0.830993\n",
|
||
"837 868 0.951816\n",
|
||
"Length: 1445, dtype: float32"
|
||
]
|
||
},
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"SIMILARITY_THRESHOLD = 0.8\n",
|
||
"SAVE_PATH_COSSIM_CANDFILT_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_3_cosineSim_idx_cand_filter_textbased.pkl'\n",
|
||
"\n",
|
||
"SKIP = False\n",
|
||
"if not SKIP:\n",
|
||
" cosineSim_filt = filt_thresh_cosSim_matrix(\n",
|
||
" threshold=SIMILARITY_THRESHOLD,\n",
|
||
" cosineSim_idx_matrix=cosineSim_idx_matrix,\n",
|
||
" )\n",
|
||
" # saving\n",
|
||
" cosineSim_filt.to_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)\n",
|
||
"else:\n",
|
||
" cosineSim_filt = pd.read_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)\n",
|
||
"cosineSim_filt"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"a:\\Arbeitsaufgaben\\Instandhaltung\\ihm_analyze\\helpers.py:131: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
|
||
" df_candidates = pd.concat([df_candidates, df_conc])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"SKIP = False\n",
|
||
"SAVE_PATH_DUPL_CANDIDATES = (f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_candidates_'\n",
|
||
" f'cosSim_thresh_{SIMILARITY_THRESHOLD}.xlsx')\n",
|
||
"SAVE_PATH_IDX_CAND_PAIRS = f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_idx_pairs_whole_Exp4.pkl'\n",
|
||
"\n",
|
||
"if not SKIP:\n",
|
||
" cosSim_dupl_candidates, dupl_idx_pairs = list_cosSim_dupl_candidates(\n",
|
||
" cosineSim_filt=cosineSim_filt,\n",
|
||
" embeddings=embds,\n",
|
||
" )\n",
|
||
" # save results\n",
|
||
" cosSim_dupl_candidates.to_excel(SAVE_PATH_DUPL_CANDIDATES)\n",
|
||
" save_pickle(obj=dupl_idx_pairs, path=SAVE_PATH_IDX_CAND_PAIRS)\n",
|
||
" #cosSim_dupl_candidates\n",
|
||
"else:\n",
|
||
" cosSim_dupl_candidates = pd.read_excel(SAVE_PATH_DUPL_CANDIDATES, index_col=0)\n",
|
||
" dupl_idx_pairs = load_pickle(SAVE_PATH_IDX_CAND_PAIRS)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>idx1</th>\n",
|
||
" <th>text1</th>\n",
|
||
" <th>idx2</th>\n",
|
||
" <th>text2</th>\n",
|
||
" <th>score</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>14</td>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>18</td>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>0.851394</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>16</td>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>181</td>\n",
|
||
" <td>· VDE Prüfung</td>\n",
|
||
" <td>0.818661</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>16</td>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>195</td>\n",
|
||
" <td>VDE Prüfung nach VDE 0701/0702</td>\n",
|
||
" <td>0.840125</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>16</td>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>87</td>\n",
|
||
" <td>Prüfung nach VDE 701/702</td>\n",
|
||
" <td>0.812861</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>16</td>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>1306</td>\n",
|
||
" <td>·VDE Prüfung</td>\n",
|
||
" <td>0.818661</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>876</td>\n",
|
||
" <td>defekte Filter-Stützkörpe von AGT2 und AGT 3</td>\n",
|
||
" <td>911</td>\n",
|
||
" <td>AGT1 Filter Trichter 2 Klopfer defekt</td>\n",
|
||
" <td>0.812442</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>929</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>910</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>0.847216</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>929</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>870</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>0.964813</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>910</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>870</td>\n",
|
||
" <td>Unter \"Sonstiges\" können Sie alle anderen Mäng...</td>\n",
|
||
" <td>0.830993</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>837</td>\n",
|
||
" <td>Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe...</td>\n",
|
||
" <td>868</td>\n",
|
||
" <td>Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe...</td>\n",
|
||
" <td>0.951816</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1445 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" idx1 text1 idx2 \\\n",
|
||
"0 14 Bestimmen des Prüftermins für elektrische Arbe... 18 \n",
|
||
"0 16 VDE Prüfung 181 \n",
|
||
"0 16 VDE Prüfung 195 \n",
|
||
"0 16 VDE Prüfung 87 \n",
|
||
"0 16 VDE Prüfung 1306 \n",
|
||
".. ... ... ... \n",
|
||
"0 876 defekte Filter-Stützkörpe von AGT2 und AGT 3 911 \n",
|
||
"0 929 Unter \"Sonstiges\" können Sie alle anderen Mäng... 910 \n",
|
||
"0 929 Unter \"Sonstiges\" können Sie alle anderen Mäng... 870 \n",
|
||
"0 910 Unter \"Sonstiges\" können Sie alle anderen Mäng... 870 \n",
|
||
"0 837 Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe... 868 \n",
|
||
"\n",
|
||
" text2 score \n",
|
||
"0 Bestimmen des Prüftermins für elektrische Arbe... 0.851394 \n",
|
||
"0 · VDE Prüfung 0.818661 \n",
|
||
"0 VDE Prüfung nach VDE 0701/0702 0.840125 \n",
|
||
"0 Prüfung nach VDE 701/702 0.812861 \n",
|
||
"0 ·VDE Prüfung 0.818661 \n",
|
||
".. ... ... \n",
|
||
"0 AGT1 Filter Trichter 2 Klopfer defekt 0.812442 \n",
|
||
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.847216 \n",
|
||
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.964813 \n",
|
||
"0 Unter \"Sonstiges\" können Sie alle anderen Mäng... 0.830993 \n",
|
||
"0 Die Pumpe ist undicht bzw. tropft. Halle:2 Ebe... 0.951816 \n",
|
||
"\n",
|
||
"[1445 rows x 5 columns]"
|
||
]
|
||
},
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"cosSim_dupl_candidates"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Nächste Schritte:**\n",
|
||
"- Grenz-Threshold finden, bei dem Duplikate gerade noch richtig erkannt werden"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"if False:\n",
|
||
" thresholds = (0.75, 0.8, 0.85, 0.9, 0.93, 0.95, 0.96, 0.97, 0.98)\n",
|
||
"\n",
|
||
" for thresh in thresholds:\n",
|
||
" \n",
|
||
" cosineSim_filt = filt_thresh_cosSim_matrix(\n",
|
||
" threshold=thresh,\n",
|
||
" cosineSim_idx_matrix=cosineSim_idx_matrix.copy(),\n",
|
||
" )\n",
|
||
" \n",
|
||
" cosSim_dupl_candidates = list_cosSim_dupl_candidates(\n",
|
||
" cosineSim_filt=cosineSim_filt,\n",
|
||
" embeddings=embds,\n",
|
||
" )\n",
|
||
" \n",
|
||
" # saving path\n",
|
||
" saving_path = (f'./Filterung_Duplikate/dupl_candidates_'\n",
|
||
" f'cosSim_thresh_{thresh}_STFR.xlsx')\n",
|
||
" \n",
|
||
" cosSim_dupl_candidates.to_excel(saving_path)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Ergebnisse:**\n",
|
||
"- kein allgemeiner Threshold ableitbar, nur grober Richtwert\n",
|
||
"- Paare mit geringerem Score stellenweise ähnlicher als die mit höherem Score\n",
|
||
"- finale Entscheidung für Duplikat händisch, da Kontextwissen trotzdem notwendig\n",
|
||
"- Arbeit mit ``temp1`` und merging von Einträgen\n",
|
||
"\n",
|
||
"- für gesamten Datensatz händisch nicht zielführend (über 9300 Einträge, die verglichen werden müssten)\n",
|
||
"- für ersten Wurf: Merging basierend auf Threshold von ``0.8``"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"*Manual Decision*"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# manually decide if candidates are indeed duplicates\n",
|
||
"\n",
|
||
"SKIP = True\n",
|
||
"if not SKIP:\n",
|
||
" cosSim_dupl_candidates, dupl_idx_pairs = choose_cosSim_dupl_candidates(\n",
|
||
" cosineSim_filt=cosineSim_filt,\n",
|
||
" embeddings=embds,\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 54,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#save_pickle(obj=dupl_idx_pairs, path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 72,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#dupl_idx_pairs = load_pickle(path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')\n",
|
||
"#dupl_idx_pairs = load_pickle(path='./02_1_Preprocess1/dupl_idx_pairs_whole_Exp4.pkl')\n",
|
||
"\n",
|
||
"#dupl_idx_pairs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"*Eliminate Candidates*"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"temp2 = temp1.copy()\n",
|
||
"dupl_idx_pairs = load_pickle(path=SAVE_PATH_IDX_CAND_PAIRS)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"1445"
|
||
]
|
||
},
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(dupl_idx_pairs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# merge duplicates\n",
|
||
"\n",
|
||
"# to-do:\n",
|
||
"# merge: 'num_occur', 'assoc_obj_ids', \n",
|
||
"# recalc: 'num_assoc_obj_ids'\n",
|
||
"\n",
|
||
"for (i1, i2) in dupl_idx_pairs:\n",
|
||
" \n",
|
||
" # if an entry does not exist anymore, skip this pair\n",
|
||
" if i1 not in temp2.index or i2 not in temp2.index:\n",
|
||
" continue\n",
|
||
" \n",
|
||
" # merge num occur\n",
|
||
" num_occur1 = temp2.at[i1, 'num_occur']\n",
|
||
" num_occur2 = temp2.at[i2, 'num_occur']\n",
|
||
" new_num_occur = num_occur1 + num_occur2\n",
|
||
"\n",
|
||
" # merge assoc obj ids\n",
|
||
" assoc_ids1 = temp2.at[i1, 'assoc_obj_ids']\n",
|
||
" assoc_ids2 = temp2.at[i2, 'assoc_obj_ids']\n",
|
||
" new_assoc_ids = np.append(assoc_ids1, assoc_ids2)\n",
|
||
" new_assoc_ids = np.unique(new_assoc_ids.flatten())\n",
|
||
"\n",
|
||
" # recalc num assoc obj ids\n",
|
||
" new_num_assoc_obj_ids = len(new_assoc_ids)\n",
|
||
"\n",
|
||
" # write porperties to first entry\n",
|
||
" temp2.at[i1, 'num_occur'] = new_num_occur\n",
|
||
" temp2.at[i1, 'assoc_obj_ids'] = new_assoc_ids\n",
|
||
" temp2.at[i1, 'num_assoc_obj_ids'] = new_num_assoc_obj_ids\n",
|
||
" \n",
|
||
" # drop second entry\n",
|
||
" temp2 = temp2.drop(index=i2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>len</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" <th>assoc_obj_ids</th>\n",
|
||
" <th>num_assoc_obj_ids</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>527</td>\n",
|
||
" <td>2809</td>\n",
|
||
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
|
||
" <td>1724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>2034</td>\n",
|
||
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
|
||
" <td>1187</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>1062</td>\n",
|
||
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
|
||
" <td>218</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>526</td>\n",
|
||
" <td>[0, 722, 723, 724, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>511</td>\n",
|
||
" <td>[722, 723, 724, 725, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr len num_occur \\\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 2809 \n",
|
||
"16 VDE Prüfung 11 2034 \n",
|
||
"4 Wartung nach Arbeitsplan 24 1062 \n",
|
||
"7 Manuelle Dosierung des Biozids 30 526 \n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
|
||
"\n",
|
||
" assoc_obj_ids num_assoc_obj_ids \n",
|
||
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
|
||
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1187 \n",
|
||
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 218 \n",
|
||
"7 [0, 722, 723, 724, 726] 5 \n",
|
||
"12 [722, 723, 724, 725, 726] 5 "
|
||
]
|
||
},
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp1.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>len</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" <th>assoc_obj_ids</th>\n",
|
||
" <th>num_assoc_obj_ids</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>527</td>\n",
|
||
" <td>3081</td>\n",
|
||
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
|
||
" <td>1724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>2201</td>\n",
|
||
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
|
||
" <td>1203</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>1091</td>\n",
|
||
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
|
||
" <td>219</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>526</td>\n",
|
||
" <td>[0, 722, 723, 724, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>511</td>\n",
|
||
" <td>[722, 723, 724, 725, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr len num_occur \\\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
|
||
"16 VDE Prüfung 11 2201 \n",
|
||
"4 Wartung nach Arbeitsplan 24 1091 \n",
|
||
"7 Manuelle Dosierung des Biozids 30 526 \n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
|
||
"\n",
|
||
" assoc_obj_ids num_assoc_obj_ids \n",
|
||
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
|
||
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
|
||
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
|
||
"7 [0, 722, 723, 724, 726] 5 \n",
|
||
"12 [722, 723, 724, 725, 726] 5 "
|
||
]
|
||
},
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp2.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"Index: 2184 entries, 14 to 2183\n",
|
||
"Data columns (total 5 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 descr 2184 non-null object\n",
|
||
" 1 len 2184 non-null object\n",
|
||
" 2 num_occur 2184 non-null object\n",
|
||
" 3 assoc_obj_ids 2184 non-null object\n",
|
||
" 4 num_assoc_obj_ids 2184 non-null object\n",
|
||
"dtypes: object(5)\n",
|
||
"memory usage: 166.9+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"temp1.info()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"Index: 1735 entries, 14 to 2183\n",
|
||
"Data columns (total 5 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 descr 1735 non-null object\n",
|
||
" 1 len 1735 non-null object\n",
|
||
" 2 num_occur 1735 non-null object\n",
|
||
" 3 assoc_obj_ids 1735 non-null object\n",
|
||
" 4 num_assoc_obj_ids 1735 non-null object\n",
|
||
"dtypes: object(5)\n",
|
||
"memory usage: 81.3+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"temp2.info()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# transform assoc_obj_ids to list to be able to save DF\n",
|
||
"temp2['assoc_obj_ids'] = temp2['assoc_obj_ids'].map(lambda x: x.tolist())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>len</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" <th>assoc_obj_ids</th>\n",
|
||
" <th>num_assoc_obj_ids</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>527</td>\n",
|
||
" <td>3081</td>\n",
|
||
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
|
||
" <td>1724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>2201</td>\n",
|
||
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
|
||
" <td>1203</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>1091</td>\n",
|
||
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
|
||
" <td>219</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>526</td>\n",
|
||
" <td>[0, 722, 723, 724, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>511</td>\n",
|
||
" <td>[722, 723, 724, 725, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>844</th>\n",
|
||
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[0]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>843</th>\n",
|
||
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
|
||
" <td>33</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[1245]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>842</th>\n",
|
||
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[1326]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>841</th>\n",
|
||
" <td>Ausgeführt</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[2365]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2183</th>\n",
|
||
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
|
||
" <td>39</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>[0]</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1735 rows × 5 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr len num_occur \\\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
|
||
"16 VDE Prüfung 11 2201 \n",
|
||
"4 Wartung nach Arbeitsplan 24 1091 \n",
|
||
"7 Manuelle Dosierung des Biozids 30 526 \n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
|
||
"... ... ... ... \n",
|
||
"844 Filterabreinigung AGT 1 : das erste Ventil von... 68 1 \n",
|
||
"843 Abnahmeprüfung durch Sachkundigen 33 1 \n",
|
||
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 47 1 \n",
|
||
"841 Ausgeführt 10 1 \n",
|
||
"2183 Antrieb neu Dichten. Liegt auf Werkbank 39 1 \n",
|
||
"\n",
|
||
" assoc_obj_ids num_assoc_obj_ids \n",
|
||
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
|
||
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
|
||
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
|
||
"7 [0, 722, 723, 724, 726] 5 \n",
|
||
"12 [722, 723, 724, 725, 726] 5 \n",
|
||
"... ... ... \n",
|
||
"844 [0] 1 \n",
|
||
"843 [1245] 1 \n",
|
||
"842 [1326] 1 \n",
|
||
"841 [2365] 1 \n",
|
||
"2183 [0] 1 \n",
|
||
"\n",
|
||
"[1735 rows x 5 columns]"
|
||
]
|
||
},
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'\n",
|
||
"temp2.to_pickle(SAVE_PATH_REMOVED_DUPL)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- Handling von Rechtschreibfehlern (Hunspell über PyEnchant)\n",
|
||
"- Handling von Vector-Embeddings über Transformer-Modelle:\n",
|
||
" - höhere Fehlertoleranz (Rechtschreibung, redundante oder unbedeutende Worte)\n",
|
||
" - nicht angewiesen, dass jedes Wort im Vocabulary vorkommt (vgl. spaCy-Modell)\n",
|
||
" - bei ersten Versuchen höhere Genauigkeit bei der Erkennung tatsächlicher Duplikate\n",
|
||
"- Nutzung Vector-Embeddings für Duplikatfindung"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### ---> Model Training: Data Set"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# data for model training\n",
|
||
"data = temp1.iloc[50:300,0].to_list()\n",
|
||
"data = [e for e in data if e != '']\n",
|
||
"\n",
|
||
"with open('spacy_train/training_data_2.txt','w', encoding='utf-8') as f:\n",
|
||
" f.writelines(\"\\n\".join(data))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### spaCy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 245,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Durchführung: Sollwert: 20 0,1g'"
|
||
]
|
||
},
|
||
"execution_count": 245,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"string = temp1.iloc[-2,0]\n",
|
||
"#string = temp1.iloc[0,0]\n",
|
||
"string"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 246,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"string = 'Ich spiele jeden Tag mit den Kindern im Garten. Das ist schön.'\n",
|
||
"string = 'Die Maschine XYZ ist aufgrund einer Störung im Druckluftsystem defekt.'\n",
|
||
"#string = 'The machine XYZ is broken because of a failure in the air pressure system.'\n",
|
||
"#string = 'Wir benötigen das Werkzeug von Herr Stöppel, um das derzeit abzuarbeiten.Dies wird durch Herrn Strebe getan.'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 247,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"doc = nlp(string)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 131,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# simulate occurence counter\n",
|
||
"OCC_COUNTER = 10"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SPELL_CHECK_NON_CHARS = set([' ', '.', ',', ';', ':', '-'])\n",
|
||
"CLEANING = True\n",
|
||
"#CLEANING = False\n",
|
||
"\n",
|
||
"def pre_clean_word(string: str) -> str:\n",
|
||
" \n",
|
||
" pattern = r'[^A-Za-zäöüÄÖÜ]+'\n",
|
||
" string = re.sub(pattern, '', string)\n",
|
||
" \"\"\"\n",
|
||
" for char in SPELL_CHECK_NON_CHARS:\n",
|
||
" string = string.replace(char, '')\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" return string\n",
|
||
"\n",
|
||
"# https://stackoverflow.com/questions/25341945/check-if-string-has-date-any-format \n",
|
||
"def is_str_date(string, fuzzy=False):\n",
|
||
" \n",
|
||
" try:\n",
|
||
" parse(string, fuzzy=fuzzy)\n",
|
||
" return True\n",
|
||
" except ValueError:\n",
|
||
" return False\n",
|
||
"\n",
|
||
"\n",
|
||
"def obtain_sub_tree(token):\n",
|
||
" # check if token is a POS of interest\n",
|
||
" descendants = list(token.subtree)\n",
|
||
" descendants.remove(token)\n",
|
||
" logger.debug(f'Token >>{token}<< has subtree >>{descendants}<<')\n",
|
||
" return descendants\n",
|
||
"\n",
|
||
"\n",
|
||
"def add_children_descendants(\n",
|
||
" parent,\n",
|
||
" weight,\n",
|
||
" connections,\n",
|
||
" unique_tokens,\n",
|
||
" children_sents,\n",
|
||
" map_2_word: dict[str, str] | None = None,\n",
|
||
"):\n",
|
||
" # add child as key\n",
|
||
" if CLEANING:\n",
|
||
" parent_lemma = pre_clean_word(string=parent.lemma_)\n",
|
||
" \n",
|
||
" # map words\n",
|
||
" if map_2_word is not None:\n",
|
||
" if parent_lemma.lower() in map_2_word:\n",
|
||
" parent_lemma = map_2_word[parent_lemma.lower()]\n",
|
||
" #logger.info(f\"[SUCCESS] Mapped PARENT to {parent_lemma}\")\n",
|
||
" \n",
|
||
" if parent_lemma != '':\n",
|
||
" if (parent_lemma, parent.pos_) in connections:\n",
|
||
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
|
||
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
|
||
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
|
||
" else:\n",
|
||
" # do not add auxiliary words\n",
|
||
" if parent.pos_ != 'AUX':\n",
|
||
" unique_tokens.add(parent_lemma)\n",
|
||
" connections[(parent_lemma, parent.pos_)] = list()\n",
|
||
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
|
||
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
|
||
" else:\n",
|
||
" if (parent.lemma_, parent.pos_) in connections:\n",
|
||
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
|
||
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
|
||
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
|
||
" else:\n",
|
||
" # do not add auxiliary words\n",
|
||
" if parent.pos_ != 'AUX':\n",
|
||
" unique_tokens.add(parent.lemma_)\n",
|
||
" connections[(parent.lemma_, parent.pos_)] = list()\n",
|
||
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
|
||
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
|
||
"\n",
|
||
"\n",
|
||
"def obtain_descendant_info(\n",
|
||
" doc,\n",
|
||
" weight,\n",
|
||
" POS_of_interest,\n",
|
||
" TAG_of_interest,\n",
|
||
" connections,\n",
|
||
" unique_tokens,\n",
|
||
" map_2_word: dict[str, str] | None = None,\n",
|
||
"):\n",
|
||
" \n",
|
||
" # iterate over sentences\n",
|
||
" for sent in doc.sents:\n",
|
||
" \n",
|
||
" # iterate over tokens in one sentence\n",
|
||
" for token in sent:\n",
|
||
" \n",
|
||
" if not (token.pos_ in POS_of_interest or token.tag_ in TAG_of_interest):\n",
|
||
" continue\n",
|
||
" elif token.lemma_.lower() in GENERAL_BLACKLIST:\n",
|
||
" logger.debug(f'Eliminated parent >>{token}<< because of blacklist')\n",
|
||
" continue\n",
|
||
" \n",
|
||
" descendants = obtain_sub_tree(token=token)\n",
|
||
" \n",
|
||
" # iterate over all children if there are any\n",
|
||
" if descendants is not None:\n",
|
||
" # list with all children in the current sentence\n",
|
||
" children_sents = list()\n",
|
||
" \n",
|
||
" for child in descendants:\n",
|
||
" logger.debug(f'Token is >>{token}<< with child >>{child}<< and POS {child.pos_}')\n",
|
||
" \n",
|
||
" # elimnate cases of cross-references with verbs\n",
|
||
" if ((token.pos_ == 'AUX' or token.pos_ == 'VERB') and\n",
|
||
" (child.pos_ == 'AUX' or child.pos_ == 'VERB')):\n",
|
||
" continue\n",
|
||
" elif not (child.pos_ in POS_of_interest or child.tag_ in TAG_of_interest):\n",
|
||
" continue\n",
|
||
" elif child.lemma_.lower() in GENERAL_BLACKLIST:\n",
|
||
" logger.debug(f'Eliminated child >>{child}<< because of blacklist')\n",
|
||
" continue\n",
|
||
" \n",
|
||
" \n",
|
||
" if CLEANING:\n",
|
||
" child = pre_clean_word(string=child.lemma_)\n",
|
||
" if child == '':\n",
|
||
" continue\n",
|
||
" #child = pre_clean_word(string=child)\n",
|
||
" \n",
|
||
" if (child not in DESC_BLACKLIST and\n",
|
||
" not is_str_date(string=child)):\n",
|
||
" #not is_str_date(string=child.text)):\n",
|
||
" #children_sents.append((child.lemma_, weight))\n",
|
||
" \n",
|
||
" # map words\n",
|
||
" if map_2_word is not None:\n",
|
||
" if child.lower() in map_2_word:\n",
|
||
" child = map_2_word[child.lower()]\n",
|
||
" #logger.info(f\"[SUCCESS] Mapped CHILD to {child}\")\n",
|
||
" \n",
|
||
" children_sents.append((child, weight))\n",
|
||
" \n",
|
||
" #if child.lemma_ not in unique_tokens:\n",
|
||
" if (child not in unique_tokens and\n",
|
||
" not is_str_date(string=child)):\n",
|
||
" #unique_tokens.add(child.lemma_)\n",
|
||
" unique_tokens.add(child)\n",
|
||
" \n",
|
||
" else:\n",
|
||
" if (child.lemma_ not in DESC_BLACKLIST and\n",
|
||
" not is_str_date(string=child.text)):\n",
|
||
" children_sents.append((child.lemma_, weight))\n",
|
||
" \n",
|
||
" if child.lemma_ not in unique_tokens:\n",
|
||
" unique_tokens.add(child.lemma_)\n",
|
||
" \n",
|
||
" # add list of children for current parent if not empty\n",
|
||
" if children_sents:\n",
|
||
" \n",
|
||
" add_children_descendants(\n",
|
||
" parent=token,\n",
|
||
" weight=weight,\n",
|
||
" connections=connections,\n",
|
||
" unique_tokens=unique_tokens,\n",
|
||
" children_sents=children_sents,\n",
|
||
" map_2_word=map_2_word,\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def obtain_adj_matrix(unique_tokens, connections):\n",
|
||
"\n",
|
||
" adj_mat = pd.DataFrame(\n",
|
||
" data=0, \n",
|
||
" columns=list(unique_tokens), \n",
|
||
" index=list(unique_tokens),\n",
|
||
" dtype=np.uint32,\n",
|
||
" )\n",
|
||
" \n",
|
||
" for (pred, POS), descendants_list in connections.items():\n",
|
||
" #print(f'{pred=}, {descendants=}')\n",
|
||
" \n",
|
||
" for descendants in descendants_list:\n",
|
||
" #print(f'{descendants}')\n",
|
||
" \n",
|
||
" if POS not in POS_INDIRECT:\n",
|
||
" for (desc, weight) in descendants:\n",
|
||
" adj_mat.at[pred, desc] += weight\n",
|
||
" \n",
|
||
" else:\n",
|
||
" if len(descendants) > 1:\n",
|
||
" # if auxiliary word, make connection between all associated words\n",
|
||
" combs = combinations(descendants, r=2)\n",
|
||
" \n",
|
||
" for comb in combs:\n",
|
||
" # comb is tuple ((word_1, weight), (word_2, weight))\n",
|
||
" weight = comb[0][1]\n",
|
||
" word_1 = comb[0][0]\n",
|
||
" word_2 = comb[1][0]\n",
|
||
" \n",
|
||
" \"\"\"\n",
|
||
" if ((word_1 == 'Eigenverantwortlichkeit' or word_1 == 'neu') and\n",
|
||
" (word_2 == 'Eigenverantwortlichkeit' or word_2 == 'neu')):\n",
|
||
" print(f'Hello from {pred=} with {descendants=}')\n",
|
||
" \"\"\"\n",
|
||
" \n",
|
||
" adj_mat.at[word_1, word_2] += weight\n",
|
||
" \n",
|
||
" return adj_mat\n",
|
||
"\n",
|
||
"\n",
|
||
"def make_undir_adj_matrix(adj_mat):\n",
|
||
" \n",
|
||
" adj_mat_undir = adj_mat.copy()\n",
|
||
" arr = adj_mat_undir.to_numpy()\n",
|
||
" arr_upper = np.triu(arr)\n",
|
||
" arr_lower = np.tril(arr)\n",
|
||
" arr_lower = np.rot90(np.fliplr(arr_lower))\n",
|
||
" arr_new = arr_lower + arr_upper\n",
|
||
" \n",
|
||
" adj_mat_undir.loc[:] = arr_new\n",
|
||
" \n",
|
||
" return adj_mat_undir"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Gesamter Datensatz"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 61,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SKIP = False\n",
|
||
"\n",
|
||
"SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'\n",
|
||
"if not SKIP:\n",
|
||
" temp2 = pd.read_pickle(SAVE_PATH_REMOVED_DUPL)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 62,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>len</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" <th>assoc_obj_ids</th>\n",
|
||
" <th>num_assoc_obj_ids</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>527</td>\n",
|
||
" <td>3081</td>\n",
|
||
" <td>[404, 405, 406, 407, 408, 409, 410, 411, 412, ...</td>\n",
|
||
" <td>1724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>2201</td>\n",
|
||
" <td>[404, 407, 408, 409, 410, 411, 412, 413, 414, ...</td>\n",
|
||
" <td>1203</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>1091</td>\n",
|
||
" <td>[726, 798, 800, 801, 802, 921, 922, 923, 924, ...</td>\n",
|
||
" <td>219</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>526</td>\n",
|
||
" <td>[0, 722, 723, 724, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>511</td>\n",
|
||
" <td>[722, 723, 724, 725, 726]</td>\n",
|
||
" <td>5</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr len num_occur \\\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 527 3081 \n",
|
||
"16 VDE Prüfung 11 2201 \n",
|
||
"4 Wartung nach Arbeitsplan 24 1091 \n",
|
||
"7 Manuelle Dosierung des Biozids 30 526 \n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 29 511 \n",
|
||
"\n",
|
||
" assoc_obj_ids num_assoc_obj_ids \n",
|
||
"14 [404, 405, 406, 407, 408, 409, 410, 411, 412, ... 1724 \n",
|
||
"16 [404, 407, 408, 409, 410, 411, 412, 413, 414, ... 1203 \n",
|
||
"4 [726, 798, 800, 801, 802, 921, 922, 923, 924, ... 219 \n",
|
||
"7 [0, 722, 723, 724, 726] 5 \n",
|
||
"12 [722, 723, 724, 725, 726] 5 "
|
||
]
|
||
},
|
||
"execution_count": 62,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp2.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 63,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# analysiere erste 10 Einträge\n",
|
||
"#descr = temp1[['descr', 'num_occur']]\n",
|
||
"descr = temp2[['descr', 'num_occur']]\n",
|
||
"#descr = descr.iloc[:7,:]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 64,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 65,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"1735"
|
||
]
|
||
},
|
||
"execution_count": 65,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(descr)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 66,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>descr</th>\n",
|
||
" <th>num_occur</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>14</th>\n",
|
||
" <td>Bestimmen des Prüftermins für elektrische Arbe...</td>\n",
|
||
" <td>3081</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>16</th>\n",
|
||
" <td>VDE Prüfung</td>\n",
|
||
" <td>2201</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>Wartung nach Arbeitsplan</td>\n",
|
||
" <td>1091</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Manuelle Dosierung des Biozids</td>\n",
|
||
" <td>526</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>Mikrobiologie(Abklatsch-Test)</td>\n",
|
||
" <td>511</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>844</th>\n",
|
||
" <td>Filterabreinigung AGT 1 : das erste Ventil von...</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>843</th>\n",
|
||
" <td>Abnahmeprüfung durch Sachkundigen</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>842</th>\n",
|
||
" <td>Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>841</th>\n",
|
||
" <td>Ausgeführt</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2183</th>\n",
|
||
" <td>Antrieb neu Dichten. Liegt auf Werkbank</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1735 rows × 2 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" descr num_occur\n",
|
||
"14 Bestimmen des Prüftermins für elektrische Arbe... 3081\n",
|
||
"16 VDE Prüfung 2201\n",
|
||
"4 Wartung nach Arbeitsplan 1091\n",
|
||
"7 Manuelle Dosierung des Biozids 526\n",
|
||
"12 Mikrobiologie(Abklatsch-Test) 511\n",
|
||
"... ... ...\n",
|
||
"844 Filterabreinigung AGT 1 : das erste Ventil von... 1\n",
|
||
"843 Abnahmeprüfung durch Sachkundigen 1\n",
|
||
"842 Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung 1\n",
|
||
"841 Ausgeführt 1\n",
|
||
"2183 Antrieb neu Dichten. Liegt auf Werkbank 1\n",
|
||
"\n",
|
||
"[1735 rows x 2 columns]"
|
||
]
|
||
},
|
||
"execution_count": 66,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"descr"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 67,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#LOAD_CALC_FILES = True\n",
|
||
"#LOAD_CALC_FILES = False\n",
|
||
"#IS_TEST = True\n",
|
||
"IS_TEST = False"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Entdeckte Gruppen**\n",
|
||
"- Prüfung:\n",
|
||
" - Prüfen\n",
|
||
" - Sichtprüfung\n",
|
||
" - Überprüfung / überprüfen\n",
|
||
" - Kontrolle / kontrollieren\n",
|
||
" - sicherstellen / Sicherstellung\n",
|
||
" - Wartung / warten\n",
|
||
" - Reinigung / reinigen\n",
|
||
" - Prüfbericht\n",
|
||
"- Handlung:\n",
|
||
" - Schmierung\n",
|
||
" - schmieren\n",
|
||
" - reinigen\n",
|
||
" - Reinigung\n",
|
||
" - schneiden / nachschneiden\n",
|
||
"- zyklisch:\n",
|
||
" - täglich\n",
|
||
" - wöchentlich\n",
|
||
" - monatlich\n",
|
||
" - jährlich\n",
|
||
"- Datum:\n",
|
||
" - Uhr\n",
|
||
" - Montag, Dienstag, Mittwoch, Donnerstag, Freitag, Samstag, Sonntag\n",
|
||
"- Kleinteile:\n",
|
||
" - Schraube\n",
|
||
" - Adapter\n",
|
||
" - Halterung\n",
|
||
" - Scheibe\n",
|
||
" - Gewinde\n",
|
||
" - Ventil\n",
|
||
" - Schalter\n",
|
||
" - Befestigungsschraube\n",
|
||
"- Komponenten:\n",
|
||
" - Kupplung\n",
|
||
" - Motor\n",
|
||
" - Getriebe\n",
|
||
" - Ventilator\n",
|
||
" - Zahnriemen\n",
|
||
" - Tranformator\n",
|
||
" - Filterelement\n",
|
||
" - Dosierpumpe\n",
|
||
" - Luftschlauch\n",
|
||
" - Dichtung\n",
|
||
" - Filter\n",
|
||
" - Scharnier\n",
|
||
" - Spannrolle\n",
|
||
" - Druckluftbehälter\n",
|
||
" - Kette\n",
|
||
" - Anschlüsse\n",
|
||
" - Schläuche\n",
|
||
" - Beleuchtung\n",
|
||
"- Elektrik:\n",
|
||
" - Zuleitung\n",
|
||
" - Kabel\n",
|
||
" - Steckdose\n",
|
||
" - Elektriker\n",
|
||
" - Elektronik\n",
|
||
" - elektrisch\n",
|
||
" - Sicherheitsbeleuchtung\n",
|
||
"- Anlagen:\n",
|
||
" - Mischanlage\n",
|
||
" - Maschine\n",
|
||
" - Wasserenthärtungsanlage\n",
|
||
" - Lüftungsanlage\n",
|
||
" - Klimaanlage\n",
|
||
"- Vereinbarung:\n",
|
||
" - Wartungsvertrag\n",
|
||
" - Neuvertrag\n",
|
||
" - Vertrag\n",
|
||
" - terminieren / terminiert\n",
|
||
" - Absprache\n",
|
||
" - melden\n",
|
||
" - telefonisch\n",
|
||
" - mitteilen\n",
|
||
"- Störbild:\n",
|
||
" - defekt\n",
|
||
" - kaputt\n",
|
||
" - Geräusch\n",
|
||
" - undicht\n",
|
||
" - leckt\n",
|
||
" - Dichtigkeit\n",
|
||
"- Abteilung:\n",
|
||
" - Buchhaltung\n",
|
||
" - Betriebstechnik\n",
|
||
" - Entwicklung\n",
|
||
"- Ort:\n",
|
||
" - Kesselhaus\n",
|
||
" - Durchfahrt\n",
|
||
" - Dach\n",
|
||
" - Haupteingang\n",
|
||
" - Werkbank\n",
|
||
" - Schlosserei"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 68,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"word_2_map = {\n",
|
||
" 'Prüfung': ['prüfen', 'sichtprüfung', 'überprüfung', 'überprüfen',\n",
|
||
" 'kontrolle', 'kontrollieren', 'sicherstellen', 'sicherstellung',\n",
|
||
" 'reinigung', 'reinigen', 'prüfbericht', 'sichtkontrolle',\n",
|
||
" 'rundgang', 'technikrundgang'],\n",
|
||
" 'Wartung': ['wartung', 'warten', 'wartungstätigkeit', 'wartungsarbeit',\n",
|
||
" 'wartungsplan'],\n",
|
||
" 'Handlung': ['schmierung', 'schmieren', 'reinigen', 'reinigung',\n",
|
||
" 'schneiden', 'nachschneiden'],\n",
|
||
" 'zyklisch': ['täglich', 'tägliche', 'täglicher', 'wöchentlich', 'wöchentliche', 'monatlich', 'jährlich',\n",
|
||
" 'halbjährlich', 'monatliche', 'wartungsintervall'],\n",
|
||
" 'Datum': ['uhr', 'montag', 'dienstag', 'mittwoch', 'donnerstag',\n",
|
||
" 'freitag', 'samstag', 'sonntag'],\n",
|
||
" 'Kleinteile': ['schraube', 'adapter', 'halterung', 'scheibe', 'gewinde',\n",
|
||
" 'ventil', 'schalter', 'befestigungsschraube'],\n",
|
||
" 'Komponenten': ['kupplung', 'motor', 'getriebe', 'ventilator',\n",
|
||
" 'zahnriemen', 'transformator', 'filterelement',\n",
|
||
" 'dosierpumpe', 'luftschlauch', 'dichtung', 'filter',\n",
|
||
" 'scharnier', 'spannrolle', 'druckluftbehälter', 'kette',\n",
|
||
" 'anschlüsse', 'anschluss', 'schläuche', 'schlauch', 'beleuchtung'],\n",
|
||
" 'Elektrik': ['zuleitung', 'kabel', 'steckdose', 'elektriker',\n",
|
||
" 'elektronik', 'elektrisch', 'sicherheitsbeleuchtung'],\n",
|
||
" 'Anlagen': ['anlage', 'mischanlage', 'maschine', 'klimaanlage', 'filteranlage',\n",
|
||
" 'wasserenthärtungsanlage', 'lüftungsanlage', 'wasseraufbereitungsanlage'],\n",
|
||
" 'Vereinbarung': ['wartungsvertrag', 'neuvertrag', 'vertrag', 'terminieren'\n",
|
||
" 'terminiert', 'absprache', 'melden', 'telefonisch', 'mitteilen'],\n",
|
||
" 'Störbild': ['defekt', 'kaputt', 'geräusch', 'undicht', 'leckt', 'dichtigkeit'],\n",
|
||
" 'Abteilung': ['buchhaltung', 'betriebstechnik', 'entwicklung'],\n",
|
||
" 'Ort': ['kesselhaus', 'durchfahrt', 'dach', \n",
|
||
" 'haupteingang', 'werkbank', 'schlosserei'],\n",
|
||
"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- Frage: Existiert Möglichkeit zur Klassifizierung von Begriffen?\n",
|
||
" - z.B. automatische Kennung, ob Komponente oder nicht"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 69,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"map_2_word = dict()\n",
|
||
"\n",
|
||
"for key, word_list in word_2_map.items():\n",
|
||
" \n",
|
||
" for word in word_list:\n",
|
||
" map_2_word[word] = key"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 70,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"IS_TEST = False\n",
|
||
"LOAD_CALC_FILES = False"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 71,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"1735"
|
||
]
|
||
},
|
||
"execution_count": 71,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"len(descr)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 72,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"INFO:base:Number of entries processed: 1, Percent completed: 0.06\n",
|
||
"INFO:base:Number of entries processed: 501, Percent completed: 28.88\n",
|
||
"INFO:base:Number of entries processed: 1001, Percent completed: 57.69\n",
|
||
"INFO:base:Number of entries processed: 1501, Percent completed: 86.51\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# adjacency matrix\n",
|
||
"connections = dict()\n",
|
||
"unique_tokens = set()\n",
|
||
"UPDATE_STATUS = 500\n",
|
||
"length_data = len(descr)\n",
|
||
"\n",
|
||
"if not LOAD_CALC_FILES or IS_TEST:\n",
|
||
" for count, description in enumerate(descr.iterrows()):\n",
|
||
" \n",
|
||
" text = description[1]['descr']\n",
|
||
" weight = description[1]['num_occur']\n",
|
||
" \n",
|
||
" doc = nlp(text)\n",
|
||
" \n",
|
||
" obtain_descendant_info(\n",
|
||
" doc=doc,\n",
|
||
" weight=weight,\n",
|
||
" POS_of_interest=POS_of_interest,\n",
|
||
" TAG_of_interest=TAG_of_interest,\n",
|
||
" connections=connections,\n",
|
||
" unique_tokens=unique_tokens,\n",
|
||
" map_2_word=None,\n",
|
||
" )\n",
|
||
" \n",
|
||
" if count % UPDATE_STATUS == 0:\n",
|
||
" logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 73,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"adj_mat = obtain_adj_matrix(\n",
|
||
" unique_tokens=unique_tokens, \n",
|
||
" connections=connections\n",
|
||
")\n",
|
||
"adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 74,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SAVE_PATH_UNI_TOKENS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_unique_tokens.pkl'\n",
|
||
"SAVE_PATH_CONNECTIONS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_connections.pkl'\n",
|
||
"SAVE_PATH_ADJ_DF = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df.parquet'\n",
|
||
"SAVE_PATH_ADJ_DF_UNDIR = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df_undir.parquet'\n",
|
||
"if not IS_TEST:\n",
|
||
" if LOAD_CALC_FILES:\n",
|
||
" connections = load_pickle(SAVE_PATH_UNI_TOKENS)\n",
|
||
" unique_tokens = load_pickle(SAVE_PATH_CONNECTIONS)\n",
|
||
" adj_mat = pd.read_parquet(SAVE_PATH_ADJ_DF)\n",
|
||
" adj_mat_undir = pd.read_parquet(SAVE_PATH_ADJ_DF_UNDIR)\n",
|
||
" else:\n",
|
||
" adj_mat.to_parquet(SAVE_PATH_ADJ_DF)\n",
|
||
" adj_mat_undir.to_parquet(SAVE_PATH_ADJ_DF_UNDIR)\n",
|
||
" save_pickle(obj=connections, path=SAVE_PATH_CONNECTIONS)\n",
|
||
" save_pickle(obj=unique_tokens, path=SAVE_PATH_UNI_TOKENS)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 75,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Dampf</th>\n",
|
||
" <th>Riss</th>\n",
|
||
" <th>Förderleistung</th>\n",
|
||
" <th>festlegen</th>\n",
|
||
" <th>weie</th>\n",
|
||
" <th>reperatur</th>\n",
|
||
" <th>Edelstahlblech</th>\n",
|
||
" <th>Kidde</th>\n",
|
||
" <th>Anlagenstillstand</th>\n",
|
||
" <th>Füllstandssonde</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>Kaltwasserhähne</th>\n",
|
||
" <th>Andreas</th>\n",
|
||
" <th>Haltebügel</th>\n",
|
||
" <th>Sicherheitsschalter</th>\n",
|
||
" <th>Tränkle</th>\n",
|
||
" <th>Fall</th>\n",
|
||
" <th>Zusatzstoff</th>\n",
|
||
" <th>Gelenk</th>\n",
|
||
" <th>trocknen</th>\n",
|
||
" <th>Kilo</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>AB</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>ABIC</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>AGT</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>AGipsTechnikRZB</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>AKU</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>überfähren</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>überholen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>überprüfen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>übertragen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>überwachen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>2337 rows × 2337 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Dampf Riss Förderleistung festlegen weie reperatur \\\n",
|
||
"AB 0 0 0 0 0 0 \n",
|
||
"ABIC 0 0 0 0 0 0 \n",
|
||
"AGT 0 0 0 0 0 0 \n",
|
||
"AGipsTechnikRZB 0 0 0 0 0 0 \n",
|
||
"AKU 0 0 0 0 0 0 \n",
|
||
"... ... ... ... ... ... ... \n",
|
||
"überfähren 0 0 0 0 0 0 \n",
|
||
"überholen 0 0 0 0 0 0 \n",
|
||
"überprüfen 0 0 0 0 0 0 \n",
|
||
"übertragen 0 0 0 0 0 0 \n",
|
||
"überwachen 0 0 0 0 0 0 \n",
|
||
"\n",
|
||
" Edelstahlblech Kidde Anlagenstillstand Füllstandssonde \\\n",
|
||
"AB 0 0 0 0 \n",
|
||
"ABIC 0 0 0 0 \n",
|
||
"AGT 0 0 0 0 \n",
|
||
"AGipsTechnikRZB 0 0 0 0 \n",
|
||
"AKU 0 0 0 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"überfähren 0 0 0 0 \n",
|
||
"überholen 0 0 0 0 \n",
|
||
"überprüfen 0 0 0 0 \n",
|
||
"übertragen 0 0 0 0 \n",
|
||
"überwachen 0 0 0 0 \n",
|
||
"\n",
|
||
" ... Kaltwasserhähne Andreas Haltebügel \\\n",
|
||
"AB ... 0 0 0 \n",
|
||
"ABIC ... 0 0 0 \n",
|
||
"AGT ... 0 0 0 \n",
|
||
"AGipsTechnikRZB ... 0 0 0 \n",
|
||
"AKU ... 0 0 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"überfähren ... 0 0 0 \n",
|
||
"überholen ... 0 0 0 \n",
|
||
"überprüfen ... 0 0 0 \n",
|
||
"übertragen ... 0 0 0 \n",
|
||
"überwachen ... 0 0 0 \n",
|
||
"\n",
|
||
" Sicherheitsschalter Tränkle Fall Zusatzstoff Gelenk \\\n",
|
||
"AB 0 0 0 1 0 \n",
|
||
"ABIC 0 0 0 0 0 \n",
|
||
"AGT 0 0 0 0 0 \n",
|
||
"AGipsTechnikRZB 0 0 0 0 0 \n",
|
||
"AKU 0 0 0 0 0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"überfähren 0 0 0 0 0 \n",
|
||
"überholen 0 0 0 0 0 \n",
|
||
"überprüfen 0 0 0 0 0 \n",
|
||
"übertragen 0 0 0 0 0 \n",
|
||
"überwachen 0 0 0 0 0 \n",
|
||
"\n",
|
||
" trocknen Kilo \n",
|
||
"AB 0 0 \n",
|
||
"ABIC 0 0 \n",
|
||
"AGT 0 0 \n",
|
||
"AGipsTechnikRZB 0 0 \n",
|
||
"AKU 0 0 \n",
|
||
"... ... ... \n",
|
||
"überfähren 0 0 \n",
|
||
"überholen 0 0 \n",
|
||
"überprüfen 0 0 \n",
|
||
"übertragen 0 0 \n",
|
||
"überwachen 0 0 \n",
|
||
"\n",
|
||
"[2337 rows x 2337 columns]"
|
||
]
|
||
},
|
||
"execution_count": 75,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"adj_mat_undir.sort_index()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 76,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'AGipsTechnikRZB'"
|
||
]
|
||
},
|
||
"execution_count": 76,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"ret = adj_mat_undir.sort_index().index[3]\n",
|
||
"ret"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 77,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"False"
|
||
]
|
||
},
|
||
"execution_count": 77,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"is_str_date(ret)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 78,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"12"
|
||
]
|
||
},
|
||
"execution_count": 78,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"adj_mat_undir.loc[ret,:].sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Threshold"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 88,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"WEIGHT_THRESHOLD = 120\n",
|
||
"arr = adj_mat_undir.to_numpy()\n",
|
||
"arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 89,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"190"
|
||
]
|
||
},
|
||
"execution_count": 89,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"np.count_nonzero(arr)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 90,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"70"
|
||
]
|
||
},
|
||
"execution_count": 90,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"temp = np.sum(arr, axis=0)\n",
|
||
"np.count_nonzero(temp)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 91,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"thresh_adj_mat = adj_mat_undir.copy()\n",
|
||
"thresh_adj_mat.loc[:] = arr"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 92,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Dampf</th>\n",
|
||
" <th>Riss</th>\n",
|
||
" <th>Förderleistung</th>\n",
|
||
" <th>festlegen</th>\n",
|
||
" <th>weie</th>\n",
|
||
" <th>reperatur</th>\n",
|
||
" <th>Edelstahlblech</th>\n",
|
||
" <th>Kidde</th>\n",
|
||
" <th>Anlagenstillstand</th>\n",
|
||
" <th>Füllstandssonde</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>Kaltwasserhähne</th>\n",
|
||
" <th>Andreas</th>\n",
|
||
" <th>Haltebügel</th>\n",
|
||
" <th>Sicherheitsschalter</th>\n",
|
||
" <th>Tränkle</th>\n",
|
||
" <th>Fall</th>\n",
|
||
" <th>Zusatzstoff</th>\n",
|
||
" <th>Gelenk</th>\n",
|
||
" <th>trocknen</th>\n",
|
||
" <th>Kilo</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>Dampf</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Riss</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Förderleistung</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>festlegen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>weie</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Fall</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Zusatzstoff</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Gelenk</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>trocknen</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Kilo</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>2337 rows × 2337 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Dampf Riss Förderleistung festlegen weie reperatur \\\n",
|
||
"Dampf 0 0 0 0 0 0 \n",
|
||
"Riss 0 0 0 0 0 0 \n",
|
||
"Förderleistung 0 0 0 0 0 0 \n",
|
||
"festlegen 0 0 0 0 0 0 \n",
|
||
"weie 0 0 0 0 0 0 \n",
|
||
"... ... ... ... ... ... ... \n",
|
||
"Fall 0 0 0 0 0 0 \n",
|
||
"Zusatzstoff 0 0 0 0 0 0 \n",
|
||
"Gelenk 0 0 0 0 0 0 \n",
|
||
"trocknen 0 0 0 0 0 0 \n",
|
||
"Kilo 0 0 0 0 0 0 \n",
|
||
"\n",
|
||
" Edelstahlblech Kidde Anlagenstillstand Füllstandssonde \\\n",
|
||
"Dampf 0 0 0 0 \n",
|
||
"Riss 0 0 0 0 \n",
|
||
"Förderleistung 0 0 0 0 \n",
|
||
"festlegen 0 0 0 0 \n",
|
||
"weie 0 0 0 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"Fall 0 0 0 0 \n",
|
||
"Zusatzstoff 0 0 0 0 \n",
|
||
"Gelenk 0 0 0 0 \n",
|
||
"trocknen 0 0 0 0 \n",
|
||
"Kilo 0 0 0 0 \n",
|
||
"\n",
|
||
" ... Kaltwasserhähne Andreas Haltebügel \\\n",
|
||
"Dampf ... 0 0 0 \n",
|
||
"Riss ... 0 0 0 \n",
|
||
"Förderleistung ... 0 0 0 \n",
|
||
"festlegen ... 0 0 0 \n",
|
||
"weie ... 0 0 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"Fall ... 0 0 0 \n",
|
||
"Zusatzstoff ... 0 0 0 \n",
|
||
"Gelenk ... 0 0 0 \n",
|
||
"trocknen ... 0 0 0 \n",
|
||
"Kilo ... 0 0 0 \n",
|
||
"\n",
|
||
" Sicherheitsschalter Tränkle Fall Zusatzstoff Gelenk \\\n",
|
||
"Dampf 0 0 0 0 0 \n",
|
||
"Riss 0 0 0 0 0 \n",
|
||
"Förderleistung 0 0 0 0 0 \n",
|
||
"festlegen 0 0 0 0 0 \n",
|
||
"weie 0 0 0 0 0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"Fall 0 0 0 0 0 \n",
|
||
"Zusatzstoff 0 0 0 0 0 \n",
|
||
"Gelenk 0 0 0 0 0 \n",
|
||
"trocknen 0 0 0 0 0 \n",
|
||
"Kilo 0 0 0 0 0 \n",
|
||
"\n",
|
||
" trocknen Kilo \n",
|
||
"Dampf 0 0 \n",
|
||
"Riss 0 0 \n",
|
||
"Förderleistung 0 0 \n",
|
||
"festlegen 0 0 \n",
|
||
"weie 0 0 \n",
|
||
"... ... ... \n",
|
||
"Fall 0 0 \n",
|
||
"Zusatzstoff 0 0 \n",
|
||
"Gelenk 0 0 \n",
|
||
"trocknen 0 0 \n",
|
||
"Kilo 0 0 \n",
|
||
"\n",
|
||
"[2337 rows x 2337 columns]"
|
||
]
|
||
},
|
||
"execution_count": 92,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"thresh_adj_mat"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 93,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"ADJ_MAT_PATH_CSV = f'./02_2_Preprocess2/{DATA_SET_ID}_01_1_adj_mat_thresh_mapping_{WEIGHT_THRESHOLD}.csv'\n",
|
||
"thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"*Transfer in NetworkX Graph for Exporting to Standardized Formats*"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 94,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import networkx as nx"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 95,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"G = nx.from_pandas_adjacency(thresh_adj_mat)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 96,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SAVE_PATH_GRAPHML = f'./02_2_Preprocess2/{DATA_SET_ID}_adj_mat_thresh_{WEIGHT_THRESHOLD}.graphml'\n",
|
||
"nx.write_graphml(G, SAVE_PATH_GRAPHML)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Test Cosine Similarity\n",
|
||
"- erstelle Matrix mit Ähnlichkeits-Score (obere Dreiecksmatrix)\n",
|
||
"- jedes Wortpaar\n",
|
||
"- filtere Tabelle nach Threshold\n",
|
||
"- nutze Gewichts-Adjezenzmatrix mit Threshold als Maske\n",
|
||
" - nur Analyse von hochgewichtigen Gruppen\n",
|
||
"- analysiere Zusammenhänge in Form von Graph (ähnlich bisherigem Vorgehen)\n",
|
||
"- bilde Gruppen und benenne diese (z.B. Prüfung+Überprüfung+Kontrolle --> Überprüfung)\n",
|
||
"- baue daraus Wörterbuch und matche Begriffe bei der Erstellung"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def build_cosine_similarity_matrix(\n",
|
||
" adj_mat\n",
|
||
"):\n",
|
||
" # obtain words to compare\n",
|
||
" words = adj_mat.index.to_list()\n",
|
||
" \n",
|
||
" # cos matrix\n",
|
||
" cos_mat = pd.DataFrame(\n",
|
||
" data=0., \n",
|
||
" columns=words, \n",
|
||
" index=words,\n",
|
||
" dtype=np.float32,\n",
|
||
" )\n",
|
||
" \n",
|
||
" for (word1, word2) in combinations(words, 2):\n",
|
||
" # obtain model vocabulary\n",
|
||
" w1 = nlp.vocab[str(word1)]\n",
|
||
" w2 = nlp.vocab[str(word2)]\n",
|
||
" # calculate cosine similarity\n",
|
||
" cos_sim = w1.similarity(w2)\n",
|
||
" # set value\n",
|
||
" cos_mat.at[word1, word2] = cos_sim\n",
|
||
" \n",
|
||
" return cos_mat"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"C:\\Users\\foersterflorian\\AppData\\Local\\Temp\\ipykernel_17216\\213623562.py:20: UserWarning: [W008] Evaluating Lexeme.similarity based on empty vectors.\n",
|
||
" cos_sim = w1.similarity(w2)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"cos_mat = build_cosine_similarity_matrix(adj_mat=adj_mat_undir)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Klübertemp</th>\n",
|
||
" <th>Schusssuche</th>\n",
|
||
" <th>Laser</th>\n",
|
||
" <th>Schaftteile</th>\n",
|
||
" <th>Dichtsätz</th>\n",
|
||
" <th>Tastatur</th>\n",
|
||
" <th>Vorspuleinheit</th>\n",
|
||
" <th>beginnen</th>\n",
|
||
" <th>auslesen</th>\n",
|
||
" <th>Kettspannung</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>Tänzerwalze</th>\n",
|
||
" <th>Abfallkante</th>\n",
|
||
" <th>rappeln</th>\n",
|
||
" <th>Rottenegger</th>\n",
|
||
" <th>Contrawalze</th>\n",
|
||
" <th>Eisenträger</th>\n",
|
||
" <th>Hängegurte</th>\n",
|
||
" <th>Treffen</th>\n",
|
||
" <th>Greiferarmen</th>\n",
|
||
" <th>Nadelleist</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>Klübertemp</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Schusssuche</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Laser</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.324276</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.059743</td>\n",
|
||
" <td>0.133676</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>-0.063913</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.167521</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>-0.029860</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Schaftteile</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Dichtsätz</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Eisenträger</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.170954</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Hängegurte</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Treffen</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Greiferarmen</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Nadelleist</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>6951 rows × 6951 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Klübertemp Schusssuche Laser Schaftteile Dichtsätz \\\n",
|
||
"Klübertemp 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Schusssuche 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Laser 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Schaftteile 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Dichtsätz 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"Eisenträger 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Hängegurte 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Treffen 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Greiferarmen 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Nadelleist 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" Tastatur Vorspuleinheit beginnen auslesen Kettspannung ... \\\n",
|
||
"Klübertemp 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Schusssuche 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Laser 0.324276 0.0 0.059743 0.133676 0.0 ... \n",
|
||
"Schaftteile 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Dichtsätz 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"... ... ... ... ... ... ... \n",
|
||
"Eisenträger 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Hängegurte 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Treffen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Greiferarmen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"Nadelleist 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
|
||
"\n",
|
||
" Tänzerwalze Abfallkante rappeln Rottenegger Contrawalze \\\n",
|
||
"Klübertemp 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Schusssuche 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Laser 0.0 0.0 -0.063913 0.0 0.0 \n",
|
||
"Schaftteile 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Dichtsätz 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"Eisenträger 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Hängegurte 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Treffen 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Greiferarmen 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"Nadelleist 0.0 0.0 0.000000 0.0 0.0 \n",
|
||
"\n",
|
||
" Eisenträger Hängegurte Treffen Greiferarmen Nadelleist \n",
|
||
"Klübertemp 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Schusssuche 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Laser 0.167521 0.0 -0.029860 0.0 0.0 \n",
|
||
"Schaftteile 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Dichtsätz 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"Eisenträger 0.000000 0.0 0.170954 0.0 0.0 \n",
|
||
"Hängegurte 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Treffen 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Greiferarmen 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"Nadelleist 0.000000 0.0 0.000000 0.0 0.0 \n",
|
||
"\n",
|
||
"[6951 rows x 6951 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"cos_mat"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"WEIGHT_THRESHOLD = 10\n",
|
||
"arr = adj_mat_undir.to_numpy()\n",
|
||
"COS_THRESHOLD = 0.4\n",
|
||
"cos_arr = cos_mat.to_numpy()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"cos_arr_filt = np.where((cos_arr > COS_THRESHOLD) & (arr >= WEIGHT_THRESHOLD), cos_arr, 0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([[0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" ...,\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||
" [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"cos_arr_filt"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"217"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"np.count_nonzero(cos_arr_filt)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"thresh_cos_mat = cos_mat.copy()\n",
|
||
"thresh_cos_mat[:] = cos_arr_filt"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Verstärkung</th>\n",
|
||
" <th>Zuluftfilter</th>\n",
|
||
" <th>klemmt</th>\n",
|
||
" <th>Komminikation</th>\n",
|
||
" <th>Doppelholztische</th>\n",
|
||
" <th>Deckenbeleuchtung</th>\n",
|
||
" <th>Abfalltransport</th>\n",
|
||
" <th>fahrbar</th>\n",
|
||
" <th>Folieneinlauf</th>\n",
|
||
" <th>entsorgen</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>neuwertig</th>\n",
|
||
" <th>Bleit</th>\n",
|
||
" <th>Rauchentwicklung</th>\n",
|
||
" <th>Kompressorsteuerung</th>\n",
|
||
" <th>anziehen</th>\n",
|
||
" <th>Mitarbeiterin</th>\n",
|
||
" <th>Nägel</th>\n",
|
||
" <th>WZ</th>\n",
|
||
" <th>ExSchutzAnlage</th>\n",
|
||
" <th>Gemisch</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>Verstärkung</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Zuluftfilter</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>klemmt</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Komminikation</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Doppelholztische</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Mitarbeiterin</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Nägel</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>WZ</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>ExSchutzAnlage</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>Gemisch</th>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>6951 rows × 6951 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Verstärkung Zuluftfilter klemmt Komminikation \\\n",
|
||
"Verstärkung 0.0 0.0 0.0 0.0 \n",
|
||
"Zuluftfilter 0.0 0.0 0.0 0.0 \n",
|
||
"klemmt 0.0 0.0 0.0 0.0 \n",
|
||
"Komminikation 0.0 0.0 0.0 0.0 \n",
|
||
"Doppelholztische 0.0 0.0 0.0 0.0 \n",
|
||
"... ... ... ... ... \n",
|
||
"Mitarbeiterin 0.0 0.0 0.0 0.0 \n",
|
||
"Nägel 0.0 0.0 0.0 0.0 \n",
|
||
"WZ 0.0 0.0 0.0 0.0 \n",
|
||
"ExSchutzAnlage 0.0 0.0 0.0 0.0 \n",
|
||
"Gemisch 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" Doppelholztische Deckenbeleuchtung Abfalltransport \\\n",
|
||
"Verstärkung 0.0 0.0 0.0 \n",
|
||
"Zuluftfilter 0.0 0.0 0.0 \n",
|
||
"klemmt 0.0 0.0 0.0 \n",
|
||
"Komminikation 0.0 0.0 0.0 \n",
|
||
"Doppelholztische 0.0 0.0 0.0 \n",
|
||
"... ... ... ... \n",
|
||
"Mitarbeiterin 0.0 0.0 0.0 \n",
|
||
"Nägel 0.0 0.0 0.0 \n",
|
||
"WZ 0.0 0.0 0.0 \n",
|
||
"ExSchutzAnlage 0.0 0.0 0.0 \n",
|
||
"Gemisch 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" fahrbar Folieneinlauf entsorgen ... neuwertig Bleit \\\n",
|
||
"Verstärkung 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"Zuluftfilter 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"klemmt 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"Komminikation 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"Doppelholztische 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"... ... ... ... ... ... ... \n",
|
||
"Mitarbeiterin 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"Nägel 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"WZ 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"ExSchutzAnlage 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"Gemisch 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
||
"\n",
|
||
" Rauchentwicklung Kompressorsteuerung anziehen \\\n",
|
||
"Verstärkung 0.0 0.0 0.0 \n",
|
||
"Zuluftfilter 0.0 0.0 0.0 \n",
|
||
"klemmt 0.0 0.0 0.0 \n",
|
||
"Komminikation 0.0 0.0 0.0 \n",
|
||
"Doppelholztische 0.0 0.0 0.0 \n",
|
||
"... ... ... ... \n",
|
||
"Mitarbeiterin 0.0 0.0 0.0 \n",
|
||
"Nägel 0.0 0.0 0.0 \n",
|
||
"WZ 0.0 0.0 0.0 \n",
|
||
"ExSchutzAnlage 0.0 0.0 0.0 \n",
|
||
"Gemisch 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" Mitarbeiterin Nägel WZ ExSchutzAnlage Gemisch \n",
|
||
"Verstärkung 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Zuluftfilter 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"klemmt 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Komminikation 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Doppelholztische 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"Mitarbeiterin 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Nägel 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"WZ 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"ExSchutzAnlage 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"Gemisch 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
"[6951 rows x 6951 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"thresh_cos_mat"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"COS_MAT_PATH_CSV = f'./Graphanalyse_Gruppen/cos_mat_Wthresh_{WEIGHT_THRESHOLD}_Cthresh{int(COS_THRESHOLD*100)}.csv'\n",
|
||
"thresh_cos_mat.to_csv(path_or_buf=COS_MAT_PATH_CSV, encoding='cp1252', sep=';')"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|