lang-main/notebooks/archive/Analyse_2-2.ipynb
2024-08-07 20:06:06 +02:00

12133 lines
447 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Analyse 2-2**\n",
"\n",
"## Strategie & Fokus\n",
"\n",
"- Versuche Clustering bzw. Zusammenfassung von Begriffen (z.B. Prüfung, Prüfen, Überprüfung)\n",
"- Orientierung an Häufigkeitsverteilung: häufigere Begriffe zuerst analysieren"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Merkmal 1: Clustering von Vorgangsbeschreibungen\n",
"\n",
"## Recherche\n",
"[Textmining HS Hannover](https://textmining.wp.hs-hannover.de/Preprocessing.html)\n",
"\n",
"### Allgemeine Zergliederung der Einzelbeschreibungen\n",
"\n",
"- Text in Sätze\n",
"- Sätze in Wörter\n",
"- Wörter in Grundform:\n",
" - Lemma: Die Form des Wortes, wie sie in einem Wörterbuch steht. Z.B.: Haus, laufen, begründen\n",
" - Stamm: Das Wort ohne Flexionsendungen (Prefixe und Suffixe). Z.B.: Haus, lauf, begründ\n",
" - Wurzel: Kern des Wortes, von dem das Wort ggf. durch Derivation abgeleitet wurde. Z.B.: Haus, lauf, Grund\n",
"- Wortartbestimmung\n",
" - klassische Part-of-Speech-Erkennung (herkömmliche Wortart)\n",
" - Named Entity Recognition (NER) (Eigennamen)\n",
" - Bsp. spaCy: Person, Ort, Organisation, Verschiedenes\n",
"\n",
"#### Semantik\n",
"\n",
"- Wörter innerhalb eines Satzes größere Zusammenhänge als außerhalb\n",
"\n",
"### Pakete\n",
"\n",
"- Englisch: \n",
" - [NLTK](https://www.nltk.org/)\n",
"- Deutsch:\n",
" - [HanTa - The Hanover Tagger](https://github.com/wartaal/HanTa/tree/master)\n",
" - [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)\n",
" - [Python Wrapper](https://treetaggerwrapper.readthedocs.io/en/latest/)\n",
" - [spaCy](https://spacy.io/)\n",
" - [Beispiel 1](https://www.trinnovative.de/blog/2020-09-08-natural-language-processing-mit-spacy.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"21.02.:\n",
"- Überarbeitung RegEx-Filterung\n",
"- Verbesserung Duplikatefindung über Ähnlichkeit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyse"
]
},
{
"cell_type": "code",
"execution_count": 339,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import DataFrame, Series\n",
"import spacy\n",
"from spacy.lang.de import German as GermanSpacyModel\n",
"import sentence_transformers\n",
"from sentence_transformers import SentenceTransformer\n",
"from collections import Counter\n",
"from itertools import combinations\n",
"from dateutil.parser import parse\n",
"import re\n",
"from spellchecker import SpellChecker\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"import logging\n",
"import sys\n",
"import pickle\n",
"\n",
"LOGGING_LEVEL = 'INFO'\n",
"logging.basicConfig(level=LOGGING_LEVEL, stream=sys.stdout)\n",
"logger = logging.getLogger('base')"
]
},
{
"cell_type": "code",
"execution_count": 340,
"metadata": {},
"outputs": [],
"source": [
"def save_pickle(obj, path):\n",
" with open(path, 'wb') as file:\n",
" pickle.dump(obj, file, protocol=pickle.HIGHEST_PROTOCOL)\n",
" \n",
"def load_pickle(path):\n",
" with open(path, 'rb') as file:\n",
" obj = pickle.load(file)\n",
" return obj"
]
},
{
"cell_type": "code",
"execution_count": 341,
"metadata": {},
"outputs": [],
"source": [
"sns.set()\n",
"LOAD_CALC_FILES = False\n",
"\n",
"DESC_BLACKLIST = set(['-'])\n",
"\"\"\"\n",
"GENERAL_BLACKLIST = set([\n",
" 'herr', 'hr.', 'förster', 'graf', 'stöppel', \n",
" 'stab', 'kw', 'h.', 'koch', 'heininger', '.',\n",
" 'schwab', 'm.', 'wenninger', '-', '--',\n",
"])\n",
"\"\"\"\n",
"\n",
"GENERAL_BLACKLIST = set([\n",
" 'herr', 'hr.' 'kw', 'h.', '.',\n",
" 'm.', '-', '--', 'dr.', 'dr',\n",
"])\n",
"\n",
"#GENERAL_BLACKLIST = set()\n",
"#POS_of_interest = set(['NOUN', 'PROPN', 'ADJ', 'VERB', 'AUX'])\n",
"POS_of_interest = set(['NOUN', 'ADJ', 'VERB', 'AUX'])\n",
"TAG_of_interest = set(['ADJD'])"
]
},
{
"cell_type": "code",
"execution_count": 388,
"metadata": {},
"outputs": [
{
"ename": "RuntimeError",
"evalue": "Error(s) in loading state_dict for BertModel:\n\tUnexpected key(s) in state_dict: \"embeddings.position_ids\". ",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mRuntimeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[1;32mIn[388], line 5\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# load language model\u001b[39;00m\n\u001b[0;32m 2\u001b[0m \u001b[38;5;66;03m# transformer model without vector embeddings\u001b[39;00m\n\u001b[0;32m 3\u001b[0m \u001b[38;5;66;03m# can not be used to calculate similarities\u001b[39;00m\n\u001b[0;32m 4\u001b[0m \u001b[38;5;66;03m# using sentence transformers instead\u001b[39;00m\n\u001b[1;32m----> 5\u001b[0m nlp \u001b[38;5;241m=\u001b[39m \u001b[43mspacy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mde_dep_news_trf\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 6\u001b[0m \u001b[38;5;66;03m#nlp = spacy.load('de_core_news_lg')\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\__init__.py:51\u001b[0m, in \u001b[0;36mload\u001b[1;34m(name, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[0;32m 27\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mload\u001b[39m(\n\u001b[0;32m 28\u001b[0m name: Union[\u001b[38;5;28mstr\u001b[39m, Path],\n\u001b[0;32m 29\u001b[0m \u001b[38;5;241m*\u001b[39m,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 34\u001b[0m config: Union[Dict[\u001b[38;5;28mstr\u001b[39m, Any], Config] \u001b[38;5;241m=\u001b[39m util\u001b[38;5;241m.\u001b[39mSimpleFrozenDict(),\n\u001b[0;32m 35\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Language:\n\u001b[0;32m 36\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Load a spaCy model from an installed package or a local path.\u001b[39;00m\n\u001b[0;32m 37\u001b[0m \n\u001b[0;32m 38\u001b[0m \u001b[38;5;124;03m name (str): Package name or model path.\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 49\u001b[0m \u001b[38;5;124;03m RETURNS (Language): The loaded nlp object.\u001b[39;00m\n\u001b[0;32m 50\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m---> 51\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mutil\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload_model\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 52\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 53\u001b[0m \u001b[43m \u001b[49m\u001b[43mvocab\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvocab\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 54\u001b[0m \u001b[43m \u001b[49m\u001b[43mdisable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdisable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 55\u001b[0m \u001b[43m \u001b[49m\u001b[43menable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43menable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 56\u001b[0m \u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 57\u001b[0m \u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mconfig\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 58\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:465\u001b[0m, in \u001b[0;36mload_model\u001b[1;34m(name, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[0;32m 463\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m get_lang_class(name\u001b[38;5;241m.\u001b[39mreplace(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mblank:\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m))()\n\u001b[0;32m 464\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_package(name): \u001b[38;5;66;03m# installed as package\u001b[39;00m\n\u001b[1;32m--> 465\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mload_model_from_package\u001b[49m\u001b[43m(\u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[arg-type]\u001b[39;00m\n\u001b[0;32m 466\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m Path(name)\u001b[38;5;241m.\u001b[39mexists(): \u001b[38;5;66;03m# path to model data directory\u001b[39;00m\n\u001b[0;32m 467\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m load_model_from_path(Path(name), \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs) \u001b[38;5;66;03m# type: ignore[arg-type]\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:501\u001b[0m, in \u001b[0;36mload_model_from_package\u001b[1;34m(name, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[0;32m 484\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Load a model from an installed package.\u001b[39;00m\n\u001b[0;32m 485\u001b[0m \n\u001b[0;32m 486\u001b[0m \u001b[38;5;124;03mname (str): The package name.\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 498\u001b[0m \u001b[38;5;124;03mRETURNS (Language): The loaded nlp object.\u001b[39;00m\n\u001b[0;32m 499\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 500\u001b[0m \u001b[38;5;28mcls\u001b[39m \u001b[38;5;241m=\u001b[39m importlib\u001b[38;5;241m.\u001b[39mimport_module(name)\n\u001b[1;32m--> 501\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvocab\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvocab\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdisable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdisable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43menable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43menable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mconfig\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\de_dep_news_trf\\__init__.py:10\u001b[0m, in \u001b[0;36mload\u001b[1;34m(**overrides)\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mload\u001b[39m(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39moverrides):\n\u001b[1;32m---> 10\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mload_model_from_init_py\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;18;43m__file__\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43moverrides\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:682\u001b[0m, in \u001b[0;36mload_model_from_init_py\u001b[1;34m(init_file, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[0;32m 680\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m model_path\u001b[38;5;241m.\u001b[39mexists():\n\u001b[0;32m 681\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIOError\u001b[39;00m(Errors\u001b[38;5;241m.\u001b[39mE052\u001b[38;5;241m.\u001b[39mformat(path\u001b[38;5;241m=\u001b[39mdata_path))\n\u001b[1;32m--> 682\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mload_model_from_path\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 683\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata_path\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 684\u001b[0m \u001b[43m \u001b[49m\u001b[43mvocab\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvocab\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 685\u001b[0m \u001b[43m \u001b[49m\u001b[43mmeta\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmeta\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 686\u001b[0m \u001b[43m \u001b[49m\u001b[43mdisable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdisable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 687\u001b[0m \u001b[43m \u001b[49m\u001b[43menable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43menable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 688\u001b[0m \u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 689\u001b[0m \u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mconfig\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 690\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:547\u001b[0m, in \u001b[0;36mload_model_from_path\u001b[1;34m(model_path, meta, vocab, disable, enable, exclude, config)\u001b[0m\n\u001b[0;32m 538\u001b[0m config \u001b[38;5;241m=\u001b[39m load_config(config_path, overrides\u001b[38;5;241m=\u001b[39moverrides)\n\u001b[0;32m 539\u001b[0m nlp \u001b[38;5;241m=\u001b[39m load_model_from_config(\n\u001b[0;32m 540\u001b[0m config,\n\u001b[0;32m 541\u001b[0m vocab\u001b[38;5;241m=\u001b[39mvocab,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 545\u001b[0m meta\u001b[38;5;241m=\u001b[39mmeta,\n\u001b[0;32m 546\u001b[0m )\n\u001b[1;32m--> 547\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnlp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_disk\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43moverrides\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moverrides\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\language.py:2156\u001b[0m, in \u001b[0;36mLanguage.from_disk\u001b[1;34m(self, path, exclude, overrides)\u001b[0m\n\u001b[0;32m 2153\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (path \u001b[38;5;241m/\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mexists() \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m exclude: \u001b[38;5;66;03m# type: ignore[operator]\u001b[39;00m\n\u001b[0;32m 2154\u001b[0m \u001b[38;5;66;03m# Convert to list here in case exclude is (default) tuple\u001b[39;00m\n\u001b[0;32m 2155\u001b[0m exclude \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(exclude) \u001b[38;5;241m+\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m-> 2156\u001b[0m \u001b[43mutil\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_disk\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdeserializers\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[arg-type]\u001b[39;00m\n\u001b[0;32m 2157\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_path \u001b[38;5;241m=\u001b[39m path \u001b[38;5;66;03m# type: ignore[assignment]\u001b[39;00m\n\u001b[0;32m 2158\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_link_components()\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:1392\u001b[0m, in \u001b[0;36mfrom_disk\u001b[1;34m(path, readers, exclude)\u001b[0m\n\u001b[0;32m 1389\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key, reader \u001b[38;5;129;01min\u001b[39;00m readers\u001b[38;5;241m.\u001b[39mitems():\n\u001b[0;32m 1390\u001b[0m \u001b[38;5;66;03m# Split to support file names like meta.json\u001b[39;00m\n\u001b[0;32m 1391\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m key\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m)[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m exclude:\n\u001b[1;32m-> 1392\u001b[0m \u001b[43mreader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m/\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1393\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m path\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\language.py:2150\u001b[0m, in \u001b[0;36mLanguage.from_disk.<locals>.<lambda>\u001b[1;34m(p, proc)\u001b[0m\n\u001b[0;32m 2148\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(proc, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfrom_disk\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 2149\u001b[0m \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[1;32m-> 2150\u001b[0m deserializers[name] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m p, proc\u001b[38;5;241m=\u001b[39mproc: \u001b[43mproc\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_disk\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[misc]\u001b[39;49;00m\n\u001b[0;32m 2151\u001b[0m \u001b[43m \u001b[49m\u001b[43mp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mvocab\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[0;32m 2152\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 2153\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (path \u001b[38;5;241m/\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mexists() \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m exclude: \u001b[38;5;66;03m# type: ignore[operator]\u001b[39;00m\n\u001b[0;32m 2154\u001b[0m \u001b[38;5;66;03m# Convert to list here in case exclude is (default) tuple\u001b[39;00m\n\u001b[0;32m 2155\u001b[0m exclude \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(exclude) \u001b[38;5;241m+\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy_transformers\\pipeline_component.py:416\u001b[0m, in \u001b[0;36mTransformer.from_disk\u001b[1;34m(self, path, exclude)\u001b[0m\n\u001b[0;32m 409\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel\u001b[38;5;241m.\u001b[39mattrs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mset_transformer\u001b[39m\u001b[38;5;124m\"\u001b[39m](\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel, hf_model)\n\u001b[0;32m 411\u001b[0m deserialize \u001b[38;5;241m=\u001b[39m {\n\u001b[0;32m 412\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvocab\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mvocab\u001b[38;5;241m.\u001b[39mfrom_disk,\n\u001b[0;32m 413\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcfg\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;28;01mlambda\u001b[39;00m p: \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcfg\u001b[38;5;241m.\u001b[39mupdate(deserialize_config(p)),\n\u001b[0;32m 414\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmodel\u001b[39m\u001b[38;5;124m\"\u001b[39m: load_model,\n\u001b[0;32m 415\u001b[0m }\n\u001b[1;32m--> 416\u001b[0m \u001b[43mutil\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_disk\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdeserialize\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore\u001b[39;00m\n\u001b[0;32m 417\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy\\util.py:1392\u001b[0m, in \u001b[0;36mfrom_disk\u001b[1;34m(path, readers, exclude)\u001b[0m\n\u001b[0;32m 1389\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m key, reader \u001b[38;5;129;01min\u001b[39;00m readers\u001b[38;5;241m.\u001b[39mitems():\n\u001b[0;32m 1390\u001b[0m \u001b[38;5;66;03m# Split to support file names like meta.json\u001b[39;00m\n\u001b[0;32m 1391\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m key\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m)[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m exclude:\n\u001b[1;32m-> 1392\u001b[0m \u001b[43mreader\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m/\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1393\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m path\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy_transformers\\pipeline_component.py:390\u001b[0m, in \u001b[0;36mTransformer.from_disk.<locals>.load_model\u001b[1;34m(p)\u001b[0m\n\u001b[0;32m 388\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 389\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mopen\u001b[39m(p, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrb\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mas\u001b[39;00m mfile:\n\u001b[1;32m--> 390\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_bytes\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmfile\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 391\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mAttributeError\u001b[39;00m:\n\u001b[0;32m 392\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(Errors\u001b[38;5;241m.\u001b[39mE149) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\thinc\\model.py:619\u001b[0m, in \u001b[0;36mModel.from_bytes\u001b[1;34m(self, bytes_data)\u001b[0m\n\u001b[0;32m 617\u001b[0m msg \u001b[38;5;241m=\u001b[39m srsly\u001b[38;5;241m.\u001b[39mmsgpack_loads(bytes_data)\n\u001b[0;32m 618\u001b[0m msg \u001b[38;5;241m=\u001b[39m convert_recursive(is_xp_array, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mops\u001b[38;5;241m.\u001b[39masarray, msg)\n\u001b[1;32m--> 619\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmsg\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\thinc\\model.py:657\u001b[0m, in \u001b[0;36mModel.from_dict\u001b[1;34m(self, msg)\u001b[0m\n\u001b[0;32m 655\u001b[0m node\u001b[38;5;241m.\u001b[39mset_param(param_name, value)\n\u001b[0;32m 656\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, shim_bytes \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(msg[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mshims\u001b[39m\u001b[38;5;124m\"\u001b[39m][i]):\n\u001b[1;32m--> 657\u001b[0m \u001b[43mnode\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mshims\u001b[49m\u001b[43m[\u001b[49m\u001b[43mi\u001b[49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_bytes\u001b[49m\u001b[43m(\u001b[49m\u001b[43mshim_bytes\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 658\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\spacy_transformers\\layers\\hf_shim.py:120\u001b[0m, in \u001b[0;36mHFShim.from_bytes\u001b[1;34m(self, bytes_data)\u001b[0m\n\u001b[0;32m 118\u001b[0m filelike\u001b[38;5;241m.\u001b[39mseek(\u001b[38;5;241m0\u001b[39m)\n\u001b[0;32m 119\u001b[0m device \u001b[38;5;241m=\u001b[39m get_torch_default_device()\n\u001b[1;32m--> 120\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload_state_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilelike\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmap_location\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdevice\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 121\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_model\u001b[38;5;241m.\u001b[39mto(device)\n\u001b[0;32m 122\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\foersterflorian\\mambaforge\\envs\\test\\Lib\\site-packages\\torch\\nn\\modules\\module.py:2041\u001b[0m, in \u001b[0;36mModule.load_state_dict\u001b[1;34m(self, state_dict, strict)\u001b[0m\n\u001b[0;32m 2036\u001b[0m error_msgs\u001b[38;5;241m.\u001b[39minsert(\n\u001b[0;32m 2037\u001b[0m \u001b[38;5;241m0\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mMissing key(s) in state_dict: \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m. \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[0;32m 2038\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;241m.\u001b[39mjoin(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;241m.\u001b[39mformat(k) \u001b[38;5;28;01mfor\u001b[39;00m k \u001b[38;5;129;01min\u001b[39;00m missing_keys)))\n\u001b[0;32m 2040\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(error_msgs) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[1;32m-> 2041\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mError(s) in loading state_dict for \u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m:\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;130;01m\\t\u001b[39;00m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[0;32m 2042\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class__\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;130;01m\\t\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mjoin(error_msgs)))\n\u001b[0;32m 2043\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m _IncompatibleKeys(missing_keys, unexpected_keys)\n",
"\u001b[1;31mRuntimeError\u001b[0m: Error(s) in loading state_dict for BertModel:\n\tUnexpected key(s) in state_dict: \"embeddings.position_ids\". "
]
}
],
"source": [
"# load language model\n",
"# transformer model without vector embeddings\n",
"# can not be used to calculate similarities\n",
"# using sentence transformers instead\n",
"nlp = spacy.load('de_dep_news_trf')\n",
"#nlp = spacy.load('de_core_news_lg')"
]
},
{
"cell_type": "code",
"execution_count": 343,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2\n",
"INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n"
]
}
],
"source": [
"model_stfr = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')"
]
},
{
"cell_type": "code",
"execution_count": 344,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 129020 entries, 0 to 129019\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 VorgangsID 129020 non-null int64 \n",
" 1 ObjektID 129020 non-null int64 \n",
" 2 HObjektText 129003 non-null object \n",
" 3 ObjektArtID 129020 non-null int64 \n",
" 4 ObjektArtText 128372 non-null object \n",
" 5 VorgangsTypID 129020 non-null int64 \n",
" 6 VorgangsTypName 129020 non-null object \n",
" 7 VorgangsDatum 129020 non-null datetime64[ns]\n",
" 8 VorgangsStatusId 129020 non-null int64 \n",
" 9 VorgangsPrioritaet 129020 non-null int64 \n",
" 10 VorgangsBeschreibung 124087 non-null object \n",
" 11 VorgangsOrt 507 non-null object \n",
" 12 VorgangsArtText 129020 non-null object \n",
" 13 ErledigungsDatum 129020 non-null datetime64[ns]\n",
" 14 ErledigungsArtText 128474 non-null object \n",
" 15 ErledigungsBeschreibung 118135 non-null object \n",
" 16 MPMelderArbeitsplatz 6359 non-null object \n",
" 17 MPAbteilungBezeichnung 6359 non-null object \n",
" 18 Arbeitsbeginn 123538 non-null datetime64[ns]\n",
" 19 ErstellungsDatum 129020 non-null datetime64[ns]\n",
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
"memory usage: 19.7+ MB\n"
]
}
],
"source": [
"# load dataset\n",
"FILE_PATH = '01_2_Rohdaten_neu/Export4.csv'\n",
"date_cols = ['VorgangsDatum', 'ErledigungsDatum', 'Arbeitsbeginn', 'ErstellungsDatum']\n",
"raw = pd.read_csv(filepath_or_buffer=FILE_PATH, sep=';', encoding='cp1252', parse_dates=date_cols, dayfirst=True)\n",
"raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 345,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>114</td>\n",
" <td>427 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-06</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Kettbaum kaputt</td>\n",
" <td>2019-03-06</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>124</td>\n",
" <td>621 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-11</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>asgasdg</td>\n",
" <td>2019-03-11</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>53</td>\n",
" <td>244</td>\n",
" <td>285 C, Webmaschine, SG 220 EMS</td>\n",
" <td>5</td>\n",
" <td>Greifer-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Kupplung schleift</td>\n",
" <td>NaN</td>\n",
" <td>Kupplung defekt</td>\n",
" <td>2019-03-20</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>58</td>\n",
" <td>257</td>\n",
" <td>107, Webmaschine, OM 220 EOS</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-21</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Gegengewicht wieder anbringen</td>\n",
" <td>NaN</td>\n",
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
" <td>2019-03-21</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schraube ausgebohrt\\nGegengewicht wieder angeb...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-03-21</td>\n",
" <td>2019-03-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>81</td>\n",
" <td>138</td>\n",
" <td>00138, Schärmaschine 9,</td>\n",
" <td>16</td>\n",
" <td>Schärmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
" <td>NaN</td>\n",
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"0 11 114 427 C , Webmaschine, DL 280 EMS Breite 280 \n",
"1 17 124 621 C , Webmaschine, DL 280 EMS Breite 280 \n",
"2 53 244 285 C, Webmaschine, SG 220 EMS \n",
"3 58 257 107, Webmaschine, OM 220 EOS \n",
"4 81 138 00138, Schärmaschine 9, \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"0 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"1 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"2 5 Greifer-Webmaschine 3 Reparaturauftrag (Portal) \n",
"3 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"4 16 Schärmaschine 3 Reparaturauftrag (Portal) \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"0 2019-03-06 4 0 \n",
"1 2019-03-11 5 0 \n",
"2 2019-03-19 5 0 \n",
"3 2019-03-21 5 0 \n",
"4 2019-03-25 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Kupplung schleift NaN \n",
"3 Gegengewicht wieder anbringen NaN \n",
"4 da ist etwas gebrochen. (Herr Heininger) NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"0 Kettbaum kaputt 2019-03-06 \n",
"1 asgasdg 2019-03-11 \n",
"2 Kupplung defekt 2019-03-20 \n",
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
"\n",
" ErledigungsArtText ErledigungsBeschreibung \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Reparatur UTT NaN \n",
"3 Reparatur UTT Schraube ausgebohrt\\nGegengewicht wieder angeb... \n",
"4 Reparatur UTT Bolzen gebrochen. Bolzen neu angefertig und di... \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"0 Weberei Weberei NaT 2019-03-06 \n",
"1 Elektrowerkstatt Elektrowerkstatt NaT 2019-03-11 \n",
"2 Weberei Weberei NaT 2019-03-19 \n",
"3 Weberei Weberei 2019-03-21 2019-03-21 \n",
"4 Vorwerk Vorwerk 2019-03-25 2019-03-25 "
]
},
"execution_count": 345,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 346,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Features: 20\n"
]
}
],
"source": [
"print(f\"Anzahl Features: {len(raw.columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Neue Features gegenüber letzter Analyse:**\n",
"- ``ObjektArtID``\n",
"- ``ObjektArtText``\n",
"- ``VorgangsTypName``"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Duplikate"
]
},
{
"cell_type": "code",
"execution_count": 347,
"metadata": {},
"outputs": [],
"source": [
"duplicates_filt = raw.duplicated()"
]
},
{
"cell_type": "code",
"execution_count": 348,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate: 84\n"
]
}
],
"source": [
"print(f\"Anzahl Duplikate: {duplicates_filt.sum()}\")"
]
},
{
"cell_type": "code",
"execution_count": 349,
"metadata": {},
"outputs": [],
"source": [
"filt_data = raw[duplicates_filt]\n",
"uni_obj_id_dupl = filt_data['ObjektID'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 350,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl einzigartiger Objekt-IDs unter Duplikaten: 47\n"
]
}
],
"source": [
"print(f\"Anzahl einzigartiger Objekt-IDs unter Duplikaten: {len(uni_obj_id_dupl)}\")"
]
},
{
"cell_type": "code",
"execution_count": 351,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 128936 entries, 0 to 128935\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 VorgangsID 128936 non-null int64 \n",
" 1 ObjektID 128936 non-null int64 \n",
" 2 HObjektText 128920 non-null object \n",
" 3 ObjektArtID 128936 non-null int64 \n",
" 4 ObjektArtText 128289 non-null object \n",
" 5 VorgangsTypID 128936 non-null int64 \n",
" 6 VorgangsTypName 128936 non-null object \n",
" 7 VorgangsDatum 128936 non-null datetime64[ns]\n",
" 8 VorgangsStatusId 128936 non-null int64 \n",
" 9 VorgangsPrioritaet 128936 non-null int64 \n",
" 10 VorgangsBeschreibung 124008 non-null object \n",
" 11 VorgangsOrt 507 non-null object \n",
" 12 VorgangsArtText 128936 non-null object \n",
" 13 ErledigungsDatum 128936 non-null datetime64[ns]\n",
" 14 ErledigungsArtText 128402 non-null object \n",
" 15 ErledigungsBeschreibung 118086 non-null object \n",
" 16 MPMelderArbeitsplatz 6337 non-null object \n",
" 17 MPAbteilungBezeichnung 6337 non-null object \n",
" 18 Arbeitsbeginn 123480 non-null datetime64[ns]\n",
" 19 ErstellungsDatum 128936 non-null datetime64[ns]\n",
"dtypes: datetime64[ns](4), int64(6), object(10)\n",
"memory usage: 19.7+ MB\n"
]
}
],
"source": [
"wo_duplicates = raw.drop_duplicates(ignore_index=True)\n",
"wo_duplicates.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ``VorgangsBeschreibung``"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **NA vals und Duplikate**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"String-Bereinigung"
]
},
{
"cell_type": "code",
"execution_count": 352,
"metadata": {},
"outputs": [],
"source": [
"SPECIAL_CHARS = set(['&', '$', '%', '§', '/', '(', ')', '_', \n",
" '+', '', '--', '<', '>', '´',\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 353,
"metadata": {},
"outputs": [],
"source": [
"def clean_string_slim(string: str) -> str:\n",
" # remove special chars\n",
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces at the beginning and the end\n",
" string = string.strip()\n",
" \n",
" return string\n",
"\n",
"def clean_string(string: str) -> str:\n",
" #num_reps = 5\n",
" \n",
" # remove special chars\n",
" pattern = r'[\\t\\n\\r\\f\\v]'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove dates\n",
" pattern = r'[\\d]{1,4}[.:][\\d]{1,4}[.:][\\d]{1,4}'\n",
" string = re.sub(pattern, '', string)\n",
" # remove times\n",
" pattern = r'[\\d]{1,2}[:][\\d]{1,2}[:][\\d]{0,2}'\n",
" string = re.sub(pattern, '', string)\n",
" # remove all chars despite punctuation and alphanumeric ones\n",
" pattern = r'[^ \\w.,;:\\-äöüÄÖÜ]+'\n",
" string = re.sub(pattern, '', string)\n",
" # remove - where it is used as em dash\n",
" pattern = r'[\\W]+-[\\W]+'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces in front of punctuation\n",
" pattern = r'[ ]+([;,.:])'\n",
" string = re.sub(pattern, r'\\1', string)\n",
" # remove multiple whitespaces\n",
" pattern = r'[ ]+'\n",
" string = re.sub(pattern, ' ', string)\n",
" # remove whitespaces at the beginning and the end\n",
" string = string.strip()\n",
" \n",
" #while num_reps != 0:\n",
" #string = string.replace('\\n', ' ')\n",
" #string = string.replace('\\t', ' ')\n",
" #string = string.replace(' ', ' ')\n",
" #string = string.replace(' ', ' ')\n",
" #string = string.replace(' - ', ' ')\n",
" \"\"\"\n",
" for char in SPECIAL_CHARS:\n",
" string = string.replace(char, '')\n",
" \n",
" #num_reps -= 1\n",
" \n",
" # remove spaces at the beginning and the end\n",
" string = string.strip()\n",
" \"\"\"\n",
" \n",
" return string"
]
},
{
"cell_type": "code",
"execution_count": 354,
"metadata": {},
"outputs": [],
"source": [
"base = wo_duplicates.copy()\n",
"base = base.dropna(axis=0, subset='VorgangsBeschreibung')\n",
"# preprocessing\n",
"#base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string)\n",
"base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string_slim)"
]
},
{
"cell_type": "code",
"execution_count": 355,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>53</td>\n",
" <td>244</td>\n",
" <td>285 C, Webmaschine, SG 220 EMS</td>\n",
" <td>5</td>\n",
" <td>Greifer-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Kupplung schleift</td>\n",
" <td>NaN</td>\n",
" <td>Kupplung defekt</td>\n",
" <td>2019-03-20</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>58</td>\n",
" <td>257</td>\n",
" <td>107, Webmaschine, OM 220 EOS</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-21</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Gegengewicht wieder anbringen</td>\n",
" <td>NaN</td>\n",
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
" <td>2019-03-21</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schraube ausgebohrt\\nGegengewicht wieder angeb...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-03-21</td>\n",
" <td>2019-03-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>81</td>\n",
" <td>138</td>\n",
" <td>00138, Schärmaschine 9,</td>\n",
" <td>16</td>\n",
" <td>Schärmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
" <td>NaN</td>\n",
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>82</td>\n",
" <td>0</td>\n",
" <td>Warenschau allgemein</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Klappbügel Portalkran H31 defekt</td>\n",
" <td>Warenschau allgemein</td>\n",
" <td>Allgemeine Reparaturarbeiten</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Feder ausgetauscht</td>\n",
" <td>Warenschau</td>\n",
" <td>Warenschau</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>76</td>\n",
" <td>0</td>\n",
" <td>Neben der Türe</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-22</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Schraube nix mer gut</td>\n",
" <td>Neben der Türe</td>\n",
" <td>Kettbaum</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schrauben ausgebohrt\\t\\nGewinde nachgeschnitten\\t</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128931</th>\n",
" <td>518956</td>\n",
" <td>1708</td>\n",
" <td>01708, Betriebsfahrräder Schlosserei,</td>\n",
" <td>57</td>\n",
" <td>Interne Wartungsobjekte</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2023-06-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>2-wöchige Reinigung &amp; Sichtkontrolle (Technisc...</td>\n",
" <td>NaN</td>\n",
" <td>02 Interne Reinigung / Pflege / Überprüfung</td>\n",
" <td>2023-06-19</td>\n",
" <td>Intern UTT - Prüfung</td>\n",
" <td>Reinigung &amp; Sichtkontrolle (Technische Einric...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2023-06-19</td>\n",
" <td>2023-03-14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128932</th>\n",
" <td>275123</td>\n",
" <td>1654</td>\n",
" <td>WEBEREI ALLGEMEIN, Weberei allgemein,</td>\n",
" <td>90</td>\n",
" <td>UTT allgemein</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2022-09-29</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Adapter entfernen und Gewinde nachschneiden.</td>\n",
" <td>NaN</td>\n",
" <td>Kettbaum-Adapter</td>\n",
" <td>2022-09-30</td>\n",
" <td>Intern UTT - Reparatur</td>\n",
" <td>mit schlosserei aufräumen</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2022-09-30</td>\n",
" <td>2022-09-29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128933</th>\n",
" <td>275125</td>\n",
" <td>1795</td>\n",
" <td>A054.S, Jacquardmaschine,</td>\n",
" <td>24</td>\n",
" <td>Stäubli-Jacquardmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2022-09-30</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Alle 4 Schrauben und teile der Kettbaumlagerun...</td>\n",
" <td>NaN</td>\n",
" <td>Kettbaum</td>\n",
" <td>2022-09-30</td>\n",
" <td>Intern UTT - Reparatur</td>\n",
" <td>Neues Teil eingebaut und altes repariert</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2022-09-30</td>\n",
" <td>2022-09-30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128934</th>\n",
" <td>275188</td>\n",
" <td>1</td>\n",
" <td>00001, Ausrüstungsanlage 1,</td>\n",
" <td>1</td>\n",
" <td>Waschmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2022-09-30</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Walzenlager WK 6 überprüfen/auswechseln</td>\n",
" <td>NaN</td>\n",
" <td>Lagereinheit (Wälzlager, Kugellager, etc.)</td>\n",
" <td>2022-10-04</td>\n",
" <td>Intern UTT - Reparatur</td>\n",
" <td>Lager getauscht</td>\n",
" <td>Ausrüstung</td>\n",
" <td>Ausrüstung</td>\n",
" <td>2022-10-04</td>\n",
" <td>2022-09-30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128935</th>\n",
" <td>275219</td>\n",
" <td>326</td>\n",
" <td>B38, Niederhubwagen,</td>\n",
" <td>32</td>\n",
" <td>Flurförderzeuge / Putzmaschine / Rasenmäher</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2022-10-03</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Befestigung Deckel für Batteriefach defekt ...</td>\n",
" <td>NaN</td>\n",
" <td>Flurförderzeug</td>\n",
" <td>2022-10-05</td>\n",
" <td>Intern UTT - Reparatur</td>\n",
" <td>Neue Gasfeder eingebaut</td>\n",
" <td>Warenschau</td>\n",
" <td>Warenschau</td>\n",
" <td>2022-10-04</td>\n",
" <td>2022-10-03</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>124008 rows × 20 columns</p>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"2 53 244 285 C, Webmaschine, SG 220 EMS \n",
"3 58 257 107, Webmaschine, OM 220 EOS \n",
"4 81 138 00138, Schärmaschine 9, \n",
"5 82 0 Warenschau allgemein \n",
"6 76 0 Neben der Türe \n",
"... ... ... ... \n",
"128931 518956 1708 01708, Betriebsfahrräder Schlosserei, \n",
"128932 275123 1654 WEBEREI ALLGEMEIN, Weberei allgemein, \n",
"128933 275125 1795 A054.S, Jacquardmaschine, \n",
"128934 275188 1 00001, Ausrüstungsanlage 1, \n",
"128935 275219 326 B38, Niederhubwagen, \n",
"\n",
" ObjektArtID ObjektArtText \\\n",
"2 5 Greifer-Webmaschine \n",
"3 3 Luft-Webmaschine \n",
"4 16 Schärmaschine \n",
"5 0 NaN \n",
"6 0 NaN \n",
"... ... ... \n",
"128931 57 Interne Wartungsobjekte \n",
"128932 90 UTT allgemein \n",
"128933 24 Stäubli-Jacquardmaschine \n",
"128934 1 Waschmaschine \n",
"128935 32 Flurförderzeuge / Putzmaschine / Rasenmäher \n",
"\n",
" VorgangsTypID VorgangsTypName VorgangsDatum \\\n",
"2 3 Reparaturauftrag (Portal) 2019-03-19 \n",
"3 3 Reparaturauftrag (Portal) 2019-03-21 \n",
"4 3 Reparaturauftrag (Portal) 2019-03-25 \n",
"5 3 Reparaturauftrag (Portal) 2019-03-25 \n",
"6 3 Reparaturauftrag (Portal) 2019-03-22 \n",
"... ... ... ... \n",
"128931 1 Wartung 2023-06-19 \n",
"128932 3 Reparaturauftrag (Portal) 2022-09-29 \n",
"128933 3 Reparaturauftrag (Portal) 2022-09-30 \n",
"128934 3 Reparaturauftrag (Portal) 2022-09-30 \n",
"128935 3 Reparaturauftrag (Portal) 2022-10-03 \n",
"\n",
" VorgangsStatusId VorgangsPrioritaet \\\n",
"2 5 0 \n",
"3 5 0 \n",
"4 5 0 \n",
"5 5 0 \n",
"6 5 0 \n",
"... ... ... \n",
"128931 5 0 \n",
"128932 5 0 \n",
"128933 5 0 \n",
"128934 5 1 \n",
"128935 5 0 \n",
"\n",
" VorgangsBeschreibung \\\n",
"2 Kupplung schleift \n",
"3 Gegengewicht wieder anbringen \n",
"4 da ist etwas gebrochen. (Herr Heininger) \n",
"5 Klappbügel Portalkran H31 defekt \n",
"6 Schraube nix mer gut \n",
"... ... \n",
"128931 2-wöchige Reinigung & Sichtkontrolle (Technisc... \n",
"128932 Adapter entfernen und Gewinde nachschneiden. \n",
"128933 Alle 4 Schrauben und teile der Kettbaumlagerun... \n",
"128934 Walzenlager WK 6 überprüfen/auswechseln \n",
"128935 Befestigung Deckel für Batteriefach defekt ... \n",
"\n",
" VorgangsOrt \\\n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"5 Warenschau allgemein \n",
"6 Neben der Türe \n",
"... ... \n",
"128931 NaN \n",
"128932 NaN \n",
"128933 NaN \n",
"128934 NaN \n",
"128935 NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"2 Kupplung defekt 2019-03-20 \n",
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
"5 Allgemeine Reparaturarbeiten 2019-03-25 \n",
"6 Kettbaum 2019-03-25 \n",
"... ... ... \n",
"128931 02 Interne Reinigung / Pflege / Überprüfung 2023-06-19 \n",
"128932 Kettbaum-Adapter 2022-09-30 \n",
"128933 Kettbaum 2022-09-30 \n",
"128934 Lagereinheit (Wälzlager, Kugellager, etc.) 2022-10-04 \n",
"128935 Flurförderzeug 2022-10-05 \n",
"\n",
" ErledigungsArtText \\\n",
"2 Reparatur UTT \n",
"3 Reparatur UTT \n",
"4 Reparatur UTT \n",
"5 Reparatur UTT \n",
"6 Reparatur UTT \n",
"... ... \n",
"128931 Intern UTT - Prüfung \n",
"128932 Intern UTT - Reparatur \n",
"128933 Intern UTT - Reparatur \n",
"128934 Intern UTT - Reparatur \n",
"128935 Intern UTT - Reparatur \n",
"\n",
" ErledigungsBeschreibung \\\n",
"2 NaN \n",
"3 Schraube ausgebohrt\\nGegengewicht wieder angeb... \n",
"4 Bolzen gebrochen. Bolzen neu angefertig und di... \n",
"5 Feder ausgetauscht \n",
"6 Schrauben ausgebohrt\\t\\nGewinde nachgeschnitten\\t \n",
"... ... \n",
"128931 Reinigung & Sichtkontrolle (Technische Einric... \n",
"128932 mit schlosserei aufräumen \n",
"128933 Neues Teil eingebaut und altes repariert \n",
"128934 Lager getauscht \n",
"128935 Neue Gasfeder eingebaut \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn \\\n",
"2 Weberei Weberei NaT \n",
"3 Weberei Weberei 2019-03-21 \n",
"4 Vorwerk Vorwerk 2019-03-25 \n",
"5 Warenschau Warenschau 2019-03-25 \n",
"6 Vorwerk Vorwerk 2019-03-25 \n",
"... ... ... ... \n",
"128931 NaN NaN 2023-06-19 \n",
"128932 Weberei Weberei 2022-09-30 \n",
"128933 Weberei Weberei 2022-09-30 \n",
"128934 Ausrüstung Ausrüstung 2022-10-04 \n",
"128935 Warenschau Warenschau 2022-10-04 \n",
"\n",
" ErstellungsDatum \n",
"2 2019-03-19 \n",
"3 2019-03-21 \n",
"4 2019-03-25 \n",
"5 2019-03-25 \n",
"6 2019-03-22 \n",
"... ... \n",
"128931 2023-03-14 \n",
"128932 2022-09-29 \n",
"128933 2022-09-30 \n",
"128934 2022-09-30 \n",
"128935 2022-10-03 \n",
"\n",
"[124008 rows x 20 columns]"
]
},
"execution_count": 355,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base"
]
},
{
"cell_type": "code",
"execution_count": 356,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Einträge: 124008\n"
]
}
],
"source": [
"descriptions = base['VorgangsBeschreibung']\n",
"print(f\"Einträge: {len(descriptions)}\")"
]
},
{
"cell_type": "code",
"execution_count": 357,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate Vorgangsbeschreibungen: 117208\n",
"Anzahl einzigartiger Vorgangsbeschreibungen: 6800\n",
"Anteil einzigartiger Vorgangsbeschreibungen: 5.48 %\n"
]
}
],
"source": [
"num_dupl_descr = descriptions.duplicated().sum()\n",
"uni_descr = descriptions.unique()\n",
"num_uni_descr = len(uni_descr)\n",
"\n",
"print(f\"Anzahl Duplikate Vorgangsbeschreibungen: {num_dupl_descr}\")\n",
"print(f\"Anzahl einzigartiger Vorgangsbeschreibungen: {num_uni_descr}\")\n",
"print(f\"Anteil einzigartiger Vorgangsbeschreibungen: {num_uni_descr / len(descriptions) * 100:.2f} %\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 358,
"metadata": {},
"outputs": [],
"source": [
"if not LOAD_CALC_FILES:\n",
" cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']\n",
" descr_df = pd.DataFrame(columns=cols)\n",
" max_val = 0\n",
" text = None\n",
" index = 0\n",
"\n",
"\n",
" for idx, description in enumerate(uni_descr):\n",
" len_descr = len(description)\n",
" filt = base['VorgangsBeschreibung'] == description\n",
" temp = base[filt]\n",
" assoc_obj_ids = temp['ObjektID'].unique()\n",
" assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')\n",
" num_assoc_obj_ids = len(assoc_obj_ids)\n",
" num_dupl = filt.sum()\n",
" \n",
" conc_df = pd.DataFrame(data=[[\n",
" description,\n",
" len_descr,\n",
" num_dupl,\n",
" assoc_obj_ids,\n",
" num_assoc_obj_ids\n",
" ]], columns=cols)\n",
" \n",
" descr_df = pd.concat([descr_df, conc_df], ignore_index=True)\n",
" \n",
" if num_dupl > max_val:\n",
" max_val = num_dupl\n",
" index = idx\n",
" text = description\n",
" \n",
" temp1 = descr_df.sort_values(by='num_occur', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 359,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>162</th>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>66</td>\n",
" <td>92592</td>\n",
" <td>[0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53...</td>\n",
" <td>206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Wöchentliche Sichtkontrolle / Reinigung</td>\n",
" <td>39</td>\n",
" <td>1654</td>\n",
" <td>[301, 304, 305, 313, 314, 331, 332, 510, 511, ...</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>Tägliche Überprüfung der Ölabscheider</td>\n",
" <td>37</td>\n",
" <td>1616</td>\n",
" <td>[0, 970, 2134, 2137]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160</th>\n",
" <td>Wöchentliche Kontrolle der WC-Anlagen</td>\n",
" <td>37</td>\n",
" <td>1265</td>\n",
" <td>[1352, 1353, 1354, 1684, 1685, 1686, 1687, 168...</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140</th>\n",
" <td>Halbjährliche Kontrolle des Stabbreithalters</td>\n",
" <td>44</td>\n",
" <td>687</td>\n",
" <td>[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...</td>\n",
" <td>166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2679</th>\n",
" <td>Zahnräder der Laufkatze verschlissen Ersatztei...</td>\n",
" <td>170</td>\n",
" <td>1</td>\n",
" <td>[415]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2678</th>\n",
" <td>Bitte 8 Scheiben nach Muster anfertigen. Danke.</td>\n",
" <td>48</td>\n",
" <td>1</td>\n",
" <td>[140]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2677</th>\n",
" <td>Schalter für Bühne Schwenken abgerissen, bitte...</td>\n",
" <td>126</td>\n",
" <td>1</td>\n",
" <td>[323]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2676</th>\n",
" <td>Docke angefahren!</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" <td>[176]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6799</th>\n",
" <td>Befestigung Deckel für Batteriefach defekt ...</td>\n",
" <td>107</td>\n",
" <td>1</td>\n",
" <td>[326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6800 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"162 Tägliche Wartungstätigkeiten nach Vorgabe des ... 66 92592 \n",
"33 Wöchentliche Sichtkontrolle / Reinigung 39 1654 \n",
"131 Tägliche Überprüfung der Ölabscheider 37 1616 \n",
"160 Wöchentliche Kontrolle der WC-Anlagen 37 1265 \n",
"140 Halbjährliche Kontrolle des Stabbreithalters 44 687 \n",
"... ... ... ... \n",
"2679 Zahnräder der Laufkatze verschlissen Ersatztei... 170 1 \n",
"2678 Bitte 8 Scheiben nach Muster anfertigen. Danke. 48 1 \n",
"2677 Schalter für Bühne Schwenken abgerissen, bitte... 126 1 \n",
"2676 Docke angefahren! 17 1 \n",
"6799 Befestigung Deckel für Batteriefach defekt ... 107 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"162 [0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53... 206 \n",
"33 [301, 304, 305, 313, 314, 331, 332, 510, 511, ... 18 \n",
"131 [0, 970, 2134, 2137] 4 \n",
"160 [1352, 1353, 1354, 1684, 1685, 1686, 1687, 168... 11 \n",
"140 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6... 166 \n",
"... ... ... \n",
"2679 [415] 1 \n",
"2678 [140] 1 \n",
"2677 [323] 1 \n",
"2676 [176] 1 \n",
"6799 [326] 1 \n",
"\n",
"[6800 rows x 5 columns]"
]
},
"execution_count": 359,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": 360,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Tägliche Wartungstätigkeiten nach Vorgabe des Maschinenherstellers'"
]
},
"execution_count": 360,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iloc[0,0]"
]
},
{
"cell_type": "code",
"execution_count": 361,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Wöchentliche Sichtkontrolle / Reinigung'"
]
},
"execution_count": 361,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iloc[1,0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Cosine Similarity**"
]
},
{
"cell_type": "code",
"execution_count": 362,
"metadata": {},
"outputs": [],
"source": [
"# eliminate descriptions with less than 6 symbols\n",
"subset_data = temp1.loc[temp1['len'] > 5, 'descr'].copy()\n",
"subset_data = subset_data.iloc[0:100]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Wie geht man mit unbekannten Wörtern um?"
]
},
{
"cell_type": "code",
"execution_count": 363,
"metadata": {},
"outputs": [],
"source": [
"# build mapping of embeddings for given model\n",
"def build_embedding_map(\n",
" data: Series,\n",
" model: GermanSpacyModel | SentenceTransformer,\n",
") -> dict[int, tuple['Embedding',str]]:\n",
" # dictionary with embeddings\n",
" embeddings: dict[int, tuple['Embedding',str]] = dict()\n",
" is_spacy = False\n",
" is_STRF = False\n",
" \n",
" if isinstance(model, spacy.lang.de.German):\n",
" is_spacy = True\n",
" elif isinstance(model, SentenceTransformer):\n",
" is_STRF = True\n",
" \n",
" if not any((is_spacy, is_STRF)):\n",
" raise NotImplementedError(\"Model type unknown\")\n",
" \n",
" for (idx, text) in subset_data.items():\n",
" \n",
" if is_spacy:\n",
" embd = model(text)\n",
" embeddings[idx] = (embd, text)\n",
" # check for empty vectors\n",
" if not doc.vector_norm:\n",
" print('--- Unknown Words ---')\n",
" print(f'{embd.text=} has no vector')\n",
" elif is_STRF:\n",
" embd = model.encode(text, show_progress_bar=False, normalize_embeddings=False)\n",
" embeddings[idx] = (embd, text)\n",
" \n",
" return embeddings, (is_spacy, is_STRF)\n",
"\n",
"# build similarity matrix out of embeddings\n",
"def build_cosSim_matrix(\n",
" data: Series,\n",
" model: GermanSpacyModel | SentenceTransformer,\n",
") -> DataFrame:\n",
" # build empty matrix\n",
" df_index = data.index\n",
" cosineSim_idx_matrix = pd.DataFrame(data=0., columns=df_index, \n",
" index=df_index, dtype=np.float32)\n",
" \n",
" # obtain embeddings based on used model\n",
" embds, (is_spacy, is_STRF) = build_embedding_map(\n",
" data=data,\n",
" model=model\n",
" )\n",
" \n",
" # apply index based mapping for efficient handling of large texts\n",
" combs = combinations(df_index, 2)\n",
" \n",
" for (idx1, idx2) in combs:\n",
" #print(f\"{idx1=}, {idx2=}\")\n",
" embd1 = embds[idx1][0]\n",
" embd2 = embds[idx2][0]\n",
" \n",
" # calculate similarity based on model type\n",
" if is_spacy:\n",
" cosSim = embd1.similarity(embd2)\n",
" elif is_STRF:\n",
" cosSim = sentence_transformers.util.cos_sim(embd1, embd2)\n",
" cosSim = cosSim.item()\n",
" \n",
" cosineSim_idx_matrix.at[idx1, idx2] = cosSim\n",
" \n",
" return cosineSim_idx_matrix, embds"
]
},
{
"cell_type": "code",
"execution_count": 364,
"metadata": {},
"outputs": [],
"source": [
"cosineSim_idx_matrix, embds = build_cosSim_matrix(\n",
" data=subset_data,\n",
" model=model_stfr,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 365,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>162</th>\n",
" <th>33</th>\n",
" <th>131</th>\n",
" <th>160</th>\n",
" <th>140</th>\n",
" <th>1780</th>\n",
" <th>332</th>\n",
" <th>104</th>\n",
" <th>157</th>\n",
" <th>558</th>\n",
" <th>...</th>\n",
" <th>180</th>\n",
" <th>3485</th>\n",
" <th>2255</th>\n",
" <th>81</th>\n",
" <th>360</th>\n",
" <th>47</th>\n",
" <th>2951</th>\n",
" <th>185</th>\n",
" <th>566</th>\n",
" <th>40</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>162</th>\n",
" <td>0.0</td>\n",
" <td>0.441387</td>\n",
" <td>0.409547</td>\n",
" <td>0.307963</td>\n",
" <td>0.324018</td>\n",
" <td>0.506761</td>\n",
" <td>0.475413</td>\n",
" <td>0.475614</td>\n",
" <td>0.491961</td>\n",
" <td>0.472069</td>\n",
" <td>...</td>\n",
" <td>0.306548</td>\n",
" <td>0.318907</td>\n",
" <td>0.329199</td>\n",
" <td>0.296131</td>\n",
" <td>0.283268</td>\n",
" <td>0.442444</td>\n",
" <td>0.129318</td>\n",
" <td>0.425916</td>\n",
" <td>0.432691</td>\n",
" <td>0.356977</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.298110</td>\n",
" <td>0.372992</td>\n",
" <td>0.412453</td>\n",
" <td>0.374439</td>\n",
" <td>0.423904</td>\n",
" <td>0.416100</td>\n",
" <td>0.717584</td>\n",
" <td>0.422673</td>\n",
" <td>...</td>\n",
" <td>0.317514</td>\n",
" <td>0.321114</td>\n",
" <td>0.367475</td>\n",
" <td>0.327464</td>\n",
" <td>0.228003</td>\n",
" <td>0.351899</td>\n",
" <td>0.245888</td>\n",
" <td>0.383551</td>\n",
" <td>0.384033</td>\n",
" <td>0.746593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.305271</td>\n",
" <td>0.390110</td>\n",
" <td>0.406878</td>\n",
" <td>0.390903</td>\n",
" <td>0.417179</td>\n",
" <td>0.324945</td>\n",
" <td>0.392856</td>\n",
" <td>...</td>\n",
" <td>0.387864</td>\n",
" <td>0.386872</td>\n",
" <td>0.466728</td>\n",
" <td>0.368427</td>\n",
" <td>0.297099</td>\n",
" <td>0.393476</td>\n",
" <td>0.080983</td>\n",
" <td>0.344004</td>\n",
" <td>0.346553</td>\n",
" <td>0.300196</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.294035</td>\n",
" <td>0.293377</td>\n",
" <td>0.457293</td>\n",
" <td>0.251860</td>\n",
" <td>0.327785</td>\n",
" <td>0.456575</td>\n",
" <td>...</td>\n",
" <td>0.291295</td>\n",
" <td>0.356851</td>\n",
" <td>0.326423</td>\n",
" <td>0.340315</td>\n",
" <td>0.241496</td>\n",
" <td>0.363125</td>\n",
" <td>0.205827</td>\n",
" <td>0.350013</td>\n",
" <td>0.322723</td>\n",
" <td>0.233216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.353114</td>\n",
" <td>0.368328</td>\n",
" <td>0.319977</td>\n",
" <td>0.402378</td>\n",
" <td>0.368687</td>\n",
" <td>...</td>\n",
" <td>0.328528</td>\n",
" <td>0.298065</td>\n",
" <td>0.515159</td>\n",
" <td>0.315984</td>\n",
" <td>0.240238</td>\n",
" <td>0.406395</td>\n",
" <td>0.164005</td>\n",
" <td>0.405763</td>\n",
" <td>0.403172</td>\n",
" <td>0.381799</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.119025</td>\n",
" <td>0.294950</td>\n",
" <td>0.281203</td>\n",
" <td>0.317069</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2951</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.359434</td>\n",
" <td>0.353695</td>\n",
" <td>0.223206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>185</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.978342</td>\n",
" <td>0.411086</td>\n",
" </tr>\n",
" <tr>\n",
" <th>566</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.404999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100 rows × 100 columns</p>\n",
"</div>"
],
"text/plain": [
" 162 33 131 160 140 1780 332 \\\n",
"162 0.0 0.441387 0.409547 0.307963 0.324018 0.506761 0.475413 \n",
"33 0.0 0.000000 0.298110 0.372992 0.412453 0.374439 0.423904 \n",
"131 0.0 0.000000 0.000000 0.305271 0.390110 0.406878 0.390903 \n",
"160 0.0 0.000000 0.000000 0.000000 0.294035 0.293377 0.457293 \n",
"140 0.0 0.000000 0.000000 0.000000 0.000000 0.353114 0.368328 \n",
"... ... ... ... ... ... ... ... \n",
"47 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"2951 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"185 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"566 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"40 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"\n",
" 104 157 558 ... 180 3485 2255 \\\n",
"162 0.475614 0.491961 0.472069 ... 0.306548 0.318907 0.329199 \n",
"33 0.416100 0.717584 0.422673 ... 0.317514 0.321114 0.367475 \n",
"131 0.417179 0.324945 0.392856 ... 0.387864 0.386872 0.466728 \n",
"160 0.251860 0.327785 0.456575 ... 0.291295 0.356851 0.326423 \n",
"140 0.319977 0.402378 0.368687 ... 0.328528 0.298065 0.515159 \n",
"... ... ... ... ... ... ... ... \n",
"47 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n",
"2951 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n",
"185 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n",
"566 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n",
"40 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n",
"\n",
" 81 360 47 2951 185 566 40 \n",
"162 0.296131 0.283268 0.442444 0.129318 0.425916 0.432691 0.356977 \n",
"33 0.327464 0.228003 0.351899 0.245888 0.383551 0.384033 0.746593 \n",
"131 0.368427 0.297099 0.393476 0.080983 0.344004 0.346553 0.300196 \n",
"160 0.340315 0.241496 0.363125 0.205827 0.350013 0.322723 0.233216 \n",
"140 0.315984 0.240238 0.406395 0.164005 0.405763 0.403172 0.381799 \n",
"... ... ... ... ... ... ... ... \n",
"47 0.000000 0.000000 0.000000 0.119025 0.294950 0.281203 0.317069 \n",
"2951 0.000000 0.000000 0.000000 0.000000 0.359434 0.353695 0.223206 \n",
"185 0.000000 0.000000 0.000000 0.000000 0.000000 0.978342 0.411086 \n",
"566 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.404999 \n",
"40 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"\n",
"[100 rows x 100 columns]"
]
},
"execution_count": 365,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cosineSim_idx_matrix"
]
},
{
"cell_type": "code",
"execution_count": 366,
"metadata": {},
"outputs": [],
"source": [
"# obtain index pairs with cosine similarity \n",
"# greater than or equal to given threshold value\n",
"\n",
"def filt_thresh_cosSim_matrix(\n",
" threshold: float,\n",
" cosineSim_idx_matrix: DataFrame,\n",
"):\n",
" cosineSim_filt = cosineSim_idx_matrix.where(cosineSim_idx_matrix >= threshold).stack()\n",
" \n",
" return cosineSim_filt\n",
"\n",
"def list_cosSim_dupl_candidates(\n",
" cosineSim_filt: Series,\n",
" embeddings: dict[int, tuple['Embedding',str]],\n",
"):\n",
" # compare found duplicates\n",
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
" df_candidates = pd.DataFrame(columns=columns)\n",
" \n",
" index_pairs = list()\n",
"\n",
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
" # get text content from embedding as second tuple entry\n",
" content = [[\n",
" idx1,\n",
" embeddings[idx1][1],\n",
" idx2,\n",
" embeddings[idx2][1],\n",
" score,\n",
" ]]\n",
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
" \n",
" df_candidates = pd.concat([df_candidates, df_conc])\n",
" index_pairs.append((idx1, idx2))\n",
" \n",
" return df_candidates, index_pairs\n",
"\n",
"def choose_cosSim_dupl_candidates(\n",
" cosineSim_filt: Series,\n",
" embeddings: dict[int, tuple['Embedding',str]],\n",
") -> tuple[DataFrame, list[tuple['Index', 'Index']]]:\n",
" # compare found duplicates\n",
" columns = ['idx1', 'text1', 'idx2', 'text2', 'score']\n",
" df_candidates = pd.DataFrame(columns=columns)\n",
" \n",
" index_pairs = list()\n",
"\n",
" for ((idx1, idx2), score) in cosineSim_filt.items():\n",
" # get texts for comparison\n",
" text1 = embeddings[idx1][1]\n",
" text2 = embeddings[idx2][1]\n",
" # get decision\n",
" print('---------- New Decision ----------')\n",
" print('text1:\\n', text1, '\\n', flush=True)\n",
" print('text2:\\n', text2, '\\n', flush=True)\n",
" decision = input('Please enter >>y<< if this is a duplicate, else hit enter:')\n",
" \n",
" if not decision == 'y':\n",
" continue\n",
" \n",
" # get text content from embedding as second tuple entry\n",
" content = [[\n",
" idx1,\n",
" text1,\n",
" idx2,\n",
" text2,\n",
" score,\n",
" ]]\n",
" df_conc = pd.DataFrame(columns=columns, data=content)\n",
" \n",
" df_candidates = pd.concat([df_candidates, df_conc])\n",
" index_pairs.append((idx1, idx2))\n",
" \n",
" return df_candidates, index_pairs"
]
},
{
"cell_type": "code",
"execution_count": 367,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33 176 0.921449\n",
" 247 0.903092\n",
"332 558 0.987194\n",
"157 247 0.812700\n",
"176 247 0.816763\n",
"34 63 0.952310\n",
"477 247 0.831053\n",
"111 360 0.991955\n",
"53 56 0.866648\n",
" 15 0.871172\n",
"56 15 0.989507\n",
"84 191 0.999377\n",
"28 173 0.836900\n",
"184 40 0.959962\n",
"602 255 0.800500\n",
"29 78 0.939677\n",
"732 185 0.815442\n",
"136 174 0.943705\n",
"680 106 0.889502\n",
"6580 3371 0.866680\n",
"185 566 0.978342\n",
"dtype: float32"
]
},
"execution_count": 367,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SIMILARITY_THRESHOLD = 0.8\n",
"\n",
"cosineSim_filt = filt_thresh_cosSim_matrix(\n",
" threshold=SIMILARITY_THRESHOLD,\n",
" cosineSim_idx_matrix=cosineSim_idx_matrix,\n",
")\n",
"cosineSim_filt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>idx1</th>\n",
" <th>text1</th>\n",
" <th>idx2</th>\n",
" <th>text2</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>332</td>\n",
" <td>Prüfung von: - Scharniere - Dichtung - Schlie...</td>\n",
" <td>558</td>\n",
" <td>Monatliche Prüfung von: - Scharniere - Dichtu...</td>\n",
" <td>0.987194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>111</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten durch die...</td>\n",
" <td>360</td>\n",
" <td>Wöchentliche Interne Wartungstätigkeiten durch...</td>\n",
" <td>0.991955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>56</td>\n",
" <td>Vorgaben aus Brückner Wartungsplan (siehe Extr...</td>\n",
" <td>15</td>\n",
" <td>Vorgaben aus Brückner Wartungsplan siehe Extr...</td>\n",
" <td>0.989507</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>84</td>\n",
" <td>Vorgabe aus Wartungsplan Firma Menzel (siehe V...</td>\n",
" <td>191</td>\n",
" <td>Vorgabe aus Wartungsplan Firma Menzel (siehe V...</td>\n",
" <td>0.999377</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>185</td>\n",
" <td>Vorgabe aus Wartungsplan Firma Menzel (siehe V...</td>\n",
" <td>566</td>\n",
" <td>Vorgabe aus Wartungsplan Firma Menzel (siehe V...</td>\n",
" <td>0.978342</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" idx1 text1 idx2 \\\n",
"0 332 Prüfung von: - Scharniere - Dichtung - Schlie... 558 \n",
"0 111 Tägliche Interne Wartungstätigkeiten durch die... 360 \n",
"0 56 Vorgaben aus Brückner Wartungsplan (siehe Extr... 15 \n",
"0 84 Vorgabe aus Wartungsplan Firma Menzel (siehe V... 191 \n",
"0 185 Vorgabe aus Wartungsplan Firma Menzel (siehe V... 566 \n",
"\n",
" text2 score \n",
"0 Monatliche Prüfung von: - Scharniere - Dichtu... 0.987194 \n",
"0 Wöchentliche Interne Wartungstätigkeiten durch... 0.991955 \n",
"0 Vorgaben aus Brückner Wartungsplan siehe Extr... 0.989507 \n",
"0 Vorgabe aus Wartungsplan Firma Menzel (siehe V... 0.999377 \n",
"0 Vorgabe aus Wartungsplan Firma Menzel (siehe V... 0.978342 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cosSim_dupl_candidates, dupl_idx_pairs = list_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
")\n",
"# save results\n",
"SAVE_PATH_DUPL_CANDIDATES = (f'./Filterung_Duplikate/dupl_candidates_'\n",
" f'cosSim_thresh_{SIMILARITY_THRESHOLD}.xlsx')\n",
"#cosSim_dupl_candidates.to_excel(SAVE_PATH_DUPL_CANDIDATES)\n",
"cosSim_dupl_candidates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Nächste Schritte:**\n",
"- Grenz-Threshold finden, bei dem Duplikate gerade noch richtig erkannt werden"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"thresholds = (0.75, 0.8, 0.85, 0.9, 0.93, 0.95, 0.96, 0.97, 0.98)\n",
"\n",
"for thresh in thresholds:\n",
" \n",
" cosineSim_filt = filt_thresh_cosSim_matrix(\n",
" threshold=thresh,\n",
" cosineSim_idx_matrix=cosineSim_idx_matrix.copy(),\n",
" )\n",
" \n",
" cosSim_dupl_candidates = list_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
" )\n",
" \n",
" # saving path\n",
" saving_path = (f'./Filterung_Duplikate/dupl_candidates_'\n",
" f'cosSim_thresh_{thresh}_STFR.xlsx')\n",
" \n",
" cosSim_dupl_candidates.to_excel(saving_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Ergebnisse:**\n",
"- kein allgemeiner Threshold ableitbar, nur grober Richtwert\n",
"- Paare mit geringerem Score stellenweise ähnlicher als die mit höherem Score\n",
"- finale Entscheidung für Duplikat händisch, da Kontextwissen trotzdem notwendig\n",
"- Arbeit mit ``temp1`` und merging von Einträgen"
]
},
{
"cell_type": "code",
"execution_count": 368,
"metadata": {},
"outputs": [],
"source": [
"# manually decide if candidates are indeed duplicates\n",
"\n",
"SKIP = True\n",
"if not SKIP:\n",
" cosSim_dupl_candidates, dupl_idx_pairs = choose_cosSim_dupl_candidates(\n",
" cosineSim_filt=cosineSim_filt,\n",
" embeddings=embds,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#save_pickle(obj=dupl_idx_pairs, path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 369,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(33, 176),\n",
" (332, 558),\n",
" (34, 63),\n",
" (53, 56),\n",
" (53, 15),\n",
" (56, 15),\n",
" (84, 191),\n",
" (29, 78),\n",
" (136, 174),\n",
" (680, 106),\n",
" (185, 566)]"
]
},
"execution_count": 369,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dupl_idx_pairs = load_pickle(path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')\n",
"dupl_idx_pairs"
]
},
{
"cell_type": "code",
"execution_count": 433,
"metadata": {},
"outputs": [],
"source": [
"temp2 = temp1.copy()"
]
},
{
"cell_type": "code",
"execution_count": 434,
"metadata": {},
"outputs": [],
"source": [
"# merge duplicates\n",
"\n",
"# to-do:\n",
"# merge: 'num_occur', 'assoc_obj_ids', \n",
"# recalc: 'num_assoc_obj_ids'\n",
"\n",
"for (i1, i2) in dupl_idx_pairs:\n",
" \n",
" # if an entry does not exist anymore, skip this pair\n",
" if i1 not in temp2.index or i2 not in temp2.index:\n",
" continue\n",
" \n",
" # merge num occur\n",
" num_occur1 = temp2.at[i1, 'num_occur']\n",
" num_occur2 = temp2.at[i2, 'num_occur']\n",
" new_num_occur = num_occur1 + num_occur2\n",
"\n",
" # merge assoc obj ids\n",
" assoc_ids1 = temp2.at[i1, 'assoc_obj_ids']\n",
" assoc_ids2 = temp2.at[i2, 'assoc_obj_ids']\n",
" new_assoc_ids = np.append(assoc_ids1, assoc_ids2)\n",
" new_assoc_ids = np.unique(new_assoc_ids.flatten())\n",
"\n",
" # recalc num assoc obj ids\n",
" new_num_assoc_obj_ids = len(new_assoc_ids)\n",
"\n",
" # write porperties to first entry\n",
" temp2.at[i1, 'num_occur'] = new_num_occur\n",
" temp2.at[i1, 'assoc_obj_ids'] = new_assoc_ids\n",
" temp2.at[i1, 'num_assoc_obj_ids'] = new_num_assoc_obj_ids\n",
" \n",
" # drop second entry\n",
" temp2 = temp2.drop(index=i2)"
]
},
{
"cell_type": "code",
"execution_count": 435,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>162</th>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>66</td>\n",
" <td>92592</td>\n",
" <td>[0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53...</td>\n",
" <td>206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Wöchentliche Sichtkontrolle / Reinigung</td>\n",
" <td>39</td>\n",
" <td>1654</td>\n",
" <td>[301, 304, 305, 313, 314, 331, 332, 510, 511, ...</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>Tägliche Überprüfung der Ölabscheider</td>\n",
" <td>37</td>\n",
" <td>1616</td>\n",
" <td>[0, 970, 2134, 2137]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160</th>\n",
" <td>Wöchentliche Kontrolle der WC-Anlagen</td>\n",
" <td>37</td>\n",
" <td>1265</td>\n",
" <td>[1352, 1353, 1354, 1684, 1685, 1686, 1687, 168...</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140</th>\n",
" <td>Halbjährliche Kontrolle des Stabbreithalters</td>\n",
" <td>44</td>\n",
" <td>687</td>\n",
" <td>[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...</td>\n",
" <td>166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2679</th>\n",
" <td>Zahnräder der Laufkatze verschlissen Ersatztei...</td>\n",
" <td>170</td>\n",
" <td>1</td>\n",
" <td>[415]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2678</th>\n",
" <td>Bitte 8 Scheiben nach Muster anfertigen. Danke.</td>\n",
" <td>48</td>\n",
" <td>1</td>\n",
" <td>[140]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2677</th>\n",
" <td>Schalter für Bühne Schwenken abgerissen, bitte...</td>\n",
" <td>126</td>\n",
" <td>1</td>\n",
" <td>[323]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2676</th>\n",
" <td>Docke angefahren!</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" <td>[176]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6799</th>\n",
" <td>Befestigung Deckel für Batteriefach defekt ...</td>\n",
" <td>107</td>\n",
" <td>1</td>\n",
" <td>[326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6800 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"162 Tägliche Wartungstätigkeiten nach Vorgabe des ... 66 92592 \n",
"33 Wöchentliche Sichtkontrolle / Reinigung 39 1654 \n",
"131 Tägliche Überprüfung der Ölabscheider 37 1616 \n",
"160 Wöchentliche Kontrolle der WC-Anlagen 37 1265 \n",
"140 Halbjährliche Kontrolle des Stabbreithalters 44 687 \n",
"... ... ... ... \n",
"2679 Zahnräder der Laufkatze verschlissen Ersatztei... 170 1 \n",
"2678 Bitte 8 Scheiben nach Muster anfertigen. Danke. 48 1 \n",
"2677 Schalter für Bühne Schwenken abgerissen, bitte... 126 1 \n",
"2676 Docke angefahren! 17 1 \n",
"6799 Befestigung Deckel für Batteriefach defekt ... 107 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"162 [0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53... 206 \n",
"33 [301, 304, 305, 313, 314, 331, 332, 510, 511, ... 18 \n",
"131 [0, 970, 2134, 2137] 4 \n",
"160 [1352, 1353, 1354, 1684, 1685, 1686, 1687, 168... 11 \n",
"140 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6... 166 \n",
"... ... ... \n",
"2679 [415] 1 \n",
"2678 [140] 1 \n",
"2677 [323] 1 \n",
"2676 [176] 1 \n",
"6799 [326] 1 \n",
"\n",
"[6800 rows x 5 columns]"
]
},
"execution_count": 435,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": 436,
"metadata": {},
"outputs": [],
"source": [
"temp2['assoc_obj_ids'] = temp2['assoc_obj_ids'].map(lambda x: x.tolist())"
]
},
{
"cell_type": "code",
"execution_count": 437,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>162</th>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>66</td>\n",
" <td>92592</td>\n",
" <td>[0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53...</td>\n",
" <td>206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Wöchentliche Sichtkontrolle / Reinigung</td>\n",
" <td>39</td>\n",
" <td>2015</td>\n",
" <td>[301, 304, 305, 313, 314, 323, 329, 331, 332, ...</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>Tägliche Überprüfung der Ölabscheider</td>\n",
" <td>37</td>\n",
" <td>1616</td>\n",
" <td>[0, 970, 2134, 2137]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160</th>\n",
" <td>Wöchentliche Kontrolle der WC-Anlagen</td>\n",
" <td>37</td>\n",
" <td>1265</td>\n",
" <td>[1352, 1353, 1354, 1684, 1685, 1686, 1687, 168...</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140</th>\n",
" <td>Halbjährliche Kontrolle des Stabbreithalters</td>\n",
" <td>44</td>\n",
" <td>687</td>\n",
" <td>[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...</td>\n",
" <td>166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2679</th>\n",
" <td>Zahnräder der Laufkatze verschlissen Ersatztei...</td>\n",
" <td>170</td>\n",
" <td>1</td>\n",
" <td>[415]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2678</th>\n",
" <td>Bitte 8 Scheiben nach Muster anfertigen. Danke.</td>\n",
" <td>48</td>\n",
" <td>1</td>\n",
" <td>[140]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2677</th>\n",
" <td>Schalter für Bühne Schwenken abgerissen, bitte...</td>\n",
" <td>126</td>\n",
" <td>1</td>\n",
" <td>[323]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2676</th>\n",
" <td>Docke angefahren!</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" <td>[176]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6799</th>\n",
" <td>Befestigung Deckel für Batteriefach defekt ...</td>\n",
" <td>107</td>\n",
" <td>1</td>\n",
" <td>[326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6790 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"162 Tägliche Wartungstätigkeiten nach Vorgabe des ... 66 92592 \n",
"33 Wöchentliche Sichtkontrolle / Reinigung 39 2015 \n",
"131 Tägliche Überprüfung der Ölabscheider 37 1616 \n",
"160 Wöchentliche Kontrolle der WC-Anlagen 37 1265 \n",
"140 Halbjährliche Kontrolle des Stabbreithalters 44 687 \n",
"... ... ... ... \n",
"2679 Zahnräder der Laufkatze verschlissen Ersatztei... 170 1 \n",
"2678 Bitte 8 Scheiben nach Muster anfertigen. Danke. 48 1 \n",
"2677 Schalter für Bühne Schwenken abgerissen, bitte... 126 1 \n",
"2676 Docke angefahren! 17 1 \n",
"6799 Befestigung Deckel für Batteriefach defekt ... 107 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"162 [0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53... 206 \n",
"33 [301, 304, 305, 313, 314, 323, 329, 331, 332, ... 23 \n",
"131 [0, 970, 2134, 2137] 4 \n",
"160 [1352, 1353, 1354, 1684, 1685, 1686, 1687, 168... 11 \n",
"140 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6... 166 \n",
"... ... ... \n",
"2679 [415] 1 \n",
"2678 [140] 1 \n",
"2677 [323] 1 \n",
"2676 [176] 1 \n",
"6799 [326] 1 \n",
"\n",
"[6790 rows x 5 columns]"
]
},
"execution_count": 437,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 438,
"metadata": {},
"outputs": [],
"source": [
"LOAD_CALC_FILES = False\n",
"\n",
"# save/load dataframe\n",
"#FILE_PATH = 'VorgangsBeschreibung_analyse_20240306.fth'\n",
"FILE_PATH = 'VorgangsBeschreibung_analyse_20240306.pkl'\n",
"if LOAD_CALC_FILES:\n",
" temp1 = pd.read_feather(FILE_PATH)\n",
" temp1 = temp1.set_index('index')\n",
"else:\n",
" save_df = temp2.copy()\n",
" #save_df = temp2.reset_index()\n",
" #save_df.to_feather(FILE_PATH)\n",
" #save_df.to_parquet(FILE_PATH)\n",
" save_df.to_pickle(FILE_PATH)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Handling von Rechtschreibfehlern (Hunspell über PyEnchant)\n",
"- Handling von Vector-Embeddings über Transformer-Modelle:\n",
" - höhere Fehlertoleranz (Rechtschreibung, redundante oder unbedeutende Worte)\n",
" - nicht angewiesen, dass jedes Wort im Vocabulary vorkommt (vgl. spaCy-Modell)\n",
" - bei ersten Versuchen höhere Genauigkeit bei der Erkennung tatsächlicher Duplikate\n",
"- Nutzung Vector-Embeddings für Duplikatfindung"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ---> Model Training: Data Set"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# data for model training\n",
"data = temp1.iloc[50:300,0].to_list()\n",
"data = [e for e in data if e != '']\n",
"\n",
"with open('spacy_train/training_data_2.txt','w', encoding='utf-8') as f:\n",
" f.writelines(\"\\n\".join(data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {},
"outputs": [],
"source": [
"# save/load dataframe\n",
"FILE_PATH = 'VorgangsBeschreibung_analyse_1.fth'\n",
"if LOAD_CALC_FILES:\n",
" temp1 = pd.read_feather(FILE_PATH)\n",
" temp1 = temp1.set_index('index')\n",
"else:\n",
" save_df = temp1.reset_index()\n",
" save_df.to_feather(FILE_PATH)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### spaCy"
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Durchführung: Sollwert: 20 0,1g'"
]
},
"execution_count": 245,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string = temp1.iloc[-2,0]\n",
"#string = temp1.iloc[0,0]\n",
"string"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [],
"source": [
"string = 'Ich spiele jeden Tag mit den Kindern im Garten. Das ist schön.'\n",
"string = 'Die Maschine XYZ ist aufgrund einer Störung im Druckluftsystem defekt.'\n",
"#string = 'The machine XYZ is broken because of a failure in the air pressure system.'\n",
"#string = 'Wir benötigen das Werkzeug von Herr Stöppel, um das derzeit abzuarbeiten.Dies wird durch Herrn Strebe getan.'"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [],
"source": [
"doc = nlp(string)"
]
},
{
"cell_type": "code",
"execution_count": 248,
"metadata": {},
"outputs": [],
"source": [
"# simulate occurence counter\n",
"OCC_COUNTER = 10"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {},
"outputs": [],
"source": [
"SPELL_CHECK_NON_CHARS = set([' ', '.', ',', ';', ':', '-'])\n",
"CLEANING = True\n",
"#CLEANING = False\n",
"\n",
"def pre_clean_word(string: str) -> str:\n",
" \n",
" pattern = r'[^A-Za-zäöüÄÖÜ]+'\n",
" string = re.sub(pattern, '', string)\n",
" \"\"\"\n",
" for char in SPELL_CHECK_NON_CHARS:\n",
" string = string.replace(char, '')\n",
" \"\"\"\n",
" \n",
" return string\n",
"\n",
"# https://stackoverflow.com/questions/25341945/check-if-string-has-date-any-format \n",
"def is_str_date(string, fuzzy=False):\n",
" \n",
" try:\n",
" parse(string, fuzzy=fuzzy)\n",
" return True\n",
" except ValueError:\n",
" return False\n",
"\n",
"\n",
"def obtain_sub_tree(token):\n",
" # check if token is a POS of interest\n",
" descendants = list(token.subtree)\n",
" descendants.remove(token)\n",
" logger.debug(f'Token >>{token}<< has subtree >>{descendants}<<')\n",
" return descendants\n",
"\n",
"\n",
"def add_children_descendants(\n",
" parent,\n",
" weight,\n",
" connections,\n",
" unique_tokens,\n",
" children_sents,\n",
" map_2_word: dict[str, str] | None = None,\n",
"):\n",
" global CLEANING\n",
" # add child as key\n",
" if CLEANING:\n",
" parent_lemma = pre_clean_word(string=parent.lemma_)\n",
" \n",
" # map words\n",
" if word_2_map is not None:\n",
" if parent_lemma.lower() in map_2_word:\n",
" parent_lemma = map_2_word[parent_lemma.lower()]\n",
" #logger.info(f\"[SUCCESS] Mapped PARENT to {parent_lemma}\")\n",
" \n",
" if parent_lemma != '':\n",
" if (parent_lemma, parent.pos_) in connections:\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" # do not add auxiliary words\n",
" if parent.pos_ != 'AUX':\n",
" unique_tokens.add(parent_lemma)\n",
" connections[(parent_lemma, parent.pos_)] = list()\n",
" connections[(parent_lemma, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" if (parent.lemma_, parent.pos_) in connections:\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
" else:\n",
" # do not add auxiliary words\n",
" if parent.pos_ != 'AUX':\n",
" unique_tokens.add(parent.lemma_)\n",
" connections[(parent.lemma_, parent.pos_)] = list()\n",
" connections[(parent.lemma_, parent.pos_)].append(children_sents)\n",
" #connections[parent.lemma_].append([descendant.lemma_, descendant])\n",
"\n",
"\n",
"def obtain_descendant_info(\n",
" doc,\n",
" weight,\n",
" POS_of_interest,\n",
" TAG_of_interest,\n",
" connections,\n",
" unique_tokens,\n",
" spell_check_candidates,\n",
" spell_check_whitelist,\n",
" spell_checker,\n",
" corrections,\n",
" map_2_word: dict[str, str] | None = None,\n",
"):\n",
" global GENERAL_BLACKLIST\n",
" global DESC_BLACKLIST\n",
" global CLEANING\n",
" \n",
" # iterate over sentences\n",
" for sent in doc.sents:\n",
" # [REWORK] spell check list\n",
" spell_check_words = list()\n",
" \n",
" # iterate over tokens in one sentence\n",
" for token in sent:\n",
" \n",
" if not (token.pos_ in POS_of_interest or token.tag_ in TAG_of_interest):\n",
" continue\n",
" elif token.lemma_.lower() in GENERAL_BLACKLIST:\n",
" logger.debug(f'Eliminated parent >>{token}<< because of blacklist')\n",
" continue\n",
" \n",
" # [REWORK] spell check\n",
" \"\"\"\n",
" if token.lemma_.lower() not in spell_check_whitelist:\n",
" word = pre_clean_word(string=token.lemma_.lower())\n",
" if word in corrections:\n",
" word = corrections[word]\n",
" elif not word.isdigit():\n",
" spell_check_words.append(word)\n",
" \"\"\"\n",
" \n",
" descendants = obtain_sub_tree(token=token)\n",
" \n",
" # iterate over all children if there are any\n",
" if descendants is not None:\n",
" # list with all children in the current sentence\n",
" children_sents = list()\n",
" \n",
" for child in descendants:\n",
" logger.debug(f'Token is >>{token}<< with child >>{child}<< and POS {child.pos_}')\n",
" \n",
" # elimnate cases of cross-references with verbs\n",
" if ((token.pos_ == 'AUX' or token.pos_ == 'VERB') and\n",
" (child.pos_ == 'AUX' or child.pos_ == 'VERB')):\n",
" continue\n",
" elif not (child.pos_ in POS_of_interest or child.tag_ in TAG_of_interest):\n",
" continue\n",
" elif child.lemma_.lower() in GENERAL_BLACKLIST:\n",
" logger.debug(f'Eliminated child >>{child}<< because of blacklist')\n",
" continue\n",
" \n",
" \n",
" if CLEANING:\n",
" child = pre_clean_word(string=child.lemma_)\n",
" if child == '':\n",
" continue\n",
" #child = pre_clean_word(string=child)\n",
" \n",
" if (child not in DESC_BLACKLIST and\n",
" not is_str_date(string=child)):\n",
" #not is_str_date(string=child.text)):\n",
" #children_sents.append((child.lemma_, weight))\n",
" \n",
" # map words\n",
" if map_2_word is not None:\n",
" if child.lower() in map_2_word:\n",
" child = map_2_word[child.lower()]\n",
" #logger.info(f\"[SUCCESS] Mapped CHILD to {child}\")\n",
" \n",
" children_sents.append((child, weight))\n",
" \n",
" #if child.lemma_ not in unique_tokens:\n",
" if child not in unique_tokens:\n",
" #unique_tokens.add(child.lemma_)\n",
" unique_tokens.add(child)\n",
" \n",
" else:\n",
" if (child.lemma_ not in DESC_BLACKLIST and\n",
" not is_str_date(string=child.text)):\n",
" children_sents.append((child.lemma_, weight))\n",
" \n",
" if child.lemma_ not in unique_tokens:\n",
" unique_tokens.add(child.lemma_)\n",
" \n",
" # [REWORK] spell check\n",
" \"\"\"\n",
" if child.lemma_.lower() not in spell_check_whitelist:\n",
" word = pre_clean_word(string=child.lemma_.lower())\n",
" if word in corrections:\n",
" word = corrections[word]\n",
" elif not word.isdigit():\n",
" spell_check_words.append(word)\n",
" \"\"\"\n",
" \n",
" # add list of children for current parent if not empty\n",
" if children_sents:\n",
" \n",
" add_children_descendants(\n",
" parent=token,\n",
" weight=weight,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" children_sents=children_sents,\n",
" map_2_word=map_2_word,\n",
" )\n",
" \n",
" misspelled_candidates = spell_checker.unknown(spell_check_words)\n",
" spell_check_candidates.update(misspelled_candidates)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 250,
"metadata": {},
"outputs": [],
"source": [
"def obtain_adj_matrix(unique_tokens, connections):\n",
"\n",
" adj_mat = pd.DataFrame(\n",
" data=0, \n",
" columns=list(unique_tokens), \n",
" index=list(unique_tokens),\n",
" dtype=np.uint32,\n",
" )\n",
" \n",
" for (pred, POS), descendants_list in connections.items():\n",
" #print(f'{pred=}, {descendants=}')\n",
" \n",
" for descendants in descendants_list:\n",
" #print(f'{descendants}')\n",
" \n",
" if POS != 'AUX':\n",
" for (desc, weight) in descendants:\n",
" adj_mat.at[pred, desc] += weight\n",
" \n",
" else:\n",
" if len(descendants) > 1:\n",
" # if auxiliary word, make connection between all associated words\n",
" combs = combinations(descendants, r=2)\n",
" \n",
" for comb in combs:\n",
" # comb is tuple ((word_1, weight), (word_2, weight))\n",
" weight = comb[0][1]\n",
" word_1 = comb[0][0]\n",
" word_2 = comb[1][0]\n",
" \n",
" \"\"\"\n",
" if ((word_1 == 'Eigenverantwortlichkeit' or word_1 == 'neu') and\n",
" (word_2 == 'Eigenverantwortlichkeit' or word_2 == 'neu')):\n",
" print(f'Hello from {pred=} with {descendants=}')\n",
" \"\"\"\n",
" \n",
" adj_mat.at[word_1, word_2] += weight\n",
" \n",
" return adj_mat\n",
"\n",
"\n",
"def make_undir_adj_matrix(adj_mat):\n",
" \n",
" adj_mat_undir = adj_mat.copy()\n",
" arr = adj_mat_undir.to_numpy()\n",
" arr_upper = np.triu(arr)\n",
" arr_lower = np.tril(arr)\n",
" arr_lower = np.rot90(np.fliplr(arr_lower))\n",
" arr_new = arr_lower + arr_upper\n",
" \n",
" adj_mat_undir.loc[:] = arr_new\n",
" \n",
" return adj_mat_undir"
]
},
{
"cell_type": "code",
"execution_count": 251,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"de\" id=\"1293a8bd098c40caafc3b29af76d443f-0\" class=\"displacy\" width=\"1800\" height=\"399.5\" direction=\"ltr\" style=\"max-width: none; height: 399.5px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Die</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">DET</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"225\">Maschine</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"225\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"400\">XYZ</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"400\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"575\">ist</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"575\">AUX</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">aufgrund</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"925\">einer</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"925\">DET</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">Störung</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1275\">im</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1275\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1450\">Druckluftsystem</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1450\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1625\">defekt.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1625\">ADV</tspan>\n",
"</text>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-0\" stroke-width=\"2px\" d=\"M70,264.5 C70,177.0 215.0,177.0 215.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nk</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M70,266.5 L62,254.5 78,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-1\" stroke-width=\"2px\" d=\"M245,264.5 C245,89.5 570.0,89.5 570.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">sb</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M245,266.5 L237,254.5 253,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-2\" stroke-width=\"2px\" d=\"M245,264.5 C245,177.0 390.0,177.0 390.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nk</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M390.0,266.5 L398.0,254.5 382.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-3\" stroke-width=\"2px\" d=\"M595,264.5 C595,177.0 740.0,177.0 740.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">mo</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M740.0,266.5 L748.0,254.5 732.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-4\" stroke-width=\"2px\" d=\"M945,264.5 C945,177.0 1090.0,177.0 1090.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nk</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M945,266.5 L937,254.5 953,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-5\" stroke-width=\"2px\" d=\"M770,264.5 C770,89.5 1095.0,89.5 1095.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nk</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1095.0,266.5 L1103.0,254.5 1087.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-6\" stroke-width=\"2px\" d=\"M1120,264.5 C1120,177.0 1265.0,177.0 1265.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">mnr</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1265.0,266.5 L1273.0,254.5 1257.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-7\" stroke-width=\"2px\" d=\"M1295,264.5 C1295,177.0 1440.0,177.0 1440.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nk</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1440.0,266.5 L1448.0,254.5 1432.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1293a8bd098c40caafc3b29af76d443f-0-8\" stroke-width=\"2px\" d=\"M595,264.5 C595,2.0 1625.0,2.0 1625.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1293a8bd098c40caafc3b29af76d443f-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pd</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1625.0,266.5 L1633.0,254.5 1617.0,254.5\" fill=\"currentColor\"/>\n",
"</g>\n",
"</svg></span>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"spacy.displacy.render(doc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gesamter Datensatz"
]
},
{
"cell_type": "code",
"execution_count": 252,
"metadata": {},
"outputs": [],
"source": [
"# analysiere erste 10 Einträge\n",
"descr = temp1[['descr', 'num_occur']]\n",
"#descr = descr.iloc[:7,:]"
]
},
{
"cell_type": "code",
"execution_count": 253,
"metadata": {},
"outputs": [],
"source": [
"#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'"
]
},
{
"cell_type": "code",
"execution_count": 254,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6753"
]
},
"execution_count": 254,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(descr)"
]
},
{
"cell_type": "code",
"execution_count": 255,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>num_occur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>161</th>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>92592</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Wöchentliche Sichtkontrolle Reinigung</td>\n",
" <td>1654</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130</th>\n",
" <td>Tägliche Überprüfung der Ölabscheider</td>\n",
" <td>1616</td>\n",
" </tr>\n",
" <tr>\n",
" <th>159</th>\n",
" <td>Wöchentliche Kontrolle der WC-Anlagen</td>\n",
" <td>1265</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139</th>\n",
" <td>Halbjährliche Kontrolle des Stabbreithalters</td>\n",
" <td>687</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2665</th>\n",
" <td>Überprüfung der Y-Achse Schneidbrücke am LC 2 ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2664</th>\n",
" <td>Luftschlauch muss ausgetauscht werden. Ist und...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2663</th>\n",
" <td>Riemenscheibe tauschen auf 650 UPM</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2660</th>\n",
" <td>Durchführung: Sollwert: 20 0,1g</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6752</th>\n",
" <td>Befestigung Deckel für Batteriefach defekt Hal...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6753 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" descr num_occur\n",
"161 Tägliche Wartungstätigkeiten nach Vorgabe des ... 92592\n",
"33 Wöchentliche Sichtkontrolle Reinigung 1654\n",
"130 Tägliche Überprüfung der Ölabscheider 1616\n",
"159 Wöchentliche Kontrolle der WC-Anlagen 1265\n",
"139 Halbjährliche Kontrolle des Stabbreithalters 687\n",
"... ... ...\n",
"2665 Überprüfung der Y-Achse Schneidbrücke am LC 2 ... 1\n",
"2664 Luftschlauch muss ausgetauscht werden. Ist und... 1\n",
"2663 Riemenscheibe tauschen auf 650 UPM 1\n",
"2660 Durchführung: Sollwert: 20 0,1g 1\n",
"6752 Befestigung Deckel für Batteriefach defekt Hal... 1\n",
"\n",
"[6753 rows x 2 columns]"
]
},
"execution_count": 255,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descr"
]
},
{
"cell_type": "code",
"execution_count": 256,
"metadata": {},
"outputs": [],
"source": [
"#LOAD_CALC_FILES = True\n",
"#LOAD_CALC_FILES = False\n",
"#IS_TEST = True\n",
"IS_TEST = False"
]
},
{
"cell_type": "code",
"execution_count": 257,
"metadata": {},
"outputs": [],
"source": [
"spell_check_whitelist = {\n",
" '',\n",
" 'beschlag',\n",
" 'brandschutztechnische',\n",
" 'dichtung',\n",
" 'festhaltevorrichtung',\n",
" 'funktion',\n",
" 'halbjährliche',\n",
" 'kontrolle',\n",
" 'maschinenhersteller',\n",
" 'prüfung',\n",
" 'reinigung',\n",
" 'scharnier',\n",
" 'schließvorrichtung',\n",
" 'schmierung',\n",
" 'sichtkontrolle',\n",
" 'stabbreithalter',\n",
" 'technikrundgang',\n",
" 'vorgabe',\n",
" 'wartungstätigkeit',\n",
" 'wcanlage',\n",
" 'ölabscheider',\n",
" 'abarbeiten',\n",
" 'abgleichen',\n",
" 'abschmieren',\n",
" 'abschmierung',\n",
" 'abteilungsleiter',\n",
" 'akku',\n",
" 'analyse',\n",
" 'arbeitsplan',\n",
" 'aschenbecher',\n",
" 'auffüllen',\n",
" 'auflistung',\n",
" 'befestigungsschraube',\n",
" 'beschädigung',\n",
" 'betriebsstunde',\n",
" 'blombe',\n",
" 'blombieren',\n",
" 'brückner',\n",
" 'campenabwickler',\n",
" 'campenaufwickler',\n",
" 'desinfektionsmittel',\n",
" 'dichtigkeit',\n",
" 'druckkontrolle',\n",
" 'efficiosystem',\n",
" 'eigenverantwortlichkeit',\n",
" 'einrichtung',\n",
" 'email',\n",
" 'erledigungsdatum',\n",
" 'extradate',\n",
" 'extradatum',\n",
" 'filter',\n",
" 'firma',\n",
" 'formplatte',\n",
" 'frostprävention',\n",
" 'gegendruckbolze',\n",
" 'gesamtanlage',\n",
" 'heizungsanlage',\n",
" 'keller',\n",
" 'kesselhauskontrolle',\n",
" 'kesselwasser',\n",
" 'koffer',\n",
" 'kompensator',\n",
" 'kompressorstation',\n",
" 'kondensat',\n",
" 'kühlturm',\n",
" 'kühltürme',\n",
" 'lager',\n",
" 'laserabteilung',\n",
" 'leckage',\n",
" 'leerung',\n",
" 'leiterprüfung',\n",
" 'linearkugellager',\n",
" 'luftdruckkontrolle',\n",
" 'magazin',\n",
" 'maschinenbediener',\n",
" 'messwert',\n",
" 'monat',\n",
" 'motor',\n",
" 'papiermüllbehälter',\n",
" 'personalbüro',\n",
" 'pflasterschrank',\n",
" 'rieme',\n",
" 'rollenkette',\n",
" 'rundgang',\n",
" 'schweißkopf',\n",
" 'schweisskopf',\n",
" 'sichtprüfung',\n",
" 'speisewasser',\n",
" 'sprinkleranlage',\n",
" 'temperatursensor',\n",
" 'terminieren',\n",
" 'ticket',\n",
" 'trommel',\n",
" 'täglicher',\n",
" 'uvröhre',\n",
" 'ventilator',\n",
" 'verbandsmaterial',\n",
" 'verschleiß',\n",
" 'verschleiss',\n",
" 'vorbelegung',\n",
" 'wartung',\n",
" 'wartungsarbeit',\n",
" 'wartungsplan',\n",
" 'wasseraufbereitung',\n",
" 'wasseraufbereitungsanlage',\n",
" 'wasserverbrauch',\n",
" 'weberei',\n",
" 'wumagtrockner',\n",
" 'wäscherkontrolle',\n",
" 'wöchig',\n",
" 'abdichten',\n",
" 'abfluprüfung',\n",
" 'ablesen',\n",
" 'abluftkanal',\n",
" 'absauganlage',\n",
" 'abspeichern',\n",
" 'absprache',\n",
" 'aktivkohlepatron',\n",
" 'aktivkohlepatrone',\n",
" 'anbackung',\n",
" 'anfragen',\n",
" 'angebot',\n",
" 'anpresswalze',\n",
" 'ansaug',\n",
" 'anschluss',\n",
" 'anschluß',\n",
" 'anzahl',\n",
" 'auen',\n",
" 'auenbereich',\n",
" 'aueneinheit',\n",
" 'aufwickler',\n",
" 'ausblasöffnung',\n",
" 'ausbrennen',\n",
" 'auslassventil',\n",
" 'ausrüstung',\n",
" 'austausch',\n",
" 'axialpendelrollenlager',\n",
" 'batteriewechsel',\n",
" 'batterieüberprüfung',\n",
" 'baugruppe',\n",
" 'baumwolltuch',\n",
" 'bauteil',\n",
" 'befeuchter',\n",
" 'beleuchtung',\n",
" 'beschichtunglegierung',\n",
" 'besprechungszimmer',\n",
" 'bestandskontrolle',\n",
" 'bestellformular',\n",
" 'bestätigung',\n",
" 'bezeichnung',\n",
" 'binder',\n",
" 'blutstop',\n",
" 'bolze',\n",
" 'breitstreckwalze',\n",
" 'containerstellfläche',\n",
" 'contrawalze',\n",
" 'dachfläche',\n",
" 'dampfzylinder',\n",
" 'deformierung',\n",
" 'dezember',\n",
" 'din',\n",
" 'docke',\n",
" 'dokumentation',\n",
" 'dosierpumpe',\n",
" 'druckluftbehälter',\n",
" 'druckluftleitung',\n",
" 'druckluftschläuche',\n",
" 'drucktestkontrolle',\n",
" 'einterminieren',\n",
" 'eintragung',\n",
" 'einzelprotokoll',\n",
" 'einziehwalze',\n",
" 'elektisch',\n",
" 'element',\n",
" 'enthärtung',\n",
" 'entwässern',\n",
" 'erledigungsbeschreibeung',\n",
" 'erstehilfeeinrichtung',\n",
" 'erweiterung',\n",
" 'explosionsschutzanlage',\n",
" 'extradaten',\n",
" 'exzenterringbefestigung',\n",
" 'fa',\n",
" 'fach',\n",
" 'faltenbalge',\n",
" 'feedbackinput',\n",
" 'feuerwehrumfahrung',\n",
" 'filert',\n",
" 'filteranlage',\n",
" 'filterelement',\n",
" 'filterstufe',\n",
" 'fixtermin',\n",
" 'flanschlager',\n",
" 'flanschlagerquadrat',\n",
" 'fluchtwegsymbol',\n",
" 'flusenabsaugrohr',\n",
" 'freilauf',\n",
" 'fremdkörper',\n",
" 'führungswagen',\n",
" 'gaslager',\n",
" 'gaszählerstand',\n",
" 'gatter',\n",
" 'geräteinner',\n",
" 'geräteinneres',\n",
" 'geräusch',\n",
" 'gesamt',\n",
" 'gesamterzeugt',\n",
" 'getränkeautomat',\n",
" 'gewindebefestigung',\n",
" 'gewindestiftbefestigung',\n",
" 'gleitschiene',\n",
" 'grat',\n",
" 'gro',\n",
" 'grundplatte',\n",
" 'halle',\n",
" 'haupteingang',\n",
" 'hebebühne',\n",
" 'hebezeug',\n",
" 'helm',\n",
" 'hersteller',\n",
" 'hochregal',\n",
" 'hochtemperatur',\n",
" 'hochtemperatureinsatz',\n",
" 'hydraulik',\n",
" 'hydrauliköl',\n",
" 'impulseingang',\n",
" 'indikator',\n",
" 'inneneinheit',\n",
" 'insektenvernichter',\n",
" 'kabel',\n",
" 'kammer',\n",
" 'karton',\n",
" 'kegelradgetriebe',\n",
" 'kegelradgetriebemotor',\n",
" 'kette',\n",
" 'klemmrolle',\n",
" 'klimaanlage',\n",
" 'klimabühne',\n",
" 'klimagerät',\n",
" 'kompressor',\n",
" 'kompressorluftwert',\n",
" 'kontoll',\n",
" 'kontrawalze',\n",
" 'kontroll',\n",
" 'krankheit',\n",
" 'krän',\n",
" 'kräne',\n",
" 'kuehlaggregat',\n",
" 'kw',\n",
" 'kühlgerät',\n",
" 'lagereinheit',\n",
" 'lagereinsatz',\n",
" 'lagerort',\n",
" 'lagerung',\n",
" 'laser',\n",
" 'laufgeräusche',\n",
" 'luftansaugseite',\n",
" 'luftfilter',\n",
" 'luftfilterwasserabscheider',\n",
" 'luftmenge',\n",
" 'luftreiniger',\n",
" 'lösungsmittel',\n",
" 'lüftungsanlage',\n",
" 'macke',\n",
" 'managementsystem',\n",
" 'maschinenanschluss',\n",
" 'materialzersetzung',\n",
" 'messlager',\n",
" 'micron',\n",
" 'mischer',\n",
" 'monatlicher',\n",
" 'monatliches',\n",
" 'monteur',\n",
" 'moos',\n",
" 'motorstart',\n",
" 'nachfetten',\n",
" 'nachschmieren',\n",
" 'nachspann',\n",
" 'neuvertrag',\n",
" 'nord',\n",
" 'nottelefon',\n",
" 'nr',\n",
" 'oberer',\n",
" 'oberflächenkontrolle',\n",
" 'objektkarte',\n",
" 'palette',\n",
" 'pendelkugellager',\n",
" 'pfeifer',\n",
" 'platine',\n",
" 'pneum',\n",
" 'pneumatikventil',\n",
" 'pneumatisch',\n",
" 'pos',\n",
" 'positioniersystem',\n",
" 'prozesskennzahl',\n",
" 'prüfbericht',\n",
" 'prüfplan',\n",
" 'rampenbereich',\n",
" 'rauwalze',\n",
" 'regalprüfer',\n",
" 'regalsicherungsanlage',\n",
" 'reiniger',\n",
" 'reinigungstuch',\n",
" 'restlich',\n",
" 'risikoersatzteil',\n",
" 'rohrtrenner',\n",
" 'roller',\n",
" 'rundgangkontrollen',\n",
" 'rückmeldung',\n",
" 'sae',\n",
" 'sauberkeit',\n",
" 'schlitten',\n",
" 'schmierstoff',\n",
" 'schmierstoffmenge',\n",
" 'schneider',\n",
" 'schraube',\n",
" 'schraubenbestand',\n",
" 'schutzabdeckung',\n",
" 'sicherheitsbeleuchtung',\n",
" 'sicherheitseinrichtung',\n",
" 'sicherheitslichtschranke',\n",
" 'sicherheitsweste',\n",
" 'sicherstellung',\n",
" 'sonotrode',\n",
" 'sonotrodenständer',\n",
" 'spannkopflager',\n",
" 'spannlager',\n",
" 'spannrahmen',\n",
" 'spindel',\n",
" 'spindelhubgetriebe',\n",
" 'spindelmutter',\n",
" 'spülzeitprüfung',\n",
" 'stab',\n",
" 'stadtwasser',\n",
" 'stehlager',\n",
" 'stehlagergehäuse',\n",
" 'steuerung',\n",
" 'stückliste',\n",
" 'systemumstellung',\n",
" 'telefonanlage',\n",
" 'telefonat',\n",
" 'termin',\n",
" 'terminabsprache',\n",
" 'terminiern',\n",
" 'terminiert',\n",
" 'terminierung',\n",
" 'terminvorschlag',\n",
" 'testomat',\n",
" 'thermoheizelement',\n",
" 'torsprechanlage',\n",
" 'trinkwassernetz',\n",
" 'trockenzylinder',\n",
" 'tänzerrolle',\n",
" 'türdichtung',\n",
" 'türgriff',\n",
" 'türsicherung',\n",
" 'umlenkwalzen',\n",
" 'umrandung',\n",
" 'unkraut',\n",
" 'uschienenführung',\n",
" 'uvv',\n",
" 'ventil',\n",
" 'verbaut',\n",
" 'verbrennungsset',\n",
" 'vereinbarung',\n",
" 'verkalkung',\n",
" 'verschleiteileinsatz',\n",
" 'verschmutzung',\n",
" 'verschmutzungenlos',\n",
" 'verstellung',\n",
" 'verunreinigung',\n",
" 'vollständigkeit',\n",
" 'volumenzähler',\n",
" 'vorderer',\n",
" 'vordruck',\n",
" 'vorfilter',\n",
" 'vorfilterflie',\n",
" 'vorliegen',\n",
" 'vormonat',\n",
" 'wartungsintervall',\n",
" 'wartungsvertrag',\n",
" 'wasserfilter',\n",
" 'wasserhärte',\n",
" 'wasserpegelkontrolle',\n",
" 'wasserzählerstand',\n",
" 'wechselintervall',\n",
" 'wärmetauscher',\n",
" 'zahnrieme',\n",
" 'zahnstange',\n",
" 'zuleitung',\n",
" 'zuschicken',\n",
" 'ölfüllung',\n",
" 'ölstand',\n",
" 'ölstandsichtprüfung',\n",
" 'ölstandskontrolle',\n",
" 'überziehen'\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 258,
"metadata": {},
"outputs": [],
"source": [
"corrections: dict[str, str] = {\n",
" 'desifektionsmittel': 'desinfektionsmittel',\n",
" 'schweikopf': 'schweisskopf',\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Entdeckte Gruppen**\n",
"- Prüfung:\n",
" - Prüfen\n",
" - Sichtprüfung\n",
" - Überprüfung / überprüfen\n",
" - Kontrolle / kontrollieren\n",
" - sicherstellen / Sicherstellung\n",
" - Wartung / warten\n",
" - Reinigung / reinigen\n",
" - Prüfbericht\n",
"- Handlung:\n",
" - Schmierung\n",
" - schmieren\n",
" - reinigen\n",
" - Reinigung\n",
" - schneiden / nachschneiden\n",
"- zyklisch:\n",
" - täglich\n",
" - wöchentlich\n",
" - monatlich\n",
" - jährlich\n",
"- Datum:\n",
" - Uhr\n",
" - Montag, Dienstag, Mittwoch, Donnerstag, Freitag, Samstag, Sonntag\n",
"- Kleinteile:\n",
" - Schraube\n",
" - Adapter\n",
" - Halterung\n",
" - Scheibe\n",
" - Gewinde\n",
" - Ventil\n",
" - Schalter\n",
" - Befestigungsschraube\n",
"- Komponenten:\n",
" - Kupplung\n",
" - Motor\n",
" - Getriebe\n",
" - Ventilator\n",
" - Zahnriemen\n",
" - Tranformator\n",
" - Filterelement\n",
" - Dosierpumpe\n",
" - Luftschlauch\n",
" - Dichtung\n",
" - Filter\n",
" - Scharnier\n",
" - Spannrolle\n",
" - Druckluftbehälter\n",
" - Kette\n",
" - Anschlüsse\n",
" - Schläuche\n",
" - Beleuchtung\n",
"- Elektrik:\n",
" - Zuleitung\n",
" - Kabel\n",
" - Steckdose\n",
" - Elektriker\n",
" - Elektronik\n",
" - elektrisch\n",
" - Sicherheitsbeleuchtung\n",
"- Anlagen:\n",
" - Mischanlage\n",
" - Maschine\n",
" - Wasserenthärtungsanlage\n",
" - Lüftungsanlage\n",
" - Klimaanlage\n",
"- Vereinbarung:\n",
" - Wartungsvertrag\n",
" - Neuvertrag\n",
" - Vertrag\n",
" - terminieren / terminiert\n",
" - Absprache\n",
" - melden\n",
" - telefonisch\n",
" - mitteilen\n",
"- Störbild:\n",
" - defekt\n",
" - kaputt\n",
" - Geräusch\n",
" - undicht\n",
" - leckt\n",
" - Dichtigkeit\n",
"- Abteilung:\n",
" - Buchhaltung\n",
" - Betriebstechnik\n",
" - Entwicklung\n",
"- Ort:\n",
" - Kesselhaus\n",
" - Durchfahrt\n",
" - Dach\n",
" - Haupteingang\n",
" - Werkbank\n",
" - Schlosserei"
]
},
{
"cell_type": "code",
"execution_count": 272,
"metadata": {},
"outputs": [],
"source": [
"word_2_map = {\n",
" 'Prüfung': ['prüfen', 'sichtprüfung', 'überprüfung', 'überprüfen',\n",
" 'kontrolle', 'kontrollieren', 'sicherstellen', 'sicherstellung',\n",
" 'reinigung', 'reinigen', 'prüfbericht', 'sichtkontrolle',\n",
" 'rundgang', 'technikrundgang'],\n",
" 'Wartung': ['wartung', 'warten', 'wartungstätigkeit', 'wartungsarbeit',\n",
" 'wartungsplan'],\n",
" 'Handlung': ['schmierung', 'schmieren', 'reinigen', 'reinigung',\n",
" 'schneiden', 'nachschneiden'],\n",
" 'zyklisch': ['täglich', 'tägliche', 'täglicher', 'wöchentlich', 'wöchentliche', 'monatlich', 'jährlich',\n",
" 'halbjährlich', 'monatliche', 'wartungsintervall'],\n",
" 'Datum': ['uhr', 'montag', 'dienstag', 'mittwoch', 'donnerstag',\n",
" 'freitag', 'samstag', 'sonntag'],\n",
" 'Kleinteile': ['schraube', 'adapter', 'halterung', 'scheibe', 'gewinde',\n",
" 'ventil', 'schalter', 'befestigungsschraube'],\n",
" 'Komponenten': ['kupplung', 'motor', 'getriebe', 'ventilator',\n",
" 'zahnriemen', 'transformator', 'filterelement',\n",
" 'dosierpumpe', 'luftschlauch', 'dichtung', 'filter',\n",
" 'scharnier', 'spannrolle', 'druckluftbehälter', 'kette',\n",
" 'anschlüsse', 'anschluss', 'schläuche', 'schlauch', 'beleuchtung'],\n",
" 'Elektrik': ['zuleitung', 'kabel', 'steckdose', 'elektriker',\n",
" 'elektronik', 'elektrisch', 'sicherheitsbeleuchtung'],\n",
" 'Anlagen': ['anlage', 'mischanlage', 'maschine', 'klimaanlage', 'filteranlage',\n",
" 'wasserenthärtungsanlage', 'lüftungsanlage', 'wasseraufbereitungsanlage'],\n",
" 'Vereinbarung': ['wartungsvertrag', 'neuvertrag', 'vertrag', 'terminieren'\n",
" 'terminiert', 'absprache', 'melden', 'telefonisch', 'mitteilen'],\n",
" 'Störbild': ['defekt', 'kaputt', 'geräusch', 'undicht', 'leckt', 'dichtigkeit'],\n",
" 'Abteilung': ['buchhaltung', 'betriebstechnik', 'entwicklung'],\n",
" 'Ort': ['kesselhaus', 'durchfahrt', 'dach', \n",
" 'haupteingang', 'werkbank', 'schlosserei'],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Frage: Existiert Möglichkeit zur Klassifizierung von Begriffen?\n",
" - z.B. automatische Kennung, ob Komponente oder nicht"
]
},
{
"cell_type": "code",
"execution_count": 273,
"metadata": {},
"outputs": [],
"source": [
"map_2_word = dict()\n",
"\n",
"for key, word_list in word_2_map.items():\n",
" \n",
" for word in word_list:\n",
" map_2_word[word] = key"
]
},
{
"cell_type": "code",
"execution_count": 274,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 1, Percent completed: 0.01\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 501, Percent completed: 7.42\n",
"INFO:base:Number of entries processed: 1001, Percent completed: 14.82\n",
"INFO:base:Number of entries processed: 1501, Percent completed: 22.23\n",
"INFO:base:Number of entries processed: 2001, Percent completed: 29.63\n",
"INFO:base:Number of entries processed: 2501, Percent completed: 37.04\n",
"INFO:base:Number of entries processed: 3001, Percent completed: 44.44\n",
"INFO:base:Number of entries processed: 3501, Percent completed: 51.84\n",
"INFO:base:Number of entries processed: 4001, Percent completed: 59.25\n",
"INFO:base:Number of entries processed: 4501, Percent completed: 66.65\n",
"INFO:base:Number of entries processed: 5001, Percent completed: 74.06\n",
"INFO:base:Number of entries processed: 5501, Percent completed: 81.46\n",
"INFO:base:Number of entries processed: 6001, Percent completed: 88.86\n",
"INFO:base:Number of entries processed: 6501, Percent completed: 96.27\n"
]
}
],
"source": [
"# adjacency matrix\n",
"connections = dict()\n",
"unique_tokens = set()\n",
"UPDATE_STATUS = 500\n",
"length_data = len(descr)\n",
"spell_check_candidates = set()\n",
"spell_checker = SpellChecker(language='de', distance=1)\n",
"\n",
"if not LOAD_CALC_FILES or IS_TEST:\n",
" for count, description in enumerate(descr.iterrows()):\n",
" \n",
" text = description[1]['descr']\n",
" weight = description[1]['num_occur']\n",
" \n",
" doc = nlp(text)\n",
" \n",
" obtain_descendant_info(\n",
" doc=doc,\n",
" weight=weight,\n",
" POS_of_interest=POS_of_interest,\n",
" TAG_of_interest=TAG_of_interest,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" spell_check_candidates=spell_check_candidates,\n",
" spell_check_whitelist=spell_check_whitelist,\n",
" spell_checker=spell_checker,\n",
" corrections=corrections,\n",
" map_2_word=map_2_word,\n",
" )\n",
" \n",
" if count % UPDATE_STATUS == 0:\n",
" logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 275,
"metadata": {},
"outputs": [],
"source": [
"ADJ_DF_PATH = './Graphanalyse/adj_mat_df.fth'\n",
"if not IS_TEST:\n",
" if LOAD_CALC_FILES:\n",
" adj_mat_undir = pd.read_feather(ADJ_DF_PATH)\n",
" adj_mat_undir = adj_mat_undir.set_index('index')\n",
" # additional information\n",
" connections = load_pickle('connections.pkl')\n",
" unique_tokens = load_pickle('unique_tokens.pkl')\n",
" else:\n",
" adj_mat = obtain_adj_matrix(unique_tokens=unique_tokens, connections=connections)\n",
" adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)\n",
" save_df = adj_mat_undir.reset_index()\n",
" save_df.to_feather(ADJ_DF_PATH)\n",
" # additional information\n",
" save_pickle(obj=connections, path='connections.pkl')\n",
" save_pickle(obj=unique_tokens, path='unique_tokens.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Klübertemp</th>\n",
" <th>Schusssuche</th>\n",
" <th>Laser</th>\n",
" <th>Schaftteile</th>\n",
" <th>Dichtsätz</th>\n",
" <th>Tastatur</th>\n",
" <th>Vorspuleinheit</th>\n",
" <th>beginnen</th>\n",
" <th>auslesen</th>\n",
" <th>Kettspannung</th>\n",
" <th>...</th>\n",
" <th>Tänzerwalze</th>\n",
" <th>Abfallkante</th>\n",
" <th>rappeln</th>\n",
" <th>Rottenegger</th>\n",
" <th>Contrawalze</th>\n",
" <th>Eisenträger</th>\n",
" <th>Hängegurte</th>\n",
" <th>Treffen</th>\n",
" <th>Greiferarmen</th>\n",
" <th>Nadelleist</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>A</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ACHTUNG</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ACServomotor</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AForm</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AIB</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überziech</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überziehen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überzogen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>üblich</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>üperprüfen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6845 rows × 6845 columns</p>\n",
"</div>"
],
"text/plain": [
" Klübertemp Schusssuche Laser Schaftteile Dichtsätz \\\n",
"A 0 0 0 0 0 \n",
"ACHTUNG 0 0 0 0 0 \n",
"ACServomotor 0 0 0 0 0 \n",
"AForm 0 0 0 0 0 \n",
"AIB 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"überziech 0 0 0 0 0 \n",
"überziehen 0 0 0 0 0 \n",
"überzogen 0 0 0 0 0 \n",
"üblich 0 0 0 0 0 \n",
"üperprüfen 0 0 0 0 0 \n",
"\n",
" Tastatur Vorspuleinheit beginnen auslesen Kettspannung ... \\\n",
"A 0 0 0 0 0 ... \n",
"ACHTUNG 0 0 0 0 0 ... \n",
"ACServomotor 0 0 0 0 0 ... \n",
"AForm 0 0 0 0 0 ... \n",
"AIB 0 0 0 0 0 ... \n",
"... ... ... ... ... ... ... \n",
"überziech 0 0 0 0 0 ... \n",
"überziehen 0 0 0 0 0 ... \n",
"überzogen 0 0 0 0 0 ... \n",
"üblich 0 0 0 0 0 ... \n",
"üperprüfen 0 0 0 0 0 ... \n",
"\n",
" Tänzerwalze Abfallkante rappeln Rottenegger Contrawalze \\\n",
"A 0 0 0 0 0 \n",
"ACHTUNG 0 0 0 0 0 \n",
"ACServomotor 0 0 0 0 0 \n",
"AForm 0 0 0 0 0 \n",
"AIB 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"überziech 0 0 0 0 0 \n",
"überziehen 0 0 0 0 17 \n",
"überzogen 0 0 0 0 6 \n",
"üblich 0 0 0 0 0 \n",
"üperprüfen 0 0 0 0 0 \n",
"\n",
" Eisenträger Hängegurte Treffen Greiferarmen Nadelleist \n",
"A 0 0 0 0 0 \n",
"ACHTUNG 0 0 0 0 0 \n",
"ACServomotor 0 0 0 0 0 \n",
"AForm 0 0 0 0 0 \n",
"AIB 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"überziech 0 0 0 0 0 \n",
"überziehen 0 0 0 0 0 \n",
"überzogen 0 0 0 0 0 \n",
"üblich 0 0 0 0 0 \n",
"üperprüfen 0 0 0 0 0 \n",
"\n",
"[6845 rows x 6845 columns]"
]
},
"execution_count": 276,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adj_mat_undir.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test Cosine Similarity\n",
"- erstelle Matrix mit Ähnlichkeits-Score (obere Dreiecksmatrix)\n",
"- jedes Wortpaar\n",
"- filtere Tabelle nach Threshold\n",
"- nutze Gewichts-Adjezenzmatrix mit Threshold als Maske\n",
" - nur Analyse von hochgewichtigen Gruppen\n",
"- analysiere Zusammenhänge in Form von Graph (ähnlich bisherigem Vorgehen)\n",
"- bilde Gruppen und benenne diese (z.B. Prüfung+Überprüfung+Kontrolle --> Überprüfung)\n",
"- baue daraus Wörterbuch und matche Begriffe bei der Erstellung"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"def build_cosine_similarity_matrix(\n",
" adj_mat\n",
"):\n",
" # obtain words to compare\n",
" words = adj_mat.index.to_list()\n",
" \n",
" # cos matrix\n",
" cos_mat = pd.DataFrame(\n",
" data=0., \n",
" columns=words, \n",
" index=words,\n",
" dtype=np.float32,\n",
" )\n",
" \n",
" for (word1, word2) in combinations(words, 2):\n",
" # obtain model vocabulary\n",
" w1 = nlp.vocab[str(word1)]\n",
" w2 = nlp.vocab[str(word2)]\n",
" # calculate cosine similarity\n",
" cos_sim = w1.similarity(w2)\n",
" # set value\n",
" cos_mat.at[word1, word2] = cos_sim\n",
" \n",
" return cos_mat"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\foersterflorian\\AppData\\Local\\Temp\\ipykernel_17216\\213623562.py:20: UserWarning: [W008] Evaluating Lexeme.similarity based on empty vectors.\n",
" cos_sim = w1.similarity(w2)\n"
]
}
],
"source": [
"cos_mat = build_cosine_similarity_matrix(adj_mat=adj_mat_undir)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Klübertemp</th>\n",
" <th>Schusssuche</th>\n",
" <th>Laser</th>\n",
" <th>Schaftteile</th>\n",
" <th>Dichtsätz</th>\n",
" <th>Tastatur</th>\n",
" <th>Vorspuleinheit</th>\n",
" <th>beginnen</th>\n",
" <th>auslesen</th>\n",
" <th>Kettspannung</th>\n",
" <th>...</th>\n",
" <th>Tänzerwalze</th>\n",
" <th>Abfallkante</th>\n",
" <th>rappeln</th>\n",
" <th>Rottenegger</th>\n",
" <th>Contrawalze</th>\n",
" <th>Eisenträger</th>\n",
" <th>Hängegurte</th>\n",
" <th>Treffen</th>\n",
" <th>Greiferarmen</th>\n",
" <th>Nadelleist</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Klübertemp</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schusssuche</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Laser</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.324276</td>\n",
" <td>0.0</td>\n",
" <td>0.059743</td>\n",
" <td>0.133676</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-0.063913</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.167521</td>\n",
" <td>0.0</td>\n",
" <td>-0.029860</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schaftteile</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dichtsätz</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Eisenträger</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.170954</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hängegurte</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Treffen</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Greiferarmen</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nadelleist</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6951 rows × 6951 columns</p>\n",
"</div>"
],
"text/plain": [
" Klübertemp Schusssuche Laser Schaftteile Dichtsätz \\\n",
"Klübertemp 0.0 0.0 0.0 0.0 0.0 \n",
"Schusssuche 0.0 0.0 0.0 0.0 0.0 \n",
"Laser 0.0 0.0 0.0 0.0 0.0 \n",
"Schaftteile 0.0 0.0 0.0 0.0 0.0 \n",
"Dichtsätz 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.0 0.0 0.0 0.0 0.0 \n",
"Hängegurte 0.0 0.0 0.0 0.0 0.0 \n",
"Treffen 0.0 0.0 0.0 0.0 0.0 \n",
"Greiferarmen 0.0 0.0 0.0 0.0 0.0 \n",
"Nadelleist 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" Tastatur Vorspuleinheit beginnen auslesen Kettspannung ... \\\n",
"Klübertemp 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Schusssuche 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Laser 0.324276 0.0 0.059743 0.133676 0.0 ... \n",
"Schaftteile 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Dichtsätz 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"... ... ... ... ... ... ... \n",
"Eisenträger 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Hängegurte 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Treffen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Greiferarmen 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"Nadelleist 0.000000 0.0 0.000000 0.000000 0.0 ... \n",
"\n",
" Tänzerwalze Abfallkante rappeln Rottenegger Contrawalze \\\n",
"Klübertemp 0.0 0.0 0.000000 0.0 0.0 \n",
"Schusssuche 0.0 0.0 0.000000 0.0 0.0 \n",
"Laser 0.0 0.0 -0.063913 0.0 0.0 \n",
"Schaftteile 0.0 0.0 0.000000 0.0 0.0 \n",
"Dichtsätz 0.0 0.0 0.000000 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.0 0.0 0.000000 0.0 0.0 \n",
"Hängegurte 0.0 0.0 0.000000 0.0 0.0 \n",
"Treffen 0.0 0.0 0.000000 0.0 0.0 \n",
"Greiferarmen 0.0 0.0 0.000000 0.0 0.0 \n",
"Nadelleist 0.0 0.0 0.000000 0.0 0.0 \n",
"\n",
" Eisenträger Hängegurte Treffen Greiferarmen Nadelleist \n",
"Klübertemp 0.000000 0.0 0.000000 0.0 0.0 \n",
"Schusssuche 0.000000 0.0 0.000000 0.0 0.0 \n",
"Laser 0.167521 0.0 -0.029860 0.0 0.0 \n",
"Schaftteile 0.000000 0.0 0.000000 0.0 0.0 \n",
"Dichtsätz 0.000000 0.0 0.000000 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0.000000 0.0 0.170954 0.0 0.0 \n",
"Hängegurte 0.000000 0.0 0.000000 0.0 0.0 \n",
"Treffen 0.000000 0.0 0.000000 0.0 0.0 \n",
"Greiferarmen 0.000000 0.0 0.000000 0.0 0.0 \n",
"Nadelleist 0.000000 0.0 0.000000 0.0 0.0 \n",
"\n",
"[6951 rows x 6951 columns]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cos_mat"
]
},
{
"cell_type": "code",
"execution_count": 635,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 10\n",
"arr = adj_mat_undir.to_numpy()\n",
"COS_THRESHOLD = 0.4\n",
"cos_arr = cos_mat.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 636,
"metadata": {},
"outputs": [],
"source": [
"cos_arr_filt = np.where((cos_arr > COS_THRESHOLD) & (arr >= WEIGHT_THRESHOLD), cos_arr, 0)"
]
},
{
"cell_type": "code",
"execution_count": 637,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)"
]
},
"execution_count": 637,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cos_arr_filt"
]
},
{
"cell_type": "code",
"execution_count": 638,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"217"
]
},
"execution_count": 638,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(cos_arr_filt)"
]
},
{
"cell_type": "code",
"execution_count": 639,
"metadata": {},
"outputs": [],
"source": [
"thresh_cos_mat = cos_mat.copy()\n",
"thresh_cos_mat[:] = cos_arr_filt"
]
},
{
"cell_type": "code",
"execution_count": 640,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Verstärkung</th>\n",
" <th>Zuluftfilter</th>\n",
" <th>klemmt</th>\n",
" <th>Komminikation</th>\n",
" <th>Doppelholztische</th>\n",
" <th>Deckenbeleuchtung</th>\n",
" <th>Abfalltransport</th>\n",
" <th>fahrbar</th>\n",
" <th>Folieneinlauf</th>\n",
" <th>entsorgen</th>\n",
" <th>...</th>\n",
" <th>neuwertig</th>\n",
" <th>Bleit</th>\n",
" <th>Rauchentwicklung</th>\n",
" <th>Kompressorsteuerung</th>\n",
" <th>anziehen</th>\n",
" <th>Mitarbeiterin</th>\n",
" <th>Nägel</th>\n",
" <th>WZ</th>\n",
" <th>ExSchutzAnlage</th>\n",
" <th>Gemisch</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Verstärkung</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zuluftfilter</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>klemmt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Komminikation</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Doppelholztische</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mitarbeiterin</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nägel</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WZ</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ExSchutzAnlage</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gemisch</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6951 rows × 6951 columns</p>\n",
"</div>"
],
"text/plain": [
" Verstärkung Zuluftfilter klemmt Komminikation \\\n",
"Verstärkung 0.0 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 0.0 \n",
"\n",
" Doppelholztische Deckenbeleuchtung Abfalltransport \\\n",
"Verstärkung 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 \n",
"... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 \n",
"\n",
" fahrbar Folieneinlauf entsorgen ... neuwertig Bleit \\\n",
"Verstärkung 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 ... 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 ... 0.0 0.0 \n",
"... ... ... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 ... 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 ... 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 ... 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 ... 0.0 0.0 \n",
"\n",
" Rauchentwicklung Kompressorsteuerung anziehen \\\n",
"Verstärkung 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 \n",
"... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 \n",
"\n",
" Mitarbeiterin Nägel WZ ExSchutzAnlage Gemisch \n",
"Verstärkung 0.0 0.0 0.0 0.0 0.0 \n",
"Zuluftfilter 0.0 0.0 0.0 0.0 0.0 \n",
"klemmt 0.0 0.0 0.0 0.0 0.0 \n",
"Komminikation 0.0 0.0 0.0 0.0 0.0 \n",
"Doppelholztische 0.0 0.0 0.0 0.0 0.0 \n",
"... ... ... ... ... ... \n",
"Mitarbeiterin 0.0 0.0 0.0 0.0 0.0 \n",
"Nägel 0.0 0.0 0.0 0.0 0.0 \n",
"WZ 0.0 0.0 0.0 0.0 0.0 \n",
"ExSchutzAnlage 0.0 0.0 0.0 0.0 0.0 \n",
"Gemisch 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[6951 rows x 6951 columns]"
]
},
"execution_count": 640,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"thresh_cos_mat"
]
},
{
"cell_type": "code",
"execution_count": 641,
"metadata": {},
"outputs": [],
"source": [
"COS_MAT_PATH_CSV = f'./Graphanalyse_Gruppen/cos_mat_Wthresh_{WEIGHT_THRESHOLD}_Cthresh{int(COS_THRESHOLD*100)}.csv'\n",
"thresh_cos_mat.to_csv(path_or_buf=COS_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 603,
"metadata": {},
"outputs": [],
"source": [
"arr = adj_mat_undir.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 604,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24725"
]
},
"execution_count": 604,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 605,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"92788"
]
},
"execution_count": 605,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.max(arr)"
]
},
{
"cell_type": "code",
"execution_count": 606,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"257"
]
},
"execution_count": 606,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uni_arr = np.unique(arr)\n",
"len(uni_arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Threshold"
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 50\n",
"arr = adj_mat_undir.to_numpy()\n",
"arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"600"
]
},
"execution_count": 278,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 279,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"216"
]
},
"execution_count": 279,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = np.sum(arr, axis=0)\n",
"np.count_nonzero(temp)"
]
},
{
"cell_type": "code",
"execution_count": 280,
"metadata": {},
"outputs": [],
"source": [
"thresh_adj_mat = adj_mat_undir.copy()\n",
"thresh_adj_mat.loc[:] = arr"
]
},
{
"cell_type": "code",
"execution_count": 281,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Klübertemp</th>\n",
" <th>Schusssuche</th>\n",
" <th>Laser</th>\n",
" <th>Schaftteile</th>\n",
" <th>Dichtsätz</th>\n",
" <th>Tastatur</th>\n",
" <th>Vorspuleinheit</th>\n",
" <th>beginnen</th>\n",
" <th>auslesen</th>\n",
" <th>Kettspannung</th>\n",
" <th>...</th>\n",
" <th>Tänzerwalze</th>\n",
" <th>Abfallkante</th>\n",
" <th>rappeln</th>\n",
" <th>Rottenegger</th>\n",
" <th>Contrawalze</th>\n",
" <th>Eisenträger</th>\n",
" <th>Hängegurte</th>\n",
" <th>Treffen</th>\n",
" <th>Greiferarmen</th>\n",
" <th>Nadelleist</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Klübertemp</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schusssuche</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Laser</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schaftteile</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dichtsätz</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Eisenträger</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hängegurte</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Treffen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Greiferarmen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Nadelleist</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6845 rows × 6845 columns</p>\n",
"</div>"
],
"text/plain": [
" Klübertemp Schusssuche Laser Schaftteile Dichtsätz \\\n",
"Klübertemp 0 0 0 0 0 \n",
"Schusssuche 0 0 0 0 0 \n",
"Laser 0 0 0 0 0 \n",
"Schaftteile 0 0 0 0 0 \n",
"Dichtsätz 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0 0 0 0 0 \n",
"Hängegurte 0 0 0 0 0 \n",
"Treffen 0 0 0 0 0 \n",
"Greiferarmen 0 0 0 0 0 \n",
"Nadelleist 0 0 0 0 0 \n",
"\n",
" Tastatur Vorspuleinheit beginnen auslesen Kettspannung ... \\\n",
"Klübertemp 0 0 0 0 0 ... \n",
"Schusssuche 0 0 0 0 0 ... \n",
"Laser 0 0 0 0 0 ... \n",
"Schaftteile 0 0 0 0 0 ... \n",
"Dichtsätz 0 0 0 0 0 ... \n",
"... ... ... ... ... ... ... \n",
"Eisenträger 0 0 0 0 0 ... \n",
"Hängegurte 0 0 0 0 0 ... \n",
"Treffen 0 0 0 0 0 ... \n",
"Greiferarmen 0 0 0 0 0 ... \n",
"Nadelleist 0 0 0 0 0 ... \n",
"\n",
" Tänzerwalze Abfallkante rappeln Rottenegger Contrawalze \\\n",
"Klübertemp 0 0 0 0 0 \n",
"Schusssuche 0 0 0 0 0 \n",
"Laser 0 0 0 0 0 \n",
"Schaftteile 0 0 0 0 0 \n",
"Dichtsätz 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0 0 0 0 0 \n",
"Hängegurte 0 0 0 0 0 \n",
"Treffen 0 0 0 0 0 \n",
"Greiferarmen 0 0 0 0 0 \n",
"Nadelleist 0 0 0 0 0 \n",
"\n",
" Eisenträger Hängegurte Treffen Greiferarmen Nadelleist \n",
"Klübertemp 0 0 0 0 0 \n",
"Schusssuche 0 0 0 0 0 \n",
"Laser 0 0 0 0 0 \n",
"Schaftteile 0 0 0 0 0 \n",
"Dichtsätz 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"Eisenträger 0 0 0 0 0 \n",
"Hängegurte 0 0 0 0 0 \n",
"Treffen 0 0 0 0 0 \n",
"Greiferarmen 0 0 0 0 0 \n",
"Nadelleist 0 0 0 0 0 \n",
"\n",
"[6845 rows x 6845 columns]"
]
},
"execution_count": 281,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"thresh_adj_mat"
]
},
{
"cell_type": "code",
"execution_count": 282,
"metadata": {},
"outputs": [],
"source": [
"ADJ_MAT_PATH_CSV = f'./Graphanalyse_Gruppen/adj_mat_thresh_mapping_{WEIGHT_THRESHOLD}.csv'\n",
"thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Testing***"
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {},
"outputs": [],
"source": [
"important_words = []\n",
"all_entities = []\n",
"pos_tags = set()\n",
"pos_counter = dict()\n",
"token_counter = 0\n",
"\n",
"for description in descr:\n",
" doc = nlp(description)\n",
" \n",
" relevant_words = []\n",
" for token in doc:\n",
" POS = token.pos_\n",
" token_counter += 1\n",
" if POS in pos_counter:\n",
" pos_counter[POS] += 1\n",
" else:\n",
" pos_counter[POS] = 1\n",
" \n",
" if (not token.is_stop and not token.is_punct and \n",
" not token.is_space and (POS == 'NOUN' or \n",
" POS == 'PROPN' or \n",
" POS == 'ADJ' or \n",
" POS == 'ADV')):\n",
" relevant_words.append((token.lemma_.lower(), POS))\n",
" #pos_tags.add(token.pos_)\n",
" \n",
" entities = [] \n",
" for ent in doc.ents:\n",
" entities.append((ent.text, ent.label_))\n",
" \n",
" important_words.extend(relevant_words)\n",
" all_entities.extend(entities)"
]
},
{
"cell_type": "code",
"execution_count": 209,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('descr', 'ADV'), ('num_occur', 'NOUN')]"
]
},
"execution_count": 209,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"important_words"
]
},
{
"cell_type": "code",
"execution_count": 210,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(important_words)"
]
},
{
"cell_type": "code",
"execution_count": 211,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('descr', 'LOC'), ('num_occur', 'MISC')]"
]
},
"execution_count": 211,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_entities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"count = Counter(important_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({('täglich', 'ADJ'): 3,\n",
" ('prüfung', 'NOUN'): 3,\n",
" ('sichtkontrolle', 'NOUN'): 2,\n",
" ('kontrolle', 'NOUN'): 2,\n",
" ('scharniere', 'NOUN'): 2,\n",
" ('dichtung', 'NOUN'): 2,\n",
" ('schließvorrichtung', 'NOUN'): 2,\n",
" ('schloß', 'NOUN'): 2,\n",
" ('beschlag', 'NOUN'): 2,\n",
" ('allgemein', 'ADJ'): 2,\n",
" ('funktion', 'NOUN'): 2,\n",
" ('schmierung', 'NOUN'): 2,\n",
" ('festhaltevorrichtung', 'NOUN'): 2,\n",
" ('monatliche', 'ADJ'): 2,\n",
" ('wartungstätigkeit', 'NOUN'): 1,\n",
" ('vorgabe', 'NOUN'): 1,\n",
" ('maschinenhersteller', 'NOUN'): 1,\n",
" ('wöchentliche', 'ADJ'): 1,\n",
" ('reinigung', 'NOUN'): 1,\n",
" ('überprüfung', 'NOUN'): 1,\n",
" ('ölabscheider', 'NOUN'): 1,\n",
" ('wöchentlich', 'ADJ'): 1,\n",
" ('wc-anlage', 'NOUN'): 1,\n",
" ('halbjährliche', 'ADJ'): 1,\n",
" ('stabbreithalter', 'NOUN'): 1,\n",
" ('brandschutztechnische', 'ADJ'): 1,\n",
" ('technikrundgang', 'NOUN'): 1})"
]
},
"execution_count": 225,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NOUN 25722\n",
"PUNCT 11626\n",
"VERB 9093\n",
"ADP 7211\n",
"ADV 6526\n",
"PROPN 4481\n",
"NUM 4115\n",
"DET 3845\n",
"ADJ 2576\n",
"AUX 2329\n",
"PART 1561\n",
"CCONJ 1305\n",
"X 999\n",
"PRON 916\n",
"SCONJ 385\n",
"SPACE 236\n",
"INTJ 1\n",
"dtype: int64"
]
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pos_count = pd.Series(data=pos_counter)\n",
"pos_count.sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NOUN 0.310176\n",
"PUNCT 0.140196\n",
"VERB 0.109651\n",
"ADP 0.086956\n",
"ADV 0.078696\n",
"PROPN 0.054035\n",
"NUM 0.049622\n",
"DET 0.046366\n",
"ADJ 0.031063\n",
"AUX 0.028085\n",
"PART 0.018824\n",
"CCONJ 0.015737\n",
"X 0.012047\n",
"PRON 0.011046\n",
"SCONJ 0.004643\n",
"SPACE 0.002846\n",
"INTJ 0.000012\n",
"dtype: float64"
]
},
"execution_count": 184,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pos_count_rel = pos_count / pos_count.sum()\n",
"pos_count_rel.sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"82927"
]
},
"execution_count": 181,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"token_counter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Weiterführende Analyse der Beschreibungen\n",
"\n",
"- unklare Zusammenhänge der 1200er-Threshold-Ergebnisse präzisieren:\n",
" - Finden der entsprechenden Beschreibungen\n",
" - Kontextualisieren\n",
"- Identifikation von weiteren Blacklistworten"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Unklare Zusammenhänge 1200er-Threshold"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" <tr>\n",
" <th>index</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>161</th>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>66</td>\n",
" <td>92592</td>\n",
" <td>[0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53...</td>\n",
" <td>206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Wöchentliche Sichtkontrolle Reinigung</td>\n",
" <td>37</td>\n",
" <td>1654</td>\n",
" <td>[301, 304, 305, 313, 314, 331, 332, 510, 511, ...</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130</th>\n",
" <td>Tägliche Überprüfung der Ölabscheider</td>\n",
" <td>37</td>\n",
" <td>1616</td>\n",
" <td>[0, 970, 2134, 2137]</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>159</th>\n",
" <td>Wöchentliche Kontrolle der WC-Anlagen</td>\n",
" <td>37</td>\n",
" <td>1265</td>\n",
" <td>[1352, 1353, 1354, 1684, 1685, 1686, 1687, 168...</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139</th>\n",
" <td>Halbjährliche Kontrolle des Stabbreithalters</td>\n",
" <td>44</td>\n",
" <td>687</td>\n",
" <td>[51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6...</td>\n",
" <td>166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2675</th>\n",
" <td>Stand 15.07.2020 Stöppel: Herr Langner Toyota ...</td>\n",
" <td>253</td>\n",
" <td>1</td>\n",
" <td>[311]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2674</th>\n",
" <td>Zahnräder der Laufkatze verschlissen Ersatztei...</td>\n",
" <td>167</td>\n",
" <td>1</td>\n",
" <td>[415]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2673</th>\n",
" <td>Bitte 8 Scheiben nach Muster anfertigen. Danke.</td>\n",
" <td>47</td>\n",
" <td>1</td>\n",
" <td>[140]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2672</th>\n",
" <td>Schalter für Bühne Schwenken abgerissen, bitte...</td>\n",
" <td>123</td>\n",
" <td>1</td>\n",
" <td>[323]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6781</th>\n",
" <td>Befestigung Deckel für Batteriefach defekt Hal...</td>\n",
" <td>99</td>\n",
" <td>1</td>\n",
" <td>[326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6782 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"index \n",
"161 Tägliche Wartungstätigkeiten nach Vorgabe des ... 66 92592 \n",
"33 Wöchentliche Sichtkontrolle Reinigung 37 1654 \n",
"130 Tägliche Überprüfung der Ölabscheider 37 1616 \n",
"159 Wöchentliche Kontrolle der WC-Anlagen 37 1265 \n",
"139 Halbjährliche Kontrolle des Stabbreithalters 44 687 \n",
"... ... ... ... \n",
"2675 Stand 15.07.2020 Stöppel: Herr Langner Toyota ... 253 1 \n",
"2674 Zahnräder der Laufkatze verschlissen Ersatztei... 167 1 \n",
"2673 Bitte 8 Scheiben nach Muster anfertigen. Danke. 47 1 \n",
"2672 Schalter für Bühne Schwenken abgerissen, bitte... 123 1 \n",
"6781 Befestigung Deckel für Batteriefach defekt Hal... 99 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"index \n",
"161 [0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53... 206 \n",
"33 [301, 304, 305, 313, 314, 331, 332, 510, 511, ... 18 \n",
"130 [0, 970, 2134, 2137] 4 \n",
"159 [1352, 1353, 1354, 1684, 1685, 1686, 1687, 168... 11 \n",
"139 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 6... 166 \n",
"... ... ... \n",
"2675 [311] 1 \n",
"2674 [415] 1 \n",
"2673 [140] 1 \n",
"2672 [323] 1 \n",
"6781 [326] 1 \n",
"\n",
"[6782 rows x 5 columns]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"temp2 = temp1.loc[temp1['num_occur'] >= 3, :]\n",
"temp2 = temp1.copy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#temp2 = temp2.iloc[:30,:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"check_words = set(['E1.8'])\n",
"target_indices = list()\n",
"\n",
"for idx, row in temp2.iterrows():\n",
" \n",
" text = row['descr']\n",
" doc = nlp(text)\n",
" \n",
" token_set = set()\n",
" target_idx = None\n",
" for token in doc:\n",
" \n",
" if not (token.pos_ in POS_of_interest or token.tag_ in TAG_of_interest):\n",
" continue\n",
" \n",
" token_set.add(token.lemma_.lower())\n",
" #print(f'{token_set=}')\n",
"\n",
" if token_set.issuperset(check_words):\n",
" target_indices.append(idx)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target_indices"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Vorgaben aus Pleva Wartungsplan Schmieren der Rollenlager der beiden Kameralaufschlitten des Strukturdetektors SD 1C siehe Extradaten'"
]
},
"execution_count": 506,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idx = target_indices[3]\n",
"temp2.at[idx, 'descr']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Leiterprüfung derzeit in Arbeit Abteilungsleiter sind per Email am 11.06.2019 über deren Eigenverantwortlichkeit und Mithilfe durch Herr Graf informiert worden.'"
]
},
"execution_count": 229,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp2.at[1921,'descr']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 197,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"token_set.issuperset(check_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ADJD'}"
]
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"POS_of_interest\n",
"TAG_of_interest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test = 'Tägliche, tägliche Wartungstätigkeit des Maschinenherstellers Maschine'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc = nlp(test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"täglich\n",
"--\n",
"täglich\n",
"wartungstätigkeit\n",
"der\n",
"maschinenhersteller\n",
"maschine\n"
]
}
],
"source": [
"for token in doc:\n",
" print(token.lemma_.lower())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"replace_chars = [',', '\\n', '\\t', '\\s']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test = test.lower()\n",
"for char in replace_chars:\n",
" test = test.replace(char, '')\n",
"test = test.split()\n",
"test = set(test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'des', 'maschine', 'maschinenherstellers', 'tägliche', 'wartungstätigkeit'}"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.issuperset(check_words)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Zwischenergebnisse:**\n",
"\n",
"*bestimmte ObjektIDs haben den Escape-Charakter, andere nicht: keine ObjektID mit beiden Varianten*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl der Duplikate = 47689 für Beschreibung mit Index-Nr. 171:\n",
" Tägliche Wartungstätigkeiten nach Vorgabe des Maschinenherstellers\n",
"\n"
]
}
],
"source": [
"print(f\"Anzahl der Duplikate = {max_val} für Beschreibung mit Index-Nr. {index}:\\n {text}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Merkmal 2: VorgangsArtText"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"feature = 'VorgangsArtText'"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"base = wo_duplicates.copy()\n",
"base = base.dropna(axis=0, subset=feature)\n",
"base[feature] = base[feature].map(clean_string)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>114</td>\n",
" <td>427 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-06</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Kettbaum kaputt</td>\n",
" <td>2019-03-06</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17</td>\n",
" <td>124</td>\n",
" <td>621 C , Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-11</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>asgasdg</td>\n",
" <td>2019-03-11</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>Elektrowerkstatt</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>53</td>\n",
" <td>244</td>\n",
" <td>285 C, Webmaschine, SG 220 EMS</td>\n",
" <td>5</td>\n",
" <td>Greifer-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Kupplung schleift</td>\n",
" <td>NaN</td>\n",
" <td>Kupplung defekt</td>\n",
" <td>2019-03-20</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>NaN</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>NaT</td>\n",
" <td>2019-03-19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>58</td>\n",
" <td>257</td>\n",
" <td>107, Webmaschine, OM 220 EOS</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-21</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Gegengewicht wieder anbringen</td>\n",
" <td>NaN</td>\n",
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
" <td>2019-03-21</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schraube ausgebohrt\\nGegengewicht wieder angeb...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-03-21</td>\n",
" <td>2019-03-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>81</td>\n",
" <td>138</td>\n",
" <td>00138, Schärmaschine 9,</td>\n",
" <td>16</td>\n",
" <td>Schärmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
" <td>NaN</td>\n",
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"0 11 114 427 C , Webmaschine, DL 280 EMS Breite 280 \n",
"1 17 124 621 C , Webmaschine, DL 280 EMS Breite 280 \n",
"2 53 244 285 C, Webmaschine, SG 220 EMS \n",
"3 58 257 107, Webmaschine, OM 220 EOS \n",
"4 81 138 00138, Schärmaschine 9, \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"0 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"1 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"2 5 Greifer-Webmaschine 3 Reparaturauftrag (Portal) \n",
"3 3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"4 16 Schärmaschine 3 Reparaturauftrag (Portal) \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"0 2019-03-06 4 0 \n",
"1 2019-03-11 5 0 \n",
"2 2019-03-19 5 0 \n",
"3 2019-03-21 5 0 \n",
"4 2019-03-25 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Kupplung schleift NaN \n",
"3 Gegengewicht wieder anbringen NaN \n",
"4 da ist etwas gebrochen. (Herr Heininger) NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"0 Kettbaum kaputt 2019-03-06 \n",
"1 asgasdg 2019-03-11 \n",
"2 Kupplung defekt 2019-03-20 \n",
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
"\n",
" ErledigungsArtText ErledigungsBeschreibung \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Reparatur UTT NaN \n",
"3 Reparatur UTT Schraube ausgebohrt\\nGegengewicht wieder angeb... \n",
"4 Reparatur UTT Bolzen gebrochen. Bolzen neu angefertig und di... \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"0 Weberei Weberei NaT 2019-03-06 \n",
"1 Elektrowerkstatt Elektrowerkstatt NaT 2019-03-11 \n",
"2 Weberei Weberei NaT 2019-03-19 \n",
"3 Weberei Weberei 2019-03-21 2019-03-21 \n",
"4 Vorwerk Vorwerk 2019-03-25 2019-03-25 "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base.head()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Einträge: 128936\n"
]
}
],
"source": [
"descriptions = base[feature]\n",
"print(f\"Einträge: {len(descriptions)}\")"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate VorgangsArtText: 128545\n",
"Anzahl einzigartiger VorgangsArtText: 391\n",
"Anteil einzigartiger VorgangsArtText: 0.30 %\n"
]
}
],
"source": [
"num_dupl_descr = descriptions.duplicated().sum()\n",
"uni_descr = descriptions.unique()\n",
"num_uni_descr = len(uni_descr)\n",
"\n",
"print(f\"Anzahl Duplikate {feature}: {num_dupl_descr}\")\n",
"print(f\"Anzahl einzigartiger {feature}: {num_uni_descr}\")\n",
"print(f\"Anteil einzigartiger {feature}: {num_uni_descr / len(descriptions) * 100:.2f} %\")"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"if not LOAD_CALC_FILES:\n",
" cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']\n",
" descr_df = pd.DataFrame(columns=cols)\n",
" max_val = 0\n",
" text = None\n",
" index = 0\n",
"\n",
"\n",
" for idx, description in enumerate(uni_descr):\n",
" len_descr = len(description)\n",
" filt = base[feature] == description\n",
" temp = base[filt]\n",
" assoc_obj_ids = temp['ObjektID'].unique()\n",
" assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')\n",
" num_assoc_obj_ids = len(assoc_obj_ids)\n",
" num_dupl = filt.sum()\n",
" \n",
" conc_df = pd.DataFrame(data=[[\n",
" description,\n",
" len_descr,\n",
" num_dupl,\n",
" assoc_obj_ids,\n",
" num_assoc_obj_ids\n",
" ]], columns=cols)\n",
" \n",
" descr_df = pd.concat([descr_df, conc_df], ignore_index=True)\n",
" \n",
" if num_dupl > max_val:\n",
" max_val = num_dupl\n",
" index = idx\n",
" text = description\n",
" \n",
" temp1 = descr_df.sort_values(by='num_occur', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>44</td>\n",
" <td>92719</td>\n",
" <td>[0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53...</td>\n",
" <td>206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>01 Interne Reinigung Pflege Überprüfung</td>\n",
" <td>39</td>\n",
" <td>11250</td>\n",
" <td>[0, 7, 425, 426, 427, 428, 429, 517, 518, 576,...</td>\n",
" <td>349</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>02 Interne Reinigung Pflege Überprüfung</td>\n",
" <td>39</td>\n",
" <td>3263</td>\n",
" <td>[576, 906, 910, 940, 941, 942, 943, 1040, 1041...</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>Maschinen-Wartung wöchentlich</td>\n",
" <td>29</td>\n",
" <td>2408</td>\n",
" <td>[1, 301, 305, 313, 314, 331, 332, 510, 511, 51...</td>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>Gesetzliche Wartung Prüfung jährlich</td>\n",
" <td>36</td>\n",
" <td>2403</td>\n",
" <td>[0, 191, 193, 195, 197, 200, 287, 288, 289, 29...</td>\n",
" <td>638</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>222</th>\n",
" <td>Walze WK 03 Umlenkwalze zapfen</td>\n",
" <td>30</td>\n",
" <td>1</td>\n",
" <td>[1]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>224</th>\n",
" <td>Leiter Nr. 90 und überprüfen</td>\n",
" <td>28</td>\n",
" <td>1</td>\n",
" <td>[1]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>225</th>\n",
" <td>Locht nicht mehr</td>\n",
" <td>16</td>\n",
" <td>1</td>\n",
" <td>[338]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>226</th>\n",
" <td>Maschine stellt immer wieder ab</td>\n",
" <td>31</td>\n",
" <td>1</td>\n",
" <td>[338]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>390</th>\n",
" <td>Gesetzliche Wartung Prüfung Anlagenprüfung Dru...</td>\n",
" <td>56</td>\n",
" <td>1</td>\n",
" <td>[547]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>391 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"60 Tägliche Interne Wartungstätigkeiten Weberei 44 92719 \n",
"10 01 Interne Reinigung Pflege Überprüfung 39 11250 \n",
"28 02 Interne Reinigung Pflege Überprüfung 39 3263 \n",
"29 Maschinen-Wartung wöchentlich 29 2408 \n",
"46 Gesetzliche Wartung Prüfung jährlich 36 2403 \n",
".. ... .. ... \n",
"222 Walze WK 03 Umlenkwalze zapfen 30 1 \n",
"224 Leiter Nr. 90 und überprüfen 28 1 \n",
"225 Locht nicht mehr 16 1 \n",
"226 Maschine stellt immer wieder ab 31 1 \n",
"390 Gesetzliche Wartung Prüfung Anlagenprüfung Dru... 56 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"60 [0, 17, 41, 42, 43, 44, 45, 46, 47, 51, 52, 53... 206 \n",
"10 [0, 7, 425, 426, 427, 428, 429, 517, 518, 576,... 349 \n",
"28 [576, 906, 910, 940, 941, 942, 943, 1040, 1041... 52 \n",
"29 [1, 301, 305, 313, 314, 331, 332, 510, 511, 51... 25 \n",
"46 [0, 191, 193, 195, 197, 200, 287, 288, 289, 29... 638 \n",
".. ... ... \n",
"222 [1] 1 \n",
"224 [1] 1 \n",
"225 [338] 1 \n",
"226 [338] 1 \n",
"390 [547] 1 \n",
"\n",
"[391 rows x 5 columns]"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"# save/load dataframe\n",
"FILE_PATH = f'{feature}_analyse_1.fth'\n",
"if LOAD_CALC_FILES:\n",
" temp1 = pd.read_feather(FILE_PATH)\n",
" temp1 = temp1.set_index('index')\n",
"else:\n",
" save_df = temp1.reset_index()\n",
" save_df.to_feather(FILE_PATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gesamter Datensatz"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"# analysiere erste 10 Einträge\n",
"descr = temp1[['descr', 'num_occur']]\n",
"#descr = descr.iloc[50:200,:]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"391"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(descr)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>num_occur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>92719</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>01 Interne Reinigung Pflege Überprüfung</td>\n",
" <td>11250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>02 Interne Reinigung Pflege Überprüfung</td>\n",
" <td>3263</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>Maschinen-Wartung wöchentlich</td>\n",
" <td>2408</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>Gesetzliche Wartung Prüfung jährlich</td>\n",
" <td>2403</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>222</th>\n",
" <td>Walze WK 03 Umlenkwalze zapfen</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>224</th>\n",
" <td>Leiter Nr. 90 und überprüfen</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>225</th>\n",
" <td>Locht nicht mehr</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>226</th>\n",
" <td>Maschine stellt immer wieder ab</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>390</th>\n",
" <td>Gesetzliche Wartung Prüfung Anlagenprüfung Dru...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>391 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" descr num_occur\n",
"60 Tägliche Interne Wartungstätigkeiten Weberei 92719\n",
"10 01 Interne Reinigung Pflege Überprüfung 11250\n",
"28 02 Interne Reinigung Pflege Überprüfung 3263\n",
"29 Maschinen-Wartung wöchentlich 2408\n",
"46 Gesetzliche Wartung Prüfung jährlich 2403\n",
".. ... ...\n",
"222 Walze WK 03 Umlenkwalze zapfen 1\n",
"224 Leiter Nr. 90 und überprüfen 1\n",
"225 Locht nicht mehr 1\n",
"226 Maschine stellt immer wieder ab 1\n",
"390 Gesetzliche Wartung Prüfung Anlagenprüfung Dru... 1\n",
"\n",
"[391 rows x 2 columns]"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descr"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"#LOAD_CALC_FILES = True\n",
"#LOAD_CALC_FILES = False\n",
"#IS_TEST = True\n",
"IS_TEST = False"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 1, Percent completed: 0.26\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 101, Percent completed: 25.83\n",
"INFO:base:Number of entries processed: 201, Percent completed: 51.41\n",
"INFO:base:Number of entries processed: 301, Percent completed: 76.98\n"
]
}
],
"source": [
"# adjacency matrix\n",
"connections = dict()\n",
"unique_tokens = set()\n",
"UPDATE_STATUS = 100\n",
"length_data = len(descr)\n",
"spell_check_candidates = set()\n",
"spell_checker = SpellChecker(language='de', distance=1)\n",
"\n",
"if not LOAD_CALC_FILES or IS_TEST:\n",
" for count, description in enumerate(descr.iterrows()):\n",
" \n",
" text = description[1]['descr']\n",
" weight = description[1]['num_occur']\n",
" \n",
" doc = nlp(text)\n",
" \n",
" obtain_descendant_info(\n",
" doc=doc,\n",
" weight=weight,\n",
" POS_of_interest=POS_of_interest,\n",
" TAG_of_interest=TAG_of_interest,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" spell_check_candidates=spell_check_candidates,\n",
" spell_check_whitelist=spell_check_whitelist,\n",
" spell_checker=spell_checker,\n",
" corrections=corrections,\n",
" )\n",
" \n",
" if count % UPDATE_STATUS == 0:\n",
" logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"ADJ_DF_PATH = f'./Graphanalyse/adj_mat_df_{feature}.fth'\n",
"if not IS_TEST:\n",
" if LOAD_CALC_FILES:\n",
" adj_mat_undir = pd.read_feather(ADJ_DF_PATH)\n",
" adj_mat_undir = adj_mat_undir.set_index('index')\n",
" # additional information\n",
" connections = load_pickle('connections.pkl')\n",
" unique_tokens = load_pickle('unique_tokens.pkl')\n",
" else:\n",
" adj_mat = obtain_adj_matrix(unique_tokens=unique_tokens, connections=connections)\n",
" adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)\n",
" save_df = adj_mat_undir.reset_index()\n",
" save_df.to_feather(ADJ_DF_PATH)\n",
" # additional information\n",
" save_pickle(obj=connections, path='connections.pkl')\n",
" save_pickle(obj=unique_tokens, path='unique_tokens.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>lecken</th>\n",
" <th>WC</th>\n",
" <th>LKW</th>\n",
" <th>offen</th>\n",
" <th>Maschinen-Reinigung</th>\n",
" <th>Dockenwickler</th>\n",
" <th>halb-jährlich</th>\n",
" <th>Tisch</th>\n",
" <th>zentral</th>\n",
" <th>anbringen</th>\n",
" <th>...</th>\n",
" <th>undicht-</th>\n",
" <th>Platine</th>\n",
" <th>erneuern</th>\n",
" <th>Verschmutzung</th>\n",
" <th>befestigen</th>\n",
" <th>wechseln</th>\n",
" <th>Labor</th>\n",
" <th>Walze</th>\n",
" <th>anfahren</th>\n",
" <th>Leiter</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>12-monatige-Inspektion</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2-monatlich</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2-wöchentlich</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24-monatige-Inspektion</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3-jährlich</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ölwechsel</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Überprüfung</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>äußerer</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überprüfen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überziehen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>390 rows × 390 columns</p>\n",
"</div>"
],
"text/plain": [
" lecken WC LKW offen Maschinen-Reinigung \\\n",
"12-monatige-Inspektion 0 0 0 0 0 \n",
"2-monatlich 0 0 0 0 0 \n",
"2-wöchentlich 0 0 0 0 0 \n",
"24-monatige-Inspektion 0 0 0 0 0 \n",
"3-jährlich 0 0 0 0 0 \n",
"... ... .. ... ... ... \n",
"Ölwechsel 0 0 0 0 0 \n",
"Überprüfung 0 0 0 0 0 \n",
"äußerer 0 0 0 0 0 \n",
"überprüfen 0 0 0 0 0 \n",
"überziehen 0 0 0 0 0 \n",
"\n",
" Dockenwickler halb-jährlich Tisch zentral \\\n",
"12-monatige-Inspektion 0 0 0 0 \n",
"2-monatlich 0 0 0 0 \n",
"2-wöchentlich 0 0 0 0 \n",
"24-monatige-Inspektion 0 0 0 0 \n",
"3-jährlich 0 0 0 0 \n",
"... ... ... ... ... \n",
"Ölwechsel 0 0 0 0 \n",
"Überprüfung 0 0 0 0 \n",
"äußerer 0 0 0 0 \n",
"überprüfen 0 0 0 0 \n",
"überziehen 0 0 0 0 \n",
"\n",
" anbringen ... undicht- Platine erneuern \\\n",
"12-monatige-Inspektion 0 ... 0 0 0 \n",
"2-monatlich 0 ... 0 0 0 \n",
"2-wöchentlich 0 ... 0 0 0 \n",
"24-monatige-Inspektion 0 ... 0 0 0 \n",
"3-jährlich 0 ... 0 0 0 \n",
"... ... ... ... ... ... \n",
"Ölwechsel 0 ... 0 0 0 \n",
"Überprüfung 0 ... 0 0 0 \n",
"äußerer 0 ... 0 0 0 \n",
"überprüfen 0 ... 0 0 0 \n",
"überziehen 0 ... 0 0 0 \n",
"\n",
" Verschmutzung befestigen wechseln Labor Walze \\\n",
"12-monatige-Inspektion 0 0 0 0 0 \n",
"2-monatlich 0 0 0 0 0 \n",
"2-wöchentlich 0 0 0 0 0 \n",
"24-monatige-Inspektion 0 0 0 0 0 \n",
"3-jährlich 0 0 0 0 0 \n",
"... ... ... ... ... ... \n",
"Ölwechsel 0 0 0 0 0 \n",
"Überprüfung 0 0 0 0 0 \n",
"äußerer 0 0 0 0 0 \n",
"überprüfen 0 0 0 0 0 \n",
"überziehen 0 0 0 0 1 \n",
"\n",
" anfahren Leiter \n",
"12-monatige-Inspektion 0 0 \n",
"2-monatlich 0 0 \n",
"2-wöchentlich 0 0 \n",
"24-monatige-Inspektion 0 0 \n",
"3-jährlich 0 0 \n",
"... ... ... \n",
"Ölwechsel 0 0 \n",
"Überprüfung 0 0 \n",
"äußerer 0 0 \n",
"überprüfen 0 1 \n",
"überziehen 0 0 \n",
"\n",
"[390 rows x 390 columns]"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adj_mat_undir.sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"arr = adj_mat_undir.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"391"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"92964"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.max(arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Threshold"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 0"
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [],
"source": [
"arr = adj_mat_undir.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [],
"source": [
"arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)"
]
},
{
"cell_type": "code",
"execution_count": 165,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"391"
]
},
"execution_count": 165,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"233"
]
},
"execution_count": 166,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = np.sum(arr, axis=0)\n",
"np.count_nonzero(temp)"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [],
"source": [
"thresh_adj_mat = adj_mat_undir.copy()\n",
"thresh_adj_mat.loc[:] = arr"
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Wasserleitung</th>\n",
" <th>wechseln</th>\n",
" <th>Winkelpositionsgeber</th>\n",
" <th>Klimaanlagengerät</th>\n",
" <th>versetzen</th>\n",
" <th>Brennschlitten</th>\n",
" <th>feststellen</th>\n",
" <th>Stuhl</th>\n",
" <th>monatlich</th>\n",
" <th>anfertigen</th>\n",
" <th>...</th>\n",
" <th>Zahnriemen</th>\n",
" <th>Rampe</th>\n",
" <th>Tisch</th>\n",
" <th>defekt</th>\n",
" <th>Elektrische</th>\n",
" <th>haben</th>\n",
" <th>Wasserenthärtungsanlage</th>\n",
" <th>Gestank</th>\n",
" <th>Zahnrad</th>\n",
" <th>hydraulisch</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Wasserleitung</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>wechseln</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Winkelpositionsgeber</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Klimaanlagengerät</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>versetzen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>haben</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Wasserenthärtungsanlage</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gestank</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zahnrad</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hydraulisch</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>390 rows × 390 columns</p>\n",
"</div>"
],
"text/plain": [
" Wasserleitung wechseln Winkelpositionsgeber \\\n",
"Wasserleitung 0 0 0 \n",
"wechseln 0 0 0 \n",
"Winkelpositionsgeber 0 0 0 \n",
"Klimaanlagengerät 0 0 0 \n",
"versetzen 0 0 0 \n",
"... ... ... ... \n",
"haben 0 0 0 \n",
"Wasserenthärtungsanlage 0 0 0 \n",
"Gestank 0 0 0 \n",
"Zahnrad 0 0 0 \n",
"hydraulisch 0 0 0 \n",
"\n",
" Klimaanlagengerät versetzen Brennschlitten \\\n",
"Wasserleitung 0 0 0 \n",
"wechseln 0 0 0 \n",
"Winkelpositionsgeber 0 0 0 \n",
"Klimaanlagengerät 0 0 0 \n",
"versetzen 0 0 0 \n",
"... ... ... ... \n",
"haben 0 0 0 \n",
"Wasserenthärtungsanlage 0 0 0 \n",
"Gestank 0 0 0 \n",
"Zahnrad 0 0 0 \n",
"hydraulisch 0 0 0 \n",
"\n",
" feststellen Stuhl monatlich anfertigen ... \\\n",
"Wasserleitung 0 0 0 0 ... \n",
"wechseln 0 0 0 0 ... \n",
"Winkelpositionsgeber 0 0 0 0 ... \n",
"Klimaanlagengerät 0 0 0 0 ... \n",
"versetzen 0 0 0 0 ... \n",
"... ... ... ... ... ... \n",
"haben 0 0 0 0 ... \n",
"Wasserenthärtungsanlage 0 0 0 0 ... \n",
"Gestank 0 0 0 0 ... \n",
"Zahnrad 0 0 0 0 ... \n",
"hydraulisch 0 0 0 0 ... \n",
"\n",
" Zahnriemen Rampe Tisch defekt Elektrische haben \\\n",
"Wasserleitung 0 0 0 0 0 0 \n",
"wechseln 0 0 0 0 0 0 \n",
"Winkelpositionsgeber 0 0 0 1 0 0 \n",
"Klimaanlagengerät 0 0 0 0 0 0 \n",
"versetzen 0 0 0 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"haben 0 0 0 0 0 0 \n",
"Wasserenthärtungsanlage 0 0 0 0 0 0 \n",
"Gestank 0 0 0 0 0 0 \n",
"Zahnrad 0 0 0 0 0 0 \n",
"hydraulisch 0 0 0 0 0 0 \n",
"\n",
" Wasserenthärtungsanlage Gestank Zahnrad \\\n",
"Wasserleitung 0 0 0 \n",
"wechseln 0 0 0 \n",
"Winkelpositionsgeber 0 0 0 \n",
"Klimaanlagengerät 0 0 0 \n",
"versetzen 0 0 0 \n",
"... ... ... ... \n",
"haben 0 0 0 \n",
"Wasserenthärtungsanlage 0 0 0 \n",
"Gestank 0 0 0 \n",
"Zahnrad 0 0 0 \n",
"hydraulisch 0 0 0 \n",
"\n",
" hydraulisch \n",
"Wasserleitung 0 \n",
"wechseln 0 \n",
"Winkelpositionsgeber 0 \n",
"Klimaanlagengerät 0 \n",
"versetzen 0 \n",
"... ... \n",
"haben 0 \n",
"Wasserenthärtungsanlage 0 \n",
"Gestank 0 \n",
"Zahnrad 0 \n",
"hydraulisch 0 \n",
"\n",
"[390 rows x 390 columns]"
]
},
"execution_count": 168,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"thresh_adj_mat"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {},
"outputs": [],
"source": [
"ADJ_MAT_PATH_CSV = f'./Graphanalyse/adj_mat_thresh_{feature}_{WEIGHT_THRESHOLD}.csv'\n",
"thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Merkmal 3: ErledigungsBeschreibung"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"feature = 'ErledigungsBeschreibung'"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"base = wo_duplicates.copy()\n",
"base = base.dropna(axis=0, subset=feature)\n",
"base[feature] = base[feature].map(clean_string)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>58</td>\n",
" <td>257</td>\n",
" <td>107, Webmaschine, OM 220 EOS</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-21</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Gegengewicht wieder anbringen</td>\n",
" <td>NaN</td>\n",
" <td>Gegengewicht an der Webmaschine abgefallen</td>\n",
" <td>2019-03-21</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schraube ausgebohrt Gegengewicht wieder angebr...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-03-21</td>\n",
" <td>2019-03-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>81</td>\n",
" <td>138</td>\n",
" <td>00138, Schärmaschine 9,</td>\n",
" <td>16</td>\n",
" <td>Schärmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>da ist etwas gebrochen. (Herr Heininger)</td>\n",
" <td>NaN</td>\n",
" <td>zentrale Bremsenverstellung linke Gatterseite ...</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Bolzen gebrochen. Bolzen neu angefertig und di...</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>82</td>\n",
" <td>0</td>\n",
" <td>Warenschau allgemein</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-25</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Klappbügel Portalkran H31 defekt</td>\n",
" <td>Warenschau allgemein</td>\n",
" <td>Allgemeine Reparaturarbeiten</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Feder ausgetauscht</td>\n",
" <td>Warenschau</td>\n",
" <td>Warenschau</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>76</td>\n",
" <td>0</td>\n",
" <td>Neben der Türe</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-03-22</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Schraube nix mer gut</td>\n",
" <td>Neben der Türe</td>\n",
" <td>Kettbaum</td>\n",
" <td>2019-03-25</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>Schrauben ausgebohrt Gewinde nachgeschnitten</td>\n",
" <td>Vorwerk</td>\n",
" <td>Vorwerk</td>\n",
" <td>2019-03-25</td>\n",
" <td>2019-03-22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>111</td>\n",
" <td>241</td>\n",
" <td>294 C, Webmaschine, SG 240 EMS</td>\n",
" <td>5</td>\n",
" <td>Greifer-Webmaschine</td>\n",
" <td>3</td>\n",
" <td>Reparaturauftrag (Portal)</td>\n",
" <td>2019-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>KBK tauschen\\nUrsache vermutlich mechanisch</td>\n",
" <td>NaN</td>\n",
" <td>Kupplung-Brems-Kombination</td>\n",
" <td>2019-04-08</td>\n",
" <td>Reparatur UTT</td>\n",
" <td>da derzeit Keine Ersatzteile da Reparatur mit ...</td>\n",
" <td>Weberei</td>\n",
" <td>Weberei</td>\n",
" <td>2019-04-02</td>\n",
" <td>2019-04-01</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText ObjektArtID \\\n",
"3 58 257 107, Webmaschine, OM 220 EOS 3 \n",
"4 81 138 00138, Schärmaschine 9, 16 \n",
"5 82 0 Warenschau allgemein 0 \n",
"6 76 0 Neben der Türe 0 \n",
"8 111 241 294 C, Webmaschine, SG 240 EMS 5 \n",
"\n",
" ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"3 Luft-Webmaschine 3 Reparaturauftrag (Portal) \n",
"4 Schärmaschine 3 Reparaturauftrag (Portal) \n",
"5 NaN 3 Reparaturauftrag (Portal) \n",
"6 NaN 3 Reparaturauftrag (Portal) \n",
"8 Greifer-Webmaschine 3 Reparaturauftrag (Portal) \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"3 2019-03-21 5 0 \n",
"4 2019-03-25 5 0 \n",
"5 2019-03-25 5 0 \n",
"6 2019-03-22 5 0 \n",
"8 2019-04-01 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"3 Gegengewicht wieder anbringen NaN \n",
"4 da ist etwas gebrochen. (Herr Heininger) NaN \n",
"5 Klappbügel Portalkran H31 defekt Warenschau allgemein \n",
"6 Schraube nix mer gut Neben der Türe \n",
"8 KBK tauschen\\nUrsache vermutlich mechanisch NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"3 Gegengewicht an der Webmaschine abgefallen 2019-03-21 \n",
"4 zentrale Bremsenverstellung linke Gatterseite ... 2019-03-25 \n",
"5 Allgemeine Reparaturarbeiten 2019-03-25 \n",
"6 Kettbaum 2019-03-25 \n",
"8 Kupplung-Brems-Kombination 2019-04-08 \n",
"\n",
" ErledigungsArtText ErledigungsBeschreibung \\\n",
"3 Reparatur UTT Schraube ausgebohrt Gegengewicht wieder angebr... \n",
"4 Reparatur UTT Bolzen gebrochen. Bolzen neu angefertig und di... \n",
"5 Reparatur UTT Feder ausgetauscht \n",
"6 Reparatur UTT Schrauben ausgebohrt Gewinde nachgeschnitten \n",
"8 Reparatur UTT da derzeit Keine Ersatzteile da Reparatur mit ... \n",
"\n",
" MPMelderArbeitsplatz MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"3 Weberei Weberei 2019-03-21 2019-03-21 \n",
"4 Vorwerk Vorwerk 2019-03-25 2019-03-25 \n",
"5 Warenschau Warenschau 2019-03-25 2019-03-25 \n",
"6 Vorwerk Vorwerk 2019-03-25 2019-03-22 \n",
"8 Weberei Weberei 2019-04-02 2019-04-01 "
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base.head()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Einträge: 118086\n"
]
}
],
"source": [
"descriptions = base[feature]\n",
"print(f\"Einträge: {len(descriptions)}\")"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Duplikate ErledigungsBeschreibung: 110707\n",
"Anzahl einzigartiger ErledigungsBeschreibung: 7379\n",
"Anteil einzigartiger ErledigungsBeschreibung: 6.25 %\n"
]
}
],
"source": [
"num_dupl_descr = descriptions.duplicated().sum()\n",
"uni_descr = descriptions.unique()\n",
"num_uni_descr = len(uni_descr)\n",
"\n",
"print(f\"Anzahl Duplikate {feature}: {num_dupl_descr}\")\n",
"print(f\"Anzahl einzigartiger {feature}: {num_uni_descr}\")\n",
"print(f\"Anteil einzigartiger {feature}: {num_uni_descr / len(descriptions) * 100:.2f} %\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LOAD_CALC_FILES"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"if not LOAD_CALC_FILES:\n",
" cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']\n",
" descr_df = pd.DataFrame(columns=cols)\n",
" max_val = 0\n",
" text = None\n",
" index = 0\n",
"\n",
"\n",
" for idx, description in enumerate(uni_descr):\n",
" len_descr = len(description)\n",
" filt = base[feature] == description\n",
" temp = base[filt]\n",
" assoc_obj_ids = temp['ObjektID'].unique()\n",
" assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')\n",
" num_assoc_obj_ids = len(assoc_obj_ids)\n",
" num_dupl = filt.sum()\n",
" \n",
" conc_df = pd.DataFrame(data=[[\n",
" description,\n",
" len_descr,\n",
" num_dupl,\n",
" assoc_obj_ids,\n",
" num_assoc_obj_ids\n",
" ]], columns=cols)\n",
" \n",
" descr_df = pd.concat([descr_df, conc_df], ignore_index=True)\n",
" \n",
" if num_dupl > max_val:\n",
" max_val = num_dupl\n",
" index = idx\n",
" text = description\n",
" \n",
" temp1 = descr_df.sort_values(by='num_occur', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>len</th>\n",
" <th>num_occur</th>\n",
" <th>assoc_obj_ids</th>\n",
" <th>num_assoc_obj_ids</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>112</th>\n",
" <td>Sichtkontrolle durchgeführt Auffälligkeiten fe...</td>\n",
" <td>95</td>\n",
" <td>98720</td>\n",
" <td>[0, 1, 7, 17, 41, 42, 43, 44, 45, 46, 47, 51, ...</td>\n",
" <td>953</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>Sichtkontrolle durchgeführt Auffälligkeiten fe...</td>\n",
" <td>100</td>\n",
" <td>1450</td>\n",
" <td>[0, 1, 140, 301, 305, 313, 314, 576, 970, 1110...</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>Externe Prüfung wurde durchgeführt Beanstandun...</td>\n",
" <td>119</td>\n",
" <td>1082</td>\n",
" <td>[191, 193, 195, 197, 200, 264, 287, 288, 289, ...</td>\n",
" <td>413</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128</th>\n",
" <td>Reinigung durchgeführt Auffälligkeiten festges...</td>\n",
" <td>90</td>\n",
" <td>762</td>\n",
" <td>[0, 1, 7, 123, 136, 137, 138, 177, 298, 304, 3...</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>Sichtkontrolle wie festgelegt durchgeführt Auf...</td>\n",
" <td>110</td>\n",
" <td>648</td>\n",
" <td>[1, 20, 21, 51, 52, 53, 54, 55, 56, 64, 65, 66...</td>\n",
" <td>271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2805</th>\n",
" <td>X Achse Süd Führungswägen Kurze Version eingebaut</td>\n",
" <td>49</td>\n",
" <td>1</td>\n",
" <td>[21]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2804</th>\n",
" <td>Maschinenrahmen ausgerichtet und ausgebeult. M...</td>\n",
" <td>90</td>\n",
" <td>1</td>\n",
" <td>[144]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2803</th>\n",
" <td>Bügel und Stützräder getauscht</td>\n",
" <td>30</td>\n",
" <td>1</td>\n",
" <td>[315]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2802</th>\n",
" <td>Graf: TK wurde in Arbeitsauftrag 65487 gewandelt</td>\n",
" <td>48</td>\n",
" <td>1</td>\n",
" <td>[405]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7378</th>\n",
" <td>Neue Gasfeder eingebaut</td>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>[326]</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>7379 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" descr len num_occur \\\n",
"112 Sichtkontrolle durchgeführt Auffälligkeiten fe... 95 98720 \n",
"108 Sichtkontrolle durchgeführt Auffälligkeiten fe... 100 1450 \n",
"147 Externe Prüfung wurde durchgeführt Beanstandun... 119 1082 \n",
"128 Reinigung durchgeführt Auffälligkeiten festges... 90 762 \n",
"96 Sichtkontrolle wie festgelegt durchgeführt Auf... 110 648 \n",
"... ... ... ... \n",
"2805 X Achse Süd Führungswägen Kurze Version eingebaut 49 1 \n",
"2804 Maschinenrahmen ausgerichtet und ausgebeult. M... 90 1 \n",
"2803 Bügel und Stützräder getauscht 30 1 \n",
"2802 Graf: TK wurde in Arbeitsauftrag 65487 gewandelt 48 1 \n",
"7378 Neue Gasfeder eingebaut 23 1 \n",
"\n",
" assoc_obj_ids num_assoc_obj_ids \n",
"112 [0, 1, 7, 17, 41, 42, 43, 44, 45, 46, 47, 51, ... 953 \n",
"108 [0, 1, 140, 301, 305, 313, 314, 576, 970, 1110... 28 \n",
"147 [191, 193, 195, 197, 200, 264, 287, 288, 289, ... 413 \n",
"128 [0, 1, 7, 123, 136, 137, 138, 177, 298, 304, 3... 90 \n",
"96 [1, 20, 21, 51, 52, 53, 54, 55, 56, 64, 65, 66... 271 \n",
"... ... ... \n",
"2805 [21] 1 \n",
"2804 [144] 1 \n",
"2803 [315] 1 \n",
"2802 [405] 1 \n",
"7378 [326] 1 \n",
"\n",
"[7379 rows x 5 columns]"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sichtkontrolle durchgeführt Auffälligkeiten festgestellt vom Ausführenden bitte dazu schreiben:'"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iat[0,0]"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sichtkontrolle durchgeführt Auffälligkeiten festgestellt vom Ausführenden bitte dazu schreiben: Nein'"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp1.iat[1,0]"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"# save/load dataframe\n",
"FILE_PATH = f'{feature}_analyse_1.fth'\n",
"if LOAD_CALC_FILES:\n",
" temp1 = pd.read_feather(FILE_PATH)\n",
" temp1 = temp1.set_index('index')\n",
"else:\n",
" save_df = temp1.reset_index()\n",
" save_df.to_feather(FILE_PATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Gesamter Datensatz"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"# analysiere erste 10 Einträge\n",
"descr = temp1[['descr', 'num_occur']]\n",
"#descr = descr.iloc[50:200,:]"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7379"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(descr)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>descr</th>\n",
" <th>num_occur</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>112</th>\n",
" <td>Sichtkontrolle durchgeführt Auffälligkeiten fe...</td>\n",
" <td>98720</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>Sichtkontrolle durchgeführt Auffälligkeiten fe...</td>\n",
" <td>1450</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>Externe Prüfung wurde durchgeführt Beanstandun...</td>\n",
" <td>1082</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128</th>\n",
" <td>Reinigung durchgeführt Auffälligkeiten festges...</td>\n",
" <td>762</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>Sichtkontrolle wie festgelegt durchgeführt Auf...</td>\n",
" <td>648</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2805</th>\n",
" <td>X Achse Süd Führungswägen Kurze Version eingebaut</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2804</th>\n",
" <td>Maschinenrahmen ausgerichtet und ausgebeult. M...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2803</th>\n",
" <td>Bügel und Stützräder getauscht</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2802</th>\n",
" <td>Graf: TK wurde in Arbeitsauftrag 65487 gewandelt</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7378</th>\n",
" <td>Neue Gasfeder eingebaut</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>7379 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" descr num_occur\n",
"112 Sichtkontrolle durchgeführt Auffälligkeiten fe... 98720\n",
"108 Sichtkontrolle durchgeführt Auffälligkeiten fe... 1450\n",
"147 Externe Prüfung wurde durchgeführt Beanstandun... 1082\n",
"128 Reinigung durchgeführt Auffälligkeiten festges... 762\n",
"96 Sichtkontrolle wie festgelegt durchgeführt Auf... 648\n",
"... ... ...\n",
"2805 X Achse Süd Führungswägen Kurze Version eingebaut 1\n",
"2804 Maschinenrahmen ausgerichtet und ausgebeult. M... 1\n",
"2803 Bügel und Stützräder getauscht 1\n",
"2802 Graf: TK wurde in Arbeitsauftrag 65487 gewandelt 1\n",
"7378 Neue Gasfeder eingebaut 1\n",
"\n",
"[7379 rows x 2 columns]"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descr"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"#LOAD_CALC_FILES = True\n",
"#LOAD_CALC_FILES = False\n",
"#IS_TEST = True\n",
"IS_TEST = False"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:base:Number of entries processed: 1, Percent completed: 0.01\n",
"INFO:base:Number of entries processed: 501, Percent completed: 6.79\n",
"INFO:base:Number of entries processed: 1001, Percent completed: 13.57\n",
"INFO:base:Number of entries processed: 1501, Percent completed: 20.34\n",
"INFO:base:Number of entries processed: 2001, Percent completed: 27.12\n",
"INFO:base:Number of entries processed: 2501, Percent completed: 33.89\n",
"INFO:base:Number of entries processed: 3001, Percent completed: 40.67\n",
"INFO:base:Number of entries processed: 3501, Percent completed: 47.45\n",
"INFO:base:Number of entries processed: 4001, Percent completed: 54.22\n",
"INFO:base:Number of entries processed: 4501, Percent completed: 61.00\n",
"INFO:base:Number of entries processed: 5001, Percent completed: 67.77\n",
"INFO:base:Number of entries processed: 5501, Percent completed: 74.55\n",
"INFO:base:Number of entries processed: 6001, Percent completed: 81.33\n",
"INFO:base:Number of entries processed: 6501, Percent completed: 88.10\n",
"INFO:base:Number of entries processed: 7001, Percent completed: 94.88\n"
]
}
],
"source": [
"# adjacency matrix\n",
"connections = dict()\n",
"unique_tokens = set()\n",
"UPDATE_STATUS = 500\n",
"length_data = len(descr)\n",
"spell_check_candidates = set()\n",
"spell_checker = SpellChecker(language='de', distance=1)\n",
"\n",
"if not LOAD_CALC_FILES or IS_TEST:\n",
" for count, description in enumerate(descr.iterrows()):\n",
" \n",
" text = description[1]['descr']\n",
" weight = description[1]['num_occur']\n",
" \n",
" doc = nlp(text)\n",
" \n",
" obtain_descendant_info(\n",
" doc=doc,\n",
" weight=weight,\n",
" POS_of_interest=POS_of_interest,\n",
" TAG_of_interest=TAG_of_interest,\n",
" connections=connections,\n",
" unique_tokens=unique_tokens,\n",
" spell_check_candidates=spell_check_candidates,\n",
" spell_check_whitelist=spell_check_whitelist,\n",
" spell_checker=spell_checker,\n",
" corrections=corrections,\n",
" )\n",
" \n",
" if count % UPDATE_STATUS == 0:\n",
" logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"ADJ_DF_PATH = f'./Graphanalyse/adj_mat_df_{feature}.fth'\n",
"if not IS_TEST:\n",
" if LOAD_CALC_FILES:\n",
" adj_mat_undir = pd.read_feather(ADJ_DF_PATH)\n",
" adj_mat_undir = adj_mat_undir.set_index('index')\n",
" # additional information\n",
" connections = load_pickle('connections.pkl')\n",
" unique_tokens = load_pickle('unique_tokens.pkl')\n",
" else:\n",
" adj_mat = obtain_adj_matrix(unique_tokens=unique_tokens, connections=connections)\n",
" adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)\n",
" save_df = adj_mat_undir.reset_index()\n",
" save_df.to_feather(ADJ_DF_PATH)\n",
" # additional information\n",
" save_pickle(obj=connections, path='connections.pkl')\n",
" save_pickle(obj=unique_tokens, path='unique_tokens.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>funktionsfähig</th>\n",
" <th>Zwischenbehälter</th>\n",
" <th>Ölfilter</th>\n",
" <th>Rechter</th>\n",
" <th>Kontaktproblem</th>\n",
" <th>Geschweisst</th>\n",
" <th>vorbereiten</th>\n",
" <th>Gelenkbolzen</th>\n",
" <th>Silikonfass</th>\n",
" <th>Ausbau</th>\n",
" <th>...</th>\n",
" <th>Kom</th>\n",
" <th>anlernen</th>\n",
" <th>nah</th>\n",
" <th>Begutachtung</th>\n",
" <th>Betriebszeit</th>\n",
" <th>paletten</th>\n",
" <th>augetreten</th>\n",
" <th>Antriebszahnrad</th>\n",
" <th>Gewindereparaturset</th>\n",
" <th>Heizventil</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>-20C</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-Befestihgung</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-Einlaufwalze</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-Entlüftungssicherung</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>-Faltbalken</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überzogenn</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>überzoggen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>übrtprüfen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ünerziehen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>üperprüfen</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6946 rows × 6946 columns</p>\n",
"</div>"
],
"text/plain": [
" funktionsfähig Zwischenbehälter Ölfilter Rechter \\\n",
"-20C 0 0 0 0 \n",
"-Befestihgung 0 0 0 0 \n",
"-Einlaufwalze 0 0 0 0 \n",
"-Entlüftungssicherung 0 0 0 0 \n",
"-Faltbalken 0 0 0 0 \n",
"... ... ... ... ... \n",
"überzogenn 0 0 0 0 \n",
"überzoggen 0 0 0 0 \n",
"übrtprüfen 0 0 0 0 \n",
"ünerziehen 0 0 0 0 \n",
"üperprüfen 0 0 0 0 \n",
"\n",
" Kontaktproblem Geschweisst vorbereiten Gelenkbolzen \\\n",
"-20C 0 0 0 0 \n",
"-Befestihgung 0 0 0 0 \n",
"-Einlaufwalze 0 0 0 0 \n",
"-Entlüftungssicherung 0 0 0 0 \n",
"-Faltbalken 0 0 0 0 \n",
"... ... ... ... ... \n",
"überzogenn 0 0 0 0 \n",
"überzoggen 0 0 0 0 \n",
"übrtprüfen 0 0 0 0 \n",
"ünerziehen 0 0 0 0 \n",
"üperprüfen 0 0 0 0 \n",
"\n",
" Silikonfass Ausbau ... Kom anlernen nah \\\n",
"-20C 0 0 ... 0 0 0 \n",
"-Befestihgung 0 0 ... 0 0 0 \n",
"-Einlaufwalze 0 0 ... 0 0 0 \n",
"-Entlüftungssicherung 0 0 ... 0 0 0 \n",
"-Faltbalken 0 0 ... 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"überzogenn 0 0 ... 0 0 0 \n",
"überzoggen 0 0 ... 0 0 0 \n",
"übrtprüfen 0 0 ... 0 0 0 \n",
"ünerziehen 0 0 ... 0 0 0 \n",
"üperprüfen 0 0 ... 0 0 0 \n",
"\n",
" Begutachtung Betriebszeit paletten augetreten \\\n",
"-20C 0 0 0 0 \n",
"-Befestihgung 0 0 0 0 \n",
"-Einlaufwalze 0 0 0 0 \n",
"-Entlüftungssicherung 0 0 0 0 \n",
"-Faltbalken 0 0 0 0 \n",
"... ... ... ... ... \n",
"überzogenn 0 0 0 0 \n",
"überzoggen 0 0 0 0 \n",
"übrtprüfen 0 0 0 0 \n",
"ünerziehen 0 0 0 0 \n",
"üperprüfen 0 0 0 0 \n",
"\n",
" Antriebszahnrad Gewindereparaturset Heizventil \n",
"-20C 0 0 0 \n",
"-Befestihgung 0 0 0 \n",
"-Einlaufwalze 0 0 0 \n",
"-Entlüftungssicherung 0 0 0 \n",
"-Faltbalken 0 0 0 \n",
"... ... ... ... \n",
"überzogenn 0 0 0 \n",
"überzoggen 0 0 0 \n",
"übrtprüfen 0 0 0 \n",
"ünerziehen 0 0 0 \n",
"üperprüfen 0 0 0 \n",
"\n",
"[6946 rows x 6946 columns]"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adj_mat_undir.sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"arr = adj_mat_undir.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24171"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"103601"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.max(arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Threshold"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
"WEIGHT_THRESHOLD = 30"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
"arr = adj_mat_undir.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"138"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.count_nonzero(arr)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [],
"source": [
"thresh_adj_mat = adj_mat_undir.copy()\n",
"thresh_adj_mat.loc[:] = arr"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>funktionsfähig</th>\n",
" <th>Zwischenbehälter</th>\n",
" <th>Ölfilter</th>\n",
" <th>Rechter</th>\n",
" <th>Kontaktproblem</th>\n",
" <th>Geschweisst</th>\n",
" <th>vorbereiten</th>\n",
" <th>Gelenkbolzen</th>\n",
" <th>Silikonfass</th>\n",
" <th>Ausbau</th>\n",
" <th>...</th>\n",
" <th>Kom</th>\n",
" <th>anlernen</th>\n",
" <th>nah</th>\n",
" <th>Begutachtung</th>\n",
" <th>Betriebszeit</th>\n",
" <th>paletten</th>\n",
" <th>augetreten</th>\n",
" <th>Antriebszahnrad</th>\n",
" <th>Gewindereparaturset</th>\n",
" <th>Heizventil</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>funktionsfähig</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Zwischenbehälter</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ölfilter</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Rechter</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kontaktproblem</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>paletten</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>augetreten</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antriebszahnrad</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Gewindereparaturset</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Heizventil</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6946 rows × 6946 columns</p>\n",
"</div>"
],
"text/plain": [
" funktionsfähig Zwischenbehälter Ölfilter Rechter \\\n",
"funktionsfähig 0 0 0 0 \n",
"Zwischenbehälter 0 0 0 0 \n",
"Ölfilter 0 0 0 0 \n",
"Rechter 0 0 0 0 \n",
"Kontaktproblem 0 0 0 0 \n",
"... ... ... ... ... \n",
"paletten 0 0 0 0 \n",
"augetreten 0 0 0 0 \n",
"Antriebszahnrad 0 0 0 0 \n",
"Gewindereparaturset 0 0 0 0 \n",
"Heizventil 0 0 0 0 \n",
"\n",
" Kontaktproblem Geschweisst vorbereiten Gelenkbolzen \\\n",
"funktionsfähig 0 0 0 0 \n",
"Zwischenbehälter 0 0 0 0 \n",
"Ölfilter 0 0 0 0 \n",
"Rechter 0 0 0 0 \n",
"Kontaktproblem 0 0 0 0 \n",
"... ... ... ... ... \n",
"paletten 0 0 0 0 \n",
"augetreten 0 0 0 0 \n",
"Antriebszahnrad 0 0 0 0 \n",
"Gewindereparaturset 0 0 0 0 \n",
"Heizventil 0 0 0 0 \n",
"\n",
" Silikonfass Ausbau ... Kom anlernen nah \\\n",
"funktionsfähig 0 0 ... 0 0 0 \n",
"Zwischenbehälter 0 0 ... 0 0 0 \n",
"Ölfilter 0 0 ... 0 0 0 \n",
"Rechter 0 0 ... 0 0 0 \n",
"Kontaktproblem 0 0 ... 0 0 0 \n",
"... ... ... ... ... ... ... \n",
"paletten 0 0 ... 0 0 0 \n",
"augetreten 0 0 ... 0 0 0 \n",
"Antriebszahnrad 0 0 ... 0 0 0 \n",
"Gewindereparaturset 0 0 ... 0 0 0 \n",
"Heizventil 0 0 ... 0 0 0 \n",
"\n",
" Begutachtung Betriebszeit paletten augetreten \\\n",
"funktionsfähig 0 0 0 0 \n",
"Zwischenbehälter 0 0 0 0 \n",
"Ölfilter 0 0 0 0 \n",
"Rechter 0 0 0 0 \n",
"Kontaktproblem 0 0 0 0 \n",
"... ... ... ... ... \n",
"paletten 0 0 0 0 \n",
"augetreten 0 0 0 0 \n",
"Antriebszahnrad 0 0 0 0 \n",
"Gewindereparaturset 0 0 0 0 \n",
"Heizventil 0 0 0 0 \n",
"\n",
" Antriebszahnrad Gewindereparaturset Heizventil \n",
"funktionsfähig 0 0 0 \n",
"Zwischenbehälter 0 0 0 \n",
"Ölfilter 0 0 0 \n",
"Rechter 0 0 0 \n",
"Kontaktproblem 0 0 0 \n",
"... ... ... ... \n",
"paletten 0 0 0 \n",
"augetreten 0 0 0 \n",
"Antriebszahnrad 0 0 0 \n",
"Gewindereparaturset 0 0 0 \n",
"Heizventil 0 0 0 \n",
"\n",
"[6946 rows x 6946 columns]"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"thresh_adj_mat"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [],
"source": [
"ADJ_MAT_PATH_CSV = f'./Graphanalyse/adj_mat_thresh_{feature}_{WEIGHT_THRESHOLD}.csv'\n",
"thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# **Zusatz**\n",
"\n",
"#### **Analysiere beispielhaft Eintrag mit meisten Duplikaten**"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Einträge mit gewählter Beschreibung: 47689\n"
]
}
],
"source": [
"crit = uni_descr[171]\n",
"filt = wo_duplicates['VorgangsBeschreibung'] == crit\n",
"temp = wo_duplicates[filt]\n",
"print(f\"Anzahl Einträge mit gewählter Beschreibung: {len(temp)}\")"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>288</th>\n",
" <td>155717</td>\n",
" <td>187</td>\n",
" <td>246, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>289</th>\n",
" <td>152507</td>\n",
" <td>177</td>\n",
" <td>204 S SI , Webmaschine, DL 280 EMS Breite 220</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-09</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-09</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-09</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>318</th>\n",
" <td>255972</td>\n",
" <td>249</td>\n",
" <td>203 C S SI, Webmaschine, DL 280 EMS Breite 220</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-07-30</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-07-30</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-07-30</td>\n",
" <td>2022-04-28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>319</th>\n",
" <td>255977</td>\n",
" <td>249</td>\n",
" <td>203 C S SI, Webmaschine, DL 280 EMS Breite 220</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-08-04</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-08-04</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-08-04</td>\n",
" <td>2022-04-28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>340</th>\n",
" <td>267942</td>\n",
" <td>187</td>\n",
" <td>246, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-08-07</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-08-07</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-08-07</td>\n",
" <td>2022-08-05</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"288 155717 187 246, Webmaschine Jacquard, \n",
"289 152507 177 204 S SI , Webmaschine, DL 280 EMS Breite 220 \n",
"318 255972 249 203 C S SI, Webmaschine, DL 280 EMS Breite 220 \n",
"319 255977 249 203 C S SI, Webmaschine, DL 280 EMS Breite 220 \n",
"340 267942 187 246, Webmaschine Jacquard, \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"288 6 Jacquard-Webmaschine 1 Wartung \n",
"289 3 Luft-Webmaschine 1 Wartung \n",
"318 3 Luft-Webmaschine 1 Wartung \n",
"319 3 Luft-Webmaschine 1 Wartung \n",
"340 6 Jacquard-Webmaschine 1 Wartung \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"288 2022-04-01 5 0 \n",
"289 2022-04-09 5 0 \n",
"318 2022-07-30 5 0 \n",
"319 2022-08-04 5 0 \n",
"340 2022-08-07 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"288 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"289 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"318 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"319 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"340 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"288 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"289 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-09 \n",
"318 Tägliche Interne Wartungstätigkeiten Weberei 2022-07-30 \n",
"319 Tägliche Interne Wartungstätigkeiten Weberei 2022-08-04 \n",
"340 Tägliche Interne Wartungstätigkeiten Weberei 2022-08-07 \n",
"\n",
" ErledigungsArtText \\\n",
"288 Intern UTT - Sichtkontrolle \n",
"289 Intern UTT - Sichtkontrolle \n",
"318 Intern UTT - Sichtkontrolle \n",
"319 Intern UTT - Sichtkontrolle \n",
"340 Intern UTT - Sichtkontrolle \n",
"\n",
" ErledigungsBeschreibung MPMelderArbeitsplatz \\\n",
"288 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"289 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"318 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"319 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"340 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"\n",
" MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"288 NaN 2022-04-01 2022-02-17 \n",
"289 NaN 2022-04-09 2022-02-17 \n",
"318 NaN 2022-07-30 2022-04-28 \n",
"319 NaN 2022-08-04 2022-04-28 \n",
"340 NaN 2022-08-07 2022-08-05 "
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp.head()"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"# schaue welche Merkmale abweichend sind\n",
"analyse_columns = ['ObjektID', 'VorgangsTypID', 'VorgangsTypName']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ObjektID"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 187, 177, 249, 2654, 1792, 272, 271, 270, 269, 268, 186,\n",
" 178, 179, 2317, 2318, 2473, 2559, 1244, 240, 241, 180, 220,\n",
" 221, 222, 223, 224, 961, 962, 2166, 3212, 267, 266, 181,\n",
" 182, 213, 214, 174, 175, 176, 156, 157, 158, 247, 248,\n",
" 183, 265, 278, 1793, 1794, 218, 217, 219, 215, 216, 2319,\n",
" 2320, 228, 184, 152, 153, 2165, 154, 155, 159, 167, 168,\n",
" 169, 2313, 2314, 2315, 2316, 212, 211, 160, 161, 162, 164,\n",
" 165, 166, 264, 273, 274, 277, 276, 275, 279, 280, 281,\n",
" 282, 283, 242, 243, 244, 245, 246, 225, 227, 229, 170,\n",
" 171, 172, 173, 230, 231, 3213, 3211, 3214], dtype=int64)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp['ObjektID'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"filt = temp['ObjektID'] == 2318\n",
"temp_fil1 = temp[filt]"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>878</th>\n",
" <td>269743</td>\n",
" <td>2318</td>\n",
" <td>A067, Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-10-31</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-10-31</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-10-31</td>\n",
" <td>2022-08-05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6099</th>\n",
" <td>152490</td>\n",
" <td>2318</td>\n",
" <td>A067, Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-03-24</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-03-24</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-03-24</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13905</th>\n",
" <td>152476</td>\n",
" <td>2318</td>\n",
" <td>A067, Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-03-10</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-03-10</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-03-10</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14019</th>\n",
" <td>248301</td>\n",
" <td>2318</td>\n",
" <td>A067, Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-28</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-28</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-28</td>\n",
" <td>2022-04-14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14211</th>\n",
" <td>254914</td>\n",
" <td>2318</td>\n",
" <td>A067, Webmaschine, DL 280 EMS Breite 280</td>\n",
" <td>3</td>\n",
" <td>Luft-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-05-19</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-05-19</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-05-19</td>\n",
" <td>2022-04-28</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText \\\n",
"878 269743 2318 A067, Webmaschine, DL 280 EMS Breite 280 \n",
"6099 152490 2318 A067, Webmaschine, DL 280 EMS Breite 280 \n",
"13905 152476 2318 A067, Webmaschine, DL 280 EMS Breite 280 \n",
"14019 248301 2318 A067, Webmaschine, DL 280 EMS Breite 280 \n",
"14211 254914 2318 A067, Webmaschine, DL 280 EMS Breite 280 \n",
"\n",
" ObjektArtID ObjektArtText VorgangsTypID VorgangsTypName \\\n",
"878 3 Luft-Webmaschine 1 Wartung \n",
"6099 3 Luft-Webmaschine 1 Wartung \n",
"13905 3 Luft-Webmaschine 1 Wartung \n",
"14019 3 Luft-Webmaschine 1 Wartung \n",
"14211 3 Luft-Webmaschine 1 Wartung \n",
"\n",
" VorgangsDatum VorgangsStatusId VorgangsPrioritaet \\\n",
"878 2022-10-31 5 0 \n",
"6099 2022-03-24 5 0 \n",
"13905 2022-03-10 5 0 \n",
"14019 2022-04-28 5 0 \n",
"14211 2022-05-19 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"878 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"6099 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"13905 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"14019 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"14211 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"878 Tägliche Interne Wartungstätigkeiten Weberei 2022-10-31 \n",
"6099 Tägliche Interne Wartungstätigkeiten Weberei 2022-03-24 \n",
"13905 Tägliche Interne Wartungstätigkeiten Weberei 2022-03-10 \n",
"14019 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-28 \n",
"14211 Tägliche Interne Wartungstätigkeiten Weberei 2022-05-19 \n",
"\n",
" ErledigungsArtText \\\n",
"878 Intern UTT - Sichtkontrolle \n",
"6099 Intern UTT - Sichtkontrolle \n",
"13905 Intern UTT - Sichtkontrolle \n",
"14019 Intern UTT - Sichtkontrolle \n",
"14211 Intern UTT - Sichtkontrolle \n",
"\n",
" ErledigungsBeschreibung MPMelderArbeitsplatz \\\n",
"878 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"6099 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"13905 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"14019 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"14211 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"\n",
" MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"878 NaN 2022-10-31 2022-08-05 \n",
"6099 NaN 2022-03-24 2022-02-17 \n",
"13905 NaN 2022-03-10 2022-02-17 \n",
"14019 NaN 2022-04-28 2022-04-14 \n",
"14211 NaN 2022-05-19 2022-04-28 "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_fil1.head()"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<DatetimeArray>\n",
"['2022-10-31 00:00:00', '2022-03-24 00:00:00', '2022-03-10 00:00:00',\n",
" '2022-04-28 00:00:00', '2022-05-19 00:00:00', '2022-04-09 00:00:00',\n",
" '2022-04-21 00:00:00', '2022-06-11 00:00:00', '2022-05-12 00:00:00',\n",
" '2022-04-23 00:00:00',\n",
" ...\n",
" '2022-10-28 00:00:00', '2022-07-06 00:00:00', '2023-06-14 00:00:00',\n",
" '2022-10-29 00:00:00', '2022-07-07 00:00:00', '2023-06-15 00:00:00',\n",
" '2022-05-05 00:00:00', '2022-10-30 00:00:00', '2022-07-08 00:00:00',\n",
" '2022-10-19 00:00:00']\n",
"Length: 462, dtype: datetime64[ns]"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_fil1['VorgangsDatum'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"462"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(temp_fil1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"VorgangsID"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl einzigartiger VorgangsID 1855 mit Anteil am Gesamtdatensatz 3.89 %\n"
]
}
],
"source": [
"uni_VorgangsID = temp['VorgangsID'].unique()\n",
"num_uni_VorgangsID = len(uni_VorgangsID)\n",
"print(f'Anzahl einzigartiger VorgangsID {num_uni_VorgangsID} mit Anteil am Gesamtdatensatz {num_uni_VorgangsID / len(temp) * 100:.2f} %')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"155717"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"uni_VorgangsID[0]"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"filt = temp['VorgangsID'] == uni_VorgangsID[0]\n",
"temp_fil1 = temp[filt]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>288</th>\n",
" <td>155717</td>\n",
" <td>187</td>\n",
" <td>246, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2718</th>\n",
" <td>155717</td>\n",
" <td>1792</td>\n",
" <td>A057, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2719</th>\n",
" <td>155717</td>\n",
" <td>186</td>\n",
" <td>245 J, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2720</th>\n",
" <td>155717</td>\n",
" <td>2473</td>\n",
" <td>A056, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5504</th>\n",
" <td>155717</td>\n",
" <td>2559</td>\n",
" <td>A070, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5505</th>\n",
" <td>155717</td>\n",
" <td>961</td>\n",
" <td>A054, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5506</th>\n",
" <td>155717</td>\n",
" <td>962</td>\n",
" <td>A055, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5507</th>\n",
" <td>155717</td>\n",
" <td>2166</td>\n",
" <td>A061, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5508</th>\n",
" <td>155717</td>\n",
" <td>1793</td>\n",
" <td>A058, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5509</th>\n",
" <td>155717</td>\n",
" <td>1794</td>\n",
" <td>A059, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8294</th>\n",
" <td>155717</td>\n",
" <td>2165</td>\n",
" <td>A060, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>NaN</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText ObjektArtID \\\n",
"288 155717 187 246, Webmaschine Jacquard, 6 \n",
"2718 155717 1792 A057, Webmaschine Jacquard, 6 \n",
"2719 155717 186 245 J, Webmaschine Jacquard, 6 \n",
"2720 155717 2473 A056, Webmaschine Jacquard, 6 \n",
"5504 155717 2559 A070, Webmaschine Jacquard, 6 \n",
"5505 155717 961 A054, Webmaschine Jacquard, 6 \n",
"5506 155717 962 A055, Webmaschine Jacquard, 6 \n",
"5507 155717 2166 A061, Webmaschine Jacquard, 6 \n",
"5508 155717 1793 A058, Webmaschine Jacquard, 6 \n",
"5509 155717 1794 A059, Webmaschine Jacquard, 6 \n",
"8294 155717 2165 A060, Webmaschine Jacquard, 6 \n",
"\n",
" ObjektArtText VorgangsTypID VorgangsTypName VorgangsDatum \\\n",
"288 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2718 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2719 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2720 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5504 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5505 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5506 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5507 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5508 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5509 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"8294 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"\n",
" VorgangsStatusId VorgangsPrioritaet \\\n",
"288 5 0 \n",
"2718 5 0 \n",
"2719 5 0 \n",
"2720 5 0 \n",
"5504 5 0 \n",
"5505 5 0 \n",
"5506 5 0 \n",
"5507 5 0 \n",
"5508 5 0 \n",
"5509 5 0 \n",
"8294 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"288 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"2718 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"2719 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"2720 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5504 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5505 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5506 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5507 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5508 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"5509 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"8294 Tägliche Wartungstätigkeiten nach Vorgabe des ... NaN \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"288 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2718 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2719 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2720 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5504 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5505 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5506 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5507 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5508 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5509 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"8294 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"\n",
" ErledigungsArtText \\\n",
"288 Intern UTT - Sichtkontrolle \n",
"2718 Intern UTT - Sichtkontrolle \n",
"2719 Intern UTT - Sichtkontrolle \n",
"2720 Intern UTT - Sichtkontrolle \n",
"5504 Intern UTT - Sichtkontrolle \n",
"5505 Intern UTT - Sichtkontrolle \n",
"5506 Intern UTT - Sichtkontrolle \n",
"5507 Intern UTT - Sichtkontrolle \n",
"5508 Intern UTT - Sichtkontrolle \n",
"5509 Intern UTT - Sichtkontrolle \n",
"8294 Intern UTT - Sichtkontrolle \n",
"\n",
" ErledigungsBeschreibung MPMelderArbeitsplatz \\\n",
"288 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"2718 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"2719 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"2720 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5504 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5505 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5506 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5507 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5508 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"5509 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"8294 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... NaN \n",
"\n",
" MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"288 NaN 2022-04-01 2022-02-17 \n",
"2718 NaN 2022-04-01 2022-02-17 \n",
"2719 NaN 2022-04-01 2022-02-17 \n",
"2720 NaN 2022-04-01 2022-02-17 \n",
"5504 NaN 2022-04-01 2022-02-17 \n",
"5505 NaN 2022-04-01 2022-02-17 \n",
"5506 NaN 2022-04-01 2022-02-17 \n",
"5507 NaN 2022-04-01 2022-02-17 \n",
"5508 NaN 2022-04-01 2022-02-17 \n",
"5509 NaN 2022-04-01 2022-02-17 \n",
"8294 NaN 2022-04-01 2022-02-17 "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_fil1"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anzahl Einträge mit gewählter VorgangsID: 11\n",
"Anzahl einzigartiger ObjektIDs darunter: 11\n"
]
}
],
"source": [
"temp_fil2 = temp_fil1.fillna(value=False)\n",
"print(f'Anzahl Einträge mit gewählter VorgangsID: {len(temp_fil2)}')\n",
"uni_obj_id = len(temp_fil2['ObjektID'].unique())\n",
"print(f'Anzahl einzigartiger ObjektIDs darunter: {uni_obj_id}')"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 187, 1792, 186, 2473, 2559, 961, 962, 2166, 1793, 1794, 2165],\n",
" dtype=int64)"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_fil2['ObjektID'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsID</th>\n",
" <th>ObjektID</th>\n",
" <th>HObjektText</th>\n",
" <th>ObjektArtID</th>\n",
" <th>ObjektArtText</th>\n",
" <th>VorgangsTypID</th>\n",
" <th>VorgangsTypName</th>\n",
" <th>VorgangsDatum</th>\n",
" <th>VorgangsStatusId</th>\n",
" <th>VorgangsPrioritaet</th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>VorgangsOrt</th>\n",
" <th>VorgangsArtText</th>\n",
" <th>ErledigungsDatum</th>\n",
" <th>ErledigungsArtText</th>\n",
" <th>ErledigungsBeschreibung</th>\n",
" <th>MPMelderArbeitsplatz</th>\n",
" <th>MPAbteilungBezeichnung</th>\n",
" <th>Arbeitsbeginn</th>\n",
" <th>ErstellungsDatum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>288</th>\n",
" <td>155717</td>\n",
" <td>187</td>\n",
" <td>246, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2718</th>\n",
" <td>155717</td>\n",
" <td>1792</td>\n",
" <td>A057, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2719</th>\n",
" <td>155717</td>\n",
" <td>186</td>\n",
" <td>245 J, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2720</th>\n",
" <td>155717</td>\n",
" <td>2473</td>\n",
" <td>A056, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5504</th>\n",
" <td>155717</td>\n",
" <td>2559</td>\n",
" <td>A070, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5505</th>\n",
" <td>155717</td>\n",
" <td>961</td>\n",
" <td>A054, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5506</th>\n",
" <td>155717</td>\n",
" <td>962</td>\n",
" <td>A055, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5507</th>\n",
" <td>155717</td>\n",
" <td>2166</td>\n",
" <td>A061, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5508</th>\n",
" <td>155717</td>\n",
" <td>1793</td>\n",
" <td>A058, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5509</th>\n",
" <td>155717</td>\n",
" <td>1794</td>\n",
" <td>A059, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8294</th>\n",
" <td>155717</td>\n",
" <td>2165</td>\n",
" <td>A060, Webmaschine Jacquard,</td>\n",
" <td>6</td>\n",
" <td>Jacquard-Webmaschine</td>\n",
" <td>1</td>\n",
" <td>Wartung</td>\n",
" <td>2022-04-01</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>Tägliche Wartungstätigkeiten nach Vorgabe des ...</td>\n",
" <td>False</td>\n",
" <td>Tägliche Interne Wartungstätigkeiten Weberei</td>\n",
" <td>2022-04-01</td>\n",
" <td>Intern UTT - Sichtkontrolle</td>\n",
" <td>Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2022-04-01</td>\n",
" <td>2022-02-17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsID ObjektID HObjektText ObjektArtID \\\n",
"288 155717 187 246, Webmaschine Jacquard, 6 \n",
"2718 155717 1792 A057, Webmaschine Jacquard, 6 \n",
"2719 155717 186 245 J, Webmaschine Jacquard, 6 \n",
"2720 155717 2473 A056, Webmaschine Jacquard, 6 \n",
"5504 155717 2559 A070, Webmaschine Jacquard, 6 \n",
"5505 155717 961 A054, Webmaschine Jacquard, 6 \n",
"5506 155717 962 A055, Webmaschine Jacquard, 6 \n",
"5507 155717 2166 A061, Webmaschine Jacquard, 6 \n",
"5508 155717 1793 A058, Webmaschine Jacquard, 6 \n",
"5509 155717 1794 A059, Webmaschine Jacquard, 6 \n",
"8294 155717 2165 A060, Webmaschine Jacquard, 6 \n",
"\n",
" ObjektArtText VorgangsTypID VorgangsTypName VorgangsDatum \\\n",
"288 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2718 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2719 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"2720 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5504 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5505 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5506 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5507 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5508 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"5509 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"8294 Jacquard-Webmaschine 1 Wartung 2022-04-01 \n",
"\n",
" VorgangsStatusId VorgangsPrioritaet \\\n",
"288 5 0 \n",
"2718 5 0 \n",
"2719 5 0 \n",
"2720 5 0 \n",
"5504 5 0 \n",
"5505 5 0 \n",
"5506 5 0 \n",
"5507 5 0 \n",
"5508 5 0 \n",
"5509 5 0 \n",
"8294 5 0 \n",
"\n",
" VorgangsBeschreibung VorgangsOrt \\\n",
"288 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"2718 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"2719 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"2720 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5504 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5505 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5506 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5507 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5508 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"5509 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"8294 Tägliche Wartungstätigkeiten nach Vorgabe des ... False \n",
"\n",
" VorgangsArtText ErledigungsDatum \\\n",
"288 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2718 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2719 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"2720 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5504 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5505 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5506 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5507 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5508 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"5509 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"8294 Tägliche Interne Wartungstätigkeiten Weberei 2022-04-01 \n",
"\n",
" ErledigungsArtText \\\n",
"288 Intern UTT - Sichtkontrolle \n",
"2718 Intern UTT - Sichtkontrolle \n",
"2719 Intern UTT - Sichtkontrolle \n",
"2720 Intern UTT - Sichtkontrolle \n",
"5504 Intern UTT - Sichtkontrolle \n",
"5505 Intern UTT - Sichtkontrolle \n",
"5506 Intern UTT - Sichtkontrolle \n",
"5507 Intern UTT - Sichtkontrolle \n",
"5508 Intern UTT - Sichtkontrolle \n",
"5509 Intern UTT - Sichtkontrolle \n",
"8294 Intern UTT - Sichtkontrolle \n",
"\n",
" ErledigungsBeschreibung MPMelderArbeitsplatz \\\n",
"288 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"2718 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"2719 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"2720 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5504 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5505 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5506 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5507 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5508 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"5509 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"8294 Sichtkontrolle durchgeführt\\n\\nAuffälligkeiten... False \n",
"\n",
" MPAbteilungBezeichnung Arbeitsbeginn ErstellungsDatum \n",
"288 False 2022-04-01 2022-02-17 \n",
"2718 False 2022-04-01 2022-02-17 \n",
"2719 False 2022-04-01 2022-02-17 \n",
"2720 False 2022-04-01 2022-02-17 \n",
"5504 False 2022-04-01 2022-02-17 \n",
"5505 False 2022-04-01 2022-02-17 \n",
"5506 False 2022-04-01 2022-02-17 \n",
"5507 False 2022-04-01 2022-02-17 \n",
"5508 False 2022-04-01 2022-02-17 \n",
"5509 False 2022-04-01 2022-02-17 \n",
"8294 False 2022-04-01 2022-02-17 "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_fil2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Frage: Können einem Vorgang mehrere ObjektIDs zugeordnet werden? Wenn ja, warum dann unterschiedliche Erledigungsdaten?*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Länge der Beschreibungen**"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"descriptions = descriptions.to_frame()\n",
"descriptions['length_description'] = descriptions.applymap(func=lambda x: len(x))\n",
"descriptions = descriptions.sort_values(by=['length_description'], ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 124008.000000\n",
"mean 70.351751\n",
"std 53.080901\n",
"min 1.000000\n",
"25% 66.000000\n",
"50% 66.000000\n",
"75% 67.000000\n",
"max 3137.000000\n",
"Name: length_description, dtype: float64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# stats\n",
"len_descr = descriptions['length_description']\n",
"len_descr.describe()"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>length_description</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8704</th>\n",
" <td>Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /...</td>\n",
" <td>3137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7826</th>\n",
" <td>Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /...</td>\n",
" <td>3137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49779</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124118</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14853</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VorgangsBeschreibung length_description\n",
"8704 Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /... 3137\n",
"7826 Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /... 3137\n",
"49779 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311\n",
"124118 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311\n",
"14853 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descriptions.head()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VorgangsBeschreibung</th>\n",
" <th>length_description</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8704</th>\n",
" <td>Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /...</td>\n",
" <td>3137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7826</th>\n",
" <td>Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /...</td>\n",
" <td>3137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49779</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124118</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14853</th>\n",
" <td>Laut Wartungsvertrag (Hr.Radtke) Bestellnummer...</td>\n",
" <td>2311</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13450</th>\n",
" <td></td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13451</th>\n",
" <td></td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29979</th>\n",
" <td></td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13452</th>\n",
" <td></td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21214</th>\n",
" <td>\\n</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>124008 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" VorgangsBeschreibung length_description\n",
"8704 Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /... 3137\n",
"7826 Vorgaben aus Held Wartungsplan\\n\\nLC-X-Achse /... 3137\n",
"49779 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311\n",
"124118 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311\n",
"14853 Laut Wartungsvertrag (Hr.Radtke) Bestellnummer... 2311\n",
"... ... ...\n",
"13450 1\n",
"13451 1\n",
"29979 1\n",
"13452 1\n",
"21214 \\n 1\n",
"\n",
"[124008 rows x 2 columns]"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descriptions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}