# **Analyse 2-2**

## Strategie & Fokus

- Versuche Clustering bzw. Zusammenfassung von Begriffen (z.B. Prüfung, Prüfen, Überprüfung)
- Orientierung an Häufigkeitsverteilung: häufigere Begriffe zuerst analysieren

---

# Merkmal 1: Clustering von Vorgangsbeschreibungen

## Recherche
[Textmining HS Hannover](https://textmining.wp.hs-hannover.de/Preprocessing.html)

### Allgemeine Zergliederung der Einzelbeschreibungen

- Text in Sätze
- Sätze in Wörter
- Wörter in Grundform:
    - Lemma: Die Form des Wortes, wie sie in einem Wörterbuch steht. Z.B.: Haus, laufen, begründen
    - Stamm: Das Wort ohne Flexionsendungen (Prefixe und Suffixe). Z.B.: Haus, lauf, begründ
    - Wurzel: Kern des Wortes, von dem das Wort ggf. durch Derivation abgeleitet wurde. Z.B.: Haus, lauf, Grund
- Wortartbestimmung
    - klassische Part-of-Speech-Erkennung (herkömmliche Wortart)
    - Named Entity Recognition (NER) (Eigennamen)
        - Bsp. spaCy: Person, Ort, Organisation, Verschiedenes

#### Semantik

- Wörter innerhalb eines Satzes größere Zusammenhänge als außerhalb

### Pakete

- Englisch: 
    - [NLTK](https://www.nltk.org/)
- Deutsch:
    - [HanTa - The Hanover Tagger](https://github.com/wartaal/HanTa/tree/master)
    - [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
        - [Python Wrapper](https://treetaggerwrapper.readthedocs.io/en/latest/)
    - [spaCy](https://spacy.io/)
        - [Beispiel 1](https://www.trinnovative.de/blog/2020-09-08-natural-language-processing-mit-spacy.html)

21.02.:
- Überarbeitung RegEx-Filterung
- Verbesserung Duplikatefindung über Ähnlichkeit

## Analyse

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import spacy
from spacy.lang.de import German as GermanSpacyModel
import sentence_transformers
from sentence_transformers import SentenceTransformer
from collections import Counter
from itertools import combinations
from dateutil.parser import parse
import re


import logging
import sys
import pickle

from ihm_analyze.helpers import (
    save_pickle,
    load_pickle,
    build_embedding_map,
    build_cosSim_matrix,
    filt_thresh_cosSim_matrix,
    list_cosSim_dupl_candidates,
    choose_cosSim_dupl_candidates,
)

LOGGING_LEVEL = 'INFO'
logging.basicConfig(level=LOGGING_LEVEL, stream=sys.stdout)
logger = logging.getLogger('base')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
LOAD_CALC_FILES = False

DESC_BLACKLIST = set(['-'])
"""
GENERAL_BLACKLIST = set([
    'herr', 'hr.', 'förster', 'graf', 'stöppel', 
    'stab', 'kw', 'h.', 'koch', 'heininger', '.',
    'schwab', 'm.', 'wenninger', '-', '--',
])
"""

GENERAL_BLACKLIST = set([
    'herr', 'hr.' 'kw', 'h.', '.',
    'm.', '-', '--', 'dr.', 'dr',
])

#GENERAL_BLACKLIST = set()
#POS_of_interest = set(['NOUN', 'PROPN', 'ADJ', 'VERB', 'AUX'])
#POS_of_interest = set(['NOUN', 'ADJ', 'VERB', 'AUX'])
#POS_of_interest = set(['NOUN', 'PROPN'])
POS_of_interest = set(['NOUN', 'PROPN', 'VERB', 'AUX'])
#TAG_of_interest = set(['ADJD'])
TAG_of_interest = set()

#POS_INDIRECT = set(['AUX', 'VERB'])
POS_INDIRECT = set(['AUX'])

In [4]:
# load language model
# transformer model without vector embeddings
# can not be used to calculate similarities
# using sentence transformers instead
nlp = spacy.load('de_dep_news_trf')
#nlp = spacy.load('de_core_news_lg')

In [5]:
model_stfr = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu


In [98]:
# load dataset
DATA_SET_ID = 'Export4'
FILE_PATH = f'01_2_Rohdaten_neu/{DATA_SET_ID}.csv'
date_cols = ['VorgangsDatum', 'ErledigungsDatum', 'Arbeitsbeginn', 'ErstellungsDatum']
raw = pd.read_csv(filepath_or_buffer=FILE_PATH, sep=';', encoding='cp1252', parse_dates=date_cols, dayfirst=True)
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129020 entries, 0 to 129019
Data columns (total 20 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   VorgangsID               129020 non-null  int64         
 1   ObjektID                 129020 non-null  int64         
 2   HObjektText              129003 non-null  object        
 3   ObjektArtID              129020 non-null  int64         
 4   ObjektArtText            128372 non-null  object        
 5   VorgangsTypID            129020 non-null  int64         
 6   VorgangsTypName          129020 non-null  object        
 7   VorgangsDatum            129020 non-null  datetime64[ns]
 8   VorgangsStatusId         129020 non-null  int64         
 9   VorgangsPrioritaet       129020 non-null  int64         
 10  VorgangsBeschreibung     124087 non-null  object        
 11  VorgangsOrt              507 non-null     object        
 12  VorgangsArtText 

In [99]:
raw.head()

Unnamed: 0,VorgangsID,ObjektID,HObjektText,ObjektArtID,ObjektArtText,VorgangsTypID,VorgangsTypName,VorgangsDatum,VorgangsStatusId,VorgangsPrioritaet,VorgangsBeschreibung,VorgangsOrt,VorgangsArtText,ErledigungsDatum,ErledigungsArtText,ErledigungsBeschreibung,MPMelderArbeitsplatz,MPAbteilungBezeichnung,Arbeitsbeginn,ErstellungsDatum
0,11,114,"427 C , Webmaschine, DL 280 EMS Breite 280",3,Luft-Webmaschine,3,Reparaturauftrag (Portal),2019-03-06,4,0,,,Kettbaum kaputt,2019-03-06,,,Weberei,Weberei,NaT,2019-03-06
1,17,124,"621 C , Webmaschine, DL 280 EMS Breite 280",3,Luft-Webmaschine,3,Reparaturauftrag (Portal),2019-03-11,5,0,,,asgasdg,2019-03-11,,,Elektrowerkstatt,Elektrowerkstatt,NaT,2019-03-11
2,53,244,"285 C, Webmaschine, SG 220 EMS",5,Greifer-Webmaschine,3,Reparaturauftrag (Portal),2019-03-19,5,0,Kupplung schleift,,Kupplung defekt,2019-03-20,Reparatur UTT,,Weberei,Weberei,NaT,2019-03-19
3,58,257,"107, Webmaschine, OM 220 EOS",3,Luft-Webmaschine,3,Reparaturauftrag (Portal),2019-03-21,5,0,Gegengewicht wieder anbringen,,Gegengewicht an der Webmaschine abgefallen,2019-03-21,Reparatur UTT,Schraube ausgebohrt\nGegengewicht wieder angeb...,Weberei,Weberei,2019-03-21,2019-03-21
4,81,138,"00138, Schärmaschine 9,",16,Schärmaschine,3,Reparaturauftrag (Portal),2019-03-25,5,0,da ist etwas gebrochen. (Herr Heininger),,zentrale Bremsenverstellung linke Gatterseite ...,2019-03-25,Reparatur UTT,Bolzen gebrochen. Bolzen neu angefertig und di...,Vorwerk,Vorwerk,2019-03-25,2019-03-25


In [100]:
print(f"Anzahl Features: {len(raw.columns)}")

Anzahl Features: 20


**Neue Features gegenüber letzter Analyse:**
- ``ObjektArtID``
- ``ObjektArtText``
- ``VorgangsTypName``

### Duplikate

In [101]:
duplicates_filt = raw.duplicated()

In [102]:
print(f"Anzahl Duplikate: {duplicates_filt.sum()}")

Anzahl Duplikate: 84


In [103]:
filt_data = raw[duplicates_filt]
uni_obj_id_dupl = filt_data['ObjektID'].unique()

In [104]:
print(f"Anzahl einzigartiger Objekt-IDs unter Duplikaten: {len(uni_obj_id_dupl)}")

Anzahl einzigartiger Objekt-IDs unter Duplikaten: 47


In [105]:
wo_duplicates = raw.drop_duplicates(ignore_index=True)
wo_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128936 entries, 0 to 128935
Data columns (total 20 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   VorgangsID               128936 non-null  int64         
 1   ObjektID                 128936 non-null  int64         
 2   HObjektText              128920 non-null  object        
 3   ObjektArtID              128936 non-null  int64         
 4   ObjektArtText            128289 non-null  object        
 5   VorgangsTypID            128936 non-null  int64         
 6   VorgangsTypName          128936 non-null  object        
 7   VorgangsDatum            128936 non-null  datetime64[ns]
 8   VorgangsStatusId         128936 non-null  int64         
 9   VorgangsPrioritaet       128936 non-null  int64         
 10  VorgangsBeschreibung     124008 non-null  object        
 11  VorgangsOrt              507 non-null     object        
 12  VorgangsArtText 

In [97]:
SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_00_DF_wo_dupl.parquet'
wo_duplicates.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)

### ``VorgangsBeschreibung``

#### **NA vals und Duplikate**

String-Bereinigung

In [16]:
SPECIAL_CHARS = set(['&', '$', '%', '§', '/', '(', ')', '_', 
                     '+', '–', '--', '<', '>', '´',
])

In [17]:
def clean_string_slim(string: str) -> str:
    # remove special chars
    pattern = r'[\t\n\r\f\v]'
    string = re.sub(pattern, ' ', string)
    # remove whitespaces at the beginning and the end
    string = string.strip()
    
    return string

def clean_string(string: str) -> str:
    #num_reps = 5
    
    # remove special chars
    pattern = r'[\t\n\r\f\v]'
    string = re.sub(pattern, ' ', string)
    # remove dates
    pattern = r'[\d]{1,4}[.:][\d]{1,4}[.:][\d]{1,4}'
    string = re.sub(pattern, '', string)
    # remove times
    pattern = r'[\d]{1,2}[:][\d]{1,2}[:][\d]{0,2}'
    string = re.sub(pattern, '', string)
    # remove all chars despite punctuation and alphanumeric ones
    pattern = r'[^ \w.,;:\-äöüÄÖÜ]+'
    string = re.sub(pattern, '', string)
    # remove - where it is used as em dash
    pattern = r'[\W]+-[\W]+'
    string = re.sub(pattern, ' ', string)
    # remove whitespaces in front of punctuation
    pattern = r'[ ]+([;,.:])'
    string = re.sub(pattern, r'\1', string)
    # remove multiple whitespaces
    pattern = r'[ ]+'
    string = re.sub(pattern, ' ', string)
    # remove whitespaces at the beginning and the end
    string = string.strip()
    
    #while num_reps != 0:
        #string = string.replace('\n', ' ')
        #string = string.replace('\t', ' ')
        #string = string.replace('  ', ' ')
        #string = string.replace('   ', ' ')
    #string = string.replace(' - ', ' ')
    """
    for char in SPECIAL_CHARS:
        string = string.replace(char, '')
        
        #num_reps -= 1
    
    # remove spaces at the beginning and the end
    string = string.strip()
    """
    
    return string

In [18]:
base = wo_duplicates.copy()
base = base.dropna(axis=0, subset='VorgangsBeschreibung')
# preprocessing
#base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string)
base['VorgangsBeschreibung'] = base['VorgangsBeschreibung'].map(clean_string_slim)

In [19]:
base

Unnamed: 0,VorgangsID,ObjektID,HObjektText,ObjektArtID,ObjektArtText,VorgangsTypID,VorgangsTypName,VorgangsDatum,VorgangsStatusId,VorgangsPrioritaet,VorgangsBeschreibung,VorgangsOrt,VorgangsArtText,ErledigungsDatum,ErledigungsArtText,ErledigungsBeschreibung,MPMelderArbeitsplatz,MPAbteilungBezeichnung,Arbeitsbeginn,ErstellungsDatum
0,140837,728,"10107, Rechteckfilter H1,",9,Behälter,3,Reparaturauftrag (Portal),2022-03-30,2,0,Filter Links Klopfer Defekt,,Klopfer defekt,2022-03-30,Ausgetauscht,.,Produktion,Produktion,2022-03-30,2022-03-30
1,136284,1280,"03024, Flachform Hubtisch, H2E12",30,Hydraulik,2,Störungsmeldung,2022-03-25,2,0,Anfahrschutz für Hydraulikkupplung abgefahren.,,Defekt,2022-03-30,Repariert,Geschweißt,,,2022-03-30,2022-03-25
2,116920,1518,"00576, Leitstrahlmischer 1,",41,Mischer,1,Wartung,2022-04-14,2,0,.,,halbjährlich Wartung (W),2022-04-21,Planmäßige Wartung,.,,,2022-04-21,2021-11-22
3,21260,2097,"00827, Überladebrücke Rampe 1,",58,Verladung,1,Wartung,2022-05-06,2,0,Prüfung durch externen DL,,jährliche Prüfung externer Dienstleister (P),2022-04-25,Geprüft ohne Mängel,Geprüft ohne Mängel.,,,2022-04-04,2021-04-14
4,116374,1703,"00715, Vogelsang, 2",3,Pumpen,1,Wartung,2022-04-14,2,0,Wartung nach Arbeitsplan,,halbjährlich Wartung (W),2022-05-12,Planmäßige Wartung,Wartung wie geplant durchgeführt,,,2022-04-20,2021-11-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14774,165211,723,"10102, Nasswäscher AGT 2,",9,Behälter,1,Wartung,2023-05-01,2,0,Manuelle Dosierung des Biozids,,Biozid Dosierung Montag (W),2023-05-03,Planmäßige Wartung,.,,,2023-05-03,2022-10-10
14775,54805,2365,"03544, Dampfkessel BHKW 1,",11,Dampferzeuger,1,Wartung,2023-05-03,2,0,,,dreitägige Überprüfung Mittwoch (W),2023-05-03,Planmäßige Wartung,Nach Vorgabe,,,2023-05-03,2021-06-04
14776,166438,3214,"03760, Seepexpumpe , BN 5-12L",3,Pumpen,1,Wartung,2023-04-24,2,0,Wartung nach Arbeitsplan,,halbjährlich Wartung (W),2023-05-03,Planmäßige Wartung,Wartung wie geplant durchgeführt,,,2023-05-02,2022-10-24
14777,166443,1277,"00593, Hydraulik für Deckelhubeinrichtung,",30,Hydraulik,1,Wartung,2023-04-24,12,0,Wartung nach Arbeitsplan,,halbjährlich Wartung (W),2023-05-03,Planmäßige Wartung,Wartung wie geplant durchgeführt,,,2023-05-03,2022-10-24


In [20]:
descriptions = base['VorgangsBeschreibung']
print(f"Einträge: {len(descriptions)}")

Einträge: 14481


In [21]:
num_dupl_descr = descriptions.duplicated().sum()
uni_descr = descriptions.unique()
num_uni_descr = len(uni_descr)

print(f"Anzahl Duplikate Vorgangsbeschreibungen: {num_dupl_descr}")
print(f"Anzahl einzigartiger Vorgangsbeschreibungen: {num_uni_descr}")
print(f"Anteil einzigartiger Vorgangsbeschreibungen: {num_uni_descr / len(descriptions) * 100:.2f} %")

Anzahl Duplikate Vorgangsbeschreibungen: 12297
Anzahl einzigartiger Vorgangsbeschreibungen: 2184
Anteil einzigartiger Vorgangsbeschreibungen: 15.08 %


In [22]:
SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'

if not LOAD_CALC_FILES:
    cols = ['descr', 'len', 'num_occur', 'assoc_obj_ids', 'num_assoc_obj_ids']
    descr_df = pd.DataFrame(columns=cols)
    max_val = 0
    text = None
    index = 0


    for idx, description in enumerate(uni_descr):
        len_descr = len(description)
        filt = base['VorgangsBeschreibung'] == description
        temp = base[filt]
        assoc_obj_ids = temp['ObjektID'].unique()
        assoc_obj_ids = np.sort(assoc_obj_ids, kind='stable')
        num_assoc_obj_ids = len(assoc_obj_ids)
        num_dupl = filt.sum()
        
        conc_df = pd.DataFrame(data=[[
                                description,
                                len_descr,
                                num_dupl,
                                assoc_obj_ids,
                                num_assoc_obj_ids
                            ]], columns=cols)
        
        descr_df = pd.concat([descr_df, conc_df], ignore_index=True)
        
        if num_dupl > max_val:
            max_val = num_dupl
            index = idx
            text = description
            
    temp1 = descr_df.sort_values(by='num_occur', ascending=False)
    
    # saving
    temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)
else:
    # loading
    temp1 = pd.read_parquet(SAVE_PATH_DF_DUPL_OCCUR)

In [23]:
temp1

Unnamed: 0,descr,len,num_occur,assoc_obj_ids,num_assoc_obj_ids
14,Bestimmen des Prüftermins für elektrische Arbe...,527,2809,"[404, 405, 406, 407, 408, 409, 410, 411, 412, ...",1724
16,VDE Prüfung,11,2034,"[404, 407, 408, 409, 410, 411, 412, 413, 414, ...",1187
4,Wartung nach Arbeitsplan,24,1062,"[726, 798, 800, 801, 802, 921, 922, 923, 924, ...",218
7,Manuelle Dosierung des Biozids,30,526,"[0, 722, 723, 724, 726]",5
12,Mikrobiologie(Abklatsch-Test),29,511,"[722, 723, 724, 725, 726]",5
...,...,...,...,...,...
844,Filterabreinigung AGT 1 : das erste Ventil von...,68,1,[0],1
843,Abnahmeprüfung durch Sachkundigen,33,1,[1245],1
842,Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung,47,1,[1326],1
841,Ausgeführt,10,1,[2365],1


In [29]:
len(temp1)

2184

In [24]:
temp1.iloc[0,0]

'Bestimmen des Prüftermins für elektrische Arbeitsmittel(Teil der Gefährdungsbeurteilung gemäß Betribssicherheitsverordnung §3)  Ist immer ein Jahr gültig!  Erklärung:  -Warum stehen vor jeder Auswahl die Zahlen 1-7?  Antwort: Es gibt die Gefahren Klasse 1-7 daher wurde auch bei jeder Auswahlmöglichkeit die Gefahrenklasse mit integriert.  Gefährdungsklasse 1 2 3 4 5 6 7 Zustand Spitzenniv. sehr gut gut normal beeinträchtigt schlecht sehr schlecht Einwirkung/Gefährdung keine sehr niedrig niedrig normal erhöht hoch sehr hoch'

In [25]:
temp1.iloc[1,0]

'VDE Prüfung'

In [26]:
# saving
SAVE_PATH_DF_DUPL_OCCUR = f'./02_1_Preprocess1/{DATA_SET_ID}_01_DF_num_occur_temp1.parquet'
#temp1.to_parquet(SAVE_PATH_DF_DUPL_OCCUR)

**Cosine Similarity**

In [34]:
# eliminate descriptions with less than 6 symbols
subset_data = temp1.loc[temp1['len'] > 5, 'descr'].copy()
#subset_data = subset_data.iloc[0:100]

In [35]:
len(subset_data)

2171

In [36]:
# saving
SAVE_PATH_SUBSET_DATA = f'./02_1_Preprocess1/{DATA_SET_ID}_02_1_subset_data.pkl'
if not LOAD_CALC_FILES:
    subset_data.to_pickle(SAVE_PATH_SUBSET_DATA)
else:
    subset_data = pd.read_pickle(SAVE_PATH_SUBSET_DATA)

- Wie geht man mit unbekannten Wörtern um?

# build mapping of embeddings for given model
def build_embedding_map(
    data: Series,
    model: GermanSpacyModel | SentenceTransformer,
) -> dict[int, tuple['Embedding',str]]:
    # dictionary with embeddings
    embeddings: dict[int, tuple['Embedding',str]] = dict()
    is_spacy = False
    is_STRF = False
    
    if isinstance(model, spacy.lang.de.German):
        is_spacy = True
    elif isinstance(model, SentenceTransformer):
        is_STRF = True
        
    if not any((is_spacy, is_STRF)):
        raise NotImplementedError("Model type unknown")
        
    for (idx, text) in subset_data.items():
        
        if is_spacy:
            embd = model(text)
            embeddings[idx] = (embd, text)
            # check for empty vectors
            if not doc.vector_norm:
                print('--- Unknown Words ---')
                print(f'{embd.text=} has no vector')
        elif is_STRF:
            embd = model.encode(text, show_progress_bar=False, normalize_embeddings=False)
            embeddings[idx] = (embd, text)
    
    return embeddings, (is_spacy, is_STRF)

# build similarity matrix out of embeddings
def build_cosSim_matrix(
    data: Series,
    model: GermanSpacyModel | SentenceTransformer,
) -> DataFrame:
    # build empty matrix
    df_index = data.index
    cosineSim_idx_matrix = pd.DataFrame(data=0., columns=df_index, 
                                    index=df_index, dtype=np.float32)
    
    # obtain embeddings based on used model
    embds, (is_spacy, is_STRF) = build_embedding_map(
        data=data,
        model=model
    )
    
    # apply index based mapping for efficient handling of large texts
    combs = combinations(df_index, 2)
    
    for (idx1, idx2) in combs:
        #print(f"{idx1=}, {idx2=}")
        embd1 = embds[idx1][0]
        embd2 = embds[idx2][0]
        
        # calculate similarity based on model type
        if is_spacy:
            cosSim = embd1.similarity(embd2)
        elif is_STRF:
            cosSim = sentence_transformers.util.cos_sim(embd1, embd2)
            cosSim = cosSim.item()
        
        cosineSim_idx_matrix.at[idx1, idx2] = cosSim
        
    return cosineSim_idx_matrix, embds

In [37]:
SKIP = False
SAVE_PATH_COSSIM_MATRIX_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_matrix_whole_textbased.parquet'
SAVE_PATH_COSSIM_EMBDS_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_2_cosineSim_idx_embds_whole_textbased.pkl'

if not SKIP:
    cosineSim_idx_matrix, embds = build_cosSim_matrix(
        data=subset_data,
        model=model_stfr,
    )
    # saving
    cosineSim_idx_matrix.to_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)
    save_pickle(obj=embds, path=SAVE_PATH_COSSIM_EMBDS_WHOLE)
else:
    cosineSim_idx_matrix = pd.read_parquet(SAVE_PATH_COSSIM_MATRIX_WHOLE)
    embds = load_pickle(SAVE_PATH_COSSIM_EMBDS_WHOLE)

In [38]:
cosineSim_idx_matrix.to_numpy().shape

(2171, 2171)

# obtain index pairs with cosine similarity 
# greater than or equal to given threshold value

def filt_thresh_cosSim_matrix(
    threshold: float,
    cosineSim_idx_matrix: DataFrame,
):
    cosineSim_filt = cosineSim_idx_matrix.where(cosineSim_idx_matrix >= threshold).stack()
    
    return cosineSim_filt

def list_cosSim_dupl_candidates(
    cosineSim_filt: Series,
    embeddings: dict[int, tuple['Embedding',str]],
):
    # compare found duplicates
    columns = ['idx1', 'text1', 'idx2', 'text2', 'score']
    df_candidates = pd.DataFrame(columns=columns)
    
    index_pairs = list()

    for ((idx1, idx2), score) in cosineSim_filt.items():
        # get text content from embedding as second tuple entry
        content = [[
            idx1,
            embeddings[idx1][1],
            idx2,
            embeddings[idx2][1],
            score,
        ]]
        df_conc = pd.DataFrame(columns=columns, data=content)
        
        df_candidates = pd.concat([df_candidates, df_conc])
        index_pairs.append((idx1, idx2))
    
    return df_candidates, index_pairs

def choose_cosSim_dupl_candidates(
    cosineSim_filt: Series,
    embeddings: dict[int, tuple['Embedding',str]],
) -> tuple[DataFrame, list[tuple['Index', 'Index']]]:
    # compare found duplicates
    columns = ['idx1', 'text1', 'idx2', 'text2', 'score']
    df_candidates = pd.DataFrame(columns=columns)
    
    index_pairs = list()

    for ((idx1, idx2), score) in cosineSim_filt.items():
        # get texts for comparison
        text1 = embeddings[idx1][1]
        text2 = embeddings[idx2][1]
        # get decision
        print('---------- New Decision ----------')
        print('text1:\n', text1, '\n', flush=True)
        print('text2:\n', text2, '\n', flush=True)
        decision = input('Please enter >>y<< if this is a duplicate, else hit enter:')
        
        if not decision == 'y':
            continue
        
        # get text content from embedding as second tuple entry
        content = [[
            idx1,
            text1,
            idx2,
            text2,
            score,
        ]]
        df_conc = pd.DataFrame(columns=columns, data=content)
        
        df_candidates = pd.concat([df_candidates, df_conc])
        index_pairs.append((idx1, idx2))
    
    return df_candidates, index_pairs

In [39]:
SIMILARITY_THRESHOLD = 0.8
SAVE_PATH_COSSIM_CANDFILT_WHOLE = f'./02_1_Preprocess1/{DATA_SET_ID}_02_3_cosineSim_idx_cand_filter_textbased.pkl'

SKIP = False
if not SKIP:
    cosineSim_filt = filt_thresh_cosSim_matrix(
        threshold=SIMILARITY_THRESHOLD,
        cosineSim_idx_matrix=cosineSim_idx_matrix,
    )
    # saving
    cosineSim_filt.to_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)
else:
    cosineSim_filt = pd.read_pickle(SAVE_PATH_COSSIM_CANDFILT_WHOLE)
cosineSim_filt

14   18      0.851394
16   181     0.818661
     195     0.840125
     87      0.812861
     1306    0.818661
               ...   
876  911     0.812442
929  910     0.847216
     870     0.964813
910  870     0.830993
837  868     0.951816
Length: 1445, dtype: float32

In [40]:
SKIP = False
SAVE_PATH_DUPL_CANDIDATES = (f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_candidates_'
                                f'cosSim_thresh_{SIMILARITY_THRESHOLD}.xlsx')
SAVE_PATH_IDX_CAND_PAIRS = f'./02_1_Preprocess1/{DATA_SET_ID}_02_4_dupl_idx_pairs_whole_Exp4.pkl'

if not SKIP:
    cosSim_dupl_candidates, dupl_idx_pairs = list_cosSim_dupl_candidates(
        cosineSim_filt=cosineSim_filt,
        embeddings=embds,
    )
    # save results
    cosSim_dupl_candidates.to_excel(SAVE_PATH_DUPL_CANDIDATES)
    save_pickle(obj=dupl_idx_pairs, path=SAVE_PATH_IDX_CAND_PAIRS)
    #cosSim_dupl_candidates
else:
    cosSim_dupl_candidates = pd.read_excel(SAVE_PATH_DUPL_CANDIDATES, index_col=0)
    dupl_idx_pairs = load_pickle(SAVE_PATH_IDX_CAND_PAIRS)

  df_candidates = pd.concat([df_candidates, df_conc])


In [41]:
cosSim_dupl_candidates

Unnamed: 0,idx1,text1,idx2,text2,score
0,14,Bestimmen des Prüftermins für elektrische Arbe...,18,Bestimmen des Prüftermins für elektrische Arbe...,0.851394
0,16,VDE Prüfung,181,· VDE Prüfung,0.818661
0,16,VDE Prüfung,195,VDE Prüfung nach VDE 0701/0702,0.840125
0,16,VDE Prüfung,87,Prüfung nach VDE 701/702,0.812861
0,16,VDE Prüfung,1306,·VDE Prüfung,0.818661
...,...,...,...,...,...
0,876,defekte Filter-Stützkörpe von AGT2 und AGT 3,911,AGT1 Filter Trichter 2 Klopfer defekt,0.812442
0,929,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",910,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",0.847216
0,929,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",870,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",0.964813
0,910,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",870,"Unter ""Sonstiges"" können Sie alle anderen Mäng...",0.830993


**Nächste Schritte:**
- Grenz-Threshold finden, bei dem Duplikate gerade noch richtig erkannt werden

In [42]:
if False:
    thresholds = (0.75, 0.8, 0.85, 0.9, 0.93, 0.95, 0.96, 0.97, 0.98)

    for thresh in thresholds:
        
        cosineSim_filt = filt_thresh_cosSim_matrix(
            threshold=thresh,
            cosineSim_idx_matrix=cosineSim_idx_matrix.copy(),
        )
        
        cosSim_dupl_candidates = list_cosSim_dupl_candidates(
            cosineSim_filt=cosineSim_filt,
            embeddings=embds,
        )
        
        # saving path
        saving_path = (f'./Filterung_Duplikate/dupl_candidates_'
                    f'cosSim_thresh_{thresh}_STFR.xlsx')
        
        cosSim_dupl_candidates.to_excel(saving_path)

**Ergebnisse:**
- kein allgemeiner Threshold ableitbar, nur grober Richtwert
- Paare mit geringerem Score stellenweise ähnlicher als die mit höherem Score
- finale Entscheidung für Duplikat händisch, da Kontextwissen trotzdem notwendig
- Arbeit mit ``temp1`` und merging von Einträgen

- für gesamten Datensatz händisch nicht zielführend (über 9300 Einträge, die verglichen werden müssten)
- für ersten Wurf: Merging basierend auf Threshold von ``0.8``

---

*Manual Decision*

In [53]:
# manually decide if candidates are indeed duplicates

SKIP = True
if not SKIP:
    cosSim_dupl_candidates, dupl_idx_pairs = choose_cosSim_dupl_candidates(
        cosineSim_filt=cosineSim_filt,
        embeddings=embds,
    )

In [54]:
#save_pickle(obj=dupl_idx_pairs, path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')

In [72]:
#dupl_idx_pairs = load_pickle(path='./Filterung_Duplikate/dupl_idx_pairs_Exp4.pkl')
#dupl_idx_pairs = load_pickle(path='./02_1_Preprocess1/dupl_idx_pairs_whole_Exp4.pkl')

#dupl_idx_pairs

---

*Eliminate Candidates*

In [43]:
temp2 = temp1.copy()
dupl_idx_pairs = load_pickle(path=SAVE_PATH_IDX_CAND_PAIRS)

In [44]:
len(dupl_idx_pairs)

1445

In [45]:
# merge duplicates

# to-do:
# merge: 'num_occur', 'assoc_obj_ids', 
# recalc: 'num_assoc_obj_ids'

for (i1, i2) in dupl_idx_pairs:
    
    # if an entry does not exist anymore, skip this pair
    if i1 not in temp2.index or i2 not in temp2.index:
        continue
    
    # merge num occur
    num_occur1 = temp2.at[i1, 'num_occur']
    num_occur2 = temp2.at[i2, 'num_occur']
    new_num_occur = num_occur1 + num_occur2

    # merge assoc obj ids
    assoc_ids1 = temp2.at[i1, 'assoc_obj_ids']
    assoc_ids2 = temp2.at[i2, 'assoc_obj_ids']
    new_assoc_ids = np.append(assoc_ids1, assoc_ids2)
    new_assoc_ids = np.unique(new_assoc_ids.flatten())

    # recalc num assoc obj ids
    new_num_assoc_obj_ids = len(new_assoc_ids)

    # write porperties to first entry
    temp2.at[i1, 'num_occur'] = new_num_occur
    temp2.at[i1, 'assoc_obj_ids'] = new_assoc_ids
    temp2.at[i1, 'num_assoc_obj_ids'] = new_num_assoc_obj_ids
    
    # drop second entry
    temp2 = temp2.drop(index=i2)

In [46]:
temp1.head()

Unnamed: 0,descr,len,num_occur,assoc_obj_ids,num_assoc_obj_ids
14,Bestimmen des Prüftermins für elektrische Arbe...,527,2809,"[404, 405, 406, 407, 408, 409, 410, 411, 412, ...",1724
16,VDE Prüfung,11,2034,"[404, 407, 408, 409, 410, 411, 412, 413, 414, ...",1187
4,Wartung nach Arbeitsplan,24,1062,"[726, 798, 800, 801, 802, 921, 922, 923, 924, ...",218
7,Manuelle Dosierung des Biozids,30,526,"[0, 722, 723, 724, 726]",5
12,Mikrobiologie(Abklatsch-Test),29,511,"[722, 723, 724, 725, 726]",5


In [47]:
temp2.head()

Unnamed: 0,descr,len,num_occur,assoc_obj_ids,num_assoc_obj_ids
14,Bestimmen des Prüftermins für elektrische Arbe...,527,3081,"[404, 405, 406, 407, 408, 409, 410, 411, 412, ...",1724
16,VDE Prüfung,11,2201,"[404, 407, 408, 409, 410, 411, 412, 413, 414, ...",1203
4,Wartung nach Arbeitsplan,24,1091,"[726, 798, 800, 801, 802, 921, 922, 923, 924, ...",219
7,Manuelle Dosierung des Biozids,30,526,"[0, 722, 723, 724, 726]",5
12,Mikrobiologie(Abklatsch-Test),29,511,"[722, 723, 724, 725, 726]",5


In [59]:
temp1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2184 entries, 14 to 2183
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   descr              2184 non-null   object
 1   len                2184 non-null   object
 2   num_occur          2184 non-null   object
 3   assoc_obj_ids      2184 non-null   object
 4   num_assoc_obj_ids  2184 non-null   object
dtypes: object(5)
memory usage: 166.9+ KB


In [60]:
temp2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1735 entries, 14 to 2183
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   descr              1735 non-null   object
 1   len                1735 non-null   object
 2   num_occur          1735 non-null   object
 3   assoc_obj_ids      1735 non-null   object
 4   num_assoc_obj_ids  1735 non-null   object
dtypes: object(5)
memory usage: 81.3+ KB


In [48]:
# transform assoc_obj_ids to list to be able to save DF
temp2['assoc_obj_ids'] = temp2['assoc_obj_ids'].map(lambda x: x.tolist())

In [49]:
temp2

Unnamed: 0,descr,len,num_occur,assoc_obj_ids,num_assoc_obj_ids
14,Bestimmen des Prüftermins für elektrische Arbe...,527,3081,"[404, 405, 406, 407, 408, 409, 410, 411, 412, ...",1724
16,VDE Prüfung,11,2201,"[404, 407, 408, 409, 410, 411, 412, 413, 414, ...",1203
4,Wartung nach Arbeitsplan,24,1091,"[726, 798, 800, 801, 802, 921, 922, 923, 924, ...",219
7,Manuelle Dosierung des Biozids,30,526,"[0, 722, 723, 724, 726]",5
12,Mikrobiologie(Abklatsch-Test),29,511,"[722, 723, 724, 725, 726]",5
...,...,...,...,...,...
844,Filterabreinigung AGT 1 : das erste Ventil von...,68,1,[0],1
843,Abnahmeprüfung durch Sachkundigen,33,1,[1245],1
842,Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung,47,1,[1326],1
841,Ausgeführt,10,1,[2365],1


In [50]:
SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'
temp2.to_pickle(SAVE_PATH_REMOVED_DUPL)

- Handling von Rechtschreibfehlern (Hunspell über PyEnchant)
- Handling von Vector-Embeddings über Transformer-Modelle:
    - höhere Fehlertoleranz (Rechtschreibung, redundante oder unbedeutende Worte)
    - nicht angewiesen, dass jedes Wort im Vocabulary vorkommt (vgl. spaCy-Modell)
    - bei ersten Versuchen höhere Genauigkeit bei der Erkennung tatsächlicher Duplikate
- Nutzung Vector-Embeddings für Duplikatfindung

#### ---> Model Training: Data Set

In [44]:
# data for model training
data = temp1.iloc[50:300,0].to_list()
data = [e for e in data if e != '']

with open('spacy_train/training_data_2.txt','w', encoding='utf-8') as f:
    f.writelines("\n".join(data))

#### spaCy

In [245]:
string = temp1.iloc[-2,0]
#string = temp1.iloc[0,0]
string

'Durchführung: Sollwert: 20 0,1g'

In [246]:
string = 'Ich spiele jeden Tag mit den Kindern im Garten. Das ist schön.'
string = 'Die Maschine XYZ ist aufgrund einer Störung im Druckluftsystem defekt.'
#string = 'The machine XYZ is broken because of a failure in the air pressure system.'
#string = 'Wir benötigen das Werkzeug von Herr Stöppel, um das derzeit abzuarbeiten.Dies wird durch Herrn Strebe getan.'

In [247]:
doc = nlp(string)

In [131]:
# simulate occurence counter
OCC_COUNTER = 10

In [51]:
SPELL_CHECK_NON_CHARS = set([' ', '.', ',', ';', ':', '-'])
CLEANING = True
#CLEANING = False

def pre_clean_word(string: str) -> str:
    
    pattern = r'[^A-Za-zäöüÄÖÜ]+'
    string = re.sub(pattern, '', string)
    """
    for char in SPELL_CHECK_NON_CHARS:
        string = string.replace(char, '')
    """
    
    return string

# https://stackoverflow.com/questions/25341945/check-if-string-has-date-any-format 
def is_str_date(string, fuzzy=False):
    
    try:
        parse(string, fuzzy=fuzzy)
        return True
    except ValueError:
        return False


def obtain_sub_tree(token):
    # check if token is a POS of interest
    descendants = list(token.subtree)
    descendants.remove(token)
    logger.debug(f'Token >>{token}<< has subtree >>{descendants}<<')
    return descendants


def add_children_descendants(
    parent,
    weight,
    connections,
    unique_tokens,
    children_sents,
    map_2_word: dict[str, str] | None = None,
):
    # add child as key
    if CLEANING:
        parent_lemma = pre_clean_word(string=parent.lemma_)
        
        # map words
        if map_2_word is not None:
            if parent_lemma.lower() in map_2_word:
                parent_lemma = map_2_word[parent_lemma.lower()]
                #logger.info(f"[SUCCESS] Mapped PARENT to {parent_lemma}")
        
        if parent_lemma != '':
            if (parent_lemma, parent.pos_) in connections:
                connections[(parent_lemma, parent.pos_)].append(children_sents)
                connections[(parent_lemma, parent.pos_)].append(children_sents)
                #connections[parent.lemma_].append([descendant.lemma_, descendant])
            else:
                # do not add auxiliary words
                if parent.pos_ != 'AUX':
                    unique_tokens.add(parent_lemma)
                connections[(parent_lemma, parent.pos_)] = list()
                connections[(parent_lemma, parent.pos_)].append(children_sents)
                #connections[parent.lemma_].append([descendant.lemma_, descendant])
    else:
        if (parent.lemma_, parent.pos_) in connections:
            connections[(parent.lemma_, parent.pos_)].append(children_sents)
            connections[(parent.lemma_, parent.pos_)].append(children_sents)
            #connections[parent.lemma_].append([descendant.lemma_, descendant])
        else:
            # do not add auxiliary words
            if parent.pos_ != 'AUX':
                unique_tokens.add(parent.lemma_)
            connections[(parent.lemma_, parent.pos_)] = list()
            connections[(parent.lemma_, parent.pos_)].append(children_sents)
            #connections[parent.lemma_].append([descendant.lemma_, descendant])


def obtain_descendant_info(
    doc,
    weight,
    POS_of_interest,
    TAG_of_interest,
    connections,
    unique_tokens,
    map_2_word: dict[str, str] | None = None,
):
    
    # iterate over sentences
    for sent in doc.sents:
        
        # iterate over tokens in one sentence
        for token in sent:
            
            if not (token.pos_ in POS_of_interest or token.tag_ in TAG_of_interest):
                continue
            elif token.lemma_.lower() in GENERAL_BLACKLIST:
                logger.debug(f'Eliminated parent >>{token}<< because of blacklist')
                continue
            
            descendants = obtain_sub_tree(token=token)
            
            # iterate over all children if there are any
            if descendants is not None:
                # list with all children in the current sentence
                children_sents = list()
                
                for child in descendants:
                    logger.debug(f'Token is >>{token}<< with child >>{child}<< and POS {child.pos_}')
                    
                    # elimnate cases of cross-references with verbs
                    if ((token.pos_ == 'AUX' or token.pos_ == 'VERB') and
                        (child.pos_ == 'AUX' or child.pos_ == 'VERB')):
                        continue
                    elif not (child.pos_ in POS_of_interest or child.tag_ in TAG_of_interest):
                        continue
                    elif child.lemma_.lower() in GENERAL_BLACKLIST:
                        logger.debug(f'Eliminated child >>{child}<< because of blacklist')
                        continue
                    
                    
                    if CLEANING:
                        child = pre_clean_word(string=child.lemma_)
                        if child == '':
                            continue
                        #child = pre_clean_word(string=child)
                        
                        if (child not in DESC_BLACKLIST and
                            not is_str_date(string=child)):
                            #not is_str_date(string=child.text)):
                            #children_sents.append((child.lemma_, weight))
                            
                            # map words
                            if map_2_word is not None:
                                if child.lower() in map_2_word:
                                    child = map_2_word[child.lower()]
                                    #logger.info(f"[SUCCESS] Mapped CHILD to {child}")
                            
                            children_sents.append((child, weight))
                        
                        #if child.lemma_ not in unique_tokens:
                        if (child not in unique_tokens and
                            not is_str_date(string=child)):
                            #unique_tokens.add(child.lemma_)
                            unique_tokens.add(child)
                            
                    else:
                        if (child.lemma_ not in DESC_BLACKLIST and
                            not is_str_date(string=child.text)):
                            children_sents.append((child.lemma_, weight))
                        
                        if child.lemma_ not in unique_tokens:
                            unique_tokens.add(child.lemma_)
                
                # add list of children for current parent if not empty
                if children_sents:
                    
                    add_children_descendants(
                        parent=token,
                        weight=weight,
                        connections=connections,
                        unique_tokens=unique_tokens,
                        children_sents=children_sents,
                        map_2_word=map_2_word,
                    )

In [52]:
def obtain_adj_matrix(unique_tokens, connections):

    adj_mat = pd.DataFrame(
        data=0, 
        columns=list(unique_tokens), 
        index=list(unique_tokens),
        dtype=np.uint32,
    )
    
    for (pred, POS), descendants_list in connections.items():
        #print(f'{pred=}, {descendants=}')
        
        for descendants in descendants_list:
            #print(f'{descendants}')
            
            if POS not in POS_INDIRECT:
                for (desc, weight) in descendants:
                    adj_mat.at[pred, desc] += weight
            
            else:
                if len(descendants) > 1:
                    # if auxiliary word, make connection between all associated words
                    combs = combinations(descendants, r=2)
                    
                    for comb in combs:
                        # comb is tuple ((word_1, weight), (word_2, weight))
                        weight = comb[0][1]
                        word_1 = comb[0][0]
                        word_2 = comb[1][0]
                        
                        """
                        if ((word_1 == 'Eigenverantwortlichkeit' or word_1 == 'neu') and
                            (word_2 == 'Eigenverantwortlichkeit' or word_2 == 'neu')):
                            print(f'Hello from {pred=} with {descendants=}')
                        """
                        
                        adj_mat.at[word_1, word_2] += weight
    
    return adj_mat


def make_undir_adj_matrix(adj_mat):
    
    adj_mat_undir = adj_mat.copy()
    arr = adj_mat_undir.to_numpy()
    arr_upper = np.triu(arr)
    arr_lower = np.tril(arr)
    arr_lower = np.rot90(np.fliplr(arr_lower))
    arr_new = arr_lower + arr_upper
    
    adj_mat_undir.loc[:] = arr_new
    
    return adj_mat_undir

#### Gesamter Datensatz

In [61]:
SKIP = False

SAVE_PATH_REMOVED_DUPL = f'./02_1_Preprocess1/{DATA_SET_ID}_03_dataset_remov_dupl_similar_whole.pkl'
if not SKIP:
    temp2 = pd.read_pickle(SAVE_PATH_REMOVED_DUPL)

In [62]:
temp2.head()

Unnamed: 0,descr,len,num_occur,assoc_obj_ids,num_assoc_obj_ids
14,Bestimmen des Prüftermins für elektrische Arbe...,527,3081,"[404, 405, 406, 407, 408, 409, 410, 411, 412, ...",1724
16,VDE Prüfung,11,2201,"[404, 407, 408, 409, 410, 411, 412, 413, 414, ...",1203
4,Wartung nach Arbeitsplan,24,1091,"[726, 798, 800, 801, 802, 921, 922, 923, 924, ...",219
7,Manuelle Dosierung des Biozids,30,526,"[0, 722, 723, 724, 726]",5
12,Mikrobiologie(Abklatsch-Test),29,511,"[722, 723, 724, 725, 726]",5


In [63]:
# analysiere erste 10 Einträge
#descr = temp1[['descr', 'num_occur']]
descr = temp2[['descr', 'num_occur']]
#descr = descr.iloc[:7,:]

In [64]:
#descr.iat[0,0] = 'Das ist ein Test am 24.08.2023'

In [65]:
len(descr)

1735

In [66]:
descr

Unnamed: 0,descr,num_occur
14,Bestimmen des Prüftermins für elektrische Arbe...,3081
16,VDE Prüfung,2201
4,Wartung nach Arbeitsplan,1091
7,Manuelle Dosierung des Biozids,526
12,Mikrobiologie(Abklatsch-Test),511
...,...,...
844,Filterabreinigung AGT 1 : das erste Ventil von...,1
843,Abnahmeprüfung durch Sachkundigen,1
842,Sprühluftverdichter (Nr.2) ZE4 VSD2 auf Störung,1
841,Ausgeführt,1


In [67]:
#LOAD_CALC_FILES = True
#LOAD_CALC_FILES = False
#IS_TEST = True
IS_TEST = False

**Entdeckte Gruppen**
- Prüfung:
    - Prüfen
    - Sichtprüfung
    - Überprüfung / überprüfen
    - Kontrolle / kontrollieren
    - sicherstellen / Sicherstellung
    - Wartung / warten
    - Reinigung / reinigen
    - Prüfbericht
- Handlung:
    - Schmierung
    - schmieren
    - reinigen
    - Reinigung
    - schneiden / nachschneiden
- zyklisch:
    - täglich
    - wöchentlich
    - monatlich
    - jährlich
- Datum:
    - Uhr
    - Montag, Dienstag, Mittwoch, Donnerstag, Freitag, Samstag, Sonntag
- Kleinteile:
    - Schraube
    - Adapter
    - Halterung
    - Scheibe
    - Gewinde
    - Ventil
    - Schalter
    - Befestigungsschraube
- Komponenten:
    - Kupplung
    - Motor
    - Getriebe
    - Ventilator
    - Zahnriemen
    - Tranformator
    - Filterelement
    - Dosierpumpe
    - Luftschlauch
    - Dichtung
    - Filter
    - Scharnier
    - Spannrolle
    - Druckluftbehälter
    - Kette
    - Anschlüsse
    - Schläuche
    - Beleuchtung
- Elektrik:
    - Zuleitung
    - Kabel
    - Steckdose
    - Elektriker
    - Elektronik
    - elektrisch
    - Sicherheitsbeleuchtung
- Anlagen:
    - Mischanlage
    - Maschine
    - Wasserenthärtungsanlage
    - Lüftungsanlage
    - Klimaanlage
- Vereinbarung:
    - Wartungsvertrag
    - Neuvertrag
    - Vertrag
    - terminieren / terminiert
    - Absprache
    - melden
    - telefonisch
    - mitteilen
- Störbild:
    - defekt
    - kaputt
    - Geräusch
    - undicht
    - leckt
    - Dichtigkeit
- Abteilung:
    - Buchhaltung
    - Betriebstechnik
    - Entwicklung
- Ort:
    - Kesselhaus
    - Durchfahrt
    - Dach
    - Haupteingang
    - Werkbank
    - Schlosserei

In [68]:
word_2_map = {
    'Prüfung': ['prüfen', 'sichtprüfung', 'überprüfung', 'überprüfen',
                'kontrolle', 'kontrollieren', 'sicherstellen', 'sicherstellung',
                'reinigung', 'reinigen', 'prüfbericht', 'sichtkontrolle',
                'rundgang', 'technikrundgang'],
    'Wartung': ['wartung', 'warten', 'wartungstätigkeit', 'wartungsarbeit',
                'wartungsplan'],
    'Handlung': ['schmierung', 'schmieren', 'reinigen', 'reinigung',
                 'schneiden', 'nachschneiden'],
    'zyklisch': ['täglich', 'tägliche', 'täglicher', 'wöchentlich', 'wöchentliche', 'monatlich', 'jährlich',
                 'halbjährlich', 'monatliche', 'wartungsintervall'],
    'Datum': ['uhr', 'montag', 'dienstag', 'mittwoch', 'donnerstag',
              'freitag', 'samstag', 'sonntag'],
    'Kleinteile': ['schraube', 'adapter', 'halterung', 'scheibe', 'gewinde',
                   'ventil', 'schalter', 'befestigungsschraube'],
    'Komponenten': ['kupplung', 'motor', 'getriebe', 'ventilator',
                    'zahnriemen', 'transformator', 'filterelement',
                    'dosierpumpe', 'luftschlauch', 'dichtung', 'filter',
                    'scharnier', 'spannrolle', 'druckluftbehälter', 'kette',
                    'anschlüsse', 'anschluss', 'schläuche', 'schlauch', 'beleuchtung'],
    'Elektrik': ['zuleitung', 'kabel', 'steckdose', 'elektriker',
                 'elektronik', 'elektrisch', 'sicherheitsbeleuchtung'],
    'Anlagen': ['anlage', 'mischanlage', 'maschine', 'klimaanlage', 'filteranlage',
                'wasserenthärtungsanlage', 'lüftungsanlage', 'wasseraufbereitungsanlage'],
    'Vereinbarung': ['wartungsvertrag', 'neuvertrag', 'vertrag', 'terminieren'
                     'terminiert', 'absprache', 'melden', 'telefonisch', 'mitteilen'],
    'Störbild': ['defekt', 'kaputt', 'geräusch', 'undicht', 'leckt', 'dichtigkeit'],
    'Abteilung': ['buchhaltung', 'betriebstechnik', 'entwicklung'],
    'Ort': ['kesselhaus', 'durchfahrt', 'dach', 
            'haupteingang', 'werkbank', 'schlosserei'],
}

- Frage: Existiert Möglichkeit zur Klassifizierung von Begriffen?
    - z.B. automatische Kennung, ob Komponente oder nicht

In [69]:
map_2_word = dict()

for key, word_list in word_2_map.items():
    
    for word in word_list:
        map_2_word[word] = key

In [70]:
IS_TEST = False
LOAD_CALC_FILES = False

In [71]:
len(descr)

1735

In [72]:
# adjacency matrix
connections = dict()
unique_tokens = set()
UPDATE_STATUS = 500
length_data = len(descr)

if not LOAD_CALC_FILES or IS_TEST:
    for count, description in enumerate(descr.iterrows()):
        
        text = description[1]['descr']
        weight = description[1]['num_occur']
        
        doc = nlp(text)
        
        obtain_descendant_info(
            doc=doc,
            weight=weight,
            POS_of_interest=POS_of_interest,
            TAG_of_interest=TAG_of_interest,
            connections=connections,
            unique_tokens=unique_tokens,
            map_2_word=None,
        )
        
        if count % UPDATE_STATUS == 0:
            logger.info(f'Number of entries processed: {count+1}, Percent completed: {((count+1) / length_data) * 100:.2f}')

INFO:base:Number of entries processed: 1, Percent completed: 0.06
INFO:base:Number of entries processed: 501, Percent completed: 28.88
INFO:base:Number of entries processed: 1001, Percent completed: 57.69
INFO:base:Number of entries processed: 1501, Percent completed: 86.51


In [73]:
adj_mat = obtain_adj_matrix(
    unique_tokens=unique_tokens, 
    connections=connections
)
adj_mat_undir = make_undir_adj_matrix(adj_mat=adj_mat)

In [74]:
SAVE_PATH_UNI_TOKENS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_unique_tokens.pkl'
SAVE_PATH_CONNECTIONS = f'./02_1_Preprocess1/{DATA_SET_ID}_04_1_connections.pkl'
SAVE_PATH_ADJ_DF = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df.parquet'
SAVE_PATH_ADJ_DF_UNDIR = f'./02_1_Preprocess1/{DATA_SET_ID}_04_2_adj_mat_df_undir.parquet'
if not IS_TEST:
    if LOAD_CALC_FILES:
        connections = load_pickle(SAVE_PATH_UNI_TOKENS)
        unique_tokens = load_pickle(SAVE_PATH_CONNECTIONS)
        adj_mat = pd.read_parquet(SAVE_PATH_ADJ_DF)
        adj_mat_undir = pd.read_parquet(SAVE_PATH_ADJ_DF_UNDIR)
    else:
        adj_mat.to_parquet(SAVE_PATH_ADJ_DF)
        adj_mat_undir.to_parquet(SAVE_PATH_ADJ_DF_UNDIR)
        save_pickle(obj=connections, path=SAVE_PATH_CONNECTIONS)
        save_pickle(obj=unique_tokens, path=SAVE_PATH_UNI_TOKENS)

In [75]:
adj_mat_undir.sort_index()

Unnamed: 0,Dampf,Riss,Förderleistung,festlegen,weie,reperatur,Edelstahlblech,Kidde,Anlagenstillstand,Füllstandssonde,...,Kaltwasserhähne,Andreas,Haltebügel,Sicherheitsschalter,Tränkle,Fall,Zusatzstoff,Gelenk,trocknen,Kilo
AB,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
ABIC,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AGT,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AGipsTechnikRZB,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AKU,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
überfähren,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
überholen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
überprüfen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
übertragen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
ret = adj_mat_undir.sort_index().index[3]
ret

'AGipsTechnikRZB'

In [77]:
is_str_date(ret)

False

In [78]:
adj_mat_undir.loc[ret,:].sum()

12

Threshold

In [88]:
WEIGHT_THRESHOLD = 120
arr = adj_mat_undir.to_numpy()
arr = np.where(arr < WEIGHT_THRESHOLD, 0, arr)

In [89]:
np.count_nonzero(arr)

190

In [90]:
temp = np.sum(arr, axis=0)
np.count_nonzero(temp)

70

In [91]:
thresh_adj_mat = adj_mat_undir.copy()
thresh_adj_mat.loc[:] = arr

In [92]:
thresh_adj_mat

Unnamed: 0,Dampf,Riss,Förderleistung,festlegen,weie,reperatur,Edelstahlblech,Kidde,Anlagenstillstand,Füllstandssonde,...,Kaltwasserhähne,Andreas,Haltebügel,Sicherheitsschalter,Tränkle,Fall,Zusatzstoff,Gelenk,trocknen,Kilo
Dampf,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Riss,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Förderleistung,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
festlegen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
weie,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Fall,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Zusatzstoff,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Gelenk,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
trocknen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [93]:
ADJ_MAT_PATH_CSV = f'./02_2_Preprocess2/{DATA_SET_ID}_01_1_adj_mat_thresh_mapping_{WEIGHT_THRESHOLD}.csv'
thresh_adj_mat.to_csv(path_or_buf=ADJ_MAT_PATH_CSV, encoding='cp1252', sep=';')

*Transfer in NetworkX Graph for Exporting to Standardized Formats*

In [94]:
import networkx as nx

In [95]:
G = nx.from_pandas_adjacency(thresh_adj_mat)

In [96]:
SAVE_PATH_GRAPHML = f'./02_2_Preprocess2/{DATA_SET_ID}_adj_mat_thresh_{WEIGHT_THRESHOLD}.graphml'
nx.write_graphml(G, SAVE_PATH_GRAPHML)

Test Cosine Similarity
- erstelle Matrix mit Ähnlichkeits-Score (obere Dreiecksmatrix)
- jedes Wortpaar
- filtere Tabelle nach Threshold
- nutze Gewichts-Adjezenzmatrix mit Threshold als Maske
    - nur Analyse von hochgewichtigen Gruppen
- analysiere Zusammenhänge in Form von Graph (ähnlich bisherigem Vorgehen)
- bilde Gruppen und benenne diese (z.B. Prüfung+Überprüfung+Kontrolle --> Überprüfung)
- baue daraus Wörterbuch und matche Begriffe bei der Erstellung

In [None]:
def build_cosine_similarity_matrix(
    adj_mat
):
    # obtain words to compare
    words = adj_mat.index.to_list()
    
    # cos matrix
    cos_mat = pd.DataFrame(
        data=0., 
        columns=words, 
        index=words,
        dtype=np.float32,
    )
    
    for (word1, word2) in combinations(words, 2):
        # obtain model vocabulary
        w1 = nlp.vocab[str(word1)]
        w2 = nlp.vocab[str(word2)]
        # calculate cosine similarity
        cos_sim = w1.similarity(w2)
        # set value
        cos_mat.at[word1, word2] = cos_sim
        
    return cos_mat

In [None]:
cos_mat = build_cosine_similarity_matrix(adj_mat=adj_mat_undir)

  cos_sim = w1.similarity(w2)


In [None]:
cos_mat

Unnamed: 0,Klübertemp,Schusssuche,Laser,Schaftteile,Dichtsätz,Tastatur,Vorspuleinheit,beginnen,auslesen,Kettspannung,...,Tänzerwalze,Abfallkante,rappeln,Rottenegger,Contrawalze,Eisenträger,Hängegurte,Treffen,Greiferarmen,Nadelleist
Klübertemp,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
Schusssuche,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
Laser,0.0,0.0,0.0,0.0,0.0,0.324276,0.0,0.059743,0.133676,0.0,...,0.0,0.0,-0.063913,0.0,0.0,0.167521,0.0,-0.029860,0.0,0.0
Schaftteile,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
Dichtsätz,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Eisenträger,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.170954,0.0,0.0
Hängegurte,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
Treffen,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
Greiferarmen,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0


In [None]:
WEIGHT_THRESHOLD = 10
arr = adj_mat_undir.to_numpy()
COS_THRESHOLD = 0.4
cos_arr = cos_mat.to_numpy()

In [None]:
cos_arr_filt = np.where((cos_arr > COS_THRESHOLD) & (arr >= WEIGHT_THRESHOLD), cos_arr, 0)

In [None]:
cos_arr_filt

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
np.count_nonzero(cos_arr_filt)

217

In [None]:
thresh_cos_mat = cos_mat.copy()
thresh_cos_mat[:] = cos_arr_filt

In [None]:
thresh_cos_mat

Unnamed: 0,Verstärkung,Zuluftfilter,klemmt,Komminikation,Doppelholztische,Deckenbeleuchtung,Abfalltransport,fahrbar,Folieneinlauf,entsorgen,...,neuwertig,Bleit,Rauchentwicklung,Kompressorsteuerung,anziehen,Mitarbeiterin,Nägel,WZ,ExSchutzAnlage,Gemisch
Verstärkung,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zuluftfilter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
klemmt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Komminikation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doppelholztische,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mitarbeiterin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Nägel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
WZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ExSchutzAnlage,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
COS_MAT_PATH_CSV = f'./Graphanalyse_Gruppen/cos_mat_Wthresh_{WEIGHT_THRESHOLD}_Cthresh{int(COS_THRESHOLD*100)}.csv'
thresh_cos_mat.to_csv(path_or_buf=COS_MAT_PATH_CSV, encoding='cp1252', sep=';')