From 2002ed71a66f5665772f9d01443d22f2888c3e6c Mon Sep 17 00:00:00 2001 From: pBxr <> Date: Sun, 20 Oct 2024 11:12:11 +0200 Subject: [PATCH] Insert NER Plugin into ttw --- .gitignore | 5 + README.md | 32 ++-- cpp_core/TagTool_WiZArd.dev | 2 +- cpp_core/ttwClasses.h | 4 +- python_frame/TagTool_WiZArd_Start.py | 209 ++++++++++++++----------- python_frame/pyNER.py | 226 +++++++++++++++++++++++++-- python_frame/ttw_help.html | 169 ++++++++++++++++++-- 7 files changed, 519 insertions(+), 128 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..6782ff8 --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +*.docx + +*.bak + +*.exe \ No newline at end of file diff --git a/README.md b/README.md index a14ac7e..18f02a1 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ ttw consists of two components: - it also runs several integrity checks on the files (- step by step it will also take over the functions from the `c++` core) -2.) The `c++` core (`tagtool_v2-0-0.exe`): +2.) The `c++` core (`tagtool_v2-1-0.exe`): - it runs most of the main tasks - using the `Python` framework it needs to be embedded into the framework´s main folder - like in former releases it still can be run as a standalone application using a terminal. @@ -57,13 +57,13 @@ pyinstaller -wF --icon="Logo.ico" TagTool_WiZArd_Start.py Result is `TagTool_WiZArd_Start.exe`. -3.) Create `tagtool_v2-0-0.exe` using this repo (`cpp_core`): -A simple way to create the `tagtool_v2-0-0.exe` file from the `c++` core is to use Embarcadero Dev-C++ 6.3.: +3.) Create `tagtool_v2-1-0.exe` using this repo (`cpp_core`): +A simple way to create the `tagtool_v2-1-0.exe` file from the `c++` core is to use Embarcadero Dev-C++ 6.3.: - Open the `.dev` file and add all `c++` files to your project (`main.cpp` and all header files (`.h`)) - If using Embarcadero Dev-C++ 6.3 add "`-std=c++17`" in Project Options -> Parameter s -> C++ compilers. - Run "Rebuild all". -Result is `tagtool_v2-0-0.exe` +Result is `tagtool_v2-1-0.exe` ## How to setup and run @@ -74,7 +74,7 @@ Result is `tagtool_v2-0-0.exe` - ttw_help.html - Logo.ico - Logo.gif -- tagtool_v2-0-0.exe (how to create the `.exe` file from the `c++` core see above) +- tagtool_v2-1-0.exe (how to create the `.exe` file from the `c++` core see above) - and the \resources folder (with all necessary files downloaded together with the ttw release) If you create a shortcut on your desktop to start `TagTool_WiZArd_Start.exe` you don´t have to touch the ttw folder again. @@ -90,23 +90,34 @@ For preparing the `.csv` files and all other questions how to run the applicatio **Alternatively: Stand alone from console:** -After compiling the binary (tagtool_v2-0-0.exe, see above) open a terminal and run "tagtool_v2-0-0.exe" either with the parameter "--help" to get further informations or together with the name of the file you want to process. +After compiling the binary (tagtool_v2-1-0.exe, see above) open a terminal and run "tagtool_v2-1-0.exe" either with the parameter "--help" to get further informations or together with the name of the file you want to process. Be sure not to omit the `.html`-ending of the file you want to process. -Be sure that all necessary files are saved in the **same folder** together with the `tagtool_v2-0-0.exe` file, i. e. +Be sure that all necessary files are saved in the **same folder** together with the `tagtool_v2-1-0.exe` file, i. e. - 01_MetadataValueList.csv - 02_AuthorYearList.csv - 03_ImageCreditList.csv - 04_ToSearchAndReplaceList.csv - article.html -- tagtool_v2-0-0.exe +- tagtool_v2-1-0.exe - \resources See "--help" to find all necessary informations to run the application in a standalone version. For preparing the `.csv` files see `ttw_help.html`. +## New in v2.1.0 + +Starting with v2.1.0 `ttw` comes with a test version of a `Named Entity Recognition (NER)` Plugin option. The NER Plugin needs a specific environment and various additional libraries with special dependencies. This plugin therefore is switched off by default in the release versions to avoid conflicts. If you want to test the plugin: +- Prepare your environment carefully, see the README.md file with the complete documentation here: https://github.com/pBxr/NER_Plugin_for_ttw. +- Activate the plugin in the `Python` source code before re-interpreting the Python files. See `TagTool_WiZArd_Start.py` and set the `NER_Plugin_Switch` to `True`. +The insufficient quality of the `iDAI.gazetteer` query results was ignored for this first test version (as well as the webservice´s default query limit). To work on filter mechanisms to improve the quality of the result will be a task for forthcoming commits. +For more information see the "Help" file and especially the documentation here: https://github.com/pBxr/NER_Plugin_for_ttw. + +New also: +- Function to convert tables to XML, implemented with Beautiful Soup (therefore not availabe when using the console version). + ## New in v2.0.0 -- Starting with v2.0.0 ttw comes with a GUI, based on `Python/tkinter`. Although the `c++` core can still be used as terminal standalone application (`tagtool_v2-0-0.exe`, see above), it is not recommended, because the `Python` framework does several integrity checks. +- Starting with v2.0.0 ttw comes with a GUI, based on `Python/tkinter`. Although the `c++` core can still be used as terminal standalone application (`tagtool_v2-1-0.exe`, see above), it is not recommended, because the `Python` framework does several integrity checks. Also new to previous versions: - The article file and value lists no longer need to be saved in the same folder with ttw, any directory can be chosen. @@ -157,4 +168,5 @@ Therefore new in v1.3.0: Additional mode implemented when ttw is called from web ## See also - For ttw_webx see https://github.com/pBxr/ttw_WebExtension -- ID_Extractor (ID_Ex) for extracting IDs and references from `.jats` article files, especially for the above mentioned journals, see https://github.com/pBxr/ID_Extractor +- ID_Extractor (ID_Ex) for extracting IDs and references from `.jats` article files, especially for the above mentioned journals, see https://github.com/pBxr/ID_Extractor +- Test Environment for a TagTool_WiZArD Named Entity Recognition Plugin, see https://github.com/pBxr/NER_Plugin_for_ttw. \ No newline at end of file diff --git a/cpp_core/TagTool_WiZArd.dev b/cpp_core/TagTool_WiZArd.dev index e77e2d5..5063fc3 100644 --- a/cpp_core/TagTool_WiZArd.dev +++ b/cpp_core/TagTool_WiZArd.dev @@ -19,7 +19,7 @@ ObjectOutput= LogOutput= LogOutputEnabled=0 OverrideOutput=1 -OverrideOutputName=tagtool_v2-0-0.exe +OverrideOutputName=tagtool_v2-1-0.exe HostApplication= UseCustomMakefile=0 CustomMakefile= diff --git a/cpp_core/ttwClasses.h b/cpp_core/ttwClasses.h index 3a39831..a0c2238 100644 --- a/cpp_core/ttwClasses.h +++ b/cpp_core/ttwClasses.h @@ -634,8 +634,8 @@ string strongEndXML_ = ""; //Global settings and switches... -string versionNumber = "v2-0-0"; -string versionTag = "v2.0.0"; +string versionNumber = "v2-1-0"; +string versionTag = "v2.1.0"; bool firstRun=true; bool nextRunIsSet=true; diff --git a/python_frame/TagTool_WiZArd_Start.py b/python_frame/TagTool_WiZArd_Start.py index 67464e3..7ec3ea0 100644 --- a/python_frame/TagTool_WiZArd_Start.py +++ b/python_frame/TagTool_WiZArd_Start.py @@ -17,7 +17,7 @@ from Settings import files import pyScripts as pyScr -import pyNER + class MainWindow(tkinter.Frame): @@ -143,7 +143,7 @@ def actualize_widgets(self): #NER plugin button self.buttonStartNER = ttk.Button(self, text = "Open NER plugin", style = "TButton", - command=lambda: self.set_NER_settings()) + command=lambda: self.basic_NER_lib_check()) self.buttonStartNER["width"] = 20 self.buttonStartNER.grid(column = 3, row = heightRow1 + 2, sticky="nw") @@ -188,6 +188,51 @@ def actualize_widgets(self): self.buttonOpenBrowser["width"] = 25 self.buttonOpenBrowser.grid(column = 2, row = heightRow1 + 3, sticky="nw") + def basic_NER_lib_check(self): + + if NER_Plugin_Switch == False: + textInfo = ("NER Plugin must be activated first.\n\n" + "See \"About\" -> \"Help\" for instructions.\n\n\n") + tkinter.messagebox.showwarning(title="ERROR", \ + message=textInfo) + return + + if self.files.fileName == "": + tkinter.messagebox.showwarning(title="ERROR", \ + message="No file selected!") + self.settings.selectedFileIsReady = False + self.bgColorTboxArticle = "red" + self.actualize_widgets() + return + + #Extensive tests ommitted, but at least a quick check, + #whether it can be assumed that the necessary environment exists. + try: + from transformers import pipeline + except ModuleNotFoundError as err: + textInfo = ("The required NER libraries do not seem to be installed.\n\n" + "Check your environment.\n\n" + "See \"About\" -> \"Help\" for instructions.\n\n\n") + tkinter.messagebox.showwarning(title="ERROR", \ + message=textInfo) + return + else: + yesnoResult = tkinter.messagebox.askyesno(title="Important Information", \ + message=( "The now opening NER Plugin must to be run before " + "preparing the file \"04_ToSearchAndReplaceList.csv\".\n\n" + "So:\n" + "1. Run the NER Pluging first. Its results will be saved " + "in the file \"NER_results\\02_Gazetteer_IDs_DRAFT.csv\"\n\n" + "2. Copy the entries you have approved and selected into " + "\"04_ToSearchAndReplaceList.csv\"\n\n" + "3. After having prepared the other mandatory .csv files run TagTool.\n\n\n" + "Do you wish to continue?" + )) + if yesnoResult == False: + return + else: + self.set_NER_settings() + def create_MenuBar(self): self.menueBarFile = tkinter.Menu(self.menu, tearoff=False) @@ -270,9 +315,7 @@ def reset_app(self): def run_NER_Plugin(self, window): #In this version the default settings cannot be changed so they are hard coded - pyNER.run_NER_process(self.files, self.settings) - - textInfo = "Process finished. Check result" + success, textInfo = pyNER.run_NER_process(self.files, self.settings) tkinter.messagebox.showinfo(title="Info", \ message=textInfo) @@ -348,89 +391,71 @@ def set_functions(self): def set_NER_settings(self): + + setNER_PluginWindow = tkinter.Toplevel() + setNER_PluginWindow.geometry('600x500') + setNER_PluginWindow.title('Named Entity Recognition Plugin') + setNER_PluginWindow.iconbitmap(self.settings.cwd+"\\Logo.ico") + + #Models + self.groupModels = tkinter.LabelFrame(setNER_PluginWindow) + self.groupModels["text"] = "Models" + self.groupModels.grid(sticky="w", pady = 10, padx = 10) + self.tboxModels = Text(self.groupModels, height=len(self.settings.NER_Settings['Model']), width=70, + background=self.settings.colorNeutral) + self.tboxModels.configure(font=self.textFont) + self.tboxModels.grid() + for item in self.settings.NER_Settings['Model']: + self.tboxModels.insert("end", item + "\n") + self.tboxModels.config(state='disabled') + + #Entities + self.groupEntities = tkinter.LabelFrame(setNER_PluginWindow) + self.groupEntities["text"] = "Entity Types" + self.groupEntities.grid(sticky="w", pady = 10, padx = 10) + self.tboxEntities = Text(self.groupEntities, height=len(self.settings.NER_Settings['Entity Type']), width=70, + background=self.settings.colorNeutral) + self.tboxEntities.configure(font=self.textFont) + self.tboxEntities.grid() + for item in self.settings.NER_Settings['Entity Type']: + self.tboxEntities.insert("end", item + "\n") + self.tboxEntities.config(state='disabled') + + #Sources + self.groupSources = tkinter.LabelFrame(setNER_PluginWindow) + self.groupSources["text"] = "Ways of Source Text Extraction" + self.groupSources.grid(sticky="w", pady = 10, padx = 10) + self.tboxSources = Text(self.groupSources, height=len(self.settings.NER_Settings['Source']), width=70, + background=self.settings.colorNeutral) + self.tboxSources.configure(font=self.textFont) + self.tboxSources.grid() + for item in self.settings.NER_Settings['Source']: + self.tboxSources.insert("end", item + "\n") + self.tboxSources.config(state='disabled') - if self.files.fileName == "": - tkinter.messagebox.showwarning(title="ERROR", \ - message="No file selected!") - self.settings.selectedFileIsReady = False - self.bgColorTboxArticle = "red" - self.actualize_widgets() - else: - tkinter.messagebox.showwarning(title="Important Information", \ - message=( "The now opening NER Plugin needs to be run before " - "preparing the file \"04_ToSearchAndReplaceList.csv\".\n\n" - "So:\n" - "1. Run the NER Pluging first. Its results will be saved " - "in the file \"NER_results\\02_Gazetteer_IDs_DRAFT.csv\"\n\n" - "2. Copy the entries you have approved into " - "\"04_ToSearchAndReplaceList.csv\"\n\n" - "3. After having prepared the other mandatory files run TagTool" - )) - - setNER_PluginWindow = tkinter.Toplevel() - setNER_PluginWindow.geometry('600x500') - setNER_PluginWindow.title('Named Entity Recognition Plugin') - setNER_PluginWindow.iconbitmap(self.settings.cwd+"\\Logo.ico") - - #Models - self.groupModels = tkinter.LabelFrame(setNER_PluginWindow) - self.groupModels["text"] = "Models" - self.groupModels.grid(sticky="w", pady = 10, padx = 10) - self.tboxModels = Text(self.groupModels, height=len(self.settings.NER_Settings['Model']), width=70, - background=self.settings.colorNeutral) - self.tboxModels.configure(font=self.textFont) - self.tboxModels.grid() - for item in self.settings.NER_Settings['Model']: - self.tboxModels.insert("end", item + "\n") - self.tboxModels.config(state='disabled') - - #Entities - self.groupEntities = tkinter.LabelFrame(setNER_PluginWindow) - self.groupEntities["text"] = "Entity Types" - self.groupEntities.grid(sticky="w", pady = 10, padx = 10) - self.tboxEntities = Text(self.groupEntities, height=len(self.settings.NER_Settings['Entity Type']), width=70, - background=self.settings.colorNeutral) - self.tboxEntities.configure(font=self.textFont) - self.tboxEntities.grid() - for item in self.settings.NER_Settings['Entity Type']: - self.tboxEntities.insert("end", item + "\n") - self.tboxEntities.config(state='disabled') - - #Sources - self.groupSources = tkinter.LabelFrame(setNER_PluginWindow) - self.groupSources["text"] = "Ways of Source Text Extraction" - self.groupSources.grid(sticky="w", pady = 10, padx = 10) - self.tboxSources = Text(self.groupSources, height=len(self.settings.NER_Settings['Source']), width=70, - background=self.settings.colorNeutral) - self.tboxSources.configure(font=self.textFont) - self.tboxSources.grid() - for item in self.settings.NER_Settings['Source']: - self.tboxSources.insert("end", item + "\n") - self.tboxSources.config(state='disabled') - - #In this version the default settings cannot be changed so they are hard coded - #Selected Settings - self.groupSettings = tkinter.LabelFrame(setNER_PluginWindow) - self.groupSettings["text"] = "Selected Settings" - self.groupSettings.grid(sticky="w", pady = 10, padx = 10) - self.tboxSettings = Text(self.groupSettings, height=3, width=70, - background=self.settings.okGreen) - self.tboxSettings.configure(font=self.textFont) - self.tboxSettings.grid() - for x, y in self.settings.NER_SettingsSet.items(): - self.tboxSettings.insert("end", x + ": " + y + "\n") - self.tboxSettings.config(state='disabled') - - self.buttonRunNER = ttk.Button(setNER_PluginWindow, text = "Run NER Plugin", style = "TButton", - command=lambda: self.run_NER_Plugin(setNER_PluginWindow)) - self.buttonRunNER.grid(sticky="e") + #In this version the default settings cannot be changed so they are hard coded + #Selected Settings + self.groupSettings = tkinter.LabelFrame(setNER_PluginWindow) + self.groupSettings["text"] = "Selected Settings" + self.groupSettings.grid(sticky="w", pady = 10, padx = 10) + self.tboxSettings = Text(self.groupSettings, height=3, width=70, + background=self.settings.okGreen) + self.tboxSettings.configure(font=self.textFont) + self.tboxSettings.grid() + for x, y in self.settings.NER_SettingsSet.items(): + self.tboxSettings.insert("end", x + ": " + y + "\n") + self.tboxSettings.config(state='disabled') + + self.buttonRunNER = ttk.Button(setNER_PluginWindow, text = "Run NER Plugin", style = "TButton", + command=lambda: self.run_NER_Plugin(setNER_PluginWindow)) + self.buttonRunNER.grid(sticky="e") - self.tboxInfo = Text(setNER_PluginWindow, height=1, width=70, background="#ffff66") - self.tboxInfo.configure(font=self.textFont) - infoText = "NOTE: In this test versions this selected settings are predefined." - self.tboxInfo.insert("end", infoText) - self.tboxInfo.grid(sticky = "w", pady = 10, padx = 10) - self.tboxInfo.config(state='disabled') + self.tboxInfo = Text(setNER_PluginWindow, height=1, width=70, background="#ffff66") + self.tboxInfo.configure(font=self.textFont) + infoText = "NOTE: In this test versions this selected settings are predefined." + self.tboxInfo.insert("end", infoText) + self.tboxInfo.grid(sticky = "w", pady = 10, padx = 10) + self.tboxInfo.config(state='disabled') def show_help(self): @@ -474,7 +499,8 @@ def start_process(self): subprocess.run(pandocCall, stdout=FNULL, stderr=FNULL, shell=False) #Step 2: Run ttw - ttwCall = "\"" + self.settings.cwd + "\\tagtool_v2-0-0.exe\"" + " \""\ + versionNumberCall = versionNumber.replace(".","-") + ttwCall = "\"" + self.settings.cwd + "\\tagtool_v"+versionNumberCall+".exe\"" + " \""\ + self.files.projectPath + self.settings.target + "\"" #In case of whitespaces @@ -500,7 +526,14 @@ def start_process(self): root = tkinter.Tk() global versionNumber - versionNumber = "2.0.0" + versionNumber = "2.1.0" + + #Here is the switch if you want to test the NER Plugin + global NER_Plugin_Switch + NER_Plugin_Switch = False + if NER_Plugin_Switch == True: + import pyNER + currentDirectory = os.getcwd() titleText = "Welcome to TagToolWiZArd application " + "(v"+versionNumber+")" root.title(titleText) diff --git a/python_frame/pyNER.py b/python_frame/pyNER.py index eadce78..5a31487 100644 --- a/python_frame/pyNER.py +++ b/python_frame/pyNER.py @@ -1,17 +1,156 @@ +import time import subprocess import os +from bs4 import BeautifulSoup -def run_NER_process(files, settings): +#Make sure that the basic functions are not beeing affected by this plugin in case of missing libraries. +#A quick check of the environment before running is done by the basic_NER_lib_check() function in the main file. +try: + import json + import requests + from transformers import AutoTokenizer, AutoModelForTokenClassification + from transformers import pipeline +except: + pass + + +class log_NER_Class: + + def __init__(self, files): + + self.files = files + + self.actualTime = time.localtime() + self.year, self.month, self.day = self.actualTime[0:3] + self.hour, self.minute, self.second = self.actualTime[3:6] + + self.NERresultPath = self.files.projectPath+"NER_results/" + + if not os.path.exists(self.NERresultPath): + os.makedirs(self.NERresultPath) + + self.logCollector = [] + + self.logCollector.append(f"Log file: {self.year:4d}-{self.month:02d}-{self.day:02d}_{self.hour:02d}:{self.minute:02d}:{self.second:02d}\n") + + def add_to_log(self, logInput): + + self.logCollector.append(logInput) + + def save_log(self): + + with open(self.NERresultPath + "01_log.txt", 'w', encoding="utf8") as fp: + + for logEntry in self.logCollector: + fp.write(logEntry) + #print(logEntry) + fp.close() + + + def save_results(self, resultJSONList, resultForCSVList): + + self.save_log() + + #Now the .csv list + intro=("Place name|Suggested ID\nNote|\"(*NOT LIKELY*)\" means, " + "that the entity is not of type \"archaeological-site\", " + "\"archaeological-area\" or \"populated-place\".\n") + + resultForCSVList_sorted = sorted(resultForCSVList) + + with open(self.NERresultPath + "02_Gazetteer_IDs_DRAFT.csv", 'w', encoding="utf8") as fp: + fp.write(intro + "\n") + #print(intro) + for item in resultForCSVList_sorted: + fp.write(item + "\n") + #print(item) + fp.close() + + #Now the complete .json file + with open(self.NERresultPath + "03_Gazetteer_result_detailed.json", 'w', encoding="utf8") as fp: + resultJSON = json.dumps(resultJSONList, + indent=4, sort_keys=False, + separators=(',', ': '), ensure_ascii=False) + fp.write(resultJSON) + fp.close() + + +def call_gazetteer(results, logGenerator): + + listGazetteer = [] + listForCSV = [] + csvRow = "" + logGenerator.add_to_log("\n3. iDAI.gazetteer query result") + + for result in results: + + toBeRun = filter_NER_results(result) #Decides whether the entry will be run or not + + if toBeRun == True: + #Most simple way to call gazetteer, only for testing purposes, more elaborated filters following. + #See also the README.md file here https://github.com/pBxr/NER_Plugin_for_ttw on this point. + toSearch = "https://gazetteer.dainst.org/search.json?q=" + result + response = requests.get(toSearch) + resultListComplete = response.json() + + logGenerator.add_to_log(f"\n--------------------------------------------------------------\nSearching in iDAI.gazetteer for \"{result}\"\n") + + logGenerator.add_to_log(f"Number of results: {resultListComplete['total']}\n") + + resultList = resultListComplete['result'] + i=1 + + for item in resultList: + if item['prefName']['title']: + logGenerator.add_to_log(f"Nr. {i}: Preferred Name: {item['prefName']['title']}\n") + + if "types" in item: + logGenerator.add_to_log("Type: ") + + for entry in item['types']: + logGenerator.add_to_log(entry +", ") + logGenerator.add_to_log("\n") + + if "@id" in item: + logGenerator.add_to_log(item['@id']) + + if item['prefName']['title'] and "@id" in item: + + if ("types" in item) and ('archaeological-area' in item['types'] + or 'populated-place' in item['types'] + or 'archaeological-site' in item['types']): + + csvRow = result + "|" + item['@id'] + logGenerator.add_to_log("\n") + listForCSV.append(csvRow) + else: + result2 = result + "(*NOT LIKELY*)" + csvRow = result2 + "|" + item['@id'] + logGenerator.add_to_log("\n") + + listForCSV.append(csvRow) #To save only the needed entries + i+=1 + + logGenerator.add_to_log("\n--------------------------------------------------------------\n") + toSearch="" + + listGazetteer.append(resultListComplete) #To save the complete result + + return listGazetteer, listForCSV + + +def filter_NER_results(result): """ - This is only the first step to get a plugin integrated into ttw. - The complete NER functions will be inserted during the next commits. + This is only a simple placeholder for a more elaborated function. """ + if len(result) > 3: + return True + else: + return False - inputText = prepare_folder_and_input_Text(files, settings) - -def prepare_folder_and_input_Text(files, settings): +def prepare_folder_and_input_text(files, settings): #Prepare folder pathNERresults = files.projectPath + "NER_results" @@ -19,20 +158,19 @@ def prepare_folder_and_input_Text(files, settings): if not os.path.exists(pathNERresults): os.makedirs(pathNERresults) - #Convert text to the chosen input format for NER pipeline + #Convert text to the selected input format if settings.NER_SettingsSet['Source'] == 'Convert .docx to .txt and get text': pandocParameter = "00_Plain_article_text.txt" else: pandocParameter = "00_Plain_article_text.html" - #Put together the command to call pandoc to convert the .docx file into the chosen format + #Put together the pandoc call to convert the .docx file into the selected format and save it pandocCall = "pandoc -o " + "\"" + pathNERresults + "\\" + pandocParameter + "\"" + " " + "\"" + files.projectPath + files.fileName + "\"" FNULL = open(os.devnull, 'w') #For subprocess subprocess.run(pandocCall, stdout=FNULL, stderr=FNULL, shell=False) - #Return the plain text to NER. - #If a structured file is chosen, text gets extracted with bs4 + #Return the plain text for the pipeline. If a structured format like .html is selected, text gets extracted with bs4. plainTextPath = pathNERresults + "\\" + pandocParameter if settings.NER_SettingsSet['Source'] == 'Convert .docx to .txt and get text': @@ -47,4 +185,70 @@ def prepare_folder_and_input_Text(files, settings): #Remove blank lines inputText = str(text).replace('\n\n','') return inputText - + + +def return_location_names(nerResults, logGenerator): + + listNames = [] + logGenerator.add_to_log("\n1. NER result:\n") + for result in nerResults: + logGenerator.add_to_log(str(result)+"\n") + + for entry in nerResults: + #"I-LOC" and "B-LOC" are specific for the selected model. If you use another model check the entity types. + if entry['entity'] == "I-LOC" or entry['entity'] == "B-LOC": + + if "##" in entry['word']: + toInsert = entry['word'].replace("##", "") + else: + toInsert = "%%"+ entry['word'] + listNames.append(toInsert) + + result = ''.join(listNames) + locationNamesRaw = result.split("%%") + locationNamesRaw.remove('') + locationNames = list(set(locationNamesRaw)) + + logGenerator.add_to_log("\n2. Extracted entities de-tokenized\n") + for entry in locationNames: + logGenerator.add_to_log(entry +", ") + logGenerator.add_to_log("\n") + return locationNames + + +def run_NER_process(files, settings): + + logGenerator = log_NER_Class(files) + + inputText = prepare_folder_and_input_text(files, settings) + + selectedModell = settings.NER_SettingsSet['Model'] + try: + #Now run NER + tokenizer = AutoTokenizer.from_pretrained(selectedModell) + model = AutoModelForTokenClassification.from_pretrained(selectedModell) + nlp = pipeline("ner", model=model, tokenizer=tokenizer) + + nerResults = nlp(inputText) + + #Now extract names, get iDAI.gazetteer entries and save log and results + extractedLocationNames = return_location_names(nerResults, logGenerator) + resultJSONList, resultForCSVList = call_gazetteer(extractedLocationNames, logGenerator) + logGenerator.save_log() + logGenerator.save_results(resultJSONList, resultForCSVList) + + except: + textInfo = ("Some unexpected problem occured while starting the NER pipeline.\n\n" + "Check your environment.\n\n" + "See \"About\" -> \"Help\" for instructions.\n\n\n") + logGenerator.add_to_log(textInfo) + logGenerator.save_log() + return False, textInfo + + else: + textInfo = "Process finished. Check result" + return True, textInfo + + + + diff --git a/python_frame/ttw_help.html b/python_frame/ttw_help.html index 1f17850..d932b64 100644 --- a/python_frame/ttw_help.html +++ b/python_frame/ttw_help.html @@ -153,23 +153,30 @@

TagTool_WiZArd Help

-

v2.0.0

+

TagTool_WiZArd Help

+

v2.1.0

Contents

-

Before starting TagToolWiZArd (ttw) 1

-

Preparation of the article file 2

-

Preparation of the .csv files 3

-

In general 3

-

Metadata list (01_MetadataValueList.csv) 4

-

Bibliography (02_AuthorYearList.csv) 4

-

Illustration credits and captions (03_IllustrationCreditList.csv) 5

-

To search and replace (04_ToSearchAndReplaceList.csv) 5

-

Common errors 6

-

Start ttw 6

-

Automatic checks 7

-

Mandatory .csv files 7

-

Article file: 7

-

XML output 8

-

Result 8

+

TagTool_WiZArd Help 1

+

Before starting TagToolWiZArd (ttw) 2

+

Preparation of the article file 3

+

Preparation of the .csv files 4

+

In general 4

+

Metadata list (01_MetadataValueList.csv) 4

+

Bibliography (02_AuthorYearList.csv) 5

+

Illustration credits and captions (03_IllustrationCreditList.csv) 5

+

To search and replace (04_ToSearchAndReplaceList.csv) 6

+

Notes for function “Additional search and replace” 6

+

Common errors 6

+

Start ttw 7

+

Automatic checks 8

+

Mandatory .csv files 8

+

Article file: 8

+

XML output 8

+

Result 8

+

Named Entity Recognition (NER) Plugin Option 9

+

Requirements for using the NER Plugin 9

+

NER functions 9

+

Before starting TagToolWiZArd (ttw)

1.) ttw runs only on Windows. MacOS, Linux and so on are not supported yet.

2.) Install pandoc on your machine. TagTool_WiZArd (ttw) is tested with version pandoc 2.16.2, other versions may cause problems.

@@ -183,6 +190,8 @@

Before starting TagToolWiZArd (ttw) +

4.) Create a shortcut on your desktop to start TagTool_WiZArd_Start.exe. In this case you don´t have to touch the ttw folder again.

@@ -190,8 +199,14 @@

Before starting TagToolWiZArd (ttw)Make sure that the 4 mandatory value lists are in the same folder together with the article file:

+ + +

For the preparation of the .csv files see below.

@@ -200,9 +215,17 @@

Preparation of the article file

  • Arrangement of the article sections: To avoid errors in the automatic paragraph numbering, it is recommended to place all technical sections or informations at the end of the article and not to use headlines for them, e. g.

    + + + +
  • @@ -212,14 +235,24 @@

    Preparation of the article file

    Also important:

    + + +

    Figure references within the text:

    + +

    Preparation of the .csv files

    @@ -228,31 +261,49 @@

    In general

    Prepare each file concerning the categories and so on like in the example files in the folder \resources:

    + + +

    Use the option "Save as .csv" for the conversion. Be sure that:

    + +

    Important:

    +

    Metadata list (01_MetadataValueList.csv)

    The list needs to contain two columns, separated with the pipe symbol (“|”) using following schema:

    +

    That means:

    + +

    Please note:

    @@ -263,25 +314,39 @@

    Bibliography (02_AuthorYearList.csv)

    The list needs to contain three columns, separated with the pipe symbol (“|”) using following schema:

    +

    That means:

    + + +

    Please note:

    + +

    Illustration credits and captions (03_IllustrationCreditList.csv)

    The list needs to contain four columns, separated with the pipe symbol (“|”) using following schema:

    +

    That means:

    @@ -290,7 +355,11 @@

    Illustr + +

    Please note:

    @@ -301,43 +370,77 @@

    To search and repla

    The list needs to contain two columns, separated with the pipe symbol (“|”) using following schema:

    +

    Notes for function “Additional search and replace”

    + + +

    Common errors

    + +

    When preparing the .csv lists:

    + + +

    Start ttw

    When starting ttw the following functions are chosen by default (recommended):

    + + + + + + +

    Please note:

    +

    Automatic checks

    @@ -346,7 +449,11 @@

    Mandatory .csv files

    ttw is also carrying out a simple integrity check on the 4 .csv value lists:

    + +

    Article file:

    @@ -356,7 +463,11 @@

    XML output

    Important:

    + +

    Result

    @@ -366,11 +477,37 @@

    Result

    If the files are written successfully:

    +

    Important:

    +

    Named Entity Recognition (NER) Plugin Option

    +

    Requirements for using the NER Plugin

    +

    The NER Plugin needs a specific environment and various additional libraries with special dependencies. This plugin therefore is switched off by default in the release versions to avoid conflicts.

    +

    If you want to test the plugin:

    + +

    In case of problems when running the plugin check the log.txt file (see below).

    +

    NER functions

    +

    In this version the default settings (e. g. the model) are predefined an cannot be changed. The focus lies on the extraction of location names.

    + +

    That means that the NER Plugin must to be run before preparing the file "04_ToSearchAndReplaceList.csv":

    +
      +
    1. Run the NER Pluging first. Its results will be saved in the file "NER_results\02_Gazetteer_IDs_DRAFT.csv”.

    2. +
    3. Since the quality of the iDAI.gazetteer query results is still low in this test version, you need to revise the result manually.

    4. +
    5. Copy the entries you have approved and selected into "04_ToSearchAndReplaceList.csv”.

    6. +
    7. After having prepared the other mandatory .csv files run ttw.

    8. +