From 2002ed71a66f5665772f9d01443d22f2888c3e6c Mon Sep 17 00:00:00 2001
From: pBxr <>
Date: Sun, 20 Oct 2024 11:12:11 +0200
Subject: [PATCH] Insert NER Plugin into ttw
---
.gitignore | 5 +
README.md | 32 ++--
cpp_core/TagTool_WiZArd.dev | 2 +-
cpp_core/ttwClasses.h | 4 +-
python_frame/TagTool_WiZArd_Start.py | 209 ++++++++++++++-----------
python_frame/pyNER.py | 226 +++++++++++++++++++++++++--
python_frame/ttw_help.html | 169 ++++++++++++++++++--
7 files changed, 519 insertions(+), 128 deletions(-)
create mode 100644 .gitignore
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..6782ff8
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,5 @@
+*.docx
+
+*.bak
+
+*.exe
\ No newline at end of file
diff --git a/README.md b/README.md
index a14ac7e..18f02a1 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ ttw consists of two components:
- it also runs several integrity checks on the files
(- step by step it will also take over the functions from the `c++` core)
-2.) The `c++` core (`tagtool_v2-0-0.exe`):
+2.) The `c++` core (`tagtool_v2-1-0.exe`):
- it runs most of the main tasks
- using the `Python` framework it needs to be embedded into the framework´s main folder
- like in former releases it still can be run as a standalone application using a terminal.
@@ -57,13 +57,13 @@ pyinstaller -wF --icon="Logo.ico" TagTool_WiZArd_Start.py
Result is `TagTool_WiZArd_Start.exe`.
-3.) Create `tagtool_v2-0-0.exe` using this repo (`cpp_core`):
-A simple way to create the `tagtool_v2-0-0.exe` file from the `c++` core is to use Embarcadero Dev-C++ 6.3.:
+3.) Create `tagtool_v2-1-0.exe` using this repo (`cpp_core`):
+A simple way to create the `tagtool_v2-1-0.exe` file from the `c++` core is to use Embarcadero Dev-C++ 6.3.:
- Open the `.dev` file and add all `c++` files to your project (`main.cpp` and all header files (`.h`))
- If using Embarcadero Dev-C++ 6.3 add "`-std=c++17`" in Project Options -> Parameter s -> C++ compilers.
- Run "Rebuild all".
-Result is `tagtool_v2-0-0.exe`
+Result is `tagtool_v2-1-0.exe`
## How to setup and run
@@ -74,7 +74,7 @@ Result is `tagtool_v2-0-0.exe`
- ttw_help.html
- Logo.ico
- Logo.gif
-- tagtool_v2-0-0.exe (how to create the `.exe` file from the `c++` core see above)
+- tagtool_v2-1-0.exe (how to create the `.exe` file from the `c++` core see above)
- and the \resources folder (with all necessary files downloaded together with the ttw release)
If you create a shortcut on your desktop to start `TagTool_WiZArd_Start.exe` you don´t have to touch the ttw folder again.
@@ -90,23 +90,34 @@ For preparing the `.csv` files and all other questions how to run the applicatio
**Alternatively: Stand alone from console:**
-After compiling the binary (tagtool_v2-0-0.exe, see above) open a terminal and run "tagtool_v2-0-0.exe" either with the parameter "--help" to get further informations or together with the name of the file you want to process.
+After compiling the binary (tagtool_v2-1-0.exe, see above) open a terminal and run "tagtool_v2-1-0.exe" either with the parameter "--help" to get further informations or together with the name of the file you want to process.
Be sure not to omit the `.html`-ending of the file you want to process.
-Be sure that all necessary files are saved in the **same folder** together with the `tagtool_v2-0-0.exe` file, i. e.
+Be sure that all necessary files are saved in the **same folder** together with the `tagtool_v2-1-0.exe` file, i. e.
- 01_MetadataValueList.csv
- 02_AuthorYearList.csv
- 03_ImageCreditList.csv
- 04_ToSearchAndReplaceList.csv
- article.html
-- tagtool_v2-0-0.exe
+- tagtool_v2-1-0.exe
- \resources
See "--help" to find all necessary informations to run the application in a standalone version.
For preparing the `.csv` files see `ttw_help.html`.
+## New in v2.1.0
+
+Starting with v2.1.0 `ttw` comes with a test version of a `Named Entity Recognition (NER)` Plugin option. The NER Plugin needs a specific environment and various additional libraries with special dependencies. This plugin therefore is switched off by default in the release versions to avoid conflicts. If you want to test the plugin:
+- Prepare your environment carefully, see the README.md file with the complete documentation here: https://github.com/pBxr/NER_Plugin_for_ttw.
+- Activate the plugin in the `Python` source code before re-interpreting the Python files. See `TagTool_WiZArd_Start.py` and set the `NER_Plugin_Switch` to `True`.
+The insufficient quality of the `iDAI.gazetteer` query results was ignored for this first test version (as well as the webservice´s default query limit). To work on filter mechanisms to improve the quality of the result will be a task for forthcoming commits.
+For more information see the "Help" file and especially the documentation here: https://github.com/pBxr/NER_Plugin_for_ttw.
+
+New also:
+- Function to convert tables to XML, implemented with Beautiful Soup (therefore not availabe when using the console version).
+
## New in v2.0.0
-- Starting with v2.0.0 ttw comes with a GUI, based on `Python/tkinter`. Although the `c++` core can still be used as terminal standalone application (`tagtool_v2-0-0.exe`, see above), it is not recommended, because the `Python` framework does several integrity checks.
+- Starting with v2.0.0 ttw comes with a GUI, based on `Python/tkinter`. Although the `c++` core can still be used as terminal standalone application (`tagtool_v2-1-0.exe`, see above), it is not recommended, because the `Python` framework does several integrity checks.
Also new to previous versions:
- The article file and value lists no longer need to be saved in the same folder with ttw, any directory can be chosen.
@@ -157,4 +168,5 @@ Therefore new in v1.3.0: Additional mode implemented when ttw is called from web
## See also
- For ttw_webx see https://github.com/pBxr/ttw_WebExtension
-- ID_Extractor (ID_Ex) for extracting IDs and references from `.jats` article files, especially for the above mentioned journals, see https://github.com/pBxr/ID_Extractor
+- ID_Extractor (ID_Ex) for extracting IDs and references from `.jats` article files, especially for the above mentioned journals, see https://github.com/pBxr/ID_Extractor
+- Test Environment for a TagTool_WiZArD Named Entity Recognition Plugin, see https://github.com/pBxr/NER_Plugin_for_ttw.
\ No newline at end of file
diff --git a/cpp_core/TagTool_WiZArd.dev b/cpp_core/TagTool_WiZArd.dev
index e77e2d5..5063fc3 100644
--- a/cpp_core/TagTool_WiZArd.dev
+++ b/cpp_core/TagTool_WiZArd.dev
@@ -19,7 +19,7 @@ ObjectOutput=
LogOutput=
LogOutputEnabled=0
OverrideOutput=1
-OverrideOutputName=tagtool_v2-0-0.exe
+OverrideOutputName=tagtool_v2-1-0.exe
HostApplication=
UseCustomMakefile=0
CustomMakefile=
diff --git a/cpp_core/ttwClasses.h b/cpp_core/ttwClasses.h
index 3a39831..a0c2238 100644
--- a/cpp_core/ttwClasses.h
+++ b/cpp_core/ttwClasses.h
@@ -634,8 +634,8 @@ string strongEndXML_ = "";
//Global settings and switches...
-string versionNumber = "v2-0-0";
-string versionTag = "v2.0.0";
+string versionNumber = "v2-1-0";
+string versionTag = "v2.1.0";
bool firstRun=true;
bool nextRunIsSet=true;
diff --git a/python_frame/TagTool_WiZArd_Start.py b/python_frame/TagTool_WiZArd_Start.py
index 67464e3..7ec3ea0 100644
--- a/python_frame/TagTool_WiZArd_Start.py
+++ b/python_frame/TagTool_WiZArd_Start.py
@@ -17,7 +17,7 @@
from Settings import files
import pyScripts as pyScr
-import pyNER
+
class MainWindow(tkinter.Frame):
@@ -143,7 +143,7 @@ def actualize_widgets(self):
#NER plugin button
self.buttonStartNER = ttk.Button(self, text = "Open NER plugin", style = "TButton",
- command=lambda: self.set_NER_settings())
+ command=lambda: self.basic_NER_lib_check())
self.buttonStartNER["width"] = 20
self.buttonStartNER.grid(column = 3, row = heightRow1 + 2, sticky="nw")
@@ -188,6 +188,51 @@ def actualize_widgets(self):
self.buttonOpenBrowser["width"] = 25
self.buttonOpenBrowser.grid(column = 2, row = heightRow1 + 3, sticky="nw")
+ def basic_NER_lib_check(self):
+
+ if NER_Plugin_Switch == False:
+ textInfo = ("NER Plugin must be activated first.\n\n"
+ "See \"About\" -> \"Help\" for instructions.\n\n\n")
+ tkinter.messagebox.showwarning(title="ERROR", \
+ message=textInfo)
+ return
+
+ if self.files.fileName == "":
+ tkinter.messagebox.showwarning(title="ERROR", \
+ message="No file selected!")
+ self.settings.selectedFileIsReady = False
+ self.bgColorTboxArticle = "red"
+ self.actualize_widgets()
+ return
+
+ #Extensive tests ommitted, but at least a quick check,
+ #whether it can be assumed that the necessary environment exists.
+ try:
+ from transformers import pipeline
+ except ModuleNotFoundError as err:
+ textInfo = ("The required NER libraries do not seem to be installed.\n\n"
+ "Check your environment.\n\n"
+ "See \"About\" -> \"Help\" for instructions.\n\n\n")
+ tkinter.messagebox.showwarning(title="ERROR", \
+ message=textInfo)
+ return
+ else:
+ yesnoResult = tkinter.messagebox.askyesno(title="Important Information", \
+ message=( "The now opening NER Plugin must to be run before "
+ "preparing the file \"04_ToSearchAndReplaceList.csv\".\n\n"
+ "So:\n"
+ "1. Run the NER Pluging first. Its results will be saved "
+ "in the file \"NER_results\\02_Gazetteer_IDs_DRAFT.csv\"\n\n"
+ "2. Copy the entries you have approved and selected into "
+ "\"04_ToSearchAndReplaceList.csv\"\n\n"
+ "3. After having prepared the other mandatory .csv files run TagTool.\n\n\n"
+ "Do you wish to continue?"
+ ))
+ if yesnoResult == False:
+ return
+ else:
+ self.set_NER_settings()
+
def create_MenuBar(self):
self.menueBarFile = tkinter.Menu(self.menu, tearoff=False)
@@ -270,9 +315,7 @@ def reset_app(self):
def run_NER_Plugin(self, window):
#In this version the default settings cannot be changed so they are hard coded
- pyNER.run_NER_process(self.files, self.settings)
-
- textInfo = "Process finished. Check result"
+ success, textInfo = pyNER.run_NER_process(self.files, self.settings)
tkinter.messagebox.showinfo(title="Info", \
message=textInfo)
@@ -348,89 +391,71 @@ def set_functions(self):
def set_NER_settings(self):
+
+ setNER_PluginWindow = tkinter.Toplevel()
+ setNER_PluginWindow.geometry('600x500')
+ setNER_PluginWindow.title('Named Entity Recognition Plugin')
+ setNER_PluginWindow.iconbitmap(self.settings.cwd+"\\Logo.ico")
+
+ #Models
+ self.groupModels = tkinter.LabelFrame(setNER_PluginWindow)
+ self.groupModels["text"] = "Models"
+ self.groupModels.grid(sticky="w", pady = 10, padx = 10)
+ self.tboxModels = Text(self.groupModels, height=len(self.settings.NER_Settings['Model']), width=70,
+ background=self.settings.colorNeutral)
+ self.tboxModels.configure(font=self.textFont)
+ self.tboxModels.grid()
+ for item in self.settings.NER_Settings['Model']:
+ self.tboxModels.insert("end", item + "\n")
+ self.tboxModels.config(state='disabled')
+
+ #Entities
+ self.groupEntities = tkinter.LabelFrame(setNER_PluginWindow)
+ self.groupEntities["text"] = "Entity Types"
+ self.groupEntities.grid(sticky="w", pady = 10, padx = 10)
+ self.tboxEntities = Text(self.groupEntities, height=len(self.settings.NER_Settings['Entity Type']), width=70,
+ background=self.settings.colorNeutral)
+ self.tboxEntities.configure(font=self.textFont)
+ self.tboxEntities.grid()
+ for item in self.settings.NER_Settings['Entity Type']:
+ self.tboxEntities.insert("end", item + "\n")
+ self.tboxEntities.config(state='disabled')
+
+ #Sources
+ self.groupSources = tkinter.LabelFrame(setNER_PluginWindow)
+ self.groupSources["text"] = "Ways of Source Text Extraction"
+ self.groupSources.grid(sticky="w", pady = 10, padx = 10)
+ self.tboxSources = Text(self.groupSources, height=len(self.settings.NER_Settings['Source']), width=70,
+ background=self.settings.colorNeutral)
+ self.tboxSources.configure(font=self.textFont)
+ self.tboxSources.grid()
+ for item in self.settings.NER_Settings['Source']:
+ self.tboxSources.insert("end", item + "\n")
+ self.tboxSources.config(state='disabled')
- if self.files.fileName == "":
- tkinter.messagebox.showwarning(title="ERROR", \
- message="No file selected!")
- self.settings.selectedFileIsReady = False
- self.bgColorTboxArticle = "red"
- self.actualize_widgets()
- else:
- tkinter.messagebox.showwarning(title="Important Information", \
- message=( "The now opening NER Plugin needs to be run before "
- "preparing the file \"04_ToSearchAndReplaceList.csv\".\n\n"
- "So:\n"
- "1. Run the NER Pluging first. Its results will be saved "
- "in the file \"NER_results\\02_Gazetteer_IDs_DRAFT.csv\"\n\n"
- "2. Copy the entries you have approved into "
- "\"04_ToSearchAndReplaceList.csv\"\n\n"
- "3. After having prepared the other mandatory files run TagTool"
- ))
-
- setNER_PluginWindow = tkinter.Toplevel()
- setNER_PluginWindow.geometry('600x500')
- setNER_PluginWindow.title('Named Entity Recognition Plugin')
- setNER_PluginWindow.iconbitmap(self.settings.cwd+"\\Logo.ico")
-
- #Models
- self.groupModels = tkinter.LabelFrame(setNER_PluginWindow)
- self.groupModels["text"] = "Models"
- self.groupModels.grid(sticky="w", pady = 10, padx = 10)
- self.tboxModels = Text(self.groupModels, height=len(self.settings.NER_Settings['Model']), width=70,
- background=self.settings.colorNeutral)
- self.tboxModels.configure(font=self.textFont)
- self.tboxModels.grid()
- for item in self.settings.NER_Settings['Model']:
- self.tboxModels.insert("end", item + "\n")
- self.tboxModels.config(state='disabled')
-
- #Entities
- self.groupEntities = tkinter.LabelFrame(setNER_PluginWindow)
- self.groupEntities["text"] = "Entity Types"
- self.groupEntities.grid(sticky="w", pady = 10, padx = 10)
- self.tboxEntities = Text(self.groupEntities, height=len(self.settings.NER_Settings['Entity Type']), width=70,
- background=self.settings.colorNeutral)
- self.tboxEntities.configure(font=self.textFont)
- self.tboxEntities.grid()
- for item in self.settings.NER_Settings['Entity Type']:
- self.tboxEntities.insert("end", item + "\n")
- self.tboxEntities.config(state='disabled')
-
- #Sources
- self.groupSources = tkinter.LabelFrame(setNER_PluginWindow)
- self.groupSources["text"] = "Ways of Source Text Extraction"
- self.groupSources.grid(sticky="w", pady = 10, padx = 10)
- self.tboxSources = Text(self.groupSources, height=len(self.settings.NER_Settings['Source']), width=70,
- background=self.settings.colorNeutral)
- self.tboxSources.configure(font=self.textFont)
- self.tboxSources.grid()
- for item in self.settings.NER_Settings['Source']:
- self.tboxSources.insert("end", item + "\n")
- self.tboxSources.config(state='disabled')
-
- #In this version the default settings cannot be changed so they are hard coded
- #Selected Settings
- self.groupSettings = tkinter.LabelFrame(setNER_PluginWindow)
- self.groupSettings["text"] = "Selected Settings"
- self.groupSettings.grid(sticky="w", pady = 10, padx = 10)
- self.tboxSettings = Text(self.groupSettings, height=3, width=70,
- background=self.settings.okGreen)
- self.tboxSettings.configure(font=self.textFont)
- self.tboxSettings.grid()
- for x, y in self.settings.NER_SettingsSet.items():
- self.tboxSettings.insert("end", x + ": " + y + "\n")
- self.tboxSettings.config(state='disabled')
-
- self.buttonRunNER = ttk.Button(setNER_PluginWindow, text = "Run NER Plugin", style = "TButton",
- command=lambda: self.run_NER_Plugin(setNER_PluginWindow))
- self.buttonRunNER.grid(sticky="e")
+ #In this version the default settings cannot be changed so they are hard coded
+ #Selected Settings
+ self.groupSettings = tkinter.LabelFrame(setNER_PluginWindow)
+ self.groupSettings["text"] = "Selected Settings"
+ self.groupSettings.grid(sticky="w", pady = 10, padx = 10)
+ self.tboxSettings = Text(self.groupSettings, height=3, width=70,
+ background=self.settings.okGreen)
+ self.tboxSettings.configure(font=self.textFont)
+ self.tboxSettings.grid()
+ for x, y in self.settings.NER_SettingsSet.items():
+ self.tboxSettings.insert("end", x + ": " + y + "\n")
+ self.tboxSettings.config(state='disabled')
+
+ self.buttonRunNER = ttk.Button(setNER_PluginWindow, text = "Run NER Plugin", style = "TButton",
+ command=lambda: self.run_NER_Plugin(setNER_PluginWindow))
+ self.buttonRunNER.grid(sticky="e")
- self.tboxInfo = Text(setNER_PluginWindow, height=1, width=70, background="#ffff66")
- self.tboxInfo.configure(font=self.textFont)
- infoText = "NOTE: In this test versions this selected settings are predefined."
- self.tboxInfo.insert("end", infoText)
- self.tboxInfo.grid(sticky = "w", pady = 10, padx = 10)
- self.tboxInfo.config(state='disabled')
+ self.tboxInfo = Text(setNER_PluginWindow, height=1, width=70, background="#ffff66")
+ self.tboxInfo.configure(font=self.textFont)
+ infoText = "NOTE: In this test versions this selected settings are predefined."
+ self.tboxInfo.insert("end", infoText)
+ self.tboxInfo.grid(sticky = "w", pady = 10, padx = 10)
+ self.tboxInfo.config(state='disabled')
def show_help(self):
@@ -474,7 +499,8 @@ def start_process(self):
subprocess.run(pandocCall, stdout=FNULL, stderr=FNULL, shell=False)
#Step 2: Run ttw
- ttwCall = "\"" + self.settings.cwd + "\\tagtool_v2-0-0.exe\"" + " \""\
+ versionNumberCall = versionNumber.replace(".","-")
+ ttwCall = "\"" + self.settings.cwd + "\\tagtool_v"+versionNumberCall+".exe\"" + " \""\
+ self.files.projectPath + self.settings.target + "\""
#In case of whitespaces
@@ -500,7 +526,14 @@ def start_process(self):
root = tkinter.Tk()
global versionNumber
- versionNumber = "2.0.0"
+ versionNumber = "2.1.0"
+
+ #Here is the switch if you want to test the NER Plugin
+ global NER_Plugin_Switch
+ NER_Plugin_Switch = False
+ if NER_Plugin_Switch == True:
+ import pyNER
+
currentDirectory = os.getcwd()
titleText = "Welcome to TagToolWiZArd application " + "(v"+versionNumber+")"
root.title(titleText)
diff --git a/python_frame/pyNER.py b/python_frame/pyNER.py
index eadce78..5a31487 100644
--- a/python_frame/pyNER.py
+++ b/python_frame/pyNER.py
@@ -1,17 +1,156 @@
+import time
import subprocess
import os
+from bs4 import BeautifulSoup
-def run_NER_process(files, settings):
+#Make sure that the basic functions are not beeing affected by this plugin in case of missing libraries.
+#A quick check of the environment before running is done by the basic_NER_lib_check() function in the main file.
+try:
+ import json
+ import requests
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
+ from transformers import pipeline
+except:
+ pass
+
+
+class log_NER_Class:
+
+ def __init__(self, files):
+
+ self.files = files
+
+ self.actualTime = time.localtime()
+ self.year, self.month, self.day = self.actualTime[0:3]
+ self.hour, self.minute, self.second = self.actualTime[3:6]
+
+ self.NERresultPath = self.files.projectPath+"NER_results/"
+
+ if not os.path.exists(self.NERresultPath):
+ os.makedirs(self.NERresultPath)
+
+ self.logCollector = []
+
+ self.logCollector.append(f"Log file: {self.year:4d}-{self.month:02d}-{self.day:02d}_{self.hour:02d}:{self.minute:02d}:{self.second:02d}\n")
+
+ def add_to_log(self, logInput):
+
+ self.logCollector.append(logInput)
+
+ def save_log(self):
+
+ with open(self.NERresultPath + "01_log.txt", 'w', encoding="utf8") as fp:
+
+ for logEntry in self.logCollector:
+ fp.write(logEntry)
+ #print(logEntry)
+ fp.close()
+
+
+ def save_results(self, resultJSONList, resultForCSVList):
+
+ self.save_log()
+
+ #Now the .csv list
+ intro=("Place name|Suggested ID\nNote|\"(*NOT LIKELY*)\" means, "
+ "that the entity is not of type \"archaeological-site\", "
+ "\"archaeological-area\" or \"populated-place\".\n")
+
+ resultForCSVList_sorted = sorted(resultForCSVList)
+
+ with open(self.NERresultPath + "02_Gazetteer_IDs_DRAFT.csv", 'w', encoding="utf8") as fp:
+ fp.write(intro + "\n")
+ #print(intro)
+ for item in resultForCSVList_sorted:
+ fp.write(item + "\n")
+ #print(item)
+ fp.close()
+
+ #Now the complete .json file
+ with open(self.NERresultPath + "03_Gazetteer_result_detailed.json", 'w', encoding="utf8") as fp:
+ resultJSON = json.dumps(resultJSONList,
+ indent=4, sort_keys=False,
+ separators=(',', ': '), ensure_ascii=False)
+ fp.write(resultJSON)
+ fp.close()
+
+
+def call_gazetteer(results, logGenerator):
+
+ listGazetteer = []
+ listForCSV = []
+ csvRow = ""
+ logGenerator.add_to_log("\n3. iDAI.gazetteer query result")
+
+ for result in results:
+
+ toBeRun = filter_NER_results(result) #Decides whether the entry will be run or not
+
+ if toBeRun == True:
+ #Most simple way to call gazetteer, only for testing purposes, more elaborated filters following.
+ #See also the README.md file here https://github.com/pBxr/NER_Plugin_for_ttw on this point.
+ toSearch = "https://gazetteer.dainst.org/search.json?q=" + result
+ response = requests.get(toSearch)
+ resultListComplete = response.json()
+
+ logGenerator.add_to_log(f"\n--------------------------------------------------------------\nSearching in iDAI.gazetteer for \"{result}\"\n")
+
+ logGenerator.add_to_log(f"Number of results: {resultListComplete['total']}\n")
+
+ resultList = resultListComplete['result']
+ i=1
+
+ for item in resultList:
+ if item['prefName']['title']:
+ logGenerator.add_to_log(f"Nr. {i}: Preferred Name: {item['prefName']['title']}\n")
+
+ if "types" in item:
+ logGenerator.add_to_log("Type: ")
+
+ for entry in item['types']:
+ logGenerator.add_to_log(entry +", ")
+ logGenerator.add_to_log("\n")
+
+ if "@id" in item:
+ logGenerator.add_to_log(item['@id'])
+
+ if item['prefName']['title'] and "@id" in item:
+
+ if ("types" in item) and ('archaeological-area' in item['types']
+ or 'populated-place' in item['types']
+ or 'archaeological-site' in item['types']):
+
+ csvRow = result + "|" + item['@id']
+ logGenerator.add_to_log("\n")
+ listForCSV.append(csvRow)
+ else:
+ result2 = result + "(*NOT LIKELY*)"
+ csvRow = result2 + "|" + item['@id']
+ logGenerator.add_to_log("\n")
+
+ listForCSV.append(csvRow) #To save only the needed entries
+ i+=1
+
+ logGenerator.add_to_log("\n--------------------------------------------------------------\n")
+ toSearch=""
+
+ listGazetteer.append(resultListComplete) #To save the complete result
+
+ return listGazetteer, listForCSV
+
+
+def filter_NER_results(result):
"""
- This is only the first step to get a plugin integrated into ttw.
- The complete NER functions will be inserted during the next commits.
+ This is only a simple placeholder for a more elaborated function.
"""
+ if len(result) > 3:
+ return True
+ else:
+ return False
- inputText = prepare_folder_and_input_Text(files, settings)
-
-def prepare_folder_and_input_Text(files, settings):
+def prepare_folder_and_input_text(files, settings):
#Prepare folder
pathNERresults = files.projectPath + "NER_results"
@@ -19,20 +158,19 @@ def prepare_folder_and_input_Text(files, settings):
if not os.path.exists(pathNERresults):
os.makedirs(pathNERresults)
- #Convert text to the chosen input format for NER pipeline
+ #Convert text to the selected input format
if settings.NER_SettingsSet['Source'] == 'Convert .docx to .txt and get text':
pandocParameter = "00_Plain_article_text.txt"
else:
pandocParameter = "00_Plain_article_text.html"
- #Put together the command to call pandoc to convert the .docx file into the chosen format
+ #Put together the pandoc call to convert the .docx file into the selected format and save it
pandocCall = "pandoc -o " + "\"" + pathNERresults + "\\" + pandocParameter + "\"" + " " + "\"" + files.projectPath + files.fileName + "\""
FNULL = open(os.devnull, 'w') #For subprocess
subprocess.run(pandocCall, stdout=FNULL, stderr=FNULL, shell=False)
- #Return the plain text to NER.
- #If a structured file is chosen, text gets extracted with bs4
+ #Return the plain text for the pipeline. If a structured format like .html is selected, text gets extracted with bs4.
plainTextPath = pathNERresults + "\\" + pandocParameter
if settings.NER_SettingsSet['Source'] == 'Convert .docx to .txt and get text':
@@ -47,4 +185,70 @@ def prepare_folder_and_input_Text(files, settings):
#Remove blank lines
inputText = str(text).replace('\n\n','')
return inputText
-
+
+
+def return_location_names(nerResults, logGenerator):
+
+ listNames = []
+ logGenerator.add_to_log("\n1. NER result:\n")
+ for result in nerResults:
+ logGenerator.add_to_log(str(result)+"\n")
+
+ for entry in nerResults:
+ #"I-LOC" and "B-LOC" are specific for the selected model. If you use another model check the entity types.
+ if entry['entity'] == "I-LOC" or entry['entity'] == "B-LOC":
+
+ if "##" in entry['word']:
+ toInsert = entry['word'].replace("##", "")
+ else:
+ toInsert = "%%"+ entry['word']
+ listNames.append(toInsert)
+
+ result = ''.join(listNames)
+ locationNamesRaw = result.split("%%")
+ locationNamesRaw.remove('')
+ locationNames = list(set(locationNamesRaw))
+
+ logGenerator.add_to_log("\n2. Extracted entities de-tokenized\n")
+ for entry in locationNames:
+ logGenerator.add_to_log(entry +", ")
+ logGenerator.add_to_log("\n")
+ return locationNames
+
+
+def run_NER_process(files, settings):
+
+ logGenerator = log_NER_Class(files)
+
+ inputText = prepare_folder_and_input_text(files, settings)
+
+ selectedModell = settings.NER_SettingsSet['Model']
+ try:
+ #Now run NER
+ tokenizer = AutoTokenizer.from_pretrained(selectedModell)
+ model = AutoModelForTokenClassification.from_pretrained(selectedModell)
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+
+ nerResults = nlp(inputText)
+
+ #Now extract names, get iDAI.gazetteer entries and save log and results
+ extractedLocationNames = return_location_names(nerResults, logGenerator)
+ resultJSONList, resultForCSVList = call_gazetteer(extractedLocationNames, logGenerator)
+ logGenerator.save_log()
+ logGenerator.save_results(resultJSONList, resultForCSVList)
+
+ except:
+ textInfo = ("Some unexpected problem occured while starting the NER pipeline.\n\n"
+ "Check your environment.\n\n"
+ "See \"About\" -> \"Help\" for instructions.\n\n\n")
+ logGenerator.add_to_log(textInfo)
+ logGenerator.save_log()
+ return False, textInfo
+
+ else:
+ textInfo = "Process finished. Check result"
+ return True, textInfo
+
+
+
+
diff --git a/python_frame/ttw_help.html b/python_frame/ttw_help.html
index 1f17850..d932b64 100644
--- a/python_frame/ttw_help.html
+++ b/python_frame/ttw_help.html
@@ -153,23 +153,30 @@
TagTool_WiZArd Help
v2.0.0
+v2.1.0
Before starting TagToolWiZArd (ttw) 1
-Preparation of the article file 2
-Preparation of the .csv files 3
- -Metadata list (01_MetadataValueList.csv) 4
-Bibliography (02_AuthorYearList.csv) 4
-Illustration credits and captions (03_IllustrationCreditList.csv) 5
-To search and replace (04_ToSearchAndReplaceList.csv) 5
- - - - - - - + +Before starting TagToolWiZArd (ttw) 2
+Preparation of the article file 3
+Preparation of the .csv files 4
+ +Metadata list (01_MetadataValueList.csv) 4
+Bibliography (02_AuthorYearList.csv) 5
+Illustration credits and captions (03_IllustrationCreditList.csv) 5
+To search and replace (04_ToSearchAndReplaceList.csv) 6
+Notes for function “Additional search and replace” 6
+ + + + + + + +Named Entity Recognition (NER) Plugin Option 9
+Requirements for using the NER Plugin 9
+ +1.) ttw runs only on Windows. MacOS, Linux and so on are not supported yet.
2.) Install pandoc on your machine. TagTool_WiZArd (ttw) is tested with version pandoc 2.16.2, other versions may cause problems.
@@ -183,6 +190,8 @@tagtool_v2-0-0.exe (or later versions)
and the \resources folder (with all necessary files downloaded together with the ttw release)
4.) Create a shortcut on your desktop to start TagTool_WiZArd_Start.exe. In this case you don´t have to touch the ttw folder again.
@@ -190,8 +199,14 @@Make sure that the 4 mandatory value lists are in the same folder together with the article file:
01_MetadataValueList.csv
02_AuthorYearList.csv
03_ImageCreditList.csv
04_ToSearchAndReplaceList.csv
For the preparation of the .csv files see below.
@@ -200,9 +215,17 @@Arrangement of the article sections: To avoid errors in the automatic paragraph numbering, it is recommended to place all technical sections or informations at the end of the article and not to use headlines for them, e. g.
Abstract
Kewords
Bibliography/references
Illustrations credits/captions
Address/affiliations.
Also important:
If the chapter headlines are not marked with MS Word style sheet headline templates, it may lead to errors in the automatic paragraph numbering. It is therefore recommended to use MS Word style sheet headline templates, which has the additional advantage that the tool then automatically sets the correct style sheet needed for the typesetting process.
Paragraph numbers may be set by mistake in catalog sections or other unusually formatted sections. In case of too many sections or special formattings, it is not recommended to use the tool but to prepare the MS manually.
The tool can process simple tables (for output to .html).
Foreign-language texts (Greek, Cyrillic) are no problem, if unicode fonts are used (Calibri, Noto or similar).
Figure references within the text:
Only figure references that are in brackets and begin directly after the opening bracket – i.e. "(Abb.", "(Fig." or "(Figs." etc. – are marked automatically. All other variants are ignored and must be marked manually.
Make sure that there is no other content in the brackets apart from the figure references (e.g. references to inventory numbers or similar) or drag these manually behind the bracket: "(Fig. 6, K 93.301.4)" → "(Fig. 6) (K 93.301.4)".
References to figure sections are automatically resolved by ttw, as this is necessary for the viewer output (i.e. "(Fig. 2-5. 8-10)" will be resolved to "(Fig. 2. 3. 4. 5. 8. 9. 10)". In addition to normal hyphens between the numbers, ttw should also recognize dashes.
Prepare each file concerning the categories and so on like in the example files in the folder \resources:
01_MetadataValueList_TEMPLATE.csv
02_AuthorYearList_EXAMPLE.csv,
03_IllustrationCreditList_EXAMPLE.csv
04_ToSearchAndReplaceList_EXAMPLE.csv.
Use the option "Save as .csv" for the conversion. Be sure that:
the character encoding ("character set"/"Zeichensatz") is set to "Unicode (UTF-8)",
the separating character ("field delimiter"/"Feldtrenner") is set to "|",
that no "string delimiter"/"Zeichenketten-Trenner" is entered (to avoid possible conflicts with similar characters).
Important:
All 4 .csv value lists are mandatory.
In case you do not intend special alterations (references, illustrations, search and replace) you can use the value lists 02, 03 and 04 with no entries.
The list needs to contain two columns, separated with the pipe symbol (“|”) using following schema:
##Source list to replace ##|##Replace with##
Example: ###_Insert volume number_###|42
That means:
Open "01_MetadataValueList_TEMPLATE.csv" in the "resources" directory (see above).
Enter the metadata of the journal and the article.
Save the file in the same folder where you have saved the article file as "01_MetadataValueList.csv"
Please note:
@@ -263,25 +314,39 @@The list needs to contain three columns, separated with the pipe symbol (“|”) using following schema:
Author/Year|Full citation|Identifier
Example: Bury 1932|R. G. Bury, The Symposium of Plato (Cambridge MA 1932)|https://zenon.dainst.org/Record/000182245
That means:
Open "02_AuthorYearList.csv" in the "resources" directory (see above).
Enter the necessary informations.
If no identifier is entered in the "Identifier" column, the tool automatically sets a note during the conversion.
Save the file in the same folder where you have saved the article file as "02_AuthorYearList.csv "
Please note:
ttw tags the references directly in the article file. Therefore the bibliography/references must also remain in the article in a congruent version.
Only articles using the author-year-schema are suitable for ttw. If this is not the case, it is not recommended to use ttw.
In some cases, the automatic tagging does not work if special characters appear in the short quotation and/or the full quotation (superscript numbers etc.). These passages can easily identified after the conversion because the style sheet has not been applied.
The list needs to contain four columns, separated with the pipe symbol (“|”) using following schema:
Figure number|Captions|Path of source file (for xml-Version)|Copyright/Illustration credit
Example: Fig. 1|Plan of Heraion with excavation area 2010–2013 marked in yellow|C:/Data/Article/image.jpg|All rights reserved
That means:
@@ -290,7 +355,11 @@Copy the captions and informations on the illustrations credits into the file.
The "Path of source file (for xml version only)" only needs to be entered if an XML output is chosen. In the case of .html output, this column can remain empty.
Save the file in the same folder where you have saved the article file as "03_IllustrationCreditList.csv”
Please note:
@@ -301,43 +370,77 @@The list needs to contain two columns, separated with the pipe symbol (“|”) using following schema:
##Search##|##Replace with##
Example: Olympia|https://gazetteer.dainst.org/place/2281840
The purpose of this function is mainly to set tagged hyperlinks. Other and especially more complex find-replace operations should be done in other applications for better reliability.
If a plain url is entered as replacement string, the tool creates the whole tagged link automatically prepared for the chosen output format (.html or .xml).
If you want to avoid to add links to all occurrences of the search expression the "@" character can be used as a prefix (both in the search expression in the .csv file and in the article text). The tool will remove this prefix when creating the tagged link.
See "04_ToSearchAndReplaceList_EXAMPLE.csv" in the folder \resources.
Unusual formattings in the footnote section (such as manual paragraph marks, tables, indentations, blank lines or similar).
Rudiments of previous formattings that remain (sometimes hidden) in the footnotes or elsewhere in the text, such as section breaks (frequent problem when opening Mac MS Word documents in Windows MS Word or LibreOffice)
Additional content in the brackets of the figure references (see above).
When preparing the .csv lists:
Blank lines at the beginning, at the end or in the middle of the list. (Although ttw tries to correct this, it is not a guarantee.)
02_AuthorYearList.csv: A field entry is missing (either the full citation associated with the short citation is missing or vice versa). Or: Whitespaces at the beginning or the end of a field entry. (Although ttw tries to correct this, it is not a guarantee.)
03_IllustrationCreditList.csv: A field entry is missing in the column "Figure number", "Captions" or "Copyright/Illustration credit".
When converting to .csv, LibreOffice sometimes inserts paragraph marks in case the table itself was not prepared correctly or incorrect settings were selected when saving. This step should therefore be checked carefully. Tip: Sometimes it is more reliable to create the source table in a Word table, convert it to text with the appropriate separator and then save the result as a .csv file.
When starting ttw the following functions are chosen by default (recommended):
Apply custom DAI citation style features (not yet fully implemented)
Set customized journal body tags
Set figure references tags
Set author year tags
Set paragraph numbers
Insert tagged illustration credits section
Search and replace by using the separate value list (see remarks below)
Output format will be HTML.
Please note:
You can change the functions and/or the output format using the ttw menu bar or the buttons.
It is not recommended to change the functions if you are not familiar with ttw.
ttw is also carrying out a simple integrity check on the 4 .csv value lists:
If a .csv list contains less than the expected number of columns/cells (see the templates above) or empty rows within the document ttw throws an error and the tool will not be ready to run. In this case check the files and reload the article.
In case of an empty line at the end of one of the .csv lists ttw throws a warning, i. e. the tool will be ready to run, but in case the process fails it is recommended to actualize the value lists.
Although ttw removes leading or ending whitespaces in the entries of the cells it is recommended to prepare the .csv lists carefully.
Important:
This function will create in most cases only a non-valid .xml version.
Manual completion is necessary, especially concerning the section endings. Check especially the marker "<!-- CHECK POSITION OF CLOSING TAG-->" that is set in the xml file automatically.
Table conversion is not implemented yet.
If the files are written successfully:
Open MS Word and open the edited .html-file.
Save it immediately as a .docx-file
Important:
For a correct representation in MS Word the folder "\__ress" need to be in the same folder together with the .html article file.
The NER Plugin needs a specific environment and various additional libraries with special dependencies. This plugin therefore is switched off by default in the release versions to avoid conflicts.
+If you want to test the plugin:
+Prepare your environment carefully, see the README.md file with the complete documentation here: https://github.com/pBxr/NER_Plugin_for_ttw.
Activate the plugin in the Python source code before re-interpreting the Python files, see the README.md file here: https://github.com/pBxr/TagTool_WiZArd.
In case of problems when running the plugin check the log.txt file (see below).
+In this version the default settings (e. g. the model) are predefined an cannot be changed. The focus lies on the extraction of location names.
+The plugin gets the plain text from the selected text file and saves it in a folder ("NER_results") in the selected project directory.
It extracts a tokenized list of result entries that are beeing re-merged to a list of place names.
The place names are run through the iDAI.gazetteer webservice to identify the locations and extract the gazetteer-IDs.
A log file with the tokenized results, a draft for the final .csv list and the complete gazetteer query result as .json file will be saved in the NER_results folder.
That means that the NER Plugin must to be run before preparing the file "04_ToSearchAndReplaceList.csv":
+Run the NER Pluging first. Its results will be saved in the file "NER_results\02_Gazetteer_IDs_DRAFT.csv”.
Since the quality of the iDAI.gazetteer query results is still low in this test version, you need to revise the result manually.
Copy the entries you have approved and selected into "04_ToSearchAndReplaceList.csv”.
After having prepared the other mandatory .csv files run ttw.