Docker_Instructions

Get docker

For Windows 7 and 8 Users download from this link
For Windows 10 and later Users download (from Stable Channel) from this link
For macOS Users download (from Stable Channel) from this link
For Linux (Ubuntu) Users follow the instructions in this link

Run in your terminal (For Windows 7 and 8, run in Quickstart terminal / For Windows 10, run in Command Prompt)

docker pull medkhem/grobid-dictionaries

If the image pull was successful, your terminal should show something similar to the below screenshot

And to make sure that you have all the setup necessary for the following steps (3 onwards), try to run the following command

docker run -it medkhem/grobid-dictionaries bash

Your terminal should look like the following screenshot

Do not forget to exit this image test, before starting with step 3, by simply writing "exit" and typing Enter

You need the 'toyData' directory to create dummy models. You could get it from the github repository
We could now run our image and having the 'toyData' and 'resources' as shared volumes between your machine and the container. Whatever you do in on of these directories, it's applied to both of them

For macOS users:

docker run -v PATH_TO_YOUR_TOYDATA/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash

For Windows 7 users:

docker run -v //c/Users/YOUR_USERNAME/Desktop/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash

For Windows 10 Pro users:

docker run -v C:/Users/YOUR_USERNAME/Desktop/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash

Create/train models by running these commands

For Dictionary Segmentation model, run:

mvn generate-resources -P train_dictionary_segmentation -e

For Dictionary Body Segmentation model, run:

mvn generate-resources -P train_dictionary_body_segmentation -e

For Lexical Entry model, run:

mvn generate-resources -P train_lexicalEntries -e

For Form model, run:

mvn generate-resources -P train_form -e

For Sense model, run:

mvn generate-resources -P train_sense -e

For the first stage model of processing etymology information (EtymQuote model), run:

mvn generate-resources -P train_etymQuote -e

For the second stage model of processing etymology information (Etym model), run:

mvn generate-resources -P train_etym -e

Run the web service to see the output of the models

mvn -DskipTests jetty:run-war

You can see the running application in your web browser:

For Windows 7, your 8080 port should be free to see the web application on the address: http://192.168.99.100:8080
For Linux, MacOs and Windows 10, the web application is running on the address:
http://localhost:8080

To shutdown the server, you need to press ctrl + c

To create training data from your dictionary, copy the pdf directory corresponding to the target model and paste it under the corresponding model location under your toyData.

For Dictionary Segmentation model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingDictionarySegmentation

For Dictionary Body Segmentation model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingDictionaryBodySegmentation

For Lexical Entry model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingLexicalEntry

For Form model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingForm

For Sense model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingSense

For EtymQuote model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingEtymQuote

For Etym model:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF  -dOut resources -exe createTrainingEtym

If you are using macOS, you might need to remove './DS_Store' file, which blocks the jar to run (thniking that it's a pdf)
Note also the choice of the pages is also imported: it should be varied
The above commands create training data to be annotated from scratch (files ending with tei.xml). It is possible also to generate pre-annotations using the current model, to be corrected afterwards (this mode is recommended when the model to be trained is becoming more precise). To do so, the latest token of the above commands should include Annotated. For example: createTrainingDictionarySegmentation -> createAnnotatedTrainingDictionarySegmentation

Annotate your files
Move your tei.xml files under your toyData/dataset/MODEL_NAME/corpus/tei directory and the rest (except rng and css files) under your toyData/dataset/MODEL_NAME/corpus/raw directory
Train the model (step 5)
Don't forget to put the same files under evaluation. tei.xml files under your toyData/dataset/MODEL_NAME/evaluation/tei directory and the rest (except rng and css files) under your toyData/dataset/MODEL_NAME/evaluation/raw directory. If you have carried out your annotation correctly, you must see 100% in your the evaluation table displayed at the end of the model training
Run the web app to see the result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker_Instructions

Clone this wiki locally