-
Notifications
You must be signed in to change notification settings - Fork 7
Docker_Instructions
- Get docker
- For Windows 7 and 8 Users download from this link
- For Windows 10 and later Users download (from Stable Channel) from this link
- For macOS Users download (from Stable Channel) from this link
- For Linux (Ubuntu) Users follow the instructions in this link
- Run in your terminal (For Windows 7 and 8, run in Quickstart terminal / For Windows 10, run in Command Prompt)
docker pull medkhem/grobid-dictionaries
If the image pull was successful, your terminal should show something similar to the below screenshot
And to make sure that you have all the setup necessary for the following steps (3 onwards), try to run the following command
docker run -it medkhem/grobid-dictionaries bash
Your terminal should look like the following screenshot
Do not forget to exit this image test, before starting with step 3, by simply writing "exit" and typing Enter
-
You need the 'toyData' directory to create dummy models. You could get it from the github repository
-
We could now run our image and having the 'toyData' and 'resources' as shared volumes between your machine and the container. Whatever you do in on of these directories, it's applied to both of them
- For macOS users:
docker run -v PATH_TO_YOUR_TOYDATA/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash
- For Windows 7 users:
docker run -v //c/Users/YOUR_USERNAME/Desktop/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash
- For Windows 10 Pro users:
docker run -v C:/Users/YOUR_USERNAME/Desktop/toyData:/grobid/grobid-dictionaries/resources -p 8080:8080 -it medkhem/grobid-dictionaries bash
- Create/train models by running these commands
For Dictionary Segmentation model, run:
mvn generate-resources -P train_dictionary_segmentation -e
For Dictionary Body Segmentation model, run:
mvn generate-resources -P train_dictionary_body_segmentation -e
For Lexical Entry model, run:
mvn generate-resources -P train_lexicalEntries -e
For Form model, run:
mvn generate-resources -P train_form -e
For Sense model, run:
mvn generate-resources -P train_sense -e
For the first stage model of processing etymology information (EtymQuote model), run:
mvn generate-resources -P train_etymQuote -e
For the second stage model of processing etymology information (Etym model), run:
mvn generate-resources -P train_etym -e
- Run the web service to see the output of the models
mvn -DskipTests jetty:run-war
You can see the running application in your web browser:
-
For Windows 7, your 8080 port should be free to see the web application on the address:
http://192.168.99.100:8080
-
For Linux, MacOs and Windows 10, the web application is running on the address:
http://localhost:8080
To shutdown the server, you need to press
ctrl + c
- To create training data from your dictionary, copy the pdf directory corresponding to the target model and paste it under the corresponding model location under your toyData.
For Dictionary Segmentation model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingDictionarySegmentation
For Dictionary Body Segmentation model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingDictionaryBodySegmentation
For Lexical Entry model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingLexicalEntry
For Form model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingForm
For Sense model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingSense
For EtymQuote model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingEtymQuote
For Etym model:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/DIRECTORY_OF_YOUR_PDF -dOut resources -exe createTrainingEtym
-
If you are using macOS, you might need to remove './DS_Store' file, which blocks the jar to run (thniking that it's a pdf)
-
Note also the choice of the pages is also imported: it should be varied
-
The above commands create training data to be annotated from scratch (files ending with tei.xml). It is possible also to generate pre-annotations using the current model, to be corrected afterwards (this mode is recommended when the model to be trained is becoming more precise). To do so, the latest token of the above commands should include Annotated. For example: createTrainingDictionarySegmentation -> createAnnotatedTrainingDictionarySegmentation
-
Annotate your files
-
Move your tei.xml files under your toyData/dataset/MODEL_NAME/corpus/tei directory and the rest (except rng and css files) under your toyData/dataset/MODEL_NAME/corpus/raw directory
-
Train the model (step 5)
-
Don't forget to put the same files under evaluation. tei.xml files under your toyData/dataset/MODEL_NAME/evaluation/tei directory and the rest (except rng and css files) under your toyData/dataset/MODEL_NAME/evaluation/raw directory. If you have carried out your annotation correctly, you must see 100% in your the evaluation table displayed at the end of the model training
-
Run the web app to see the result