Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
texts		texts
README.md		README.md
demo_prediction.ipynb		demo_prediction.ipynb
prediction_helpers.py		prediction_helpers.py

README.md

Handling new content

This folder presents how implemented and trained models for (W)DSR can be leveraged to annotate new corpora made of raw text files. The process is illustrated in a demo notebook, and command lines to replicate the procedure are given below.

For the purpose of the example, two chapters of children's literature novels are gathered in texts folder:

Les_Aventures_de_Nono_VII.txt: Les Aventures de Nono, Jean Grave, 1901, chapter VII: LE TRAVAIL À AUTONOMIE. Wikisource link
Encore_Heidi_09.txt: Encore Heidi, Johanna Spyri, 1882, chapter IX: On se dit adieu, mais au revoir !.Wikisource link

Preprocessing

Generating `.json` corpus

The .txt files are compiled into a .json corpus that contains their raw text, their split is set to test, the original_corpus is called new_content and the labels field is left empty. This can be done by running the following command line:

python -c 'from new_content_helpers import make_json_from_texts; make_json_from_texts(folder_path="texts", output_dir="new_content_preprocessed")'

Generating `.tsv` files

Then, for ML models, the tokenized files need to be generated (eg. with spacy tokenization):

python ../preprocessing/data_utils.py --data_dir new_content_preprocessed/new_corpus.json --output_dir new_content_preprocessed/ --do_split False --tokenizer 'spacy_tokenization'

Run (W)DSR

(W)DSR can then be ran using configurations files, leveraging trained models:

python ../run_experiments.py --configs_folder new_content_configs

The config folder stores configuration files from the best models (ie. similar to files in experiments_configs/best_configs) adapted to the newly created corpus.

Labelled files

The tokenized text files together with models' predictions can be seen in the output folder.

These files can be merged and saved in a new table with the following command line:

python -c 'from new_content_helpers import merge_predictions; merge_predictions()'

The resulting .tsv will then be stored in the output folder. It will combine predictions from the different models in one file, as shown below:

	file	token	sentstart	token_idx	pred_regex	pred_flair	pred_transformer
11420	Les_Aventures_de_Nono_VII	—	no	(9238, 9239)	DS	DS	DS
11421	Les_Aventures_de_Nono_VII	C’	no	(9240, 9242)	DS	DS	DS
11422	Les_Aventures_de_Nono_VII	est	no	(9242, 9245)	DS	DS	DS
11423	Les_Aventures_de_Nono_VII	pour	no	(9246, 9250)	DS	DS	DS
11424	Les_Aventures_de_Nono_VII	toi	no	(9251, 9254)	DS	DS	DS
11425	Les_Aventures_de_Nono_VII	,	no	(9254, 9255)	DS	DS	DS
11426	Les_Aventures_de_Nono_VII	fit	no	(9256, 9259)	DS	O	O
11427	Les_Aventures_de_Nono_VII	Nono	no	(9260, 9264)	DS	O	O
11428	Les_Aventures_de_Nono_VII	en	no	(9265, 9267)	DS	O	O
11429	Les_Aventures_de_Nono_VII	la	no	(9268, 9270)	DS	O	O
11430	Les_Aventures_de_Nono_VII	lui	no	(9271, 9274)	DS	O	O
11431	Les_Aventures_de_Nono_VII	posant	no	(9275, 9281)	DS	O	O
11432	Les_Aventures_de_Nono_VII	sur	no	(9282, 9285)	DS	O	O
11433	Les_Aventures_de_Nono_VII	la	no	(9286, 9288)	DS	O	O
11434	Les_Aventures_de_Nono_VII	tête	no	(9289, 9293)	DS	O	O
11435	Les_Aventures_de_Nono_VII	.	no	(9293, 9294)	DS	O	O

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prediction

prediction

README.md

Handling new content

Preprocessing

Generating `.json` corpus

Generating `.tsv` files

Run (W)DSR

Labelled files

Files

prediction

Directory actions

More options

Directory actions

More options

Latest commit

History

prediction

Folders and files

parent directory

README.md

Handling new content

Preprocessing

Generating .json corpus

Generating .tsv files

Run (W)DSR

Labelled files

Generating `.json` corpus

Generating `.tsv` files