This repository contains the source code for the L3i++ team at Semeval 2023-Task 2: CoNER.
We use the dataset from the SemEval 2023 Task 2: MultiCoNER II Multilingual Complex Named Entity Recognition, which is available at here. This dataset contains 12 languages (English, Spanish, Swedish, Ukrainian, Portuguese, French, Farsi, German, Chinese, Hindi, Bangla, and Italian), divided into 3 parts: train, dev, and test. Each part contains a set of CoNLL files, which are the input data for the model. The CoNLL files are in the following format:
# id 0d88e010-c6e8-4409-9dec-a785e43eac16 domain=de
sie _ _ O
war _ _ O
die _ _ O
erste _ _ O
frau _ _ O
die _ _ O
beim _ _ O
großes _ _ B-Facility
auge _ _ I-Facility
beobachtet _ _ O
durfte _ _ O
. _ _ O
See the sample files in the public_data/DE-German/
folder.
Run the following command to install the required packages:
pip install -r requirements.txt
To preprocess the data, run the following command:
python ./models/preprocess.py --input_dir './public_data/DE-German/' --output_dir './preprocessed_data/' --lang 'de'
See the sample files after preprocessing steps in the preprocessed_data
folder.
To train the model, run the following command:
python ./models/train.py --train './preprocessed_data/de-train.csv' --test './preprocessed_data/de-dev.csv' --output_dir './bart_de' --model 'bart'
You can also access the monolingual English trained model at here as an example of how model is saved.
To inference the model and export the results, run the following command:
python ./models/inference.py --data_path './public_data/DE-German/de_test.conll' --word_max_length 4 --model 'mbart' --model_path './best_model/' --output_path './de.pred.conll'
If you are lazy to run theses 3 above commands, you can run the following command to end-to-end reproduce the results:
chmod +x run.sh
./run.sh
We will update the results after the leaderboard is released.