This is the official code implementation for the LEPISZCZE benchmark experiments. "This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish" (NeurIPS 2022) (Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Szymczak, Marcin Wątroba, Arkadiusz Janz, Piotr Szymański, Mikołaj Morzy, Tomasz Kajdanowicz, Maciej Piasecki).
LEPISZCZE benchmark resources
Name | Description | URL |
---|---|---|
Leaderboard | LEPISZCZE Leaderboard | LEPISZCZE |
Libary | clarin-pl/embeddings Our library with pre-defined NLP pipelines for text classification, pair text classification and sequence labeling taks | GitHub |
Experiments dashboard | Weight&Biases dashboard with our experiments | W&B |
Datasets | LEPISZCZE Datasets are accessible through our HuggingFace Hub organization page. | HuggingFace |
KLEJ-Datasets | Datasets for KLEJ benchmark are accessible through Allegro HuggingFace organization page. | HuggingFace |
@inproceedings{augustyniak2022lepiszcze,
author = {Augustyniak, Lukasz and Tagowski, Kamil and Sawczyn, Albert and Janiak, Denis and Bartusiak, Roman and Szymczak, Adrian and Janz, Arkadiusz and Szyma\'{n}ski, Piotr and W\k{a}troba, Marcin and Morzy, Miko\l aj and Kajdanowicz, Tomasz and Piasecki, Maciej},
booktitle = {Advances in Neural Information Processing Systems},
editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
pages = {21805--21818},
publisher = {Curran Associates, Inc.},
title = {This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish},
url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/890b206ebb79e550f3988cb8db936f42-Paper-Datasets_and_Benchmarks.pdf},
volume = {35},
year = {2022}
}
In case of any question or concerns about LEPISZCZE benchmark feel free to contact us:
- Łukasz lukasz.augustyniak@pwr.edu.pl
- Kamil kamil.tagowski@pwr.edu.pl
- Albert albert.sawczyn@pwr.edu.pl
- Denis denis.janiak@pwr.edu.pl
DVC Repository Access Due to the size of pipeline outputs data, we do not provide public access to our DVC Remote Repository. However, if you are interested in any kinds of data artifacts, don't hesitate to get in touch with us.
Repository can be setup via poetry or via docker.
Prerequisites:
- Python: 3.9+
- Poetry [LINK].
- CUDA 11.3+ for GPU support (Recommended)
Installation
poetry install
For GPU support
poetry run poe force-torch-cuda
Building image
docker build . -f docker/Dockerfile -t LEPISZCZE
After the container setup use conda env LEPISZCZE
conda activate LEPISZCZE
Our experiments can be easily reproduced with DVC repro & W&B logging. Using dvc repro
command and with W&B token setup.
DISCLAIMER Reproduction of full pipeline could take above 2000 hours to compelete on a single GPU device. We advise to execute stages in parallel on mutiple GPU computing devices.
Experiments configs can be found under configs
DISCLAIMER For some of the dataset we had to limit manually maximum sequence length to 512
for Hyper Parameter Search.
Models hyperparameters configuations can be accessed via W&B dashboard. Example: [LINK]
dataset name | task type | input_column_name(s) | target_column_name | description |
---|---|---|---|---|
clarin-pl/kpwr-ner | sequence labeling (named entity recognition) | tokens | ner | KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc. |
clarin-pl/polemo2-official | classification (sentiment analysis) | text | target | A corpus of consumer reviews from 4 domains: medicine, hotels, products and school. |
clarin-pl/2021-punctuation-restoration | punctuation restoration | text_in | text_out | Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition. |
clarin-pl/nkjp-pos | sequence labeling (part-of-speech tagging) | tokens | pos_tags | NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc. |
clarin-pl/aspectemo | sequence labeling (sentiment classification) | tokens | labels | AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis. |
laugustyniak/political-advertising-pl | sequence labeling (political advertising ) | tokens | tags | First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. |
laugustyniak/abusive-clauses-pl | classification (abusive-clauses) | text | class | Dataset with Polish abusive clauses examples. |
allegro/klej-dyk | pair classification (question answering)* | (question, answer) | target | The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs. |
allegro/klej-psc | pair classification (text summarization)* | (extract_text, summary_text) | label | The Polish Summaries Corpus contains news articles and their summaries. |
allegro/klej-cdsc-e | pair classification (textual entailment)* | (sentence_A, sentence_B) | entailment_judgment | The polish sentence pairs which are human-annotated for textual entailment. |