GitHub - s-nlp/multilingual_detox: ACL SRW "Exploring cross-lingual textual style transfer with large multilingual language models"

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

This repository contains code for the paper submitted to ACL SRW Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models by Daniil Moskovskiy, Daryna Dementieva and Alexander Panchenko

Setup

Step $1$: Install dependencies

pip install -r requirements.txt

Step $2$: Run Experiments

Multilingual setup:

python mt5_trainer.py \
    --batch_size 16 \
    --use_russian 1 \
    --max_steps 40000 \
    --learning_rate 1e-5 \
    --output_dir trained_models

Cross-lingual setup:

python mt5_trainer.py
    --batch_size 16 \
    --use_russian 0 \
    --max_steps 40000 \
    --learning_rate 1e-5 \
    --output_dir trained_models

Step $3$: Generate detoxifications for test data

For Russian use test.tsv file from data\russian_data\. For English use data\english_data\test_toxic_parallel.txt.

Example for Russian:

python inference.py \
    --model_name mbart \
    --model_path mbarts\mbart_10000_EN_RU \
    --language ru

and for English:

python inference.py \
    --model_name mbart \
    --model_path mbarts\mbart_10000_EN_RU \
    --language en

Step $4$

Calculate metrics. Example for Russian:

python evaluate_ru.py \
    --result_filename results_en \
    --input_dir mbarts/mbart_10000_EN_RU \
    --output_dir mbarts \

Data

For English we use ParaDetox parallel detoxification corpora, please, cite the original paper and proceed to the original ParaDetox repository for details. For Russian we use RuDetox corpora from RuSSE Detoxification Competition, please cite the competition if you are going to use the data.

Citation for English data:

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}

Citation for Russian data:

@article{russe2022detoxification,
  title={RUSSE-2022: Findings of the First Russian Detoxification Task Based on Parallel Corpora},
  author={Dementieva, Daryna and Nikishina, Irina and Logacheva, Varvara and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander},
  booktitle={Computational Linguistics and Intellectual Technologies},
  year={2022}
}

Results

Main results are depicted in this table below.

Here are some examples of generated text.

Citation

@inproceedings{DBLP:conf/acl/MoskovskiyDP22,
  author    = {Daniil Moskovskiy and
               Daryna Dementieva and
               Alexander Panchenko},
  editor    = {Samuel Louvan and
               Andrea Madotto and
               Brielen Madureira},
  title     = {Exploring Cross-lingual Text Detoxification with Large Multilingual
               Language Models},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
               Linguistics: Student Research Workshop, {ACL} 2022, Dublin, Ireland,
               May 22-27, 2022},
  pages     = {346--354},
  publisher = {Association for Computational Linguistics},
  year      = {2022},
  url       = {https://aclanthology.org/2022.acl-srw.26},
  timestamp = {Thu, 19 May 2022 16:52:59 +0200},
  biburl    = {https://dblp.org/rec/conf/acl/MoskovskiyDP22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contacts

If you find some issue, do not hesitate to add it to Github Issues.

Feel free to contact Daniil Moskovskiy via e-mail or telegram

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pics		pics
LICENSE		LICENSE
README.md		README.md
evaluate_ru.py		evaluate_ru.py
inference.py		inference.py
mbart_trainer.py		mbart_trainer.py
mt5_trainer.py		mt5_trainer.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Setup

Data

Results

Citation

Contacts

About

Releases

Packages

Contributors 3

Languages

License

s-nlp/multilingual_detox

Folders and files

Latest commit

History

Repository files navigation

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Setup

Data

Results

Citation

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages