Skip to content

Repo of the paper "Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners".

Notifications You must be signed in to change notification settings

Shimao-Zhang/LLM-Multilingual-Learner

Repository files navigation

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

📃 Paper | 📭 Contact

⛰️ Overview

This repository shares the code and models of our latest work "Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners". In this work, we discover and comprehensively investigate the spontaneous multilingual alignment improvement of LLMs. We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages, even including those unseen during instruction-tuning. Additionally, we utilize different settings and mechanistic interpretability methods to analyze the LLM's performance in the multilingual scenario comprehensively. Our work suggests LLMs' enormous potential for improving multilingual alignment efficiently with great language and task generalization.

📈 Benchmarks & Datasets

We provide the benchmarks and datasets we utilize in our experiments in the ./data. We report the information in detail as below:

Dataset Usage Languages Path
Amazon Reviews Polarity Question Translation Alignment \ ./data/ap_emotion
SNLI Question Translation Alignment \ ./data/snli
PAWS Question Translation Alignment \ ./data/paws
Amazon Reviews Polarity Evaluation en, zh, de, fr, es, it, nl, ja, ru, sv, sl, pl, bg, no, ms, is, hi, th, sw, bn ./data/ap_emotion
SNLI Evaluation en, zh, de, fr, es, it, nl, ja, ru, sv, sl, pl, bg, no, ms, is, hi, th, sw, bn ./data/snli
PAWS Evaluation en, zh, de, fr, es, it, nl, ja, ru, sv, sl, pl, bg, no, ms, is, hi, th, sw, bn ./data/paws

🧩 Installation

To install this repository, follow these steps:

git clone https://github.com/Shimao-Zhang/LLM-Multilingual-Learner.git
cd LLM-Multilingual-Learner
pip install -r requirements.txt

🛠️ Training

We train our models based on the LLaMA Factory.

You should replace the path of the model and data in ./LLaMA-Factory/sft_question_single_lora.bash with the appropriate paths, and you should also use the corresponding template.

  • finetuning
bash ./LLaMA-Factory/sft_question_single_lora.bash

For finetuning, you can use the hyperparameters below:

#!/bin/bash

export HF_HOME=/home/huggingface_cache_path

CUDA_VISIBLE_DEVICES=0 python ./src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path model_name_or_path \
    --dataset dataset_name \
    --dataset_dir ./data \
    --template template_name \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir output_dir_path \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 2048 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps total_step/10 \
    --save_steps 150000 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --val_size 0.05 \
    --plot_loss \
    --fp16
  • merge
bash ./LLaMA-Factory/merge_lora_weights.bash

📏 Evaluation & Analysis

We evaluate the model by constrained decoding and calculating the accuracy. To evaluate the model performance, you can use the following command. Note that before running the scripts, you should set the appropriate model_size, target_lang, and model path in the corresponding .py file.

  • evaluating with Amazon Reviews Polarity
cd ./scripts
bash run_emotion_eval.bash
  • evaluating with SNLI
cd ./scripts
bash run_snli_eval.bash
  • evaluating with PAWS
cd ./scripts
bash run_paws_eval.bash
  • logit lens
cd ./scripts
bash run_emotion.bash
  • Principal Component Analysis

    Running the Jupyter file knowledge_finding.ipynb

🌲 Citation

If you find this repository helpful, feel free to cite our paper. The following citation information is obtained from Google Scholar, while the new result hasn't been updated yet in Google Scholar though we have updated the title in our new preprint version.

You can just follow the original citation information:

@article{zhang2024large,
  title={Large Language Models are Good Spontaneous Multilingual Learners: Is the Multilingual Annotated Data Necessary?},
  author={Zhang, Shimao and Gao, Changjiang and Zhu, Wenhao and Chen, Jiajun and Huang, Xin and Han, Xue and Feng, Junlan and Deng, Chao and Huang, Shujian},
  journal={arXiv preprint arXiv:2405.13816},
  year={2024}
}

About

Repo of the paper "Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published