Skip to content

GlotEval: a unified evaluation toolkit designed to benchmark Large Language Models (LLMs) in a language-specific way

License

Notifications You must be signed in to change notification settings

MaLA-LM/GlotEval

Repository files navigation

🌍GlotEval: Massively Multilingual Evaluation of Large Language Models

GlotEval is a unified evaluation toolkit designed to benchmark Large Language Models (LLMs) across multiple languages and tasks. It supports text classification, machine translation, summarization, token classification, and open-ended generation, with a focus on massively multilingual coverage, evaluating models across over 1500 languages.

📚 Table of Contents


✨Key Features

🌐 Consistent Multilingual Benchmarking

  • Standardized ISO 639-3 language code alignment
  • Support for diverse language families (Bantu, Dravidian, Uralic, etc.)
  • Automatic language mapping for large-scale benchmarks

🗣️ Language-Specific Prompt Templates

  • Centralized multilingual prompt library
  • Configure prompts per language
  • Automatic prompt translation via Microsoft Translator (130+ languages)

🔁 Non-English-Centered Machine Translation

  • Evaluate translation beyond English-centric pivots
  • Support any-to-pivot and pivot-to-any directions

🧪 Multilingual Tasks

  • Text Classification: SIB-200, Taxi-1500
  • Machine Translation: Flores-200, Flores+, AmericasNLP, IN22, NTREX-128, Tatoeba, NTEU, TICO-19, MAFAND, MMHB, OpenSubtitles
  • Summarization: XLSum
  • Token Classification: WikiANN, UD
  • Comprehension: MMLU-style tasks (MMMLU, Global-MMLU)
  • Open-ended Generation: Aya, PolyWrite
  • Intrinsic Evaluation: PBC, MaLA

🤖 Model Compatibility

  • Hugging Face Transformers: for classification, tagging, etc.
  • vLLM: efficient, large-batch generation for generation tasks

📏 Rich Evaluation Metrics

  • Machine Translation: BLEU, ChrF++, COMET
  • Summarization: ROUGE-L
  • Classification: Accuracy, F1
  • Open-ended Gen: Self-BLEU, etc.

⚙️Requirements

  • Python 3.8+
  • PyTorch
  • Additional libraries: transformers, vllm, pandas, sacrebleu, etc.
  • Benchmark-specific data files: .conllu, .tsv, .jsonl, etc.

🚀Quickstart

1️⃣ Clone the Repository

git clone https://github.com/MaLA-LM/GlotEval
cd GlotEval

2️⃣ Set Up Environment

conda create -n gloteval python=3.9
conda activate gloteval
pip install -r requirements.txt

3️⃣ Prepare Data

  • Download benchmark data from Github Release under benchmark_dataset/, e.g.:
    • benchmark_dataset/flores200
    • benchmark_dataset/wikiann
  • Update paths in config.json if needed

4️⃣ Run an Evaluation

python main.py \
  --model_name "Qwen/Qwen2-1.5B" \
  --benchmarks xlsum sib200 \
  --params config.json \
  --output_dir results \
  --langs zho gsw por fra fin \
  --store_details \
  --efficiency_analysis

📝 Notes:

  • --model_name: Hugging Face model name or your local model path
  • --benchmarks: Choose one or more tasks
  • --params: Path to the config file specifying prompts, shots, etc.
  • --output_dir: Directory to store results
  • --langs: ISO 639-3 codes (e.g., zho=Chinese, gsw=Swiss German)
  • --store_details: Save detailed output for each sample in CSV format
  • --efficiency_analysis: Track and report token generation efficiency metrics

5️⃣ Check Results

  • Results saved under: results/<model>/<timestamp>/
  • Includes: scores.json, detailed CSVs (if enabled)

🛠️Configuration & Customization

The central configuration is in config.json, which specifies:

🔧 Model Arguments

"model_args": {
  "device": "cuda",
  "tensor_parallel_size": 1,
  "batch_size": 1,
  "dtype": "auto",
  "max_num_seqs": 256,
  "sampling_params": {
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 128,
    "stop": "\n"
  }
}

🧑‍🏫 Prompt Strategy

"prompt_language_strategy": "single",
"prompt_language": "eng_Latn",
  • "single": Use the same prompt in one language for all datasets
  • "multi": Use language-specific prompts when available

🧪 Benchmark-Specific Parameters

"benchmark_params": {
  "flores200_mt": {
    "n_shots": 3,
    "seed": 42,
    "center_lang": "eng_Latn",
    "direction": "center-x"
  },
  "xlsum": {
    "n_shots": 0,
    "seed": 42
  }
}

📋 Task-Specific Prompt Guidelines

"prompt_guidelines": {
  "translation": {
    "required_placeholders": ["{src_text}", "{tgt_lang}"],
    "optional_placeholders": ["{src_lang}"],
    "description": "For translation tasks, the instruction template must include {src_text} and {tgt_lang}."
  }
}

🧰Utility Tools

GlotEval includes two important utility tools that enhance its multilingual capabilities:

🔠 1. Language ID Alignment

The language alignment tool standardizes language codes from various benchmarks to the ISO 639-3 format with script information (e.g., eng_Latn, zho_Hans). This enables seamless cross-benchmark language-specific evaluation.

📘 Read more about Language ID Alignment

Features:

  • Processes inconsistent language codes from benchmarks (e.g., zh, zho, cmn, Chinese)
  • Maps to standardized ISO 639-3 codes with script information
  • Automatically detects scripts using GlotScript
  • Handles special cases like CJK scripts with precise identification

🧾 2. Multilingual Prompt Builder

This tool helps create and manage prompts in multiple languages for all evaluation tasks.

📘 Read more about the Multilingual Prompt Builder

Features:

  • Translates prompts from a source language to 130+ target languages
  • Preserves placeholders during translation
  • Supports various prompt formats for different tasks
  • Creates a comprehensive prompt library for consistent multilingual evaluation

📤Expected Output

After running an evaluation, GlotEval produces:

🧾 1. scores.json

{
  "xlsum": {
    "zho_Hans": {
      "rouge_l_f1": 0.342
    },
    "fra_Latn": {
      "rouge_l_f1": 0.387
    }
  },
  "sib200": {
    "zho_Hans": {
      "accuracy": 0.78
    }
  }
}

📊 2. Detailed CSVs (if --store_details is specified)

  • Contains each sample's prompt, model output, reference, and corresponding scores
  • Useful for fine-grained error analysis and qualitative evaluation

⏱️ 3. Efficiency Report (if --efficiency_analysis used)

  • Metrics include tokens per second, prefill and decode times, etc.

📁 Output Directory Structure

results/
  └── Qwen2-1.5B/
      └── 2025-03-30_10-12-43/
          ├── scores.json
          ├── xlsum_zho_Hans.csv
          ├── sib200_zho_Hans.csv
          └── efficiency.json

🤝Contributing

We welcome contributions! Please see the GitHub repository for guidelines and how to get involved.


📄License

GlotEval is released under the Apache-2.0 license.


📚Citation

@article{gloteval,
    title={GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models}, 
    author={Hengyu Luo and Zihao Li and Joseph Attieh and Sawal Devkota and Ona de Gibert and Shaoxiong Ji and Peiqin Lin and Bhavani Sai Praneeth Varma Mantina and Ananda Sreenidhi and Raúl Vázquez and Mengjie Wang and Samea Yusofi and Jörg Tiedemann},
    year={2025},
    journal={arXiv preprint 2504.04155},
    url={https://arxiv.org/abs/2504.04155}, 
}

About

GlotEval: a unified evaluation toolkit designed to benchmark Large Language Models (LLMs) in a language-specific way

Resources

License

Stars

Watchers

Forks

Packages

No packages published