Hack the Tokenizer

A Python library for augmenting pretrained language model tokenizers to handle non-English languages more effectively, with a focus on Portuguese.

Motivation

Language models have revolutionized AI applications across domains, but they face significant challenges with non-English languages. This disparity stems from tokenization inefficiency:

Quality Gap: Non-English text generation often results in lower quality output
Cost Disparity: Processing non-English text is more expensive due to higher token counts

For example, the Llama-3 tokenizer breaks down the English sentence "This is a thesis proposal!" into 6 tokens, while the equivalent Portuguese sentence "Isto é uma proposta de tese!" produces 10 tokens. This inefficiency:

Reduces generation quality (more tokens = more opportunities for errors)
Increases API costs (pricing is typically per token)

Approach

This project develops a method to augment pretrained tokenizers with language-specific tokens, improving efficiency for non-English languages without requiring full model retraining:

Token Generation: Create new language-specific tokens
Embedding Initialization: Initialize embeddings for new tokens using various strategies
Fine-tuning: Optimize the new token embeddings
Evaluation: Measure improvements using benchmarks and fertility metrics

Installation

# Clone the repository
git clone https://github.com/yourusername/hack-tokenizer.git
cd hack-tokenizer

# Install dependencies
pip install -r requirements.txt

# Install the package in development mode
pip install -e .

Requirements

Python 3.10+
PyTorch
Transformers
Other dependencies listed in requirements.txt

Usage

Quick Start

from hack_tokenizer.hack import ModelHacker
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a ModelHacker instance
hacker = ModelHacker(
    dataset=["your Portuguese text corpus here"],
    batch_size=8,
    learning_rate=1e-6
)

# Hack the tokenizer with 1000 new Portuguese tokens
model, tokenizer = hacker.hack(
    model=model,
    tokenizer=tokenizer,
    encoding_tokenizer=tokenizer,
    num_tokens=1000,
    embed_initializer_method="mean",
    show_progress=True,
    train=True
)

# Save the hacked model and tokenizer
model.save_pretrained("./hacked-model")
tokenizer.save_pretrained("./hacked-tokenizer")

CLI Usage

# Run tokenizer hacking with default parameters
python -m hack_tokenizer.hack --model microsoft/phi-2 --num-tokens 1000

# Run evaluation on hacked model
python -m hack_tokenizer.evaluation --model ./hacked-model --tokenizer ./hacked-tokenizer

Project Structure

hack_tokenizer/
├── benchmark/            # Benchmark implementations
│   ├── base.py           # Base benchmark class
│   ├── CalamePT.py       # Portuguese benchmark
│   ├── MMLU.py           # Multilingual benchmark
│   └── SuperGLUE.py      # English benchmark
├── evaluation/           # Evaluation framework
│   └── evaluation.py     # Main evaluation logic
├── hack/                 # Core tokenizer hacking functionality
│   ├── ModelHacker.py    # Model embedding manipulation
│   └── TokenizerHack.py  # Tokenizer modification
├── metrics/              # Evaluation metrics
│   ├── base.py           # Base metric class
│   ├── FertilityBoost.py # Fertility improvement metric
│   ├── FertilityInput.py # Input tokenization efficiency
│   ├── FertilityOutput.py # Output tokenization efficiency
│   └── Perplexity.py     # Language modeling quality
└── utils/                # Utility functions
    ├── cli.py            # Command-line interface
    ├── DatasetClass.py   # Dataset handling
    ├── functions.py      # Helper functions
    └── loader.py         # Model loading utilities

Results

Our approach demonstrates significant improvements for Portuguese language processing:

Token Efficiency: Reduced token count by 15-30% for Portuguese text
Generation Quality: Improved coherence and fluency in Portuguese text generation
Cost Reduction: Lower token counts translate to reduced API costs
Model Performance: Maintained or improved performance on Portuguese benchmarks

The most effective embedding initialization strategy was mean initialization, which outperformed weighted average and translation-based approaches.

Benchmarks

The project includes several benchmarks to evaluate performance:

CalamePT: Portuguese language benchmark
MMLU: Multilingual benchmark with Portuguese subset
SuperGLUE: English benchmark to verify no regression in original language

To run benchmarks:

python -m hack_tokenizer.evaluation --benchmark calamept --model ./hacked-model

Contributing

Contributions are welcome! Here are some areas for future work:

Support for additional languages beyond Portuguese
Alternative embedding initialization strategies
Integration with modern LLM deployment frameworks (vLLM)
Performance optimization for larger models

Please follow these steps to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

If you use this work in your research, please cite:

@misc{pinto2025hack,
  author = {Pinto, Duarte},
  title = {Hack the Tokenizer: Augmenting Pretrained Language Models for Non-English Languages},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yourusername/hack-tokenizer}}
}
```a

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.vscode		.vscode
Thesis Document		Thesis Document
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src/hack_tokenizer		src/hack_tokenizer
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md
RESULTS_SUMMARY.csv		RESULTS_SUMMARY.csv
config.json		config.json
evaluation.sh		evaluation.sh
hack_model.sh		hack_model.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hack the Tokenizer

Motivation

Approach

Installation

Requirements

Usage

Quick Start

CLI Usage

Project Structure

Results

Benchmarks

Contributing

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

LIAAD/hack_the_tokenizer

Folders and files

Latest commit

History

Repository files navigation

Hack the Tokenizer

Motivation

Approach

Installation

Requirements

Usage

Quick Start

CLI Usage

Project Structure

Results

Benchmarks

Contributing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages