Qwen-Tokenizer-Pruner

中文｜ English

Qwen-Tokenizer-Pruner

Due to the huge vocaburary size (151,936) of Qwen models, the Embedding and LM Head weights are excessively heavy. Therefore, this project provides a Tokenizer vocabulary pruning solution for Qwen and Qwen-VL.

If my open source projects have inspired you, giving me some sponsorship will be a great help to my subsequent open source work. Support my subsequent open source work❤️🙏 (Previous Supporters)

Installation

Run the following command to install required packages

pip install -r requirements.txt

Supported Models

This tokenizer vocabulary pruning tool supports the following LLM models.

Please download your base model from the above checkpoints.

Getting Started

We support two types of tokenizer vocabulary pruning: lossless (in support data) and lossy (to a target size)

1. Lossless Pruning

To conduct lossless vocabulary pruning, you just need to simply run the following script with your own data/model pathes.

bash prune_lossless.sh

The script will first prune the vocabulary and save it to the output path, and then check whether old tokenizer and new tokenzer are equivalent.

Explaination of arguments used in the script

old_model_path="../../checkpoints/Qwen-VL-Chat/"
new_model_path="../../checkpoints/Qwen-VL-Chat-new-vocab/"
support_data="../../VLMEvalKit/raw_data/"
support_lang="" # optional (using "langdetect")   e.g., support_lang="zh-cn en"
inherit_vocab_count="" # optional

2. Lossy Pruning

Run the following bash script can conduct lossy vocabulary pruning to a target size.

bash prune_lossy.sh

This script add an argument 'target_size', which will remove the less frequent token and cause mismatch between old tokenizer and new tokenizer. Therefore, it will no longer conduct equivalence check.

3. Other details and special cases:

For support_lang, note that language detection is using langdetect package, please using the valid abbreviations of languages.
Post processing For Qwen models, change SPECIAL_START_ID in tokenization_qwen.py to your New Tiktoken BPE file Size, check printed log (see the following example).

Prepare Your Own Support Dataset

We provide a list of sample data in "./sample_data/x.json" as an example of support data used in vocabulary pruning. Each file is either a dictionary of query and response, or a dictionary for a plain text.

Support Data Format A:

{
    "query": Picture 1: <img>/YOUR_OWN_PATH/MMBench/demo.jpg</img>\nWhat is in the image? (This query will be tokenized with system prompt)",
    "response": "A white cat. (This response will be directly tokenized from plain text)"
}

Support Data Format B:

{
    "prompt": "In the heart of the open sky, Where the winds of change freely sigh, A soul finds its endless flight, In the boundless realms of light.(This prompt will be directly tokenized from plain text)"
}

Citation

If you find this project helps your research, please kindly consider citing our project in your publications.

@misc{tang2024tokenizerpruner,
    title = {Qwen Tokenizer Pruner},
    author = {Tang, Kaihua},
    year = {2024},
    note = {\url{https://github.com/KaihuaTang/Qwen-Tokenizer-Pruner}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
__pycache__		__pycache__
assets		assets
sample_data		sample_data
README.md		README.md
README_CN.md		README_CN.md
__init__.py		__init__.py
check.py		check.py
main.py		main.py
model_save.py		model_save.py
prune_lossless.sh		prune_lossless.sh
prune_lossy.sh		prune_lossy.sh
requirements.txt		requirements.txt
utils.py		utils.py
vocab_count.py		vocab_count.py
vocab_save.py		vocab_save.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen-Tokenizer-Pruner

Contents

Installation

Supported Models

Getting Started

1. Lossless Pruning

2. Lossy Pruning

3. Other details and special cases:

Prepare Your Own Support Dataset

Citation

About

Sponsor this project

Languages

KaihuaTang/Qwen-Tokenizer-Pruner

Folders and files

Latest commit

History

Repository files navigation

Qwen-Tokenizer-Pruner

Contents

Installation

Supported Models

Getting Started

1. Lossless Pruning

2. Lossy Pruning

3. Other details and special cases:

Prepare Your Own Support Dataset

Citation

About

Resources

Stars

Watchers

Forks

Sponsor this project

Languages