This project trains a Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language using a combination of Parquet and text files.
-
Clone the Repository
git clone <repository_url> cd turkish_tokenizer
-
Install Dependencies
It's recommended to use a virtual environment.
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Prepare Data
- Place all your
.parquet
files indata/turkish/
. - Place all your additional
.txt
files indata/turkish_texts/
.
- Place all your
-
Run the Training Script
python main.py
The script will:
- Load and clean the data.
- Train the BPE tokenizer.
- Save the tokenizer to
tokenizer/turkish_bpe_tokenizer.json
. - Evaluate the tokenizer to ensure quality.
-
Check Logs
The process logs are saved in
tokenizer_training.log
. Review this file for detailed information about the training process and any potential issues.
- Vocabulary Size: Adjust
VOCAB_SIZE
inmain.py
as needed. - Minimum Frequency: Adjust
MIN_FREQUENCY
inmain.py
. - Special Tokens: Modify
SPECIAL_TOKENS
inmain.py
if required.