Turkish BPE Tokenizer Training

Vocabulary Size: Adjust VOCAB_SIZE in main.py as needed.
Minimum Frequency: Adjust MIN_FREQUENCY in main.py .
Special Tokens: Modify SPECIAL_TOKENS in main.py if required.

This project trains a Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language using a combination of Parquet and text files.

Setup Instructions

Clone the Repository

git clone <repository_url>
cd turkish_tokenizer

Install Dependencies

It's recommended to use a virtual environment.

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Prepare Data
- Place all your .parquet files in data/turkish/.
- Place all your additional .txt files in data/turkish_texts/.
Run the Training Script
```
python main.py
```
The script will:
- Load and clean the data.
- Train the BPE tokenizer.
- Save the tokenizer to tokenizer/turkish_bpe_tokenizer.json.
- Evaluate the tokenizer to ensure quality.
Check Logs

The process logs are saved in tokenizer_training.log. Review this file for detailed information about the training process and any potential issues.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
scripts		scripts
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt