Skip to content

Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language

License

Notifications You must be signed in to change notification settings

cobanov/turkish-bpe-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Turkish BPE Tokenizer Training

This project trains a Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language using a combination of Parquet and text files.

Setup Instructions

  1. Clone the Repository

    git clone <repository_url>
    cd turkish_tokenizer
  2. Install Dependencies

    It's recommended to use a virtual environment.

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
  3. Prepare Data

    • Place all your .parquet files in data/turkish/.
    • Place all your additional .txt files in data/turkish_texts/.
  4. Run the Training Script

    python main.py

    The script will:

    • Load and clean the data.
    • Train the BPE tokenizer.
    • Save the tokenizer to tokenizer/turkish_bpe_tokenizer.json.
    • Evaluate the tokenizer to ensure quality.
  5. Check Logs

    The process logs are saved in tokenizer_training.log. Review this file for detailed information about the training process and any potential issues.

Customization

  • Vocabulary Size: Adjust VOCAB_SIZE in main.py as needed.
  • Minimum Frequency: Adjust MIN_FREQUENCY in main.py.
  • Special Tokens: Modify SPECIAL_TOKENS in main.py if required.

License

MIT License

About

Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language

Topics

Resources

License

Stars

Watchers

Forks

Languages