Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!? #67

Closed
thusinh1969 opened this issue Apr 19, 2024 · 50 comments

Comments

@thusinh1969
Copy link

thusinh1969 commented Apr 19, 2024

I usually extend vocab to make the model closer to Vietnames language. The code is below. However, it seems that the tokenizer of LLaMA-3 is no longer work with SentencePiece. Even LlamaTokenizer is no longer compatible with LLaMA-3. Any hint please ?

In the meanwhile, standard AutoTokenizer can no longer load new LlaMA-3 's tokenizer.model. Any help highly appreciated.

import sentencepiece as spm
def extendVocab(tokenizer, source_tokenizer_file,
                extra_vocab_model_files, output_path, reload=True, verbose=False):
    
  # load current tokenizer proto
  print ('Create current vocab proto...')
  source_tokenizer = tokenizer.from_pretrained(source_tokenizer_file, trust_remote_code=True)
  try:
      base_spm = sp_pb2_model.ModelProto()
      base_spm.ParseFromString(source_tokenizer.sp_model.serialized_model_proto())  ### <---- error here !
  except:
      base_spm=source_tokenizer.get_vocab() 
      
  for new_vocab in extra_vocab_model_files:
      # create new temp tokenizer
      print ('Loading extra vocab file...', new_vocab)
      VN_sp_model = spm.SentencePieceProcessor()
      VN_sp_model.Load(new_vocab)
      print (len(VN_sp_model))
      # load new tokenizer proto
      print ('Create extra vocab proto...', )
      VN_spm = sp_pb2_model.ModelProto()
      VN_spm.ParseFromString(VN_sp_model.serialized_model_proto())
    
      # print number of tokens
      print("Source tokenizer len:", len(source_tokenizer))
      print("Extra tokenizer len:",len(VN_sp_model))
      print(source_tokenizer.all_special_tokens)
      print(source_tokenizer.all_special_ids)
      print(source_tokenizer.special_tokens_map)
    
      print ('Adding extra vocab into current vocab ...')
      
      ## Add extra tokens to current tokenizer
      spm_tokens_set=set(p.piece for p in base_spm.pieces)
      print(len(spm_tokens_set))
      print(f"Before:{len(spm_tokens_set)}")
    
      for p in VN_spm.pieces:
          piece = p.piece
          if piece not in spm_tokens_set:
              if verbose:
                print (piece)
              new_p = sp_pb2_model.ModelProto().SentencePiece()
              new_p.piece = piece
              new_p.score = 0
              base_spm.pieces.append(new_p)
              
      print(f"New model pieces: {len(base_spm.pieces)}")

  target_path_sp = "/".join(output_path.split('/')[:-1]) + "/sp"
  target_file = output_path.split('/')[-1]
  os.makedirs(target_path_sp,exist_ok=True)
  print ('Saving new tokenizer sp model:', target_path_sp+"/"+target_file)
  with open(target_path_sp+"/"+target_file, 'wb') as f:
      f.write(base_spm.SerializeToString())
  f.close()

  print ('Reloading sp model..')
  reload_extended_tokenizer = tokenizer(target_path_sp+"/"+target_file)
  hf_output_path = "/".join(output_path.split('/')[:-1])+ "/hf"
  os.makedirs(hf_output_path,exist_ok=True)
  print ('Saving new tokenizer hf model ...', hf_output_path)
  reload_extended_tokenizer.save_pretrained(hf_output_path)

  text='''Những công trình vĩ đại của bác Hồ Chí minh đã ghi dấu ấn lớn cho toàn thế giới và nhân loại. Bác là người đáng yêu.
  The primary use of LLaMA is research on large language models, including'''

  print(f"Tokenized by origin tokenizer:{source_tokenizer.tokenize(text)}")
  print(f"Tokenized by new tokenizer:{reload_extended_tokenizer.tokenize(text)}")

  print ('Reloading completely new HF tokenizer ...')
    
  reloaded_tokenizer = tokenizer.from_pretrained(hf_output_path, trust_remote_code=True)
  print (reloaded_tokenizer)
  return reloaded_tokenizer

Thanks,
Steve

@StephennFernandes
Copy link

cc @ArthurZucker

is there a way this could be handled in hf tokenizers ?

a few pointers and/or some code would really help a lot of folks

@amitsangani
Copy link

Llama 3 has improved tokenizer based on Tiktoken v/s Llama 2 which was based on Sentencepiece. Llama 3 tokenizer expands the vocabulary size to 128k (from 32k tokens in the previous version).

https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py

Can you try AutoTokenizer instead of LlamaTokenizer?

@StephennFernandes
Copy link

@amitsangani AutoTokenizer doesnt work

ideally the following was the go to script to extend the tokenizer in LLaMa-2

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--llama_tokenizer_dir', default="meta-llama/Llama-2-7b-hf", type=str)
parser.add_argument('--chinese_sp_model_file', default='./chinese_sp.model', type=str)
args = parser.parse_args()

llama_tokenizer_dir = args.llama_tokenizer_dir
chinese_sp_model_file = args.chinese_sp_model_file

# load
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)

llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())

# print number of tokens
print(len(llama_tokenizer),len(chinese_sp_model))
print(llama_tokenizer.all_special_tokens)
print(llama_tokenizer.all_special_ids)
print(llama_tokenizer.special_tokens_map)

## Add Chinese tokens to LLaMA tokenizer
llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)
print(len(llama_spm_tokens_set))
print(f"Before:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        llama_spm.pieces.append(new_p)
print(f"New model pieces: {len(llama_spm.pieces)}")

## Save
output_sp_dir = 'merged_tokenizer_sp'
output_hf_dir = 'merged_tokenizer_hf' # the path to save Chinese-LLaMA tokenizer
os.makedirs(output_sp_dir,exist_ok=True)
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')

tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")


# Test
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map)
text='''白日依山尽,黄河入海流。欲穷千里目,更上一层楼。
The primary use of LLaMA is research on large language models, including'''
print("Test text:\n",text)
print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")

upon changing the LlamaTokenizer to AutoTokenizer and trying to extend the tokenizer on LLaMa-3 the following is the error.

  File "/media/user/drive_2/tokenizer_extension/merge_tokenizer.py", line 21, in <module>
    llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
                              ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'sp_model'

cc @ArthurZucker does this looks like a hf issue ?
currently running transformers version 4.33.1

@thusinh1969
Copy link
Author

Llama 3 has improved tokenizer based on Tiktoken v/s Llama 2 which was based on Sentencepiece. Llama 3 tokenizer expands the vocabulary size to 128k (from 32k tokens in the previous version).

https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py

Can you try AutoTokenizer instead of LlamaTokenizer?

I tried but no hope. If there is a quick codes will help. 128k vocab still does not cover basic vocab of VNese.

Thanks and advanced.
Steve

@StephennFernandes
Copy link

despite setting use_fast=False in loading the llama tokenizer using AutoTokenizer i still get the same error

@amitsangani AutoTokenizer doesnt work

ideally the following was the go to script to extend the tokenizer in LLaMa-2

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--llama_tokenizer_dir', default="meta-llama/Llama-2-7b-hf", type=str)
parser.add_argument('--chinese_sp_model_file', default='./chinese_sp.model', type=str)
args = parser.parse_args()

llama_tokenizer_dir = args.llama_tokenizer_dir
chinese_sp_model_file = args.chinese_sp_model_file

# load
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)

llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())

# print number of tokens
print(len(llama_tokenizer),len(chinese_sp_model))
print(llama_tokenizer.all_special_tokens)
print(llama_tokenizer.all_special_ids)
print(llama_tokenizer.special_tokens_map)

## Add Chinese tokens to LLaMA tokenizer
llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)
print(len(llama_spm_tokens_set))
print(f"Before:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        llama_spm.pieces.append(new_p)
print(f"New model pieces: {len(llama_spm.pieces)}")

## Save
output_sp_dir = 'merged_tokenizer_sp'
output_hf_dir = 'merged_tokenizer_hf' # the path to save Chinese-LLaMA tokenizer
os.makedirs(output_sp_dir,exist_ok=True)
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')

tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")


# Test
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map)
text='''白日依山尽,黄河入海流。欲穷千里目,更上一层楼。
The primary use of LLaMA is research on large language models, including'''
print("Test text:\n",text)
print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")

upon changing the LlamaTokenizer to AutoTokenizer and trying to extend the tokenizer on LLaMa-3 the following is the error.

  File "/media/user/drive_2/tokenizer_extension/merge_tokenizer.py", line 21, in <module>
    llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
                              ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'sp_model'

cc @ArthurZucker does this looks like a hf issue ? currently running transformers version 4.33.1

@thusinh1969
Copy link
Author

Any help please...!

@amitsangani
Copy link

@osanseviero @HamidShojanazeri - any ideas on how to resolve this?

@VishnuPJ
Copy link

@StephennFernandes , Any update? I am also trying to do the same.

@thusinh1969
Copy link
Author

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta.
Steve

@ArthurZucker
Copy link
Collaborator

Hi all! This is not a hf bug.
For any tokenizer that is in transformers and that you load using AutoTokenizer.from_pretrained you can add any token using tokenizer.add_tokens(["token1", "token2",]) etc.
There is not need for a complexe logic, and @thusinh1969's proposal works as expected.
Reload should not be super slow however, that might be a bug.
One fix could be:

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("my-new-tokenizer")

@StephennFernandes
Copy link

okay that pretty much solves this.
@ArthurZucker , could you please confirm whats the correct way to check the condition:
if the new token from the new extended tokenizer exists in the original llama tokenizer ?

i currently do this:

for p in tqdm(chinese_spm.pieces, desc="merging tokenizers"):
    piece = p.piece 
    if piece not in llama_tokenizer.vocab.keys():
        llama_tokenizer.add_tokens(piece)

@StephennFernandes
Copy link

@amitsangani could you also share the steps on how to train tiktoken tokenizer from scratch, given that you guys have found better tokenizer efficiency would be great to train the extended tokenizer using tiktoken and extend it to llama tokenizer.

@VishnuPJ
Copy link

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 .
I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))".
But when I tried to run I am getting,
"RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

@thusinh1969
Copy link
Author

thusinh1969 commented Apr 23, 2024

Hi all! This is not a hf bug. For any tokenizer that is in transformers and that you load using AutoTokenizer.from_pretrained you can add any token using tokenizer.add_tokens(["token1", "token2",]) etc. There is not need for a complexe logic, and @thusinh1969's proposal works as expected. Reload should not be super slow however, that might be a bug. One fix could be:

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("my-new-tokenizer")

It is a completely different Tokenizer. Have to do likes this:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer")
tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

Now you can use tokenizer_new_fast as tokenizer as usual.

@StephennFernandes
Copy link

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 .
I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))".
But when I tried to run I am getting,
"RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

@thusinh1969
are you getting this issue as well, when expanding token embedding during continual pre-training ?

@thusinh1969
Copy link
Author

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.
Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 .
I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))".
But when I tried to run I am getting,
"RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

@thusinh1969 are you getting this issue as well, when expanding token embedding during continual pre-training ?

No. It is a different error regarding your model setting, probably to do with gradient.

from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=getattr(torch, "bfloat16"),
                                             device_map='auto',
                                             low_cpu_mem_usage=True)
tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer")
tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)
model.resize_token_embeddings(len(tokenizer_new_fast))

That should do it.

@thusinh1969
Copy link
Author

FYI. In order to finetune further LlaMA-3 finetuned model, with this new extended tokenizer with proper LLaMA-3 format, you have to change the ChatFormat function as follows:

class ChatFormat:
    def __init__(self, tokenizer: Tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message: Message) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.added_tokens_encoder["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], add_special_tokens=False))
        tokens.append(self.tokenizer.added_tokens_encoder["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", add_special_tokens=False))
        return tokens

    def encode_message(self, message: Message) -> List[int]:
        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), add_special_tokens=False)
        )
        tokens.append(self.tokenizer.added_tokens_encoder["<|eot_id|>"])
        return tokens

    def encode_dialog_prompt(self, dialog: Dialog) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.added_tokens_encoder["<|begin_of_text|>"])
        for message in dialog:
            tokens.extend(self.encode_message(message))
        # Add the start of an assistant message for the model to complete.
        tokens.extend(self.encode_header({"role": "assistant", "content": ""}))
        return tokens

@ArthurZucker
Copy link
Collaborator

Regarding efficiency, I'll check as well, the ignore_merges should imporve it anyways

@thusinh1969
Copy link
Author

thusinh1969 commented Apr 23, 2024

Something is WRONG. The decoding of PreTrainedTokenizerFast (which LLaMA-3 are using) decode weird output once you add that token to the vocab using .add_tokens(word) function.

I use standard tokenizer from LLaMA-3 repo and add only ONE word to the origin tokenizer and...:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.add_tokens(tokenizers.AddedToken("Bác"))
tokenizer
>>>PreTrainedTokenizerFast(name_or_path='/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|reserved_special_token_2|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128005: AddedToken("<|reserved_special_token_3|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128006: AddedToken("<|start_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128007: AddedToken("<|end_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128008: AddedToken("<|reserved_special_token_4|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128009: AddedToken("<|eot_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128010: AddedToken("<|reserved_special_token_5|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128011: AddedToken("<|reserved_special_token_6|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128012: AddedToken("<|reserved_special_token_7|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128013: AddedToken("<|reserved_special_token_8|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128014: AddedToken("<|reserved_special_token_9|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128015: AddedToken("<|reserved_special_token_10|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128016: AddedToken("<|reserved_special_token_11|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128017: AddedToken("<|reserved_special_token_12|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128018: AddedToken("<|reserved_special_token_13|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128019: AddedToken("<|reserved_special_token_14|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128020: AddedToken("<|reserved_special_token_15|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128021: AddedToken("<|reserved_special_token_16|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128022: AddedToken("<|reserved_special_token_17|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128023: AddedToken("<|reserved_special_token_18|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128024: AddedToken("<|reserved_special_token_19|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128025: AddedToken("<|reserved_special_token_20|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128026: AddedToken("<|reserved_special_token_21|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128027: AddedToken("<|reserved_special_token_22|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128028: AddedToken("<|reserved_special_token_23|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128029: AddedToken("<|reserved_special_token_24|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128030: AddedToken("<|reserved_special_token_25|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128031: AddedToken("<|reserved_special_token_26|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128032: AddedToken("<|reserved_special_token_27|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128033: AddedToken("<|reserved_special_token_28|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128034: AddedToken("<|reserved_special_token_29|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128035: AddedToken("<|reserved_special_token_30|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128036: AddedToken("<|reserved_special_token_31|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128037: AddedToken("<|reserved_special_token_32|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128038: AddedToken("<|reserved_special_token_33|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128039: AddedToken("<|reserved_special_token_34|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128040: AddedToken("<|reserved_special_token_35|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128041: AddedToken("<|reserved_special_token_36|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128042: AddedToken("<|reserved_special_token_37|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128043: AddedToken("<|reserved_special_token_38|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128044: AddedToken("<|reserved_special_token_39|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128045: AddedToken("<|reserved_special_token_40|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128046: AddedToken("<|reserved_special_token_41|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128047: AddedToken("<|reserved_special_token_42|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128048: AddedToken("<|reserved_special_token_43|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128049: AddedToken("<|reserved_special_token_44|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128050: AddedToken("<|reserved_special_token_45|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128051: AddedToken("<|reserved_special_token_46|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128052: AddedToken("<|reserved_special_token_47|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128053: AddedToken("<|reserved_special_token_48|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128054: AddedToken("<|reserved_special_token_49|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128055: AddedToken("<|reserved_special_token_50|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128056: AddedToken("<|reserved_special_token_51|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128057: AddedToken("<|reserved_special_token_52|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128058: AddedToken("<|reserved_special_token_53|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128059: AddedToken("<|reserved_special_token_54|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128060: AddedToken("<|reserved_special_token_55|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128061: AddedToken("<|reserved_special_token_56|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128062: AddedToken("<|reserved_special_token_57|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128063: AddedToken("<|reserved_special_token_58|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128064: AddedToken("<|reserved_special_token_59|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128065: AddedToken("<|reserved_special_token_60|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128066: AddedToken("<|reserved_special_token_61|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128067: AddedToken("<|reserved_special_token_62|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128068: AddedToken("<|reserved_special_token_63|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128069: AddedToken("<|reserved_special_token_64|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128070: AddedToken("<|reserved_special_token_65|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128071: AddedToken("<|reserved_special_token_66|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128072: AddedToken("<|reserved_special_token_67|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128073: AddedToken("<|reserved_special_token_68|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128074: AddedToken("<|reserved_special_token_69|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128075: AddedToken("<|reserved_special_token_70|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128076: AddedToken("<|reserved_special_token_71|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128077: AddedToken("<|reserved_special_token_72|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128078: AddedToken("<|reserved_special_token_73|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128079: AddedToken("<|reserved_special_token_74|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128080: AddedToken("<|reserved_special_token_75|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128081: AddedToken("<|reserved_special_token_76|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128082: AddedToken("<|reserved_special_token_77|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128083: AddedToken("<|reserved_special_token_78|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128084: AddedToken("<|reserved_special_token_79|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128085: AddedToken("<|reserved_special_token_80|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128086: AddedToken("<|reserved_special_token_81|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128087: AddedToken("<|reserved_special_token_82|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128088: AddedToken("<|reserved_special_token_83|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128089: AddedToken("<|reserved_special_token_84|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128090: AddedToken("<|reserved_special_token_85|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128091: AddedToken("<|reserved_special_token_86|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128092: AddedToken("<|reserved_special_token_87|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128093: AddedToken("<|reserved_special_token_88|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128094: AddedToken("<|reserved_special_token_89|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128095: AddedToken("<|reserved_special_token_90|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128096: AddedToken("<|reserved_special_token_91|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128097: AddedToken("<|reserved_special_token_92|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128098: AddedToken("<|reserved_special_token_93|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128099: AddedToken("<|reserved_special_token_94|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128100: AddedToken("<|reserved_special_token_95|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128101: AddedToken("<|reserved_special_token_96|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128102: AddedToken("<|reserved_special_token_97|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128103: AddedToken("<|reserved_special_token_98|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128104: AddedToken("<|reserved_special_token_99|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128105: AddedToken("<|reserved_special_token_100|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128106: AddedToken("<|reserved_special_token_101|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128107: AddedToken("<|reserved_special_token_102|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128108: AddedToken("<|reserved_special_token_103|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128109: AddedToken("<|reserved_special_token_104|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128110: AddedToken("<|reserved_special_token_105|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128111: AddedToken("<|reserved_special_token_106|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128112: AddedToken("<|reserved_special_token_107|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128113: AddedToken("<|reserved_special_token_108|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128114: AddedToken("<|reserved_special_token_109|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128115: AddedToken("<|reserved_special_token_110|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128116: AddedToken("<|reserved_special_token_111|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128117: AddedToken("<|reserved_special_token_112|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128118: AddedToken("<|reserved_special_token_113|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128119: AddedToken("<|reserved_special_token_114|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128120: AddedToken("<|reserved_special_token_115|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128121: AddedToken("<|reserved_special_token_116|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128122: AddedToken("<|reserved_special_token_117|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128123: AddedToken("<|reserved_special_token_118|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128124: AddedToken("<|reserved_special_token_119|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128125: AddedToken("<|reserved_special_token_120|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128126: AddedToken("<|reserved_special_token_121|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128127: AddedToken("<|reserved_special_token_122|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128128: AddedToken("<|reserved_special_token_123|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128129: AddedToken("<|reserved_special_token_124|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128130: AddedToken("<|reserved_special_token_125|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128131: AddedToken("<|reserved_special_token_126|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128132: AddedToken("<|reserved_special_token_127|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128133: AddedToken("<|reserved_special_token_128|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128134: AddedToken("<|reserved_special_token_129|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128135: AddedToken("<|reserved_special_token_130|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128136: AddedToken("<|reserved_special_token_131|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128137: AddedToken("<|reserved_special_token_132|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128138: AddedToken("<|reserved_special_token_133|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128139: AddedToken("<|reserved_special_token_134|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128140: AddedToken("<|reserved_special_token_135|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128141: AddedToken("<|reserved_special_token_136|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128142: AddedToken("<|reserved_special_token_137|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128143: AddedToken("<|reserved_special_token_138|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128144: AddedToken("<|reserved_special_token_139|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128145: AddedToken("<|reserved_special_token_140|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128146: AddedToken("<|reserved_special_token_141|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128147: AddedToken("<|reserved_special_token_142|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128148: AddedToken("<|reserved_special_token_143|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128149: AddedToken("<|reserved_special_token_144|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128150: AddedToken("<|reserved_special_token_145|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128151: AddedToken("<|reserved_special_token_146|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128152: AddedToken("<|reserved_special_token_147|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128153: AddedToken("<|reserved_special_token_148|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128154: AddedToken("<|reserved_special_token_149|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128155: AddedToken("<|reserved_special_token_150|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128156: AddedToken("<|reserved_special_token_151|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128157: AddedToken("<|reserved_special_token_152|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128158: AddedToken("<|reserved_special_token_153|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128159: AddedToken("<|reserved_special_token_154|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128160: AddedToken("<|reserved_special_token_155|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128161: AddedToken("<|reserved_special_token_156|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128162: AddedToken("<|reserved_special_token_157|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128163: AddedToken("<|reserved_special_token_158|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128164: AddedToken("<|reserved_special_token_159|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128165: AddedToken("<|reserved_special_token_160|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128166: AddedToken("<|reserved_special_token_161|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128167: AddedToken("<|reserved_special_token_162|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128168: AddedToken("<|reserved_special_token_163|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128169: AddedToken("<|reserved_special_token_164|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128170: AddedToken("<|reserved_special_token_165|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128171: AddedToken("<|reserved_special_token_166|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128172: AddedToken("<|reserved_special_token_167|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128173: AddedToken("<|reserved_special_token_168|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128174: AddedToken("<|reserved_special_token_169|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128175: AddedToken("<|reserved_special_token_170|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128176: AddedToken("<|reserved_special_token_171|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128177: AddedToken("<|reserved_special_token_172|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128178: AddedToken("<|reserved_special_token_173|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128179: AddedToken("<|reserved_special_token_174|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128180: AddedToken("<|reserved_special_token_175|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128181: AddedToken("<|reserved_special_token_176|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128182: AddedToken("<|reserved_special_token_177|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128183: AddedToken("<|reserved_special_token_178|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128184: AddedToken("<|reserved_special_token_179|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128185: AddedToken("<|reserved_special_token_180|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128186: AddedToken("<|reserved_special_token_181|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128187: AddedToken("<|reserved_special_token_182|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128188: AddedToken("<|reserved_special_token_183|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128189: AddedToken("<|reserved_special_token_184|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128190: AddedToken("<|reserved_special_token_185|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128191: AddedToken("<|reserved_special_token_186|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128192: AddedToken("<|reserved_special_token_187|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128193: AddedToken("<|reserved_special_token_188|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128194: AddedToken("<|reserved_special_token_189|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128195: AddedToken("<|reserved_special_token_190|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128196: AddedToken("<|reserved_special_token_191|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128197: AddedToken("<|reserved_special_token_192|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128198: AddedToken("<|reserved_special_token_193|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128199: AddedToken("<|reserved_special_token_194|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128200: AddedToken("<|reserved_special_token_195|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128201: AddedToken("<|reserved_special_token_196|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128202: AddedToken("<|reserved_special_token_197|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128203: AddedToken("<|reserved_special_token_198|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128204: AddedToken("<|reserved_special_token_199|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128205: AddedToken("<|reserved_special_token_200|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128206: AddedToken("<|reserved_special_token_201|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128207: AddedToken("<|reserved_special_token_202|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128208: AddedToken("<|reserved_special_token_203|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128209: AddedToken("<|reserved_special_token_204|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128210: AddedToken("<|reserved_special_token_205|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128211: AddedToken("<|reserved_special_token_206|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128212: AddedToken("<|reserved_special_token_207|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128213: AddedToken("<|reserved_special_token_208|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128214: AddedToken("<|reserved_special_token_209|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128215: AddedToken("<|reserved_special_token_210|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128216: AddedToken("<|reserved_special_token_211|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128217: AddedToken("<|reserved_special_token_212|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128218: AddedToken("<|reserved_special_token_213|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128219: AddedToken("<|reserved_special_token_214|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128220: AddedToken("<|reserved_special_token_215|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128221: AddedToken("<|reserved_special_token_216|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128222: AddedToken("<|reserved_special_token_217|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128223: AddedToken("<|reserved_special_token_218|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128224: AddedToken("<|reserved_special_token_219|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128225: AddedToken("<|reserved_special_token_220|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128226: AddedToken("<|reserved_special_token_221|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128227: AddedToken("<|reserved_special_token_222|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128228: AddedToken("<|reserved_special_token_223|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128229: AddedToken("<|reserved_special_token_224|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128230: AddedToken("<|reserved_special_token_225|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128231: AddedToken("<|reserved_special_token_226|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128232: AddedToken("<|reserved_special_token_227|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128233: AddedToken("<|reserved_special_token_228|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128234: AddedToken("<|reserved_special_token_229|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128235: AddedToken("<|reserved_special_token_230|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128236: AddedToken("<|reserved_special_token_231|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128237: AddedToken("<|reserved_special_token_232|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128238: AddedToken("<|reserved_special_token_233|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128239: AddedToken("<|reserved_special_token_234|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128240: AddedToken("<|reserved_special_token_235|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128241: AddedToken("<|reserved_special_token_236|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128242: AddedToken("<|reserved_special_token_237|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128243: AddedToken("<|reserved_special_token_238|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128244: AddedToken("<|reserved_special_token_239|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128245: AddedToken("<|reserved_special_token_240|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128246: AddedToken("<|reserved_special_token_241|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128247: AddedToken("<|reserved_special_token_242|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128248: AddedToken("<|reserved_special_token_243|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128249: AddedToken("<|reserved_special_token_244|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128250: AddedToken("<|reserved_special_token_245|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128251: AddedToken("<|reserved_special_token_246|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128252: AddedToken("<|reserved_special_token_247|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128253: AddedToken("<|reserved_special_token_248|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128254: AddedToken("<|reserved_special_token_249|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128255: AddedToken("<|reserved_special_token_250|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128256: AddedToken("Bác", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
}

tokenizer.decode(tokenizer.encode("Bác"))
>>>B�c

It does NOT use the newly recently added token at all?!?!?! Why ? Any help please. Must be something missed.
Steve

@VishnuPJ
Copy link

VishnuPJ commented Apr 23, 2024

When adding a new token ,
tokenizer.add_tokens(['ininin']) and resizing model.resize_token_embeddings(len(tokenizer)) , I am getting the error, "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn".
But when doing tokenizer.add_tokens(['inin']) there is no error?

Why is that? @ArthurZucker @thusinh1969

@StephennFernandes
Copy link

StephennFernandes commented Apr 23, 2024

@VishnuPJ are you saving the tokenizer and then expanding the token embedding by loading the tokenizer freshly ?

I don't understand you error clearly, can you elaborate more

from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer") tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

trying doing this and then saving the fast tokenizer then freshly load the tokenizer as usual and try to expand token embedding

@VishnuPJ
Copy link

@VishnuPJ are you saving the tokenizer and then expanding the token embedding by loading the tokenizer freshly ?

I don't understand you error clearly, can you elaborate more

from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer") tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

trying doing this and then saving the fast tokenizer then freshly load the tokenizer as usual and try to expand token embedding

Sorry for the confusion. I was able to add the tokens and tokenizer works as expected. But whie running trainer.train() I am getting the above error.

@StephennFernandes
Copy link

@VishnuPJ ok seems like a trainer issue.

@thusinh1969 can you check what this issue could actually be ?

Id recommend cross checking your code with Chinese LLama alpaca 2 incase you haven't already.

besides this I feel only @ArthurZucker and/or @osanseviero could help us out in this

@ArthurZucker
Copy link
Collaborator

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

@ArthurZucker
Copy link
Collaborator

Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using pre_tokenizers.ByteLevel(False,False).pre_tokenize_str

@StephennFernandes
Copy link

Thanks a lot @ArthurZucker 😊

it really means a ton !!

@thusinh1969
Copy link
Author

thusinh1969 commented Apr 24, 2024

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

Does not help. This will create 3 tokens for 1 word "Bác" which is exactly what we want to avoid. Should be only 1 token.

tokenizer.encode("Bác", add_special_tokens=False)
>>>[33, 1995, 66]

This is very ineffective.
Steve

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Apr 24, 2024

Mmm no then it's not added properly, let me try again, sorry forgot to check the ids

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Apr 24, 2024

Ok:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.encode("Bác")
128256 # a new token

this is alright, the only issue is the decoding.
Let me find a fix and if needed update tokenizers to support this

@VishnuPJ
Copy link

VishnuPJ commented Apr 24, 2024

@VishnuPJ ok seems like a trainer issue.

@thusinh1969 can you check what this issue could actually be ?

Id recommend cross checking your code with Chinese LLama alpaca 2 incase you haven't already.

besides this I feel only @ArthurZucker and/or @osanseviero could help us out in this

This issue is resolved. We need to add the below lines before calling get_peft_model(model, lora_config).

tokenizer.add_tokens(["NEW_TOKEN", "NEW_TOKEN_2"])  
model.resize_token_embeddings(len(tokenizer))  

Previously I added those lines after get_peft_model(), which somehow messes the model and tokenizer I guess.

@StephennFernandes
Copy link

Ok:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.encode("Bác")
128256 # a new token

this is alright, the only issue is the decoding.
Let me find a fix and if needed update tokenizers to support this

@ArthurZucker so just for clarification the decoder produces char /byte based tokenization while decoding ?

@ArthurZucker
Copy link
Collaborator

Yep overall the token that was added is Bác, then it gets encoded, and the decoder tries to decode Bác as if it was Bác thus failing

@ArthurZucker
Copy link
Collaborator

I think the easiest solution is to simply make sure the Bytelevel decoder does not process the added tokens

@hpsun1109
Copy link

hpsun1109 commented Apr 25, 2024

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

@ArthurZucker How to output bos_token? It doesn't work when I set "tokenizer.add_bos_token = True" Thanks

@ArthurZucker
Copy link
Collaborator

huggingface/tokenizers#1513 will fix the issue for the new tokens

@thusinh1969
Copy link
Author

huggingface/tokenizers#1513 will fix the issue for the new tokens

Wonderful. How would it merged to which repo so we can get back and test.

Cheers,
Steve

@dengxiaotian123
Copy link

dengxiaotian123 commented Apr 29, 2024

@ArthurZucker I am confused about the Tokenizer for tiktoken training. What encoding is used for the corpus (such as cl100k_base, or p50k_base) when training the Tokenizer? What is the encoding of these characters? For example,['åIJ¦', 'ãĢĤ']

['Tôi', 'ĠnhỼ', 'Ġl', 'ắm', 'ĠB', 'ác', 'ĠHá»ĵ', 'ĠkÃŃnh', 'Ġyêu', 'Ġcủa', 'ĠÄijạo', 'Ġph', 'áºŃt']

Input Chinese characters and output similar to this

word = "否。"
print('word', word)
print(tokenizer.tokenize(word))
print(tokenizer(word).input_ids)
print('decode: ', tokenizer.decode(tokenizer(word).input_ids))`
word 否。
['åIJ¦', 'ãĢĤ']  
[33476, 1811]

@ArthurZucker
Copy link
Collaborator

It is a unicode representation of the bytes!
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1478 should give you an idea of how we obtain the tokens!
What you do is that you take word = "否。", you encode it to bytes:

>>> word = "否。"
>>> bword = b'\xe5\x90\xa6'
>>> decoded = b'\xe5\x90\xa6'.decode("latin-1")
>>> [ord(char) for char in decoded.decode("latin-1")] 
[229, 144, 166]

then you fetch the unicode representation of the bytes (which are supposed to come from utf-8):

{33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 174: '®', 175: '¯', 176: '°', 177: '±', 178: '²', 179: '³', 180: '´', 181: 'µ', 182: '¶', 183: '·', 184: '¸', 185: '¹', 186: 'º', 187: '»', 188: '¼', 189: '½', 190: '¾', 191: '¿', 192: 'À', 193: 'Á', 194: 'Â', 195: 'Ã', 196: 'Ä', 197: 'Å', 198: 'Æ', 199: 'Ç', 200: 'È', 201: 'É', 202: 'Ê', 203: 'Ë', 204: 'Ì', 205: 'Í', 206: 'Î', 207: 'Ï', 208: 'Ð', 209: 'Ñ', 210: 'Ò', 211: 'Ó', 212: 'Ô', 213: 'Õ', 214: 'Ö', 215: '×', 216: 'Ø', 217: 'Ù', 218: 'Ú', 219: 'Û', 220: 'Ü', 221: 'Ý', 222: 'Þ', 223: 'ß', 224: 'à', 225: 'á', 226: 'â', 227: 'ã', 228: 'ä', 229: 'å', 230: 'æ', 231: 'ç', 232: 'è', 233: 'é', 234: 'ê', 235: 'ë', 236: 'ì', 237: 'í', 238: 'î', 239: 'ï', 240: 'ð', 241: 'ñ', 242: 'ò', 243: 'ó', 244: 'ô', 245: 'õ', 246: 'ö', 247: '÷', 248: 'ø', 249: 'ù', 250: 'ú', 251: 'û', 252: 'ü', 253: 'ý', 254: 'þ', 255: 'ÿ', 0: 'Ā', 1: 'ā', 2: 'Ă', 3: 'ă', 4: 'Ą', 5: 'ą', 6: 'Ć', 7: 'ć', 8: 'Ĉ', 9: 'ĉ', 10: 'Ċ', 11: 'ċ', 12: 'Č', 13: 'č', 14: 'Ď', 15: 'ď', 16: 'Đ', 17: 'đ', 18: 'Ē', 19: 'ē', 20: 'Ĕ', 21: 'ĕ', 22: 'Ė', 23: 'ė', 24: 'Ę', 25: 'ę', 26: 'Ě', 27: 'ě', 28: 'Ĝ', 29: 'ĝ', 30: 'Ğ', 31: 'ğ', 32: 'Ġ', 127: 'ġ', 128: 'Ģ', 129: 'ģ', 130: 'Ĥ', 131: 'ĥ', 132: 'Ħ', 133: 'ħ', 134: 'Ĩ', 135: 'ĩ', 136: 'Ī', 137: 'ī', 138: 'Ĭ', 139: 'ĭ', 140: 'Į', 141: 'į', 142: 'İ', 143: 'ı', 144: 'IJ', 145: 'ij', 146: 'Ĵ', 147: 'ĵ', 148: 'Ķ', 149: 'ķ', 150: 'ĸ', 151: 'Ĺ', 152: 'ĺ', 153: 'Ļ', 154: 'ļ', 155: 'Ľ', 156: 'ľ', 157: 'Ŀ', 158: 'ŀ', 159: 'Ł', 160: 'ł', 173: 'Ń'}

this basically allows you to represent any byte array in unicodes, simplifying the tokenization process.
The idea is to show 'åIJ¦' as a token instead of showing b'\xe5\x90\xa6'

@ArthurZucker
Copy link
Collaborator

(\xe5 give 229: 'å', \x90 gives 144: 'IJ', etc )

@thusinh1969
Copy link
Author

thusinh1969 commented May 3, 2024

Gents and @ArthurZucker is the decoder fixes merged already somewhere ?

Thanks,
Steve

@ArthurZucker
Copy link
Collaborator

huggingface/tokenizers#1513 can be used, gonna merge today and prepare the update to transformers + tokenizers release

@StephennFernandes
Copy link

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

@thusinh1969
Copy link
Author

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

Train sentence piece and merge, see above code. But its decoder is buggy, hence have to wait for the change to merge into HF's tokenizers package.

@ArthurZucker when should we expect the change to be part of oficial tokenizers package ?

Thanks,
Steve

@StephennFernandes
Copy link

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

Train sentence piece and merge, see above code. But its decoder is buggy, hence have to wait for the change to merge into HF's tokenizers package.

@ArthurZucker when should we expect the change to be part of oficial tokenizers package ?

Thanks,
Steve

I know that we could train spm and merge but that's not the point is there a way we could train tiktoken from scratch was my actual query.

as i see even other orgs using their own custom trained versions of tiktoken like phi-3 model used

@thusinh1969
Copy link
Author

thusinh1969 commented May 13, 2024

Gents,

I installed tokenizers from source (tokenizers-0.19.1.dev0) from main branch. It is now working.

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác")), tokenizer.encode("Bác")
('Bác', [128256])

I am closing the issue, we can now extend vocab and continually pretrain LlaMA-3 further.

Thanks @ArthurZucker et all,
Steve

@ArthurZucker
Copy link
Collaborator

🤗 glad I was of help!
@StephennFernandes I don't know tiktoken, I can help you train from scratch using tokenizers but otherwise it's outside my domain of knowledge!

@woohwan
Copy link

woohwan commented May 15, 2024

Gents,

I installed tokenizers from source (tokenizers-0.19.1.dev0) from main branch. It is now working.

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác")), tokenizer.encode("Bác")
('Bác', [128256])

I am closing the issue, we can now extend vocab and continually pretrain LlaMA-3 further.

Thanks @ArthurZucker et all, Steve

I'm newbie in LLM field.

I want to extend llama3 tokenizer through korean corpus.
Can you tell me where to modify when following https://huggingface.co/learn/nlp-course/chapter6/2?
The result of tokenize does not change when I do the same.

Anyone help. plz.
thanks.

@Yuhuajoe
Copy link

>>> word = "否。"
>>> bword = b'\xe5\x90\xa6'
>>> decoded = b'\xe5\x90\xa6'.decode("latin-1")
>>> [ord(char) for char in decoded.decode("latin-1")] 
[229, 144, 166]

@ArthurZucker when i run the code above, i got the error as below:

         2 bword = b'\xe5\x90\xa6'
         3 decoded = b'\xe5\x90\xa6'.decode("latin-1")
----> 4 [ord(char) for char in decoded.decode("latin-1")] 

AttributeError: 'str' object has no attribute 'decode'

@ArthurZucker
Copy link
Collaborator

Hey! Decoded is already a string, you probably wanted to do [ord(char) for char in decoded] 😉

@StephennFernandes
Copy link

@amitsangani
hey amit could you please tell us how to pretrain the tokenizer from scratch using tiktoken like you guys did for training llama 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants