-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding user defined tokens #30824 #30929
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
a979ad9
to
529e2be
Compare
6bda30c
to
b0b28f5
Compare
b0b28f5
to
79ce5bb
Compare
control_symbols = [ | ||
AddedToken(token, normalized=True, special=False) for token in self.proto.trainer_spec.control_symbols | ||
] | ||
tokenizer.add_tokens(user_defined_symbols + control_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should set normalized to False
, and more importantly control_symbols
should be special no? (I don't remember which ones should be skipped during decoding)
Also adding special and non special: relate to #30574
add_prefix_space = self.proto.normalizer_spec.add_dummy_prefix | ||
# tokenizer.add_prefix_space = add_prefix_space | ||
|
||
if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None: | |
if hadd_prefix_space: |
should we rather test this? Or do some |
on original tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it's missing the part where we want to reset the normalizer.add_dummy_prefix_space
when we don save_pretrained
+ a small test that shows this works as expected
#TODO: ita | ||
self.add_prefix_space = add_prefix_space | ||
# if add_prefix_space is not None: | ||
# kwargs["from_slow"] = True | ||
|
||
if self.force_from_slow() is True: | ||
kwargs["from_slow"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO should be called in PreTrainedTokenizerFast
elif fast_tokenizer_file is not None: # and not from_slow: | ||
# We have a serialization from tokenizers which let us directly build the backend | ||
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we changing the order of priority here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the order but it is to allow converting without sentencepiece installed
def update_normalizer(self): | ||
"""Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings.""" | ||
sequence = [] | ||
if getattr(self, "legacy", True): | ||
if getattr(self, "add_prefix_space", True): | ||
sequence += [normalizers.Prepend(prepend="▁")] | ||
sequence += [normalizers.Replace(pattern=" ", content="▁")] | ||
self._tokenizer.normalizer = normalizers.Sequence(sequence) | ||
|
||
elif not getattr(self, "legacy", True): | ||
self._tokenizer.normalizer = normalizers.Sequence(sequence) #TODO:ita2 | ||
|
||
|
||
def update_pre_tokenizer(self): | ||
"""Updates the underlying pre-tokenizer with the current `add_prefix_space` setting.""" | ||
sequence = [] | ||
if getattr(self, "add_prefix_space", None) == None: | ||
if getattr(self._tokenizer, "normalizer", None) == None: | ||
return | ||
curr_normalizer = json.loads(self._tokenizer.normalizer.__getstate__().decode('utf-8')) | ||
if 'normalizers' not in curr_normalizer: | ||
return | ||
prepend_normalizer = [n for n in curr_normalizer['normalizers'] if n['type'] == 'Prepend'] | ||
if prepend_normalizer: | ||
self.add_prefix_space = True | ||
else: | ||
return | ||
elif getattr(self, "add_prefix_space") == False: | ||
prepend_scheme = "never" | ||
|
||
if getattr(self, "add_prefix_space", True): | ||
prepend_scheme = "always" | ||
if not getattr(self, "legacy", True): | ||
prepend_scheme = "first" | ||
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement="▁", prepend_scheme=prepend_scheme, | ||
split=False) | ||
self.update_normalizer() | ||
|
||
|
||
|
||
def update_post_processor(self): | ||
""" | ||
Updates the underlying post processor with the current `bos_token` and `eos_token`. | ||
""" | ||
bos = self.bos_token | ||
bos_token_id = self.bos_token_id | ||
if bos is None and self.add_bos_token: | ||
raise ValueError("add_bos_token = True but bos_token = None") | ||
|
||
eos = self.eos_token | ||
eos_token_id = self.eos_token_id | ||
if eos is None and self.add_eos_token: | ||
raise ValueError("add_eos_token = True but eos_token = None") | ||
|
||
single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}" | ||
pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}" | ||
|
||
special_tokens = [] | ||
if self.add_bos_token: | ||
special_tokens.append((bos, bos_token_id)) | ||
if self.add_eos_token: | ||
special_tokens.append((eos, eos_token_id)) | ||
self._tokenizer.post_processor = processors.TemplateProcessing( | ||
single=single, pair=pair, special_tokens=special_tokens | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It's mostly missing ... testssssss
if type(self) == PreTrainedTokenizerFast and all(item in kwargs for item in ["add_bos_token", "add_eos_token", "eos_token", "bos_token"]): | ||
self.add_bos_token = kwargs.get("add_bos_token") | ||
self.add_eos_token = kwargs.get("add_eos_token") | ||
self.update_post_processor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially this needs to be called when the eos and bos are updated! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are they updated? It looks like only on initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone calls add_special_tokens
, or we go into eos_token_id
's setter and eos_token
's setter 😉
We can close this as #31305 superseeds it right? |
Fixes #30824 #30947
Tasks
Who can review?