Skip to content

Commit

Permalink
[Tokenizer doc] Clarification about add_prefix_space (#24368)
Browse files Browse the repository at this point in the history
* nits

* more details

* fixup

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
  • Loading branch information
ArthurZucker and amyeroberts authored Jun 20, 2023
1 parent 0527c1c commit 7feba74
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 5 deletions.
6 changes: 4 additions & 2 deletions src/transformers/generation/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,10 @@ class GenerationConfig(PushToHubMixin):
If set to int > 0, all ngrams of that size can only occur once.
bad_words_ids(`List[List[int]]`, *optional*):
List of token ids that are not allowed to be generated. In order to get the token ids of the words that
should not appear in the generated text, use `tokenizer(bad_words, add_prefix_space=True,
add_special_tokens=False).input_ids`.
should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing the
tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
force_words_ids(`List[List[int]]` or `List[List[List[int]]]`, *optional*):
List of token ids that must be generated. If given a `List[List[int]]`, this is treated as a simple list of
words that must be included, the opposite to `bad_words_ids`. If given `List[List[List[int]]]`, this
Expand Down
6 changes: 4 additions & 2 deletions src/transformers/generation/logits_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -546,8 +546,10 @@ class NoBadWordsLogitsProcessor(LogitsProcessor):
Args:
bad_words_ids (`List[List[int]]`):
List of list of token ids that are not allowed to be generated. In order to get the token ids of the words
that should not appear in the generated text, use `tokenizer(bad_words, add_prefix_space=True,
add_special_tokens=False).input_ids`.
that should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing
the tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
eos_token_id (`Union[int, List[int]]`):
The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
"""
Expand Down
5 changes: 4 additions & 1 deletion src/transformers/generation/tf_logits_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,10 @@ class TFNoBadWordsLogitsProcessor(TFLogitsProcessor):
Args:
bad_words_ids (`List[List[int]]`):
List of list of token ids that are not allowed to be generated. In order to get the tokens of the words
that should not appear in the generated text, use `tokenizer(bad_word, add_prefix_space=True).input_ids`.
that should not appear in the generated text, make sure to set `add_prefix_space=True` when initializing
the tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The `add_prefix_space`
argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours come from
`pre tokenizers`. Read more [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
eos_token_id (`int`):
The id of the *end-of-sequence* token.
"""
Expand Down

0 comments on commit 7feba74

Please sign in to comment.