Skip to content

Commit

Permalink
Fix stride condition. (#1321)
Browse files Browse the repository at this point in the history
* Release all at once for simplicity.

* rc2
  • Loading branch information
Narsil authored Aug 14, 2023
1 parent b35d33f commit 9a93c50
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion bindings/python/py_src/tokenizers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.13.4.rc1"
__version__ = "0.13.4.rc2"

from enum import Enum
from typing import List, Tuple, Union
Expand Down
2 changes: 1 addition & 1 deletion bindings/python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

setup(
name="tokenizers",
version="0.13.4.rc1",
version="0.13.4.rc2",
description="Fast and Customizable Tokenizers",
long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion tokenizers/src/tokenizer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -605,7 +605,7 @@ where
if let Some(trunc_params) = &trunc {
let n_added_tokens = self.get_n_added_tokens(false);
let effective_max_length = trunc_params.max_length - n_added_tokens;
if effective_max_length <= trunc_params.stride {
if effective_max_length < trunc_params.stride {

This comment has been minimized.

Copy link
@boyleconnor

boyleconnor Aug 17, 2023

Contributor

@Narsil why was this changed? I am pretty sure it should be <=, not < (further downstream, it will panic if !(stride < max_len)).

I tested with the following code (reminder: BERT adds exactly 2 special tokens in this case):

tokenizer = Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(max_length=10, stride=8)  # This should fail but doesn't

which doesn't create an error (even though it should). But it will produce an error when the following is run:

tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that.")
Traceback (most recent call last):
  File "/home/connor/PycharmProjects/tokenizers/test_stride_warning.py", line 11, in <module>
    main()
  File "/home/connor/PycharmProjects/tokenizers/test_stride_warning.py", line 7, in main
    tokenizer.encode("This piece of text is at least ten tokens long. In fact, it is likely many more than that.")
pyo3_runtime.PanicException: `stride` must be strictly less than `max_len=8` (note that `max_len` may be shorter than the max length of the original model, as it subtracts the number of special characters
return Err(Box::new(TruncationParamError(format!(
"tokenizer stride set to {}, which is greater than or equal to its effective max length of {} (= {} original max length - {} added special tokens), ",
trunc_params.stride, effective_max_length, trunc_params.max_length, n_added_tokens
Expand Down

0 comments on commit 9a93c50

Please sign in to comment.