Upgrade pyo3 to 0.16 #956

h-vetinari · 2022-03-21T06:33:01Z

Closes #934

h-vetinari · 2022-03-21T08:03:53Z

Looking at the abundance of CI errors, I'm not sure I'm going to be the best person to shepherd this to completion. I don't know rust that well, and pyo3/tokenizers even less. With some guidance I might get there, but I'm not going to be autonomous.

messense · 2022-03-21T08:21:40Z

@h-vetinari Here is the pyo3 migration guide: https://pyo3.rs/v0.15.1/migration.html

Narsil · 2022-03-21T08:27:28Z

Hi here.

First, I think we should move onto 0.16 directly since it's the latest version of pyo3 (as long as we're making an update here, we might as well get the latest one). Unless there's some unintended breaking changes that could prevent this from happening.

Narsil · 2022-03-21T08:28:34Z

For the linting, you can normally use make style within bindings/python to fix the format.
We also use clippy (cargo clippy) in the formatting

messense · 2022-03-21T08:29:09Z

FYI, move onto pyo3 0.16 requires dropping Python 3.6 support.

Narsil · 2022-03-21T08:35:13Z

FYI, move onto pyo3 0.16 requires dropping Python 3.6 support.

It will refuse to compile ? If that's the case it it's not great.
tokenizers stopped building for 3.6 because there's not GH runner anymore, but if people are able to build still that would be better indeed. I don't see any killer feature for 0.16 in the changelog.

messense · 2022-03-21T08:40:58Z

It will refuse to compile ? If that's the case it it's not great.

I think so.

Python 3.6 reached EOL in 23 Dec 2021 so users should upgrade if they care about security.

h-vetinari · 2022-03-21T08:41:34Z

If someone wants to push into this PR, I'd be thrilled to receive support (and make people collaborators on my fork if necessary).

I mainly attempted this because pyo3>=0.15 is a hard requirement to support python 3.10 in conda-forge, and several NLP packages are blocked on not having tokenizers for 3.10.

h-vetinari · 2022-03-21T08:43:15Z

However, a minimal backport of #650 to tags/python-v0.11.6 also didn't work.

It fails with something like:

import: 'tokenizers'
TypeError: type 'tokenizers.models.Model' is not an acceptable base type
thread '<unnamed>' panicked at 'An error occurred while initializing class BPE', /home/conda/feedstock_root/build_artifacts/tokenizers_1647850774618/_build_env/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.0/src/type_object.rs:102:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
pyo3_runtime.PanicException: An error occurred while initializing class BPE
thread '<unnamed>' panicked at 'Python API call failed'

h-vetinari · 2022-03-21T08:45:26Z

PS. I don't care much about pyo3 0.15 or 0.16, though IMO python 3.6 support should really not be a determining factor in anything anymore. For perspective - a lot of projects following NEP 29 (among them numpy, scipy, pandas, etc.) also dropped support for python 3.7 in their latest releases already.

Narsil · 2022-03-21T09:12:33Z

PS. I don't care much about pyo3 0.15 or 0.16, though IMO python 3.6 support should really not be a determining factor in anything anymore. For perspective - a lot of projects following NEP 29 (among them numpy, scipy, pandas, etc.) also dropped support for python 3.7 in their latest releases already.

Fair enough ! Let's try 0.16 then.

messense · 2022-03-21T09:23:44Z

Hey @h-vetinari, I've sent you a PR to upgrade to 0.16: h-vetinari#1

h-vetinari · 2022-03-21T09:27:08Z

Hey @h-vetinari, I've sent you a PR to upgrade to 0.16: h-vetinari#1

Thanks a lot! 🙃

I sent you an invite to collaborate on my fork, then you can push into this PR directly

messense · 2022-03-21T11:00:46Z

TypeError: PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]

I'm not sure what changed in rust-numpy or pyo3 that makes this test case fail: https://github.com/huggingface/tokenizers/runs/5625562727?check_suite_focus=true

@adamreichold Any idea?

adamreichold · 2022-03-21T11:24:15Z

@adamreichold Any idea?

I am sorry but I have a hard time following the layers here. My first impression is that the test in question does not even reach the Rust code yet but fails already in the Python around it? That said, I think the best candidate for surfacing typing issues at runtime is that before 0.16, rust-numpy (incorrectly) did not check element type and dimension when downcasting to arrays, i.e. PyO3/rust-numpy#265 (I did not find any mention of downcasts with PyArray though. Actually I did not find PyArray mentioned at all?)

messense · 2022-03-21T11:26:42Z

My first impression is that the test in question does not even reach the Rust code yet but fails already in the Python around it?

It's rejected here in Rust code

tokenizers/bindings/python/src/tokenizer.rs

Lines 420 to 423 in 1bb9884

    
                   Err(exceptions::PyTypeError::new_err( 
        
                       "PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, \ 
        
                       Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]", 
        
                   ))

messense · 2022-03-21T11:28:44Z

And I suspect it has something to do with these code in PyArrayUnicode or PyArrayStr

tokenizers/bindings/python/src/tokenizer.rs

Lines 259 to 340 in 1bb9884

    
           struct PyArrayUnicode(Vec<String>); 
        
           impl FromPyObject<'_> for PyArrayUnicode { 
        
               fn extract(ob: &PyAny) -> PyResult<Self> { 
        
                   let array = ob.downcast::<PyArray1<u8>>()?; 
        
                   let arr = array.as_array_ptr(); 
        
                   let (type_num, elsize, alignment, data) = unsafe { 
        
                       let desc = (*arr).descr; 
        
                       ( 
        
                           (*desc).type_num, 
        
                           (*desc).elsize as usize, 
        
                           (*desc).alignment as usize, 
        
                           (*arr).data, 
        
                       ) 
        
                   }; 
        
                   let n_elem = array.shape()[0]; 
        
                   // type_num == 19 => Unicode 
        
                   if type_num != 19 { 
        
                       return Err(exceptions::PyTypeError::new_err( 
        
                           "Expected a np.array[dtype='U']", 
        
                       )); 
        
                   } 
        
                   unsafe { 
        
                       let all_bytes = std::slice::from_raw_parts(data as *const u8, elsize * n_elem); 
        
                       let seq = (0..n_elem) 
        
                           .map(|i| { 
        
                               let bytes = &all_bytes[i * elsize..(i + 1) * elsize]; 
        
                               let unicode = pyo3::ffi::PyUnicode_FromUnicode( 
        
                                   bytes.as_ptr() as *const _, 
        
                                   elsize as isize / alignment as isize, 
        
                               ); 
        
                               let gil = Python::acquire_gil(); 
        
                               let py = gil.python(); 
        
                               let obj = PyObject::from_owned_ptr(py, unicode); 
        
                               let s = obj.cast_as::<PyString>(py)?; 
        
                               Ok(s.to_string_lossy().trim_matches(char::from(0)).to_owned()) 
        
                           }) 
        
                           .collect::<PyResult<Vec<_>>>()?; 
        
                       Ok(Self(seq)) 
        
                   } 
        
               } 
        
           } 
        
           impl From<PyArrayUnicode> for tk::InputSequence<'_> { 
        
               fn from(s: PyArrayUnicode) -> Self { 
        
                   s.0.into() 
        
               } 
        
           } 
        
           struct PyArrayStr(Vec<String>); 
        
           impl FromPyObject<'_> for PyArrayStr { 
        
               fn extract(ob: &PyAny) -> PyResult<Self> { 
        
                   let array = ob.downcast::<PyArray1<u8>>()?; 
        
                   let arr = array.as_array_ptr(); 
        
                   let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) }; 
        
                   let n_elem = array.shape()[0]; 
        
                   if type_num != 17 { 
        
                       return Err(exceptions::PyTypeError::new_err( 
        
                           "Expected a np.array[dtype='O']", 
        
                       )); 
        
                   } 
        
                   unsafe { 
        
                       let objects = std::slice::from_raw_parts(data as *const PyObject, n_elem); 
        
                       let seq = objects 
        
                           .iter() 
        
                           .map(|obj| { 
        
                               let gil = Python::acquire_gil(); 
        
                               let py = gil.python(); 
        
                               let s = obj.cast_as::<PyString>(py)?; 
        
                               Ok(s.to_string_lossy().into_owned()) 
        
                           }) 
        
                           .collect::<PyResult<Vec<_>>>()?; 
        
                       Ok(Self(seq)) 
        
                   } 
        
               } 
        
           }

adamreichold · 2022-03-21T11:33:48Z

And I suspect it has something to do with these code in PyArrayUnicode or PyArrayStr

tokenizers/bindings/python/src/tokenizer.rs

Lines 259 to 340 in 1bb9884

struct PyArrayUnicode(Vec<String>);

impl FromPyObject<'_> for PyArrayUnicode {

fn extract(ob: &PyAny) -> PyResult<Self> {

let array = ob.downcast::<PyArray1<u8>>()?;

let arr = array.as_array_ptr();

let (type_num, elsize, alignment, data) = unsafe {

let desc = (*arr).descr;

(

(*desc).type_num,

(*desc).elsize as usize,

(*desc).alignment as usize,

(*arr).data,

)

};

let n_elem = array.shape()[0];

// type_num == 19 => Unicode

if type_num != 19 {

return Err(exceptions::PyTypeError::new_err(

"Expected a np.array[dtype='U']",

));

}

unsafe {

let all_bytes = std::slice::from_raw_parts(data as *const u8, elsize * n_elem);

let seq = (0..n_elem)

.map(|i| {

let bytes = &all_bytes[i * elsize..(i + 1) * elsize];

let unicode = pyo3::ffi::PyUnicode_FromUnicode(

bytes.as_ptr() as *const _,

elsize as isize / alignment as isize,

);

let gil = Python::acquire_gil();

let py = gil.python();

let obj = PyObject::from_owned_ptr(py, unicode);

let s = obj.cast_as::<PyString>(py)?;

Ok(s.to_string_lossy().trim_matches(char::from(0)).to_owned())

})

.collect::<PyResult<Vec<_>>>()?;

Ok(Self(seq))

}

}

}

impl From<PyArrayUnicode> for tk::InputSequence<'_> {

fn from(s: PyArrayUnicode) -> Self {

s.0.into()

}

}

struct PyArrayStr(Vec<String>);

impl FromPyObject<'_> for PyArrayStr {

fn extract(ob: &PyAny) -> PyResult<Self> {

let array = ob.downcast::<PyArray1<u8>>()?;

let arr = array.as_array_ptr();

let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) };

let n_elem = array.shape()[0];

if type_num != 17 {

return Err(exceptions::PyTypeError::new_err(

"Expected a np.array[dtype='O']",

));

}

unsafe {

let objects = std::slice::from_raw_parts(data as *const PyObject, n_elem);

let seq = objects

.iter()

.map(|obj| {

let gil = Python::acquire_gil();

let py = gil.python();

let s = obj.cast_as::<PyString>(py)?;

Ok(s.to_string_lossy().into_owned())

})

.collect::<PyResult<Vec<_>>>()?;

Ok(Self(seq))

}

}

}

This would indeed point to the downcast fixes and hence did probably only work by accident before. I think adding an .unwrap(); instead of ? in

tokenizers/bindings/python/src/tokenizer.rs

Line 262 in 1bb9884

let array = ob.downcast::<PyArray1<u8>>()?;

might shed some light on this.

messense · 2022-03-21T11:54:21Z

Fails with PyDowncastError

pyo3_runtime.PanicException: called Result::unwrap() on an Err value: PyDowncastError { from: array(['My', 'name', 'is', 'John'], dtype='<U7'), to: "PyArray<T, D>" }

I guess tokenizers also wants PyO3/rust-numpy#141

adamreichold · 2022-03-21T12:33:06Z

Fails with PyDowncastError

pyo3_runtime.PanicException: called Result::unwrap() on an Err value: PyDowncastError { from: array(['My', 'name', 'is', 'John'], dtype='<U7'), to: "PyArray<T, D>" }

This should not have worked in the first place as <U7 would an array of 7-byte-large elements of Py_UCS4 stored in little-endian byte-order. So considering this a single byte u8 elements should yield at least incorrect strides.

I guess tokenizers also wants PyO3/rust-numpy#141

This would be the best solution, but the existing code does not really use the API provided by rust-numpy anyway (which is why this worked in the past), so I think an immediate fix would be to just use the PyArray_Check directly (which checks only if it is an array but does not consider element type and dimensionality):

So instead of

fn extract(ob: &PyAny) -> PyResult<Self> {
        let array = ob.downcast::<PyArray1<u8>>()?;
        let arr = array.as_array_ptr();
...

one could do

fn extract(ob: &PyAny) -> PyResult<Self> {
        if npyffi::PyArray_Check(ob.py(), ob.as_ptr()) == 0 {
            return Err(exceptions::PyTypeError::new_err(
                "Expected an np.array",
            ));
        }
        let arr = ob.as_ptr() as *mut npyffi::PyArrayObject;
...

and the rest of the code should continue to work as-is. (I have not actually tried to compile this so there are certainly errors in there.)

adamreichold · 2022-03-21T13:17:51Z

I am sorry that I did not try this out myself, but from reading the code, I think the part

let shape =
    unsafe { slice::from_raw_parts((*arr).dimensions as *mut usize, (*arr).nd as usize) };
let n_elem = shape[0];

should probably check nd for correctness and also verify contiguousness as it then accesses the data as a slice (Even a one-dimensional array could have non-unit strides.), i.e.

if (*arr).nd != 1 { /* return dimensionality error */ }
let n_elem = *(*arr).dimenions;

if (*arr).flags & (npyffi::NPY_ARRAY_C_CONTIGUOUS | npyffi::NPY_ARRAY_F_CONTIGUOUS) == 0 { /* return non-contiguours error */ }

adamreichold · 2022-03-21T16:11:09Z

bindings/python/Cargo.toml

-numpy = "0.12"
-ndarray = "0.13"
+env_logger = "0.9.0"
+pyo3 = "0.16.2"


I think the remaining test failures could be resolved by adding

resolver = "2" # or edition = "2021" [dev-dependencies] pyo3 = { version = "0.16", features = ["auto-initialize"]

This used to be part of the default features but is not any more since 0.14.

(If Rust 1.51 which introduced the resolver = "2" option is too new, then the feature can just be added to the normal [dependencies] entry.)

Narsil · 2022-03-21T18:17:37Z

Hi @messense ,

I don't have time today to do a full review (I tried to make the tests run during the day so you could see what was happening).

This is a becoming a very big PR, which I don't think is a good thing about a PR.

Do you mind adding comments yourself on the PRs of what is going on, and why changes are important ? It would help me tremendously review faster (otherwise I will just ask questions :))

The unsafe calls are basically big NO in tokenizers.

adamreichold · 2022-03-21T18:29:13Z

The unsafe calls are basically big NO in tokenizers.

If this refers to the calls related to npyfii, this is not materially "more unsafe" than it already was as the old (and incorrect) version of downcast::<PyArray1<u8>> was doing the exact same thing. The only difference is the manual access to the dimensions, but I would say that this is balanced out by fixing the missing check for contiguous arrays which is missing on main (or alternatively the accesses would need to consider the array's stride to be fully general).

Hopefully, we will be able to implement PyO3/rust-numpy#141 eventually and the whole business can be done using safe code.

adamreichold · 2022-03-21T18:36:23Z

bindings/python/src/tokenizer.rs

        let (type_num, data) = unsafe { ((*(*arr).descr).type_num, (*arr).data) };
-        let n_elem = array.shape()[0];

        if type_num != 17 {


I just noticed that this second case is not about a Unicode array, but actually an array containing PyObject which we do support since 0.16, so this whole method be able to become safe by using downcast::<PyArray1<PyObject>>() and then array.readonly().as_array().iter() which would even remove the requirement of a contiguous array.

Narsil · 2022-03-23T08:55:55Z

Thank you so much for this !

This is a very cool and valuable PR.

In terms of merging, what I project to do is to first do a release without this change 0.12 which does contain some slight backward breaking changes (for the decoder) and the ability to drop the regex in ByteLevel (those are already on master). These are relatively significant changes and are necessary for HF's bigscience project https://bigscience.huggingface.co/.

I will probably wait a week or so afterwards to make sure those changes have no unintended consequences and we have a safe base for our Bigscience project.

Then I will merge this PR, and release 0.13 probably shortly after with all due tests.

McPatate

👌🏻

Narsil · 2022-03-28T12:52:38Z

Following the trend of other HF repos we're moving the the main branch instead of master.

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Is all that is needed on your end.

h-vetinari · 2022-04-12T22:07:18Z

Sorry, messed up the rebase. 🤦

Fixing it

h-vetinari · 2022-04-14T23:17:17Z

Hey @Narsil 👋

I'm happy to keep rebasing this PR, but just wanted to check where things stand currently with the plans for 0.13.0 🙃

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

--no-default-features`

Why do they change?

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

Narsil · 2022-04-19T12:30:21Z

Hi @h-vetinari ,

It's coming don't worry about rebasing if you want, I can always rebase later.
Currently as mentionned we're letting the version with just the needed changes for bigscience prove itself before merging this PR. In the end 0.12.1 was only released last week (0.12.0 had a breaking change wich ended up being pretty bad for transformers so it had to be reverted, took some time to run the full test suite before getting 0.12.1 done.)

h-vetinari · 2022-05-19T08:37:20Z

Thanks for merging this @Narsil! :)

Any timeline for 0.13? 🙃

Narsil · 2022-05-19T12:20:19Z

Unfortunately no definite timeline. As you might have guessed, handling tokenizers is only a little part of what I do at HF and releases do take quite a bit of attention.

h-vetinari · 2022-06-27T06:55:58Z

A very gentle ping for a tokenizer release with the updated pyo3 :)

h-vetinari · 2022-08-01T22:05:01Z

Another month, another ping... 🙃

Narsil · 2022-09-19T14:02:47Z

Hey @h-vetinari After quite a long time (sorry, but there's definitely a lot to do and this is basically done on spare time from me.).

I wanted to release 0.13.0 today, but afaik I cannot because manylinux2010 wheel is built with a static interpreter and I don't really know how to fix that issue.

How can we run the manylinux build and make it work.

I had 2 ideas:

Finding some quay image with shared python interpreter (couldn't find one even in the recommended crates to do distribution)
Removing auto-initialize just for those manylinux builds and place the interpreter inside for them, but it seems it needs a bit more work : Enabling static interpreter embedding for manylinux. #1064

davidhewitt · 2022-09-19T15:34:33Z

@Narsil do you have the error from the manylinux2010 build? Maybe I can offer insight.

Narsil · 2022-09-19T20:05:21Z

Yes it claims that it's using a static interpreter (which it is), but the feature extension-module should be used, which should (afaik) disable the warning, and compile properly.

h-vetinari changed the title ~~Upgrade pyo3 to 0.15 (redux)~~ Upgrade pyo3 to 0.16 Mar 21, 2022

h-vetinari mentioned this pull request Mar 21, 2022

Rebuild for python310 (redux) conda-forge/tokenizers-feedstock#40

Closed

messense force-pushed the pyo3 branch 2 times, most recently from 02210b8 to f4e3d48 Compare March 21, 2022 14:43

adamreichold reviewed Mar 21, 2022

View reviewed changes

messense force-pushed the pyo3 branch from f4e3d48 to 05f3204 Compare March 22, 2022 05:36

messense force-pushed the pyo3 branch from 79c5022 to 954361c Compare March 23, 2022 01:39

Narsil requested review from McPatate and mishig25 March 23, 2022 09:34

McPatate approved these changes Mar 24, 2022

View reviewed changes

h-vetinari force-pushed the pyo3 branch from 954361c to 2b72e00 Compare April 12, 2022 22:05

h-vetinari force-pushed the pyo3 branch from 2b72e00 to 87f157c Compare April 12, 2022 22:08

messense added 8 commits April 15, 2022 10:21

Upgrade pyo3 to 0.15

98c8786

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

Upgrade pyo3 to 0.16

a532fd6

Rebase-conflicts-fixed-by: H. Vetinari <h.vetinari@gmx.com>

Install Python before running cargo clippy

0d942cd

Fix clippy warnings

69b1f6b

Use PyArray_Check instead of downcasting to PyArray1<u8>

fec2512

Enable auto-initialize of pyo3 to fix `cargo test

2343a7c

--no-default-features`

Fix some test cases

7341bb4

Why do they change?

Refactor and add SAFETY comments to PyArrayUnicode

b2955fb

Replace deprecated `PyUnicode_FromUnicode` with `PyUnicode_FromKindAndData`

h-vetinari force-pushed the pyo3 branch from 87f157c to b2955fb Compare April 14, 2022 23:22

Narsil merged commit 519cc13 into huggingface:main May 5, 2022

h-vetinari deleted the pyo3 branch May 5, 2022 16:07

h-vetinari mentioned this pull request Oct 24, 2022

Run upstream test suite conda-forge/tokenizers-feedstock#55

Merged

Upgrade pyo3 to 0.16 #956

Upgrade pyo3 to 0.16 #956

Conversation

h-vetinari commented Mar 21, 2022 • edited Loading

h-vetinari commented Mar 21, 2022

messense commented Mar 21, 2022

Narsil commented Mar 21, 2022

Narsil commented Mar 21, 2022

messense commented Mar 21, 2022

Narsil commented Mar 21, 2022

messense commented Mar 21, 2022

h-vetinari commented Mar 21, 2022

h-vetinari commented Mar 21, 2022 • edited Loading

h-vetinari commented Mar 21, 2022 • edited Loading

Narsil commented Mar 21, 2022

messense commented Mar 21, 2022

h-vetinari commented Mar 21, 2022

messense commented Mar 21, 2022

adamreichold commented Mar 21, 2022

messense commented Mar 21, 2022

messense commented Mar 21, 2022

adamreichold commented Mar 21, 2022

messense commented Mar 21, 2022 • edited Loading

adamreichold commented Mar 21, 2022

adamreichold commented Mar 21, 2022

adamreichold Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

Narsil commented Mar 21, 2022

adamreichold commented Mar 21, 2022

adamreichold Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

Narsil commented Mar 23, 2022

McPatate left a comment

Choose a reason for hiding this comment

Narsil commented Mar 28, 2022

h-vetinari commented Apr 12, 2022

h-vetinari commented Apr 14, 2022

Narsil commented Apr 19, 2022

h-vetinari commented May 19, 2022

Narsil commented May 19, 2022

h-vetinari commented Jun 27, 2022

h-vetinari commented Aug 1, 2022

Narsil commented Sep 19, 2022

davidhewitt commented Sep 19, 2022

Narsil commented Sep 19, 2022

h-vetinari commented Mar 21, 2022 •

edited

Loading

h-vetinari commented Mar 21, 2022 •

edited

Loading

h-vetinari commented Mar 21, 2022 •

edited

Loading

messense commented Mar 21, 2022 •

edited

Loading

adamreichold Mar 21, 2022 •

edited

Loading

adamreichold Mar 21, 2022 •

edited

Loading