Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did you know that py::buffer_info::format has a different meaning on Windows? #1908

Open
jpivarski opened this issue Sep 4, 2019 · 18 comments

Comments

@jpivarski
Copy link

On Windows (both 32-bit and 64-bit Python),

std::string f(py::array a) {
    py::buffer_info info = a.request();
    return info.format;
}

returns "l" for a numpy.int32 array and "q" for a numpy.int64 array, whereas MacOS and Linux return "i" and "l", respectively. (That is, "l" is ambiguous.)

To be safe against misinterpretations, I'm using

auto b = a.cast<py::array_t<int64_t, py::array::c_style | py::array::forcecast>>();

to convert whatever I have into int64_t (and similarly for int32_t), if necessary. (Then, choosing int32_t vs int64_t based on platform is just an optimization, not needed for correctness.)

@jpivarski
Copy link
Author

The same is true on a 32-bit Docker image (i386/ubuntu):

  • numpy.int32format == "l"
  • numpy.int64format == "q"

whereas on 64-bit Linux:

  • numpy.int32format == "i"
  • numpy.int64format == "l"

Windows, regardless of whether it's 32-bit or 64-bit, take the same meanings as 32-bit Linux.

@jpivarski
Copy link
Author

Idioms like the following are successful on 32-bit Linux, 64-bit Linux, 32-bit Windows, and 64-bit Windows (and 64-bit MacOS; haven't tested 32-bit MacOS). It's possible that the __i386__ macro is only defined on gcc, so that's another thing I'll have to consider.

    if (format_.compare("d") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::float64);
    }
    else if (format_.compare("f") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::float32);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("q") == 0) {
#else
    else if (format_.compare("l") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int64);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("Q") == 0) {
#else
    else if (format_.compare("L") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint64);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("l") == 0) {
#else
    else if (format_.compare("i") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int32);
    }
#if defined _MSC_VER || defined __i386__
    else if (format_.compare("L") == 0) {
#else
    else if (format_.compare("I") == 0) {
#endif
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint32);
    }
    else if (format_.compare("h") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int16);
    }
    else if (format_.compare("H") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint16);
    }
    else if (format_.compare("b") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::int8);
    }
    else if (format_.compare("B") == 0  ||  format_.compare("c") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::uint8);
    }
    else if (format_.compare("?") == 0) {
      out = std::make_shared<PrimitiveType>(parameters_, PrimitiveType::boolean);
    }

@YannickJadoul
Copy link
Collaborator

I believe this is a combination of C/C++ not defining the sizes of integer types and the way numpy handles this.

Is there any particular issue when using this from pybind11, that doesn't match C/C++/numpy? Or do you suggest we add something to the docs somewhere, or ...?

@jpivarski
Copy link
Author

Ultimately, all I need is a work-around, so if I'm given a buffer_info, I know how to cast it using stdint types. Above, I gave my work-around in the hope that it would help others, but recently, we've been running into more format related bugs that we'll cross-reference to this issue. At the moment, we don't know what's going wrong in the current work-around, so I don't have a specific request yet.

If this does get figured out, documentation is probably all that's needed, maybe a small section at the end of this page, since that's where we learn about the buffer_info interface and how to work with NumPy arrays. Maybe the best thing would be to have a listing of all the possible values format can take, with how each should be cast? (I just ran into 'e' for float16, as a result of taking the square root of an int8 in NumPy. It would be better not to discover each case, one by one, in production code.)

@YannickJadoul
Copy link
Collaborator

This does seem to mainly be a numpy feature/issue, though, where the meaning of i, l, q is int, long, long long, the size of these being platform-dependent, and np.int32 and np.int64 are aliases:

Testing out with 64-bit manylinux2010 Docker image:

$ docker run -it --rm quay.io/pypa/manylinux2010_x86_64 bash
[root@9962fd909c16 /]# /opt/python/cp38-cp38/bin/pip install numpy
Collecting numpy
  Downloading numpy-1.19.0-cp38-cp38-manylinux2010_x86_64.whl (14.6 MB)
     |████████████████████████████████| 14.6 MB 97 kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.19.0
[root@9962fd909c16 /]# /opt/python/cp38-cp38/bin/python
Python 3.8.3 (default, Jun 16 2020, 21:10:27) 
[GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.dtype(np.int32).char
'i'
>>> np.dtype(np.int64).char
'l'
>>> 

32-bit:

$ docker run -it --rm quay.io/pypa/manylinux2010_i686 bash
[root@795a929a35e8 /]# /opt/python/cp38-cp38/bin/pip install numpy
Collecting numpy
  Downloading numpy-1.19.0-cp38-cp38-manylinux2010_i686.whl (12.3 MB)
     |████████████████████████████████| 12.3 MB 513 kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.19.0
[root@795a929a35e8 /]# /opt/python/cp38-cp38/bin/python
Python 3.8.3 (default, Jul  4 2020, 10:49:13) 
[GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.dtype(np.int32).char
'l'
>>> np.dtype(np.int64).char
'q'
>>> 

See e.g., here: https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types

I guess we could refer to these numpy docs, in some warning, when talking about dtypes in pybind11? If so, do you have time to make a quick PR on that?

@jpivarski
Copy link
Author

I guess we could refer to these numpy docs, in some warning, when talking about dtypes in pybind11? If so, do you have time to make a quick PR on that?

I could. In the meantime, I'll use this space to try figuring things out, out in the open.

To try to get a handle on all the possible numeric types NumPy can handle, I scanned all objects attached to the numpy module, looking for array types, like this:

>>> [x for x in dir(numpy)
...  if isinstance(getattr(numpy, x), type) and issubclass(getattr(numpy, x), numpy.generic)]

then excluded any types that are not a leaf in the class hierarchy with:

[x for x in dir(numpy)
...  if isinstance(getattr(numpy, x), type) and issubclass(getattr(numpy, x), numpy.generic)]

Accounting for multiple names for the same type objects, nothing on this list has been left out of the above. Creating an array of each of these types, passing it into buffer_info, this is what I get for the format strings:

Linux 32-bit Linux 64-bit MacOS 64-bit Windows 32-bit Windows 64-bit diff?
bool_ ? ? ? ? ?
int8 b b b b b
int16 h h h h h
int32 l i i l l yes
int64 q l l q q yes
uint8 B B B B B
uint16 H H H H H
uint32 L I I L L yes
uint64 Q L L Q Q yes
intc i i i i i
uintc I I I I I
longlong q q q q q
ulonglong Q Q Q Q Q
float16 e e e e e
float32 f f f f f
float64 d d d d d
float128 N/A g g N/A N/A yes
complex64 Zf Zf Zf Zf Zf
complex128 Zd Zd Zd Zd Zd
complex256 N/A Zg Zg N/A N/A yes
datetime64 M M M M M
timedelta64 m m m m m
bytes_ 3s 3s 3s 3s 3s
str_ 3w 3w 3w 3w 3w
record T{...} T{...} T{...} T{...} T{...}
object_ O O O O O

On all systems where I could test it (Linux 64-bit, MacOS 64-bit, Windows 32-bit, and Windows 64-bit), Python 2.7 differed from Python 3.6+ only in that bytes_ and str_ are not distinguished (both are 3s). Numeric types are unaffected.

datetime64 and timedelta64 (M and m) both raised "ValueError: cannot include dtype 'M' in a buffer"; I'm using pybind 2.4.3.

A bug that we recently encountered involves a distinction between int64 and longlong. On 64-bit Linux, numpy.int64 is identically equal to numpy.intp, but numpy.longlong is a distinct object. I wonder why.

>>> numpy.int32, numpy.int64, numpy.intp, numpy.longlong
(<class 'numpy.int32'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.longlong'>)

@bstaletic
Copy link
Collaborator

On 64-bit Linux, numpy.int64 is identically equal to numpy.intp, but numpy.longlong is a distinct object. I wonder why.

I guess because intptr_t and int64_t are typedefs of long and long long is a distinct type.

@YannickJadoul
Copy link
Collaborator

I could. In the meantime, I'll use this space to try figuring things out, out in the open.

Thanks! And yes, please do; no problem at all to aggregate everything together in one place!

Creating an array of each of these types, passing it into buffer_info, this is what I get for the format strings:

Nice overview! So the main thing to note - I believe - is that intc, uintc, longlong, ulonglong, etc (did you somehow miss long and ulong?) are consistent across platforms (even if they have different physical sizes). So numpy defines the format strings based on the C types, and not based on the actual size of each integer type. We can debate on whether that's a good idea or not (very likely depends on the actual case), but there's not a lot we can do about it.

Note that you can compare dtype objects, btw:

>>> np.dtype(np.intc) == np.dtype(np.long)
False
>>> np.dtype(np.long) == np.dtype(np.longlong)
True

By some weird coincidence, I once made a PR on numpy about these dtype docstrings (during/after the sprints at EuroSciPy), and I while I don't remember all details, even writing docs for this was quite hairy given that you have these aliases. Maybe this is useful to look at: numpy/numpy#11858?

datetime64 and timedelta64 (M and m) both raised "ValueError: cannot include dtype 'M' in a buffer"; I'm using pybind 2.4.3.

If I'm not mistaken, there are some issues/PRs open on this, that we haven't gotten to yet.

@bstaletic
Copy link
Collaborator

I believe - is that intc, uintc, longlong, ulonglong, etc (did you somehow miss long and ulong?) are consistent across platforms (even if they have different physical sizes).

I think your conclusion is wrong here.

https://en.cppreference.com/w/cpp/language/types

Unix-like OS's always have int == 32bits, as does Win64. Same thing for long long - always 64 bits. That's why the format is consistent.

@YannickJadoul
Copy link
Collaborator

Unix-like OS's always have int == 32bits, as does Win64. Same thing for long long - always 64 bits. That's why the format is consistent.

Huh, interesting. But I don't think it changes anything. The thing is just that i is defined to be the format character for intc, according to https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types, while the other, sized types (int32, etc) are just aliases to a type with that amount of bits.

@EricCousineau-TRI
Copy link
Collaborator

@jpivarski I believe your table effectively arises from the (rather complex) set of aliasing rules from this header file:
https://github.com/numpy/numpy/blob/6ef5ec39cdfaf77aa4600ec2e3bf9f679a4fd527/numpy/core/include/numpy/npy_common.h

In #1329, I thought I had aligned this with how it was defined in the latest version of NumPy (at least in time of writing); however, it does seem like that assumption may have been rather brittle, as noted by this issue.

@YannickJadoul pointed it out in this review comment:
#1329 (review)

I'm just wondering if there's a robust way to test this out in CI...

@jpivarski
Copy link
Author

For workarounds, I've pivoted from trying to interpret each format as a distinct type to using format only for what NumPy calls "kind" and getting the size from itemsize: https://github.com/scikit-hep/awkward-1.0/blob/dd8cd018373efe44740adaf5148a1bb9dafb3124/src/libawkward/util.cpp#L126-L221

For instance,

      else if (fmt == std::string("b")  ||
               fmt == std::string("h")  ||
               fmt == std::string("i")  ||
               fmt == std::string("l")  ||
               fmt == std::string("q")) {
        if (itemsize == 1) {
          return dtype::int8;
        }
        else if (itemsize == 2) {
          return dtype::int16;
        }
        else if (itemsize == 4) {
          return dtype::int32;
        }
        else if (itemsize == 8) {
          return dtype::int64;
        }

(Sometimes, there's also an endianness character, >, <, =, even if the dtype happens to be native endian, so we have to skip that, too.)

As you can see, I've mapped it to a platform-independent enum and I just use the enum subsequently (but keep the format string around, in case a structured array or array of strings has to be passed through but not interpreted).

That's what #1329 is about, right? If pybind provides a platform-independent enum, then there's nothing to document—we should just use that enum, right?

@YannickJadoul
Copy link
Collaborator

For instance,

      else if (fmt == std::string("b")  ||
               fmt == std::string("h")  ||
               fmt == std::string("i")  ||
               fmt == std::string("l")  ||
               fmt == std::string("q")) {
        if (itemsize == 1) {
          return dtype::int8;
        }
        else if (itemsize == 2) {
          return dtype::int16;
        }
        else if (itemsize == 4) {
          return dtype::int32;
        }
        else if (itemsize == 8) {
          return dtype::int64;
        }

I believe this makes sense, indeed.

Maybe there ought to be a better way to handle/expose this in pybind11, though. I suppose this mapping should be known at compile time?

If pybind provides a platform-independent enum, then there's nothing to document—we should just use that enum, right?

pybind11 doesn't use numpy's headers though. (I suppose since we don't want to depend on them? I don't know, actually.) Instead we access some hidden Python capsule in numpy that contains function pointers to the API. So this might be easier said that done.

@bstaletic
Copy link
Collaborator

From what I can tell, according to https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types, there's a few things pybind could improve.

  1. Use NPY_SHORT_ instead of NPY_INT16, as to not assume the sizes of short, int and others.
  2. We're missing NPY_LONG and NPY_ULONG, which are specified to match a python int, as opposed to long.
  3. We're missing NPY_INTP and NPY_UINTP, corresponding to intptr_t and uintptr_t.
  4. A half precision float is missing, though that's missing from the C standard as well. However, there's not complex counterpart to the scalar f16.
  5. Finally py::object is missing its format descriptor, but I'm not sure if we want to support this one, since everything is predicated on std::is_arithmetic.

pybind11 doesn't use numpy's headers though. (I suppose since we don't want to depend on them? I don't know, actually.)

On my distro, numpy headers are in site-packages/numpy/core/include...

Instead we access some hidden Python capsule in numpy that contains function pointers to the API.

It's the numpy.core.multiarray._ARRAY_API capsule.

@wjakob
Copy link
Member

wjakob commented Jul 21, 2020

The replication of NumPy header contents was a conscious decision to break a problematic dependency. Otherwise every project that supports a function call with a py::array somewhere will require NumPy to be installed, header file search paths to be set up correctly, etc.

@YannickJadoul
Copy link
Collaborator

there's a few things pybind could improve.

Agree with all of these.
py::object would be nice, though, but that might just be a different improvement PR, rather than the one fixing the others.

Otherwise every project that supports a function call with a py::array somewhere will require NumPy to be installed, header file search paths to be set up correctly, etc.

Only for compilation, right; and if you're using numpy.h, you probably have numpy installed as developer? And not if you download a wheel as a user? But yes, this work, so why not keep it.

@henryiii
Copy link
Collaborator

By the way, the decision to avoid a NumPy dependency does make it easier to build - that's a difference between Cython wheel builds and PyBind11. With Cython, you have have min NumPy dependencies, something like this:

<something or other for 2.7 and 3.5, I don't remember>
numpy==1.12.1; python_version=="3.6"
numpy==1.14.5; python_version=="3.7"
numpy==1.17.3; python_version>="3.8"

When you build a wheel, you have to have the oldest version of NumPy supported, since a wheel built with a newer NumPy can't be used by an older NumPy, and older NumPy's don't have wheels/support for newer Pythons. Now modern tooling, like PEP 518's pyproject.toml, and FindPython natively supporting NumPy in CMake 3.14, and cibuildwheel, makes this much easier than it used to be, but it's still not as simple and elegant as PyBind11's not requiring it at all.

havogt pushed a commit to GridTools/gridtools that referenced this issue Apr 12, 2021
Python integer format char is ambiguous and platform dependent. PyBind11 `format_descriptor<...>::format()` always returns "q" and "Q" for 64bit integers, independent of the platform. Compatible passed-in Python buffers on the other hand might also have the equivalent format "l" or "L" set. See pybind/pybind11#1806 and pybind/pybind11#1908 for details. This fix introduces a special case for integer format comparisons, just checking size and signedness.
@rwgk
Copy link
Collaborator

rwgk commented May 19, 2023

FWIW, I had a run-in with this issue, too, but then found that there is already a solution, which is made more visible & accessible under #4674 (py::buffer_info::item_type_is_equivalent_to<T>).

ashvardanian added a commit to ashvardanian/SimSIMD that referenced this issue Oct 28, 2024
In Python, implementing type resolution across Windows
and Linux/macOS is tricky, as the same `l` specifier can
map to both `int32` and `int64`. Moreover, expanding to
to 8-byte integers the `u8` specifiers can mean both
8-bit and 8-byte integers, causing confusion.

The same issue is discussed in detail in the PyBind11 repo:
pybind/pybind11#1908
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants