Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open
28 of 41 tasks
Tracked by #59664
jorisvandenbossche opened this issue Aug 28, 2023 · 15 comments
Open
28 of 41 tasks
Tracked by #59664
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 28, 2023

Overview of work for the future string dtype (PDEP-14).

Main implementation:

Testing related:

Open design questions / behaviour changes to implement:

Known bugs that need to be fixed:

Documentation:


[original issue body]

With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..), .str.len(..), .str.count(), etc, or comparison operators like ==).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy") instead of ArrowDtype("string"). From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator". Main PR:

plus some follow-ups (#54720, #54585, #54591).

cc @pandas-dev/pandas-core

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Aug 28, 2023

One general question is about naming: currently, this data type is implemented as StringDtype(storage="pyarrow_numpy") (a parametrized instance of the existing StringDtype), thus using the "pyarrow_numpy" name (which can also be seen in the repr of the dtype, as "string[pyarrow_numpy]").
And the actual array class used is ArrowStringArrayNumpySemantics, expanding the existing ArrowStringArray.

I don't think "pyarrow_numpy" is a great name, but we also couldn't directly think of something better. The StringDtype(storage="pyarrow") is already taken by the existing version of that (using "nullable" semantics), and changing that to mean this new string dtype would break existing users of this (although we officially still label this as experimental, and could still change it).

In general, ideally not too many users should actually directly use the term "pyarrow_numpy". When the future option is enabled, I think we should ensure one can simply use eg astype("string") and get this new default (without having to do astype("string[pyarrow_numpy]") (note: this is currently not yet the case, opening a separate issue to discuss this -> #54793)

@nickeubank
Copy link
Contributor

I recognize I'm late to this, but out of curiosity, why use pyarrow string arrays instead of using numpy structured types for unicode strings (e.g., dtype='<U12' or whatever length)?

I understand the performance issues with pandas object dtype (it's just an array of references to Python strings), but I thought the structured numpy unicode dtypes avoided all these issues, and wouldn't require a mixed-syntax/implementation.

@bashtage
Copy link
Contributor

I recognize I'm late to this, but out of curiosity, why use pyarrow string arrays instead of using numpy structured types for unicode strings (e.g., dtype='<U12' or whatever length)?

NumPy only supports rectangular arrays of strings. So '<U12" requires 12*4 bytes (used UTF-32 encoding) for every entry irrespective of size or the characters used. More efficient storage methods use ragged arrays where usually 2 things are stores in the array, a memory address of the actual UTF8 string and the length of the string. Consider the sime example

a
abcdefghijkl

In NumPy this array requires 96 buytes for storage (+ overheads). In an efficient encoding this requires something on the order of 1 (a) + 8 (memory address) + 8 (length, assuming int64) + 12 (other string) + 8 + 8 = 45, which is about 50%. If an array is very sparse (say has one very long string, and the rest short, then the ratio os space required can get really bad).

NumPy is working on 1st party support for ragged UTF strings in NumPy 2.0(ish) which requires a new way to define dtypes..

@nickeubank
Copy link
Contributor

Ah, ok, thanks! I knew about that issue, but hadn't realized Arrow strings did something different (which I infer is the case from context). Appreciate the clarification!

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2024

After discussing this on slack I don't think that the new string dtype should use pyarrow storage with numpy NaN semantics. That may help internal development transition to 3.0, but makes for a confusing long term strategy for our end users. I feel like we are going to pay a heavy price over time to constantly clarify what the return type should be for various algorithms

@phofl pointed out to me in slack that we would have the following string types:

  • string[pyarrow_numpy]
  • string[pyarrow]
  • pd.ArrowDtype(pa.string())

And with NumPy 2.0 there is the potential for a native NumPy string dtype.

So if we take these and apply an operation like .str.len you may get the following return types:

  • string[pyarrow_numpy] - yields np.int64 or np.float64 (depending on presence of NaNs)
  • string[pyarrow] - yields Int64
  • pd.ArrowDtype(pa.string()) - yields int64[pyarrow]
  • NumPy 2.0 string type - TBD

The only type I was expecting from the above to be yielded is int64[pyarrow], which is also tucked away in arguably the most obscure string dtype. Stated explicitly, I think pandas 3.0 should use string[pyarrow] as the default and have algorithms applied to that return pyarrow types

The sheer number of iterations that can be produced from these different variants gets very confusing; I think the simpler story in the long term is that we just have NumPy / Arrow arrays and any algorithms applied to those yield whatever the underlying library provides

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 16, 2024

That may help internal development transition to 3.0

To be clear, the motivation of this was not for internal development ease (no, we actually needed to add yet another ExtensionArray for it), but to help user transition (less breaking change for users)

The sheer number of iterations that can be produced from these different variants gets very confusing

I won't deny that this listing is confusing ... but I would personally argue that the current plan reduces the number of variants for users. Currently, with the planned "pyarrow_numpy" variant as the default string dtype (and without enabling a custom experimental setting), a user will only see 1 string dtype, and only 1 kind of integer type, only 1 bool type, etc.

If we would let the default string dtype return pyarrow types, and you have a workflow with some operations that involve string columns, you can end up with a dataframe with both numpy and pyarrow dtypes, while the user never asked for pyarrow columns in the first place. And then you have one numeric column that uses NaN as the missing value, and one numeric column that use NA as the missing value (and treats NaN as not missing). Or you have one datetime column that has a datetime64[ns] dtype, and another datetime column that uses timestamp[us][pyarrow].
I think such a situation is much more confusing. Especially because this would happen for all users by default, while you only get the variations in string dtypes you list above when specifically enabling some option.

I think the simpler story in the long term is that we just have NumPy / Arrow arrays and any algorithms applied to those yield whatever the underlying library provides

We are certainly not there (and need to discuss this more), but IMO the simpler story in the long term is that we just have pandas arrays and data types (and the average user doesn't have to care about the whether it's numpy or pyarrow under the hood)

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2024

Thanks for those clarifications @jorisvandenbossche - very insightful.

If we would let the default string dtype return pyarrow types, and you have a workflow with some operations that involve string columns, you can end up with a dataframe with both numpy and pyarrow dtypes, while the user never asked for pyarrow columns in the first place

I recognize this is not ideal but I'm also not sure it is that big of a problem given pandas type system history. Is it that different from:

ser = pd.Series(["abc", None])
ser.str.len()
0    3.0
1    NaN
dtype: float64

Giving a different return type than:

ser = pd.Series(["abc", "def"])
ser.str.len()
0    3
1    3
dtype: int64

? Especially for primitive types I don't see the distinction between pyarrow / numpy array types being all that important, particularly since those can be zero-copy.

but IMO the simpler story in the long term is that we just have pandas arrays and data types (and the average user doesn't have to care about the whether it's numpy or pyarrow under the hood)

The problem I foresee with this is it liimits what users to do to the common denominator of the underlying libraries. If coming from Arrow, you lose streaming support, bitmasking and nullability handling when trying to make a compatability layer with NumPy. For the inverse, your arrays become limited to 1-D. For types that exist in one library or the other, we would arguably be adding another layer that just isn't necessary. I think doing this prevents us from really utilizing the strengths of either library

If users wanted to stick with NumPy semantics exclusively I think the new NumPy string dtype should be the right choice in the long term. I don't believe that existed at the time of this original conversation, but it may now negate the need for a pyarrow_numpy string dtype. Over the long run having IO methods that use a dtype_backend="numpy_nullable" but that don't return NumPy strings I think is also going to be confusing

@rhshadrach
Copy link
Member

Especially for primitive types I don't see the distinction between pyarrow / numpy array types being all that important,

I think the way they treat NA values differently in comparisons to be quite important.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 16, 2024

I think the way they treat NA values differently in comparisons to be quite important.

I agree. TBH, it seems like the decision to make the semantics of using pyarrow strings like numpy in terms of missing values is related to the issue of how we do numpy to pandas conversions when we have pd.NA and np.nan present in a float array.

As I understand it, the pyarrow semantics for missing values is to use a missing value sentinel similar to pd.NA (I think that is pyarrow.NA). That's in contrast with the numpy semantics of using np.nan to represent a missing value. So the string[pyarrow_numpy] makes those 2 sentinels equivalent. But then we have to decide when we do operations (e.g., len()) that are not returning a string type on a pyarrow string array that has missing values whether we return np.nan or pd.NA or pyarrow.NA to correspond to the string missing value entries, or, equivalently returning a numpy array using np.nan, or a pyarrow array using pyarrow.NA, or a pandas extension array using pd.NA.

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2024

If you don't want pyarrow nullability what is the advantage of using a pyarrow array with numpy semantics versus just a numpy array?

As I understand it, the pyarrow semantics for missing values is to use a missing value sentinel similar to pd.NA (I think that is pyarrow.NA). That's in contrast with the numpy semantics of using np.nan to represent a missing value.

Arrow uses a validity bitmask whereas numpy doesn't offer anything outside of IEEE 754 floating point arithmetic, which is still applicable within Arrow computations

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import numpy as np
>>> pc.equal(pa.array([1., None, np.nan]), pa.array([1., None, np.nan]))
<pyarrow.lib.BooleanArray object at 0x79189d8ad960>
[
  true,
  null,
  false
]

Though I'm not clear on why this matters for algorithms against a string type?

@phofl
Copy link
Member

phofl commented Feb 16, 2024

Though I'm not clear on why this matters for algorithms against a string type?

I don't think we are talking about the same thing. Even if we agree that it doesn't matter for string columns, it matters for all columns that you create from the string columns, e.g.

ser.str.len returns int64[pyarrow] and thus you have a columns that behaves different than your neighbouring columns with int64, this is a very very bad ux

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2024

How does a pyarrow[int64] behave differently than a np.int64? The underlying data buffers for these is going to be identical

@rhshadrach
Copy link
Member

rhshadrach commented Feb 17, 2024

@WillAyd - this code successfully detects NA values with NumPy dtypes, but not pyarrow:

df1 = pd.DataFrame({"a": [1, 1, 2]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 3], "b": [6, 7]}, dtype="int64[pyarrow]")
result = df1.merge(df2, on="a", how="left")
# Check for NA values
if (result["b"] != result["b"]).any():
    raise ValueError("Whoopsie!")

@WillAyd
Copy link
Member

WillAyd commented Feb 17, 2024

Might be some confusion over null versus nan. pyarrow works like what is discussed in #32265 . null uses Kleene logic for comparisons whereas NaN != NaN by definition in IEEE 754. You can use pc.is_null and/or pc.is_nan to distinguish as needed. Our own isna implementation must already wrap the latter

>>> result["b"].isna()
0    False
1    False
2     True
Name: b, dtype: bool

@rhshadrach
Copy link
Member

@WillAyd - sure, but the difference doesn't just come up when you are looking for NA values - it can impact the result of any comparison. I just gave one example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants