Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for a native string dtype #35169

Closed
TomAugspurger opened this issue Jul 7, 2020 · 27 comments
Closed

Plan for a native string dtype #35169

TomAugspurger opened this issue Jul 7, 2020 · 27 comments
Labels
Enhancement Strings String extension data type and string data

Comments

@TomAugspurger
Copy link
Contributor

Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

  1. How do users opt into this behavior?
  2. A fallback mode for not implemented kernels.

How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a StringDtype-dtype array today:

  1. Infer: pd.array(['a', 'b', None])
  2. Explicit dtype=pd.StringDtype()
  3. String alias dtype="string"

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

>>> pd.options.mode.use_arrow_string_dtype = True

Then all of those would create an Arrow-backed StringArray.

Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel

we have a few options:

  1. Raise, stating that there's no kernel for normalize.
  2. PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.

@TomAugspurger TomAugspurger added Enhancement Strings String extension data type and string data labels Jul 7, 2020
@jreback
Copy link
Contributor

jreback commented Jul 7, 2020

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

@xhochy
Copy link
Contributor

xhochy commented Jul 8, 2020

Points that come to mind having work a bit on that in fletcher:

  • Arrow's data structures aren't as straightforward as with numpy. There needs to be a decision what the backing pyarrow type of the StringArray is.
    • Should it be a pyarrow.Array (one continuous block of memory) or a pyarrow.ChunkedArray (concatenation is free)
    • Arrow has two strings types: string (32bit offsets for start/end) and large_string (64bit offsets for start/end). Do we support both? Or limit to one?
  • What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?
  • pandas supports in-place modifications of arrays. The pyarrow.(Chunked)Array structures doesn't support this. There are two ways to approach this but for strings arrays only the first makes sense due to the structure of how strings are stored in Arrow. If other Arrow types were used in some future times, then this is actually a more critical decision. Keeping this here mainly to raise awareness of the immutability.
    • Always copy the whole array even on 1-element modifications.
    • Implement a separate pandas.ArrowArray class that adheres to the Arrow memory layout (and thus construction of pyarrow.(Chunked)Array is zero-copy) but allow in-place edits.
  • Arrow 1.0 will probably only implement ~20% of the algorithms thus in most cases a fallback to object mode will be triggered. In my test, this has a roughly 2x performance penalty to a pure-object typed StringArray. This makes such a class less desirable. Although I would expect that the development of an Arrow-backed StringArray in pandas might take some time and we probably can work on this and merge to master once the next release post-1.0 is published.
  • Additionally, I would strongly prefer to not have a global option pd.options.mode.use_arrow_string_dtype but rather keep the switch with dtype=object and dtype=string so that one can decided on a per-column basis whether to use Arrow. This makes it easier if you implement algorithms on top of the arrays using e.g. numba but aren't able to convert them all-at-once to the new dtype.

Also noting here: I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher. I'm happy to just contribute them here to get things going. My feeling is that the parts that aren't yet working nicely mostly need work on the Arrow side and not in pandas. But having a Draft PR up here would probably help people understand what is needed.

@xhochy
Copy link
Contributor

xhochy commented Jul 8, 2020

Also note that I track the algorithm coverage of the current pandas API vs what is implemented in pyarrow here: xhochy/fletcher#121

@TomAugspurger
Copy link
Contributor Author

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

That's the simplest way from our end. Are we willing to require arrow to opt into the new string dtype?

Thanks for that list @xhochy, that's extremely valuable. In particular

  1. It makes this feel like less of a drop-in replacement for our current string implementation (especially around inplace mutation).
  2. We might want to consider implementing a ListDtype, based on arrow memory, at the same time. This would support things like .str.split(..., expand=False). I'll open a separate issue for that.

@xhochy
Copy link
Contributor

xhochy commented Jul 8, 2020

Regarding the immutability I posted an explanation on that in #8640 (comment) if a reader of this thread is unclear why we aren't just making a mutable type here instead.

@jorisvandenbossche
Copy link
Member

@TomAugspurger thanks for getting this discussion started!

[jreback] I would just make arrow an import of StringArray and call it a day, its already experimental.

I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow. If pyarrow is optional (and I personally prefer to keep it that way for now), I would prefer keeping it optional for the new dtypes as well.

[TomAugspurger ] I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

As long as pyarrow is optional, I think we need to keep the "old" implementation around (related to the above). But I agree we probably don't want a pd.PythonStringDtype().
In case we keep the use of pyarrow string array optional for "string" dtype, I think there are basically two options: have different dtypes (like StringDtype and PythonStringDtype or other names), or have a single "parameterized" dtype (eg StringDtype(use_objects=True/False), where the default could be None and detects whether pyarrow is installed or not).
Of course, such a single dtype might also give corner cases with losing that information etc.

@jorisvandenbossche
Copy link
Member

What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?

As in general with our new nullable dtypes, operations should as much as possible also return nullable dtypes (so eg the nullable boolean dtype). For List that's of course not yet possible.

I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher.

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?
I think fletcher right now handles this very explicit and uses different EAs/dtypes for both? But I think for pandas we want to "hide" this somewhat from the user (so either need to choose one or either handle both in the same array under the hood)?

@xhochy
Copy link
Contributor

xhochy commented Jul 8, 2020

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?

"I have no idea anymore".

From an Arrow standpoint, ChunkedArray is more natural as it provides the full flexibility and operation. It is just that chunking makes implementing algorithms on top more complicated. spatialpandas thus uses an Array as this makes the implementation simpler. Nowadays I lean a bit more towards Array as this is simpler for the end-user. But I'm totally undecided here and interested in other people's view.

@jorisvandenbossche
Copy link
Member

We discussed this a bit yesterday on the call. Will try to summarize some of the points that were brought up here (based on our notes and my recollection) with some additional background:

  • Array vs ChunkedArray
    • Context: pyarrow provides two possible data structures to use under the hood for a column in a DataFrame. Array is a single contiguous array, ChunkedArray can be seen as a list of arrays that represents a single logical array.
    • In fletcher, support for both is implemented, but this is made explicit (different dtype for both: FletcherContinuousDtype and FletcherChunkedDtype).
    • When you can call out to pyarrow functions, it is mostly smooth to use either (most kernels in pyarrow will accept both). But when needing to implement some custom functionality in pandas, ChunkedArrays give added complexity (eg finding a position in a ChunkedArray first needs to find which chunk, and then position in chunk -> "double" indexing, see eg fletcher setitem example)
    • When reading large binary/string data (> 2GB of data in total for a single column), pyarrow can automatically return a ChunkedArray instead of an Array.
    • ChunkedArray can give a cheap concat (as the arrays don't need to be copied into a single contiguous array).
    • Conclusion? No definitive, but general inclination towards starting with ChunkedArray. Since this is mostly hidden from the user, we can always re-evaluate along the way (and if we decide to swtich from ChunkedArray to Array, the code only gets easier later on).
  • String vs LargeString type
    • Context: the default String type in pyarrow uses int32 offsets into the contiguous binary array of characters (offsets denote start / stop of each "scalar" string in the contiguous array). This means that the maximum number of bytes in a single string array is limited to np.iinfo(np.int32).max -> np.iinfo(np.int32).max /1024 / 1024 / 1024 == ~2GB (max for the full array, but hence also for a single element). To overcome this limitation (without chunking), there is also a LargeString type using int64 offsets (so increasing the memory use, but giving basically unlimited size (~8000 PB)).
    • We didn't discuss much about this, but I think there are multiple options: a) choose a single one (eg "string" and rely on chunking to support larger data) b) support both using a single pandas dtype towards the user (a dtype parametrized on the offsets type) c) support both with separate pandas dtypes (following the string / large_string distinction of Arrow).
  • Mutability
    • Context: since the array of strings is stored in one contiguous chunk of bytes, assigning a different string to a single element is in general not possible (eg with the exception of strings of the exact same length in bytes). See also API/ENH: dtype='string' / pd.String #8640 (comment)
    • There are workarounds possible to still provide the same end-user experience of mutability, but this means that basically each assignment (__setitem__) leads to a copy of (part of) the data. This will however give a different performance experience for this operation (especially when doing assignments in a loop), and this migth also change the API regarding "views" (when mutating leads to a copy, other arrays that are a view on the original one are not updated a you would expect?).
    • This might be a reason to keep the “object python string” dtype for people that want to do a lot of mutations? (apart from keeping it in case we don't require pyarrow for StringDtype)
    • We need to ensure we provide effcient "bulk" assignment/replacement operations (eg the replace method where you provide a replacement mapping, assignment with a list / boolean mask, ...)

@toobaz
Copy link
Member

toobaz commented Jul 9, 2020

* Mutability

As a user, this is what has always worried me most, since we started talking about arrow as a backend for strings. But I think asking users to choose between mutability and efficiency would be acceptable - as opposed to (I think) making efficient, non-mutable strings the default and (definitely) entirely replacing the current mutable type with one that isn't.

@xhochy
Copy link
Contributor

xhochy commented Jul 10, 2020

As this will probably need more than the string algorithms that are exposed by the str accessor (covered by ARROW-555), I have setup an umbrella issue on the Arrow side for the remaining parts: ARROW-9401; notably we probably need to implement some custom pandas-take operations.

I plan to start a PR with the basic scaffolding for the data type next week.

@xhochy
Copy link
Contributor

xhochy commented Jul 13, 2020

Quite early but if someone wants to follow progress, I opened a draft PR: #35259

@TomAugspurger
Copy link
Contributor Author

Thanks for all the discussion here.

[Joris] I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow.

Agreed with this. I think that the downsides of the Arrow implementation (fundamental: mutability, temporary: not all algos implemented) means that we'll want to keep the current string implementation around at least for some time. And given the choice of object, object-backed string, Arrow-backed string, I'd get rid of the object dtype implementation first (i.e. make Series.str raise for object dtype).

[Uwe]: Additionally, I would strongly prefer to not have a global option

Fair point. Agreed that we can avoid that at least for now. @jorisvandenbossche suggested a parameter like pd.Series(['a', 'b'], dtype=pd.StringDtype(storage="python"/"pyarrow"). IMO that gives a decent level of flexibility. Then the only "ambiguous" case is whether dtype="string" refers to python object dtype or pyarrow. Perhaps we could use the global option for resolving that ambiguity?

@TomAugspurger
Copy link
Contributor Author

@xhochy do you have thoughts on using the dtype to specify how the values should be stored? It's a bit unusual for that to be an attribute of the dtype rather than the array, but I think that fits a bit better with the rest of pandas (pd.array for example).

Alternatively we give up on the goal of having a single user-facing StringArray and StringDtype, and just have StringArray and ArrowStringArray. But I think there's some value in having a single class that users deal with.

@jorisvandenbossche
Copy link
Member

, and just have StringArray and ArrowStringArray

Did you mean to say "just have StringDtype and ArrowStringDtype" ? Because we could still have two array classes (but eg subclassing the same base StringArray class), which are coupled to a single, parametrized StringDtype.

@TomAugspurger
Copy link
Contributor Author

I'm not sure exactly what I had in my head, but having separate array classes seems fine.

@Josh-Ring-jisc
Copy link

Josh-Ring-jisc commented Aug 27, 2020

An interesting approach to allow both performance and mutability here would be to pad the UTF-8 strings to their maximum possible length. Even considering the memory consumption hit from this, I think it will still be a big improvement over the pure python implementation in terms of memory consumption and performance.

In this way if a character goes from 1 byte to 6 bytes say it's not a problem as we would have already allocated enough room for the change to be possible via the padding. (And would avoid having to reallocate the whole array for any small changes as there's always enough space to accommodate the new value)

I imagine that this could require work on the Arrow side of things, do you think this is worth pursuing?

@xhochy
Copy link
Contributor

xhochy commented Aug 28, 2020

This is similar (but more extreme) to the NumPy approach and won't work efficiently due to two reasons:

  • The largest string size in Arrow is 2**32-1 for the StringType and 2**64-1 for the LargeStringType. Padding to this length will use an infeasible amount of space.
  • Having a lot of spacing between string elements reduces the cache efficiency quite a lot again. We would be fetching quite a lot of data into the L3 cache from memory that we wouldn't use. This is quite in constrast to the current Arrow implementation where every element that is fetched from memory into CPU caches (through lookahead) is also an element that contains data and is then used for one of the following rows.

@Josh-Ring-jisc
Copy link

Josh-Ring-jisc commented Aug 28, 2020

I think my previous thought was ill defined i'll clarify it a bit.
What about allowing a user-defined character limit for a row.
Where we allocate per character, the maximum space required for a UTF-8 character (since they are variable in length).

So if someone knew that their strings would never be more than say 255 characters, as for instance it came from a database, they could use that information to allocate only the space they needed for their strings, with minimal padding and allow minimal cache misses, while retaining mutability.

In future if the user specified a specific subset of UTF-8 eg "ASCII only" then it should be possible to further reduce the padding and memory requirements.

edit:
If this needed to be automatic, it could be automatically populated from the maximum string length in the data.

Obviously nothing is perfect but this does seem like a useful compromise in many cases where mutability without reallocating an array would be helpful.

I would argue that some padding overhead is still cheaper than repeated reallocation of a large array.

@jorisvandenbossche
Copy link
Member

The PR of @TomAugspurger at #36142 proposed a way to expose this "native/arrow-backed string dtype" to the the user. Specifically, that PR was doing:

  1. One user-facing pd.StringDtype[storage], parametrized by storage = python/pyarrow.
  2. Two (or three) user-facing StringArray classes, an ArrowStringArray, PythonStringArray, and maybe a base StringArray.

So for now, "string" dtype would still default to "string[python]", but the user has the option to explicitly choose for the arrow-based one (with "string[arrow]" or pd.StringDtype(storage="pyarrow")).

Are people generally OK with this as a way to provide this as an opt-in on the short term?

@xhochy
Copy link
Contributor

xhochy commented Oct 19, 2020

Are people generally OK with this as a way to provide this as an opt-in on the short term?

Sounds good as this gives the end-user still and easy and obvious way to switch between both implementations.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 19, 2020

Sounds good to me too.

I expect that the class hierarchy will be something like

class StringArray(ExtensionArray):
    ...


class ArrowStringArray(StringArray):
    ...

class PythonStringArray(StringArray, PandasArray):
    ...

and trying to make a StringArray directly wouldn't be allowed (may or may not be public in pandas.arrays).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 20, 2020

We had a quick chat with @simonjayhawkins and @TomAugspurger, and the rough next steps that can be separated we see are:

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 14, 2020

Other follow-ups needed after #35259

@jorisvandenbossche
Copy link
Member

One aspect that @simonjayhawkins raised while finalizing the implementation in #39908 (#39908 (comment)) is the use of a parametrized dtype (StringDtype(storage="python"|"pyarrow"), with each parametrization using its own specialized array class) versus having a single dtype / array class (with having the storage backend be a property of the array class, so much more hiding this "implementation detail" from the user).

Bringing that up here, because it's a more fundamental issue (and not just technical implementaiton detail of the PR) that relates to some of the things discussed above.

For example, @xhochy argued above for not (only) having a global option to control which storage backend would be used, so you can decide per-column basis whether to use Arrow.

Personally, because using Arrow or not can still have some important user facing consequence (eg regarding mutability / overhead when mutating the Arrow-based storage), I think it is useful to make this option easily available to the user when needed (so eg with the storage= keyword for StringDtype).

@TomAugspurger
Copy link
Contributor Author

I think it is useful to make this option easily available to the user when needed (so eg with the storage= keyword for StringDtype).

Agreed with this.

@TomAugspurger
Copy link
Contributor Author

There's a few followup items to discuss, but I think this issue can be closed with pandas 1.3.0 released. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants