Plan for a native string dtype #35169

TomAugspurger · 2020-07-07T21:40:35Z

Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

How do users opt into this behavior?
A fallback mode for not implemented kernels.

How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a StringDtype-dtype array today:

Infer: pd.array(['a', 'b', None])
Explicit dtype=pd.StringDtype()
String alias dtype="string"

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

>>> pd.options.mode.use_arrow_string_dtype = True

Then all of those would create an Arrow-backed StringArray.

Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel

we have a few options:

Raise, stating that there's no kernel for normalize.
PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.

The text was updated successfully, but these errors were encountered:

jreback · 2020-07-07T21:43:47Z

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

xhochy · 2020-07-08T06:25:43Z

Points that come to mind having work a bit on that in fletcher:

Arrow's data structures aren't as straightforward as with numpy. There needs to be a decision what the backing pyarrow type of the StringArray is.
- Should it be a pyarrow.Array (one continuous block of memory) or a pyarrow.ChunkedArray (concatenation is free)
- Arrow has two strings types: string (32bit offsets for start/end) and large_string (64bit offsets for start/end). Do we support both? Or limit to one?
What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?
pandas supports in-place modifications of arrays. The pyarrow.(Chunked)Array structures doesn't support this. There are two ways to approach this but for strings arrays only the first makes sense due to the structure of how strings are stored in Arrow. If other Arrow types were used in some future times, then this is actually a more critical decision. Keeping this here mainly to raise awareness of the immutability.
- Always copy the whole array even on 1-element modifications.
- Implement a separate pandas.ArrowArray class that adheres to the Arrow memory layout (and thus construction of pyarrow.(Chunked)Array is zero-copy) but allow in-place edits.
Arrow 1.0 will probably only implement ~20% of the algorithms thus in most cases a fallback to object mode will be triggered. In my test, this has a roughly 2x performance penalty to a pure-object typed StringArray. This makes such a class less desirable. Although I would expect that the development of an Arrow-backed StringArray in pandas might take some time and we probably can work on this and merge to master once the next release post-1.0 is published.
Additionally, I would strongly prefer to not have a global option pd.options.mode.use_arrow_string_dtype but rather keep the switch with dtype=object and dtype=string so that one can decided on a per-column basis whether to use Arrow. This makes it easier if you implement algorithms on top of the arrays using e.g. numba but aren't able to convert them all-at-once to the new dtype.

Also noting here: I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher. I'm happy to just contribute them here to get things going. My feeling is that the parts that aren't yet working nicely mostly need work on the Arrow side and not in pandas. But having a Draft PR up here would probably help people understand what is needed.

xhochy · 2020-07-08T10:21:59Z

Also note that I track the algorithm coverage of the current pandas API vs what is implemented in pyarrow here: xhochy/fletcher#121

TomAugspurger · 2020-07-08T10:59:23Z

why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet.

That's the simplest way from our end. Are we willing to require arrow to opt into the new string dtype?

Thanks for that list @xhochy, that's extremely valuable. In particular

It makes this feel like less of a drop-in replacement for our current string implementation (especially around inplace mutation).
We might want to consider implementing a ListDtype, based on arrow memory, at the same time. This would support things like .str.split(..., expand=False). I'll open a separate issue for that.

xhochy · 2020-07-08T11:07:27Z

Regarding the immutability I posted an explanation on that in #8640 (comment) if a reader of this thread is unclear why we aren't just making a mutable type here instead.

jorisvandenbossche · 2020-07-08T11:10:56Z

@TomAugspurger thanks for getting this discussion started!

[jreback] I would just make arrow an import of StringArray and call it a day, its already experimental.

I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow. If pyarrow is optional (and I personally prefer to keep it that way for now), I would prefer keeping it optional for the new dtypes as well.

[TomAugspurger ] I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

As long as pyarrow is optional, I think we need to keep the "old" implementation around (related to the above). But I agree we probably don't want a pd.PythonStringDtype().
In case we keep the use of pyarrow string array optional for "string" dtype, I think there are basically two options: have different dtypes (like StringDtype and PythonStringDtype or other names), or have a single "parameterized" dtype (eg StringDtype(use_objects=True/False), where the default could be None and detects whether pyarrow is installed or not).
Of course, such a single dtype might also give corner cases with losing that information etc.

jorisvandenbossche · 2020-07-08T11:14:54Z

What should be the return type of algorithms that return non-string types (mostly bool and List[str]), always object?

As in general with our new nullable dtypes, operations should as much as possible also return nullable dtypes (so eg the nullable boolean dtype). For List that's of course not yet possible.

I have spent quite some time (sadly not sufficient time) prototyping things around this in fletcher.

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?
I think fletcher right now handles this very explicit and uses different EAs/dtypes for both? But I think for pandas we want to "hide" this somewhat from the user (so either need to choose one or either handle both in the same array under the hood)?

xhochy · 2020-07-08T12:24:10Z

@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above?

"I have no idea anymore".

From an Arrow standpoint, ChunkedArray is more natural as it provides the full flexibility and operation. It is just that chunking makes implementing algorithms on top more complicated. spatialpandas thus uses an Array as this makes the implementation simpler. Nowadays I lean a bit more towards Array as this is simpler for the end-user. But I'm totally undecided here and interested in other people's view.

jorisvandenbossche · 2020-07-09T11:33:55Z

We discussed this a bit yesterday on the call. Will try to summarize some of the points that were brought up here (based on our notes and my recollection) with some additional background:

Array vs ChunkedArray
- Context: pyarrow provides two possible data structures to use under the hood for a column in a DataFrame. Array is a single contiguous array, ChunkedArray can be seen as a list of arrays that represents a single logical array.
- In fletcher, support for both is implemented, but this is made explicit (different dtype for both: FletcherContinuousDtype and FletcherChunkedDtype).
- When you can call out to pyarrow functions, it is mostly smooth to use either (most kernels in pyarrow will accept both). But when needing to implement some custom functionality in pandas, ChunkedArrays give added complexity (eg finding a position in a ChunkedArray first needs to find which chunk, and then position in chunk -> "double" indexing, see eg fletcher setitem example)
- When reading large binary/string data (> 2GB of data in total for a single column), pyarrow can automatically return a ChunkedArray instead of an Array.
- ChunkedArray can give a cheap concat (as the arrays don't need to be copied into a single contiguous array).
- Conclusion? No definitive, but general inclination towards starting with ChunkedArray. Since this is mostly hidden from the user, we can always re-evaluate along the way (and if we decide to swtich from ChunkedArray to Array, the code only gets easier later on).
String vs LargeString type
- Context: the default String type in pyarrow uses int32 offsets into the contiguous binary array of characters (offsets denote start / stop of each "scalar" string in the contiguous array). This means that the maximum number of bytes in a single string array is limited to np.iinfo(np.int32).max -> np.iinfo(np.int32).max /1024 / 1024 / 1024 == ~2GB (max for the full array, but hence also for a single element). To overcome this limitation (without chunking), there is also a LargeString type using int64 offsets (so increasing the memory use, but giving basically unlimited size (~8000 PB)).
- We didn't discuss much about this, but I think there are multiple options: a) choose a single one (eg "string" and rely on chunking to support larger data) b) support both using a single pandas dtype towards the user (a dtype parametrized on the offsets type) c) support both with separate pandas dtypes (following the string / large_string distinction of Arrow).
Mutability
- Context: since the array of strings is stored in one contiguous chunk of bytes, assigning a different string to a single element is in general not possible (eg with the exception of strings of the exact same length in bytes). See also API/ENH: dtype='string' / pd.String #8640 (comment)
- There are workarounds possible to still provide the same end-user experience of mutability, but this means that basically each assignment (__setitem__) leads to a copy of (part of) the data. This will however give a different performance experience for this operation (especially when doing assignments in a loop), and this migth also change the API regarding "views" (when mutating leads to a copy, other arrays that are a view on the original one are not updated a you would expect?).
- This might be a reason to keep the “object python string” dtype for people that want to do a lot of mutations? (apart from keeping it in case we don't require pyarrow for StringDtype)
- We need to ensure we provide effcient "bulk" assignment/replacement operations (eg the replace method where you provide a replacement mapping, assignment with a list / boolean mask, ...)

toobaz · 2020-07-09T15:51:56Z

* Mutability

As a user, this is what has always worried me most, since we started talking about arrow as a backend for strings. But I think asking users to choose between mutability and efficiency would be acceptable - as opposed to (I think) making efficient, non-mutable strings the default and (definitely) entirely replacing the current mutable type with one that isn't.

xhochy · 2020-07-10T09:40:05Z

As this will probably need more than the string algorithms that are exposed by the str accessor (covered by ARROW-555), I have setup an umbrella issue on the Arrow side for the remaining parts: ARROW-9401; notably we probably need to implement some custom pandas-take operations.

I plan to start a PR with the basic scaffolding for the data type next week.

xhochy · 2020-07-13T10:22:09Z

Quite early but if someone wants to follow progress, I opened a draft PR: #35259

TomAugspurger · 2020-07-13T14:04:18Z

Thanks for all the discussion here.

[Joris] I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow.

Agreed with this. I think that the downsides of the Arrow implementation (fundamental: mutability, temporary: not all algos implemented) means that we'll want to keep the current string implementation around at least for some time. And given the choice of object, object-backed string, Arrow-backed string, I'd get rid of the object dtype implementation first (i.e. make Series.str raise for object dtype).

[Uwe]: Additionally, I would strongly prefer to not have a global option

Fair point. Agreed that we can avoid that at least for now. @jorisvandenbossche suggested a parameter like pd.Series(['a', 'b'], dtype=pd.StringDtype(storage="python"/"pyarrow"). IMO that gives a decent level of flexibility. Then the only "ambiguous" case is whether dtype="string" refers to python object dtype or pyarrow. Perhaps we could use the global option for resolving that ambiguity?

TomAugspurger · 2020-08-06T19:08:06Z

@xhochy do you have thoughts on using the dtype to specify how the values should be stored? It's a bit unusual for that to be an attribute of the dtype rather than the array, but I think that fits a bit better with the rest of pandas (pd.array for example).

Alternatively we give up on the goal of having a single user-facing StringArray and StringDtype, and just have StringArray and ArrowStringArray. But I think there's some value in having a single class that users deal with.

jorisvandenbossche · 2020-08-06T20:18:46Z

, and just have StringArray and ArrowStringArray

Did you mean to say "just have StringDtype and ArrowStringDtype" ? Because we could still have two array classes (but eg subclassing the same base StringArray class), which are coupled to a single, parametrized StringDtype.

TomAugspurger · 2020-08-07T15:42:20Z

I'm not sure exactly what I had in my head, but having separate array classes seems fine.

Josh-Ring-jisc · 2020-08-27T12:18:02Z

An interesting approach to allow both performance and mutability here would be to pad the UTF-8 strings to their maximum possible length. Even considering the memory consumption hit from this, I think it will still be a big improvement over the pure python implementation in terms of memory consumption and performance.

In this way if a character goes from 1 byte to 6 bytes say it's not a problem as we would have already allocated enough room for the change to be possible via the padding. (And would avoid having to reallocate the whole array for any small changes as there's always enough space to accommodate the new value)

I imagine that this could require work on the Arrow side of things, do you think this is worth pursuing?

xhochy · 2020-08-28T08:47:02Z

This is similar (but more extreme) to the NumPy approach and won't work efficiently due to two reasons:

The largest string size in Arrow is 2**32-1 for the StringType and 2**64-1 for the LargeStringType. Padding to this length will use an infeasible amount of space.
Having a lot of spacing between string elements reduces the cache efficiency quite a lot again. We would be fetching quite a lot of data into the L3 cache from memory that we wouldn't use. This is quite in constrast to the current Arrow implementation where every element that is fetched from memory into CPU caches (through lookahead) is also an element that contains data and is then used for one of the following rows.

Josh-Ring-jisc · 2020-08-28T13:11:19Z

I think my previous thought was ill defined i'll clarify it a bit.
What about allowing a user-defined character limit for a row.
Where we allocate per character, the maximum space required for a UTF-8 character (since they are variable in length).

So if someone knew that their strings would never be more than say 255 characters, as for instance it came from a database, they could use that information to allocate only the space they needed for their strings, with minimal padding and allow minimal cache misses, while retaining mutability.

In future if the user specified a specific subset of UTF-8 eg "ASCII only" then it should be possible to further reduce the padding and memory requirements.

edit:
If this needed to be automatic, it could be automatically populated from the maximum string length in the data.

Obviously nothing is perfect but this does seem like a useful compromise in many cases where mutability without reallocating an array would be helpful.

I would argue that some padding overhead is still cheaper than repeated reallocation of a large array.

jorisvandenbossche · 2020-10-19T18:25:15Z

The PR of @TomAugspurger at #36142 proposed a way to expose this "native/arrow-backed string dtype" to the the user. Specifically, that PR was doing:

One user-facing pd.StringDtype[storage], parametrized by storage = python/pyarrow.
Two (or three) user-facing StringArray classes, an ArrowStringArray, PythonStringArray, and maybe a base StringArray.

So for now, "string" dtype would still default to "string[python]", but the user has the option to explicitly choose for the arrow-based one (with "string[arrow]" or pd.StringDtype(storage="pyarrow")).

Are people generally OK with this as a way to provide this as an opt-in on the short term?

xhochy · 2020-10-19T18:43:08Z

Are people generally OK with this as a way to provide this as an opt-in on the short term?

Sounds good as this gives the end-user still and easy and obvious way to switch between both implementations.

TomAugspurger · 2020-10-19T18:49:19Z

Sounds good to me too.

I expect that the class hierarchy will be something like

class StringArray(ExtensionArray):
    ...


class ArrowStringArray(StringArray):
    ...

class PythonStringArray(StringArray, PandasArray):
    ...

and trying to make a StringArray directly wouldn't be allowed (may or may not be public in pandas.arrays).

jorisvandenbossche · 2020-10-20T14:14:12Z

We had a quick chat with @simonjayhawkins and @TomAugspurger, and the rough next steps that can be separated we see are:

Finishing the actual ExtensionArray -> ENH: Basis for a StringDtype using Arrow #35259 (this doesn't yet include specific string (.str) methods, but ensures the EA itself is fully working, EA base tests + StringArray tests are passing)
Enable the string methods for the arrow-backed StringArray. This might also need some additional changes in the string accessor code (eg to dispatch extract to the underlying array as well)
Discuss and implement the user facing API (parametrized dtype? see my comment above Plan for a native string dtype #35169 (comment)) Since the first step (ENH: Basis for a StringDtype using Arrow #35259) doesn't expose the new EA publicly, we can keep this as a separate PR later on. -> [ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

jorisvandenbossche · 2020-11-14T14:46:03Z

Other follow-ups needed after #35259

Implement the factorize() method (instead of converting to objects through _values_for_factorize, see ENH: Basis for a StringDtype using Arrow #35259 (comment) -> ENH: Arrow backed string array - implement factorize() method without casting to objects #38007
Finish astype() and to_numpy() implementation (see ENH: Basis for a StringDtype using Arrow #35259 (comment) and ENH: Basis for a StringDtype using Arrow #35259 (comment))
Fix value_counts
Comparison ops implementation (make it work for all ops, not only eq/ne; make it work for all kinds of "other" including StringArray; better handle the case of non-string other, ...)
copy/view implementation ENH: Basis for a StringDtype using Arrow #35259 (comment), ENH: Basis for a StringDtype using Arrow #35259 (comment)
type annotations for all methods
xfailed tests in pandas\tests\arrays\string_\test_string.py

jorisvandenbossche · 2021-06-07T12:23:14Z

One aspect that @simonjayhawkins raised while finalizing the implementation in #39908 (#39908 (comment)) is the use of a parametrized dtype (StringDtype(storage="python"|"pyarrow"), with each parametrization using its own specialized array class) versus having a single dtype / array class (with having the storage backend be a property of the array class, so much more hiding this "implementation detail" from the user).

Bringing that up here, because it's a more fundamental issue (and not just technical implementaiton detail of the PR) that relates to some of the things discussed above.

For example, @xhochy argued above for not (only) having a global option to control which storage backend would be used, so you can decide per-column basis whether to use Arrow.

Personally, because using Arrow or not can still have some important user facing consequence (eg regarding mutability / overhead when mutating the Arrow-based storage), I think it is useful to make this option easily available to the user when needed (so eg with the storage= keyword for StringDtype).

TomAugspurger · 2021-06-07T13:16:45Z

I think it is useful to make this option easily available to the user when needed (so eg with the storage= keyword for StringDtype).

Agreed with this.

TomAugspurger · 2021-07-22T14:51:35Z

There's a few followup items to discuss, but I think this issue can be closed with pandas 1.3.0 released. Thanks all!

TomAugspurger added Enhancement Strings String extension data type and string data labels Jul 7, 2020

TomAugspurger mentioned this issue Jul 7, 2020

API/ENH: dtype='string' / pd.String #8640

Closed

TomAugspurger mentioned this issue Jul 8, 2020

ENH: ListDtype / ListArray #35176

Closed

xhochy mentioned this issue Jul 13, 2020

ENH: Basis for a StringDtype using Arrow #35259

Merged

5 tasks

TomAugspurger mentioned this issue Aug 6, 2020

Support for numpy strings #5261

Closed

TomAugspurger mentioned this issue Sep 5, 2020

Arrow string array dtype #36142

Closed

simonjayhawkins mentioned this issue Oct 22, 2020

[WIP] Arrow string array: Common base class #37337

Closed

simonjayhawkins mentioned this issue Nov 22, 2020

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Merged

3 tasks

simonjayhawkins mentioned this issue Apr 15, 2021

[ArrowStringArray] API: StringArray -> ObjectStringArray #40962

Closed

simonjayhawkins mentioned this issue May 1, 2021

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

Merged

4 tasks

simonjayhawkins mentioned this issue May 12, 2021

[ArrowStringArray] TST: parametrize str.extractall tests #41419

Merged

toobaz mentioned this issue May 14, 2021

ENH: Indexes with any numpy int/uint/float dtype #41272

Closed

simonjayhawkins mentioned this issue May 31, 2021

[ArrowStringArray] fix test_astype_int, test_astype_float #41018

Merged

This was referenced Jul 19, 2021

PyArrow StringDtype / StringArray fallback policy #42613

Closed

ENH: Support min/max on ArrowStringArray #42597

Closed

TomAugspurger closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan for a native string dtype #35169

Plan for a native string dtype #35169

TomAugspurger commented Jul 7, 2020

jreback commented Jul 7, 2020

xhochy commented Jul 8, 2020 •

edited by jorisvandenbossche

Loading

xhochy commented Jul 8, 2020

TomAugspurger commented Jul 8, 2020

xhochy commented Jul 8, 2020

jorisvandenbossche commented Jul 8, 2020

jorisvandenbossche commented Jul 8, 2020

xhochy commented Jul 8, 2020

jorisvandenbossche commented Jul 9, 2020

toobaz commented Jul 9, 2020

xhochy commented Jul 10, 2020 •

edited

Loading

xhochy commented Jul 13, 2020

TomAugspurger commented Jul 13, 2020

TomAugspurger commented Aug 6, 2020

jorisvandenbossche commented Aug 6, 2020

TomAugspurger commented Aug 7, 2020

Josh-Ring-jisc commented Aug 27, 2020 •

edited

Loading

xhochy commented Aug 28, 2020

Josh-Ring-jisc commented Aug 28, 2020 •

edited

Loading

jorisvandenbossche commented Oct 19, 2020

xhochy commented Oct 19, 2020

TomAugspurger commented Oct 19, 2020 •

edited

Loading

jorisvandenbossche commented Oct 20, 2020 •

edited

Loading

jorisvandenbossche commented Nov 14, 2020 •

edited by simonjayhawkins

Loading

jorisvandenbossche commented Jun 7, 2021

TomAugspurger commented Jun 7, 2021

TomAugspurger commented Jul 22, 2021

Plan for a native string dtype #35169

Plan for a native string dtype #35169

Comments

TomAugspurger commented Jul 7, 2020

How do users opt into Arrow-backed StringArray?

Fallback Mode

jreback commented Jul 7, 2020

xhochy commented Jul 8, 2020 • edited by jorisvandenbossche Loading

xhochy commented Jul 8, 2020

TomAugspurger commented Jul 8, 2020

xhochy commented Jul 8, 2020

jorisvandenbossche commented Jul 8, 2020

jorisvandenbossche commented Jul 8, 2020

xhochy commented Jul 8, 2020

jorisvandenbossche commented Jul 9, 2020

toobaz commented Jul 9, 2020

xhochy commented Jul 10, 2020 • edited Loading

xhochy commented Jul 13, 2020

TomAugspurger commented Jul 13, 2020

TomAugspurger commented Aug 6, 2020

jorisvandenbossche commented Aug 6, 2020

TomAugspurger commented Aug 7, 2020

Josh-Ring-jisc commented Aug 27, 2020 • edited Loading

xhochy commented Aug 28, 2020

Josh-Ring-jisc commented Aug 28, 2020 • edited Loading

jorisvandenbossche commented Oct 19, 2020

xhochy commented Oct 19, 2020

TomAugspurger commented Oct 19, 2020 • edited Loading

jorisvandenbossche commented Oct 20, 2020 • edited Loading

jorisvandenbossche commented Nov 14, 2020 • edited by simonjayhawkins Loading

jorisvandenbossche commented Jun 7, 2021

TomAugspurger commented Jun 7, 2021

TomAugspurger commented Jul 22, 2021

xhochy commented Jul 8, 2020 •

edited by jorisvandenbossche

Loading

xhochy commented Jul 10, 2020 •

edited

Loading

Josh-Ring-jisc commented Aug 27, 2020 •

edited

Loading

Josh-Ring-jisc commented Aug 28, 2020 •

edited

Loading

TomAugspurger commented Oct 19, 2020 •

edited

Loading

jorisvandenbossche commented Oct 20, 2020 •

edited

Loading

jorisvandenbossche commented Nov 14, 2020 •

edited by simonjayhawkins

Loading