ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

simonjayhawkins · 2020-11-22T20:06:58Z

xref #35169 (comment), follow-on to #35259

This is moreless copy/paste from https://github.com/xhochy/fletcher with a slight tidy for initial review feedback

still to do

benchmarking
return type for chunked array with more than 1 chunk
maybe update tests.

we don't have failing tests, but the return type should be Tuple[np.ndarray, ExtensionArray] while we have return factorize(np_array, na_sentinel=na_sentinel) if more than 1 chunk and the return type of pd.factorize is Tuple[np.ndarray, Union[np.ndarray, ABCIndex]] (mypy doesn't report this as an error since np.ndarray resolves to Any)

since we don't have failing tests, we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

cc @jorisvandenbossche @xhochy

jorisvandenbossche

we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

That sounds as a good idea, yes (not sure how easy it is to do, since there are multiple data fixtures)

jorisvandenbossche · 2020-11-23T07:39:10Z

pandas/core/arrays/string_arrow.py

-        return cls._from_sequence(values)
+    @doc(ExtensionArray.factorize)
+    def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
+        if self._data.num_chunks == 1:


Nowadays, dictionary_encode works fine for ChunkedArrays as well, so I am not sure this if statement is actually needed.

Tooks a stab at that in fletcher to let CI verify this assumption: Seems to work with pyarrow 0.17-2.0 xhochy/fletcher#206

github-actions · 2020-12-24T00:24:24Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2020-12-24T20:39:08Z

this should yield a perf improvement yes? can you add an asv for this.

jorisvandenbossche · 2020-12-28T13:50:46Z

There is an existing benchmark in algorithms.py::Factorize.time_factorize() which has a "string" case -> can rename this to "str", and add an actual "string" dtype version.

jreback · 2020-12-28T17:11:14Z

pandas/core/arrays/string_arrow.py

+            if indices.dtype.kind == "f":
+                indices[np.isnan(indices)] = na_sentinel
+                indices = indices.astype(int)
+            if not is_int64_dtype(indices):


you can just do
indices = indices.astype(np.int64, copy=False)

jreback · 2020-12-28T17:11:23Z

pandas/core/arrays/string_arrow.py

+            indices = encoded.indices.to_pandas()
+            if indices.dtype.kind == "f":
+                indices[np.isnan(indices)] = na_sentinel
+                indices = indices.astype(int)


int -> np.int64

github-actions · 2021-01-28T00:17:55Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-02-11T00:25:25Z

status here?

simonjayhawkins · 2021-02-16T11:19:23Z

status here?

benchmarks added but pyarrow not added to env. this is OK for local testing with asv dev. I'm not sure whether we want to add pyarrow to asv env.

extensions tests updated for arrays with more than 1 chunk. good news is that we now see the factorize failures

these will be fixed by incorporating latest changes from fletcher next and other comments addressed.

jorisvandenbossche

Looks good to me!

jorisvandenbossche · 2021-02-18T15:25:29Z

asv_bench/benchmarks/algorithms.py

@@ -5,6 +5,7 @@
 from pandas._libs import lib

 import pandas as pd
+from pandas.core.arrays.string_arrow import ArrowStringDtype


Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)

jorisvandenbossche · 2021-02-18T15:30:12Z

pandas/core/arrays/string_arrow.py

+        ).to_pandas()
+        if indices.dtype.kind == "f":
+            indices[np.isnan(indices)] = na_sentinel
+        indices = indices.astype(np.int64, copy=False)


Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

I suppose that we always return int64 from factorize for the indices. Short-term, casting to int64 might be best then (to ensure nothing else breaks because of not doing that), but long term we should maybe check if internally we require int64 or would be fine with int32 as well.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

refactor in 0023f08 partially to address comments

but yes, we seem to be getting an int32 from pyarrow

also we could maybe work with numpy arrays here directly for the indices instead of pandas Series?

jorisvandenbossche · 2021-02-18T15:37:56Z

@simonjayhawkins did you check what difference it gives in performance for the benchmark case compared to object dtype? (just that single case, no need to run the full suite)

simonjayhawkins · 2021-02-19T16:44:31Z

with changes in this PR to compare with String(Index)

[ 25.00%] ··· algorithms.Factorize.time_factorize                                                                            ok
[ 25.00%] ··· ======== ======= ==================== ==========
               unique    sort         dtype                   
              -------- ------- -------------------- ----------
                True     True          int           3.66±0ms 
                True     True          uint          4.55±0ms 
                True     True         float          15.1±0ms 
                True     True         string         59.8±0ms 
                True     True     datetime64[ns]     182±0μs  
                True     True   datetime64[ns, tz]   185±0μs  
                True     True         Int64          5.07±0ms 
                True     True        boolean         900±0μs  
                True     True      string_arrow      63.4±0ms 
                True    False          int           3.04±0ms 
                True    False          uint          2.78±0ms 
                True    False         float          2.49±0ms 
                True    False         string         8.48±0ms 
                True    False     datetime64[ns]     182±0μs  
                True    False   datetime64[ns, tz]   183±0μs  
                True    False         Int64          3.19±0ms 
                True    False        boolean         603±0μs  
                True    False      string_arrow      7.00±0ms 
               False     True          int           8.14±0ms 
               False     True          uint          8.45±0ms 
               False     True         float          20.9±0ms 
               False     True         string         75.7±0ms 
               False     True     datetime64[ns]     10.8±0ms 
               False     True   datetime64[ns, tz]   9.70±0ms 
               False     True         Int64          9.70±0ms 
               False     True        boolean         3.04±0ms 
               False     True      string_arrow      68.8±0ms 
               False    False          int           6.43±0ms 
               False    False          uint          6.28±0ms 
               False    False         float          8.13±0ms 
               False    False         string         20.4±0ms 
               False    False     datetime64[ns]     8.89±0ms 
               False    False   datetime64[ns, tz]   6.56±0ms 
               False    False         Int64          5.61±0ms 
               False    False        boolean         2.53±0ms 
               False    False      string_arrow      11.6±0ms 
              ======== ======= ==================== ==========

without factorize to compare with _from_factorized

[ 25.00%] ··· algorithms.Factorize.time_factorize                                                                            ok
[ 25.00%] ··· ======== ======= ==================== ==========
               unique    sort         dtype                   
              -------- ------- -------------------- ----------
                True     True          int           3.25±0ms 
                True     True          uint          4.36±0ms 
                True     True         float          14.9±0ms 
                True     True         string         59.6±0ms 
                True     True     datetime64[ns]     174±0μs  
                True     True   datetime64[ns, tz]   175±0μs  
                True     True         Int64          3.49±0ms 
                True     True        boolean         858±0μs  
                True     True      string_arrow      73.9±0ms 
                True    False          int           1.81±0ms 
                True    False          uint          1.50±0ms 
                True    False         float          2.45±0ms 
                True    False         string         7.45±0ms 
                True    False     datetime64[ns]     179±0μs  
                True    False   datetime64[ns, tz]   174±0μs  
                True    False         Int64          2.80±0ms 
                True    False        boolean         584±0μs  
                True    False      string_arrow      16.5±0ms 
               False     True          int           8.34±0ms 
               False     True          uint          9.37±0ms 
               False     True         float          21.1±0ms 
               False     True         string         75.0±0ms 
               False     True     datetime64[ns]     10.7±0ms 
               False     True   datetime64[ns, tz]   11.8±0ms 
               False     True         Int64          8.35±0ms 
               False     True        boolean         3.27±0ms 
               False     True      string_arrow      125±0ms  
               False    False          int           7.07±0ms 
               False    False          uint          6.18±0ms 
               False    False         float          8.71±0ms 
               False    False         string         23.1±0ms 
               False    False     datetime64[ns]     7.16±0ms 
               False    False   datetime64[ns, tz]   6.87±0ms 
               False    False         Int64          6.39±0ms 
               False    False        boolean         2.95±0ms 
               False    False      string_arrow      74.2±0ms 
              ======== ======= ==================== ==========

simonjayhawkins · 2021-03-02T13:06:45Z

@jorisvandenbossche @jreback anything else to be done here?

jorisvandenbossche

Thanks for the ping. Can you merge latest master to be sure?

simonjayhawkins · 2021-03-02T16:22:40Z

Can you merge latest master to be sure?

greenish.

jorisvandenbossche · 2021-03-02T16:51:48Z

Thanks!

simonjayhawkins added 3 commits November 21, 2020 19:54

moreless copy/paste from fletcher

c53a3c2

use docstring from base class

b7d0ab8

remove redundant type check

154496a

simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Nov 22, 2020

jorisvandenbossche reviewed Nov 23, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 23, 2020

Plan for a native string dtype #35169

Closed

jorisvandenbossche changed the title ~~Arrow backed string array - implement factorize() method (instead of converting to objects through _values_for_factorize)~~ ENH: Arrow backed string array - implement factorize() method without casting to objects Nov 23, 2020

github-actions bot added the Stale label Dec 24, 2020

jorisvandenbossche removed the Stale label Dec 24, 2020

jreback requested changes Dec 28, 2020

View reviewed changes

github-actions bot added the Stale label Jan 28, 2021

jorisvandenbossche removed the Stale label Jan 28, 2021

Merge remote-tracking branch 'upstream/master' into factorize

c545970

simonjayhawkins added 4 commits February 15, 2021 10:46

Merge remote-tracking branch 'upstream/master' into factorize

6e3aac8

ignore new mypy error

73c7de9

update algorithms.Factorize.time_factorize

42ca9c3

test for arrays with 2 chunks

a251537

simonjayhawkins added 4 commits February 18, 2021 11:13

Merge remote-tracking branch 'upstream/master' into factorize

dbc8253

fix failing test_factorize_equivalence

ea59c38

fix failing test_factorize_empty

7d98727

address dtype comment

0023f08

simonjayhawkins added this to the 1.3 milestone Feb 18, 2021

simonjayhawkins requested a review from jorisvandenbossche February 18, 2021 15:16

simonjayhawkins requested a review from jreback February 18, 2021 15:16

jorisvandenbossche reviewed Feb 18, 2021

View reviewed changes

simonjayhawkins added 2 commits February 19, 2021 16:46

move ArrowStringDtype import inside try/except

6a28414

Merge remote-tracking branch 'upstream/master' into factorize

c4db20d

jorisvandenbossche approved these changes Mar 2, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into factorize

88ab4f4

jorisvandenbossche merged commit 7be41ca into pandas-dev:master Mar 2, 2021

simonjayhawkins deleted the factorize branch March 2, 2021 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

simonjayhawkins commented Nov 22, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Nov 23, 2020

xhochy Nov 23, 2020

github-actions bot commented Dec 24, 2020

jreback commented Dec 24, 2020

jorisvandenbossche commented Dec 28, 2020

jreback Dec 28, 2020

jreback Dec 28, 2020

github-actions bot commented Jan 28, 2021

jreback commented Feb 11, 2021

simonjayhawkins commented Feb 16, 2021 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Feb 18, 2021

jorisvandenbossche Feb 18, 2021

simonjayhawkins Feb 18, 2021

jorisvandenbossche commented Feb 18, 2021 •

edited

Loading

simonjayhawkins commented Feb 19, 2021

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche left a comment

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Conversation

simonjayhawkins commented Nov 22, 2020 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2020

Choose a reason for hiding this comment

xhochy Nov 23, 2020

Choose a reason for hiding this comment

github-actions bot commented Dec 24, 2020

jreback commented Dec 24, 2020

jorisvandenbossche commented Dec 28, 2020

jreback Dec 28, 2020

Choose a reason for hiding this comment

jreback Dec 28, 2020

Choose a reason for hiding this comment

github-actions bot commented Jan 28, 2021

jreback commented Feb 11, 2021

simonjayhawkins commented Feb 16, 2021 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 18, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 18, 2021

Choose a reason for hiding this comment

simonjayhawkins Feb 18, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 18, 2021 • edited Loading

simonjayhawkins commented Feb 19, 2021

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

simonjayhawkins commented Nov 22, 2020 •

edited

Loading

simonjayhawkins commented Feb 16, 2021 •

edited

Loading

jorisvandenbossche commented Feb 18, 2021 •

edited

Loading