Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Merged
merged 15 commits into from
Mar 2, 2021

Conversation

simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins commented Nov 22, 2020

xref #35169 (comment), follow-on to #35259

This is moreless copy/paste from https://github.com/xhochy/fletcher with a slight tidy for initial review feedback

still to do

  • benchmarking
  • return type for chunked array with more than 1 chunk
  • maybe update tests.

we don't have failing tests, but the return type should be Tuple[np.ndarray, ExtensionArray] while we have return factorize(np_array, na_sentinel=na_sentinel) if more than 1 chunk and the return type of pd.factorize is Tuple[np.ndarray, Union[np.ndarray, ABCIndex]] (mypy doesn't report this as an error since np.ndarray resolves to Any)

since we don't have failing tests, we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

cc @jorisvandenbossche @xhochy

@simonjayhawkins simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Nov 22, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

That sounds as a good idea, yes (not sure how easy it is to do, since there are multiple data fixtures)

return cls._from_sequence(values)
@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
if self._data.num_chunks == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays, dictionary_encode works fine for ChunkedArrays as well, so I am not sure this if statement is actually needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tooks a stab at that in fletcher to let CI verify this assumption: Seems to work with pyarrow 0.17-2.0 xhochy/fletcher#206

@jorisvandenbossche jorisvandenbossche changed the title Arrow backed string array - implement factorize() method (instead of converting to objects through _values_for_factorize) ENH: Arrow backed string array - implement factorize() method without casting to objects Nov 23, 2020
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

this should yield a perf improvement yes? can you add an asv for this.

@jorisvandenbossche
Copy link
Member

There is an existing benchmark in algorithms.py::Factorize.time_factorize() which has a "string" case -> can rename this to "str", and add an actual "string" dtype version.

if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(int)
if not is_int64_dtype(indices):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do
indices = indices.astype(np.int64, copy=False)

indices = encoded.indices.to_pandas()
if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> np.int64

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

status here?

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Feb 16, 2021

status here?

benchmarks added but pyarrow not added to env. this is OK for local testing with asv dev. I'm not sure whether we want to add pyarrow to asv env.

extensions tests updated for arrays with more than 1 chunk. good news is that we now see the factorize failures

these will be fixed by incorporating latest changes from fletcher next and other comments addressed.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -5,6 +5,7 @@
from pandas._libs import lib

import pandas as pd
from pandas.core.arrays.string_arrow import ArrowStringDtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)

).to_pandas()
if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(np.int64, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

I suppose that we always return int64 from factorize for the indices. Short-term, casting to int64 might be best then (to ensure nothing else breaks because of not doing that), but long term we should maybe check if internally we require int64 or would be fine with int32 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

refactor in 0023f08 partially to address comments

but yes, we seem to be getting an int32 from pyarrow

also we could maybe work with numpy arrays here directly for the indices instead of pandas Series?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 18, 2021

@simonjayhawkins did you check what difference it gives in performance for the benchmark case compared to object dtype? (just that single case, no need to run the full suite)

@simonjayhawkins
Copy link
Member Author

with changes in this PR to compare with String(Index)

[ 25.00%] ··· algorithms.Factorize.time_factorize                                                                            ok
[ 25.00%] ··· ======== ======= ==================== ==========
               unique    sort         dtype                   
              -------- ------- -------------------- ----------
                True     True          int           3.66±0ms 
                True     True          uint          4.55±0ms 
                True     True         float          15.1±0ms 
                True     True         string         59.8±0ms 
                True     True     datetime64[ns]     182±0μs  
                True     True   datetime64[ns, tz]   185±0μs  
                True     True         Int64          5.07±0ms 
                True     True        boolean         900±0μs  
                True     True      string_arrow      63.4±0ms 
                True    False          int           3.04±0ms 
                True    False          uint          2.78±0ms 
                True    False         float          2.49±0ms 
                True    False         string         8.48±0ms 
                True    False     datetime64[ns]     182±0μs  
                True    False   datetime64[ns, tz]   183±0μs  
                True    False         Int64          3.19±0ms 
                True    False        boolean         603±0μs  
                True    False      string_arrow      7.00±0ms 
               False     True          int           8.14±0ms 
               False     True          uint          8.45±0ms 
               False     True         float          20.9±0ms 
               False     True         string         75.7±0ms 
               False     True     datetime64[ns]     10.8±0ms 
               False     True   datetime64[ns, tz]   9.70±0ms 
               False     True         Int64          9.70±0ms 
               False     True        boolean         3.04±0ms 
               False     True      string_arrow      68.8±0ms 
               False    False          int           6.43±0ms 
               False    False          uint          6.28±0ms 
               False    False         float          8.13±0ms 
               False    False         string         20.4±0ms 
               False    False     datetime64[ns]     8.89±0ms 
               False    False   datetime64[ns, tz]   6.56±0ms 
               False    False         Int64          5.61±0ms 
               False    False        boolean         2.53±0ms 
               False    False      string_arrow      11.6±0ms 
              ======== ======= ==================== ==========

without factorize to compare with _from_factorized

[ 25.00%] ··· algorithms.Factorize.time_factorize                                                                            ok
[ 25.00%] ··· ======== ======= ==================== ==========
               unique    sort         dtype                   
              -------- ------- -------------------- ----------
                True     True          int           3.25±0ms 
                True     True          uint          4.36±0ms 
                True     True         float          14.9±0ms 
                True     True         string         59.6±0ms 
                True     True     datetime64[ns]     174±0μs  
                True     True   datetime64[ns, tz]   175±0μs  
                True     True         Int64          3.49±0ms 
                True     True        boolean         858±0μs  
                True     True      string_arrow      73.9±0ms 
                True    False          int           1.81±0ms 
                True    False          uint          1.50±0ms 
                True    False         float          2.45±0ms 
                True    False         string         7.45±0ms 
                True    False     datetime64[ns]     179±0μs  
                True    False   datetime64[ns, tz]   174±0μs  
                True    False         Int64          2.80±0ms 
                True    False        boolean         584±0μs  
                True    False      string_arrow      16.5±0ms 
               False     True          int           8.34±0ms 
               False     True          uint          9.37±0ms 
               False     True         float          21.1±0ms 
               False     True         string         75.0±0ms 
               False     True     datetime64[ns]     10.7±0ms 
               False     True   datetime64[ns, tz]   11.8±0ms 
               False     True         Int64          8.35±0ms 
               False     True        boolean         3.27±0ms 
               False     True      string_arrow      125±0ms  
               False    False          int           7.07±0ms 
               False    False          uint          6.18±0ms 
               False    False         float          8.71±0ms 
               False    False         string         23.1±0ms 
               False    False     datetime64[ns]     7.16±0ms 
               False    False   datetime64[ns, tz]   6.87±0ms 
               False    False         Int64          6.39±0ms 
               False    False        boolean         2.95±0ms 
               False    False      string_arrow      74.2±0ms 
              ======== ======= ==================== ==========

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche @jreback anything else to be done here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. Can you merge latest master to be sure?

@simonjayhawkins
Copy link
Member Author

Can you merge latest master to be sure?

greenish.

@jorisvandenbossche jorisvandenbossche merged commit 7be41ca into pandas-dev:master Mar 2, 2021
@jorisvandenbossche
Copy link
Member

Thanks!

@simonjayhawkins simonjayhawkins deleted the factorize branch March 2, 2021 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants