GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

pitrou · 2024-02-29T15:56:02Z

Rationale for this change

Array.to_numpy calls np.take to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them.

What changes are included in this PR?

Avoid calling np.take, instead using our own dictionary decoding routine.

Are these changes tested?

Yes. A test failure is fixed on 32-bit.

Are there any user-facing changes?

No.

GitHub Issue: [Python] Test failures on 32-bit x86 #40153

pitrou · 2024-02-29T15:56:21Z

@github-actions crossbow submit -g python

github-actions · 2024-02-29T15:58:53Z

Revision: 7fc1ec7

Submitted crossbow builds: ursacomputing/crossbow @ actions-a31e58e0d0

Task	Status
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest
test-conda-python-3.10-pandas-nightly
test-conda-python-3.10-spark-v3.5.0
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0
test-conda-python-3.8-spark-v3.5.0
test-conda-python-3.9
test-conda-python-3.9-pandas-latest
test-cuda-python
test-debian-11-python-3-amd64
test-debian-11-python-3-i386
test-fedora-39-python-3
test-ubuntu-20.04-python-3
test-ubuntu-22.04-python-3

jorisvandenbossche · 2024-02-29T16:20:22Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

@@ -2515,6 +2515,8 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
                                   std::shared_ptr<ChunkedArray> arr, PyObject* py_ref,
                                   PyObject** out) {
  if (options.decode_dictionaries && arr->type()->id() == Type::DICTIONARY) {
+    // XXX we should return an error as below if options.zero_copy_only
+    // is true, but that would break compatibility with existing tests.


At the moment we essentially just ignore the zero_copy_only keyword in case of a DictionaryArray?

conbench-apache-arrow · 2024-03-01T06:35:40Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5c4869d.

There was 1 benchmark result indicating a performance regression:

Commit Run on ursa-i9-9960x at 2024-03-01 00:48:47Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-17, scale_factor=1

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…che#40295) ### Rationale for this change `Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them. ### What changes are included in this PR? Avoid calling `np.take`, instead using our own dictionary decoding routine. ### Are these changes tested? Yes. A test failure is fixed on 32-bit. ### Are there any user-facing changes? No. * GitHub Issue: apache#40153 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

apacheGH-40153: [Python] Avoid using np.take in Array.to_numpy()

7fc1ec7

pitrou requested a review from jorisvandenbossche February 29, 2024 15:56

github-actions bot added Component: Python awaiting review Awaiting review labels Feb 29, 2024

pitrou mentioned this pull request Feb 29, 2024

[Python] Test failures on 32-bit x86 #40153

Closed

jorisvandenbossche approved these changes Feb 29, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 29, 2024

pitrou merged commit 5c4869d into apache:main Feb 29, 2024
14 checks passed

pitrou removed the awaiting merge Awaiting merge label Feb 29, 2024

pitrou deleted the gh40153-dict-to-numpy branch February 29, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

pitrou commented Feb 29, 2024 •

edited by github-actions bot

Loading

pitrou commented Feb 29, 2024

github-actions bot commented Feb 29, 2024

jorisvandenbossche Feb 29, 2024

pitrou Feb 29, 2024

conbench-apache-arrow bot commented Mar 1, 2024

GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

Conversation

pitrou commented Feb 29, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Feb 29, 2024

github-actions bot commented Feb 29, 2024

jorisvandenbossche Feb 29, 2024

Choose a reason for hiding this comment

pitrou Feb 29, 2024

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Mar 1, 2024

pitrou commented Feb 29, 2024 •

edited by github-actions bot

Loading