Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40153: [Python] Avoid using np.take in Array.to_numpy() #40295

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Feb 29, 2024

Rationale for this change

Array.to_numpy calls np.take to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them.

What changes are included in this PR?

Avoid calling np.take, instead using our own dictionary decoding routine.

Are these changes tested?

Yes. A test failure is fixed on 32-bit.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

@github-actions crossbow submit -g python

Copy link

Revision: 7fc1ec7

Submitted crossbow builds: ursacomputing/crossbow @ actions-a31e58e0d0

Task Status
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions

@@ -2515,6 +2515,8 @@ Status ConvertChunkedArrayToPandas(const PandasOptions& options,
std::shared_ptr<ChunkedArray> arr, PyObject* py_ref,
PyObject** out) {
if (options.decode_dictionaries && arr->type()->id() == Type::DICTIONARY) {
// XXX we should return an error as below if options.zero_copy_only
// is true, but that would break compatibility with existing tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we essentially just ignore the zero_copy_only keyword in case of a DictionaryArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 29, 2024
@pitrou pitrou merged commit 5c4869d into apache:main Feb 29, 2024
14 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Feb 29, 2024
@pitrou pitrou deleted the gh40153-dict-to-numpy branch February 29, 2024 18:14
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5c4869d.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…che#40295)

### Rationale for this change

`Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them.

### What changes are included in this PR?

Avoid calling `np.take`, instead using our own dictionary decoding routine.

### Are these changes tested?

Yes. A test failure is fixed on 32-bit.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants