Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch - updates maybe_indices_to_slice to use intp_t #59537

Closed
wants to merge 6 commits into from

Conversation

benjamindonnachie
Copy link

@benjamindonnachie benjamindonnachie commented Aug 17, 2024

Updates maybe_indices_to_slice to use uint64 allowing massive dataframes to be manipulated (see pandas-dev#59531)
Update maybe_indices_to_slice to use uint64_t allowing manipulation of massive data frames (See pandas-dev#59531)
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inp_t is the proper size for an indexer - does it fix things if you try using that consistently?

@benjamindonnachie
Copy link
Author

Happy to give it a try. I'll let you know.

@benjamindonnachie
Copy link
Author

Do you mean intp_t ? If so, changing max_len to intp_t and np.intp respectively in lib.pyx and lib.pyi (to match the array definitions) also produces expected results:

>>> import pandas as pd
>>> print (pd.__version__)
3.0.0.dev0+1351.g523afa840a.dirty
>>> test = pd.read_pickle("summary.pkl")
>>> summary = test[test['count'] > 0]
>>> print (len(test))
4261028590
>>> print (len(summary))
1621743

Very minor changes but happy to submit another PR if that works for you.

@WillAyd
Copy link
Member

WillAyd commented Aug 17, 2024

Yes that's correct. You can just update this PR - no need to create another

Revised to use intp_t for max_len in maybe_indices_to_slice.  As per conversation with WillAyd revised patch as intp_t is the proper size for an indexer.
Updated may_indices_to_slice to use np.intp for max_len.  As per conversation with WillAyd revised patch as intp_t is the proper size for an indexer.
@benjamindonnachie benjamindonnachie changed the title Patch - updates maybe_indices_to_slice to use uint64 Patch - updates maybe_indices_to_slice to use intp_t Aug 17, 2024
@benjamindonnachie
Copy link
Author

Got there.... Many thanks for your help! :)

@WillAyd
Copy link
Member

WillAyd commented Aug 17, 2024

@benjamindonnachie great I think this change looks pretty good! Let's see if CI is green.

You will also want to add a whatsnew entry, which can be added to pandas/doc/source/whatsnew/v3.0.0.rst` You can add it to the section on Bug Fixes > Indexing (just follow the same pattern you see for other notes already there)

Updated whatsnew to reflect fix to maybe_indices_to_slice to address OverflowError.
@benjamindonnachie
Copy link
Author

Whatsnew/v3.0.0.rst updated. Fingers crossed! :)

@mroeschke mroeschke added Internals Related to non-user accessible pandas implementation Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 19, 2024
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 19, 2024
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Oct 10, 2024
@benjamindonnachie
Copy link
Author

@mroeschke I'm not sure what else you need given the minor nature of this patch.

@mroeschke
Copy link
Member

It appeared the code checks were not all passing https://github.com/pandas-dev/pandas/actions/runs/10938423968/job/30366432851

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants