Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.sort() is broken when used after .filter(), only in 2.10.0 #5586

Closed
MattYoon opened this issue Feb 28, 2023 · 1 comment · Fixed by #5587
Closed

.sort() is broken when used after .filter(), only in 2.10.0 #5586

MattYoon opened this issue Feb 28, 2023 · 1 comment · Fixed by #5587
Labels
bug Something isn't working

Comments

@MattYoon
Copy link

MattYoon commented Feb 28, 2023

Describe the bug

Hi, thank you for your support!

It seems like the addition of multiple key sort (#5502) in 2.10.0 broke the .sort() method.

After filtering a dataset with .filter(), the .sort() seems to refer to the query_table index of the previous unfiltered dataset, resulting in an IndexError.

This only happens with the 2.10.0 release.

Steps to reproduce the bug

from datasets import load_dataset

# dataset with length of 1104
ds = load_dataset('glue', 'ax')['test']
ds = ds.filter(lambda x: x['idx'] > 1100)
ds.sort('premise')
print('Done')

File "/home/dongkeun/datasets_test/test.py", line 5, in
ds.sort('premise')
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3959, in sort
sort_table = query_table(
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 588, in query_table
_check_valid_index_key(key, size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 537, in _check_valid_index_key
_check_valid_index_key(max(key), size=size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 531, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 1103 is out of bounds for size 3

Expected behavior

It should sort the dataset and print "Done". Which it does on 2.9.0.

Environment info

  • datasets version: 2.10.0
  • Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3
@mariosasko mariosasko added the bug Something isn't working label Feb 28, 2023
@lhoestq
Copy link
Member

lhoestq commented Feb 28, 2023

Thanks for reporting and thanks @mariosasko for fixing ! We just did a patch release 2.10.1 with the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants