You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like the addition of multiple key sort (#5502) in 2.10.0 broke the .sort() method.
After filtering a dataset with .filter(), the .sort() seems to refer to the query_table index of the previous unfiltered dataset, resulting in an IndexError.
This only happens with the 2.10.0 release.
Steps to reproduce the bug
fromdatasetsimportload_dataset# dataset with length of 1104ds=load_dataset('glue', 'ax')['test']
ds=ds.filter(lambdax: x['idx'] >1100)
ds.sort('premise')
print('Done')
File "/home/dongkeun/datasets_test/test.py", line 5, in
ds.sort('premise')
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3959, in sort
sort_table = query_table(
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 588, in query_table
_check_valid_index_key(key, size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 537, in _check_valid_index_key
_check_valid_index_key(max(key), size=size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 531, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 1103 is out of bounds for size 3
Expected behavior
It should sort the dataset and print "Done". Which it does on 2.9.0.
Describe the bug
Hi, thank you for your support!
It seems like the addition of multiple key sort (#5502) in 2.10.0 broke the
.sort()
method.After filtering a dataset with
.filter()
, the.sort()
seems to refer to the query_table index of the previous unfiltered dataset, resulting in an IndexError.This only happens with the 2.10.0 release.
Steps to reproduce the bug
File "/home/dongkeun/datasets_test/test.py", line 5, in
ds.sort('premise')
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3959, in sort
sort_table = query_table(
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 588, in query_table
_check_valid_index_key(key, size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 537, in _check_valid_index_key
_check_valid_index_key(max(key), size=size)
File "/home/dongkeun/miniconda3/envs/datasets_test/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 531, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 1103 is out of bounds for size 3
Expected behavior
It should sort the dataset and print "Done". Which it does on 2.9.0.
Environment info
datasets
version: 2.10.0The text was updated successfully, but these errors were encountered: