Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow is_null filter not working as expected after loading using deltalake #1496

Closed
paul-rohith opened this issue Jun 27, 2023 · 9 comments · Fixed by #1520
Closed

Pyarrow is_null filter not working as expected after loading using deltalake #1496

paul-rohith opened this issue Jun 27, 2023 · 9 comments · Fixed by #1520
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers

Comments

@paul-rohith
Copy link

paul-rohith commented Jun 27, 2023

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment: Seemed to have the same issue across Windows 10 and Amazon Workspaces

  • Cloud provider:
  • OS:
  • Other:

Bug

Explained in the following stackoverflow question: https://stackoverflow.com/questions/76557635/possible-bug-in-using-pyarrow-is-null-function-with-delta-tables
What happened: The is_null filter only returned rows from partitions where all rows have null values for the column in question i.e. any rows with null values that belonged to a partition that had other non-null values was not returned. Using pa.dataset.dataset("/path/to/table") works as expected.

What you expected to happen: I expected all rows with null values to be filtered.

How to reproduce it:

import pyarrow as pa
import pyarrow.compute as pc
from deltalake import write_deltalake, DeltaTable

partition_column = pa.array(['a', 'a', 'b', 'b', 'c', 'c'], type=pa.string())
int_value = pa.array([12, None, None, None, 15, 67], type=pa.int64())
names = ["partition_column", "int_value"]

table = pa.Table.from_arrays([partition_column, int_value], names=names)
write_deltalake("./resources/is_null_test_table", table, partition_by="partition_column")

ds = DeltaTable("./resources/is_null_test_table").to_pyarrow_dataset()
pat = ds.to_table(filter=(pc.field("int_value").is_valid()))
print(pat.to_pandas())

I'd expect this code to retrieve the 3 rows with None values but it only retrieves 2 of them - the ones which belong to the same partition ('b').

More details:

@paul-rohith paul-rohith added the bug Something isn't working label Jun 27, 2023
@paul-rohith
Copy link
Author

paul-rohith commented Jul 5, 2023

Hi, I'd appreciate any insights on whether this is something i can fix on my end. I'm adding an example of how to reproduce the error as well in the issue.

@wjones127
Copy link
Collaborator

Hi @paul-rohith. It sounds like we need to update the statistics conversion to handle nulls better. We handle the case where there is all null or all valid, but don't specify anything when there is mixed values.

delta-rs/python/src/lib.rs

Lines 645 to 657 in 56dfd25

for (col_name, null_count) in stats.null_count.iter().filter_map(|(k, v)| match v {
ColumnCountStat::Value(val) => Some((k, val)),
_ => None,
}) {
if *null_count == stats.num_records {
expressions.push(field.call1((col_name.clone(),))?.call_method0("is_null"));
}
if *null_count == 0 {
expressions.push(field.call1((col_name.clone(),))?.call_method0("is_valid"));
}
}
}

I suspect we instead need write guarantees like

(x > 0 and x < 10) or (x is null)

or perhaps we need to be using and_kleene instead of and.

@wjones127 wjones127 added good first issue Good for newcomers binding/python Issues for the Python package labels Jul 6, 2023
@paul-rohith
Copy link
Author

Hi, while I'd love to contribute to fixing the bug, I have no experience with Rust - I'm currently working in Python using the deltalake package.
I wanted to ask if you can think of any deltalake/pyarrow based workarounds for this issue until it's resolved. It has some pretty major effects on the code I'm trying to write.
Also, any timelines on when you expect this to be fixed would be great. Thanks!

@wjones127
Copy link
Collaborator

Im on vacation this week so I’ll take a look next week.

I don’t think there’s an obvious Python only patch. But the Rust code itself isn’t too hard. It’s just calling into some Python code.

@paul-rohith
Copy link
Author

Hi, thanks for your time!
Is there any other way I can filter Delta tables without loading all the data while the patch is in the works? That's my specific use-case currently so I'm a little stuck due to this bug, I noticed it seems to affect other filters as well.
If there's no other option, I think I'll have to load the entire data into a pyarrow table and then apply the filters but that's something I'd only like to do as an absolute last resort.

@paul-rohith
Copy link
Author

After some further testing, I think the bug is actually a little more general than what I initially reported:
In partitions with mixed (null and non-null) values, a filter that is satisfied by all non-null values returns the entire partition including the null value rows despite these not satisfying the filter.
I'm not sure if this is in line with the portion of buggy code highlighted above, will leave that to you to judge :)

@wjones127
Copy link
Collaborator

Instead of loading the whole table, you can convert to a dataset, and then get a record batch reader off of that. That will let you read data in batches and filter from there

@paul-rohith
Copy link
Author

Won't this still involve loading the entire data, just not in one go?
Or do you mean dataset.to_batches(filter=...), in which case I'm not sure why the same bug won't still be present?

@wjones127
Copy link
Collaborator

Yes, you can just read in batches. I suppose you could also just pass predicates that don’t involve nullable columns for the time being.

wjones127 added a commit that referenced this issue Jul 9, 2023
# Description

Fixes issue where predicate pushdown isn't working for null values. This
adds tests for both columns and partition columns.

# Related Issue(s)

- closes #1496


# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants