Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_list doesn't handle value being a Dictionary type when check data type matches #9530

Closed
advancedxy opened this issue Mar 10, 2024 · 5 comments · Fixed by #10031
Closed
Labels
bug Something isn't working

Comments

@advancedxy
Copy link
Contributor

Describe the bug

When I'm working on apache/datafusion-comet#184, I notice that some tpc-ds queries failing when trying to convert the static list into the set.

The error message is something like:

Error from DataFusion Internal error: The data type inlist should be same, the value type is Dictionary(Int32, Utf8), one of list expr type is Utf8.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker..

To Reproduce

Create in input schema with a Dictionary type and then try to convert the in_list expression should easily reproduce this one.

Expected behavior

When using in_list to create a InListExpr, it should handle value type being Dictionary Type.

Additional context

No response

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 20, 2024

@advancedxy It seems that you are mixing dict with utf8. I think it is not a valid case 😕

I think the problem is that the caller should do the type coercion to convert utf8 to dict(i32, utf8) beforehand. In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.

@alamb It seems that if the datafusion user imports the physical-expr functions only, they may need to fight with type coercion.

In lance, they struggle with data types from SQL string to logical expr too.

https://github.com/lancedb/lance/blob/faed71b04cc37646962badd3f19bb2369c7267f8/rust/lance-datafusion/src/expr.rs#L42-L46

@advancedxy
Copy link
Contributor Author

I think the problem is that the caller should do the type coercion to convert utf8 to dict(i32, utf8) beforehand. In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.

I think in the comet cases, the value type is a dict(i32, utf8) and the static list is all utf8 type. So we should convert the dict(i32, utf8) to utf8 instead? However, like you said, physical-expr is used directly in comet, it might not be possible to covert the value type beforehand.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

@alamb It seems that if the datafusion user imports the physical-expr functions only, they may need to fight with type coercion.

There is an example of how to invoke the coercion (but it starts from a logical expr)
https://github.com/apache/arrow-datafusion/blob/c5c9d3f57f361c6c01d0cb01c416f6a7e9dfd906/datafusion-examples/examples/expr_api.rs#L252-L264

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

I think in the comet cases, the value type is a dict(i32, utf8) and the static list is all utf8 type. So we should convert the dict(i32, utf8) to utf8 instead? However, like you said, physical-expr is used directly in comet, it might not be possible to covert the value type beforehand.

I can think of two possibilities

Option 1 Easy to code, less performant

cast the argument from Dict(Int, Utf8) to Utf8 (using cast)

Option 2: More code, but more performant

Implement direct support for Dictionary types in the Inlist physical expression implementation

Somewhere in here:
https://docs.rs/datafusion-physical-expr/36.0.0/src/datafusion_physical_expr/expressions/in_list.rs.html#146-210

(this would also help us in InfluxData so I would be happy to help review / work on a PR -- though my bandwidth for the next week is pretty limited)

@advancedxy
Copy link
Contributor Author

I can submit a PR for option 2 in the next 2 weeks maybe if no one is working on this.

though my bandwidth for the next week is pretty limited

It's not urgent, I think we can take your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants