-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Decouple logical from physical types #12622
Comments
I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document. cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort. |
I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community. |
We should make it a goal that physical planning also abstracts over physical representation of individual batches. We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635 |
While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)). I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas. |
Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans. |
What follows is my idea of the plan we might want to take to tackle this issue. Status and goalsCurrent status
#11513 initial goal
"Record batches as the physical type source" goal
How do we get there?
Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical. Towards a Logical LogicalPlanThe
|
This seems like a good first step to me (and also a massive project in itself)
I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete
I think UDFs will need to retain physical information (at least in All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen! |
I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi |
I can work on the logical type that replace current function signature on |
@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature? |
Sure |
@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you
@notfilippo awesome! |
Can a committer merge #13016 on |
Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge? |
It would be great! I think we need to rebase the branch first then remove scalarvlaue utf8view and largeutf8 |
Sounds good! I've started updating this branch to the main branch on my fork. I still need to iron out some issues, but I hope to create a PR tomorrow. |
Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion). As I understand it, the new Unfortunately, using Even this may be acceptable if the resulting solution is stable for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided. Therefore, are we sure that we require the I also experimented with adding a new variant
Adopting this strategy would mean that we cannot release logical types until we can get rid of the Any thoughts on that? |
I think downstream project are not all forced to switch to
The following is equivalent ScalarValue::Utf8View
Scalar {
ScalarValue::Utf8(String),
DataType::Utf8View
} ScalarValue::LargeUtf8
Scalar {
ScalarValue::Utf8(String),
DataType::LargeUtf8
}
I agree, If we can find such approach it would be great. |
Sorry I wasn't clear on this. I meant that matching directly on let take_idx = match &args[2] {
ColumnarValue::Scalar(ScalarValue::Int64(Some(v))) if v < &2 => *v as usize,
_ => unreachable!(),
}; becomes let take_idx = match &args[2] {
ColumnarValue::Scalar(scalar) => match scalar.value() {
ScalarValue::Int64(Some(v)) if v < &2 => *v as usize,
_ => unreachable!(),
},
_ => unreachable!(),
} And this isn't a deal breaker for me as one may argue that distinguishing between scalar/non-scalar and scalar types are two pair of shoes and should be handled separately, but at least in some cases this will impact ergonomics (e.g., tests). Maybe this also isn't such a big deal if downstream projects are not using these types of patterns.
I also don't know a good solution and maybe EDIT: |
Pattern matching is not impossible match value {
ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Int64(_)) && scalar.as_i64() > &2 => {
}
ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Utf8(_)) && scalar.as_string().as_str() == "datafusion" => {
}
} pub fn as_i64_opt(&self) -> Option<&i64> {
match self.value() {
ScalarValue::Int64(v) => v.as_ref(),
_ => None,
}
}
pub fn as_i64(&self) -> &i64 {
self.as_i64_opt().unwrap()
}
pub fn as_string_opt(&self) -> Option<&String> {
match self.value() {
ScalarValue::Utf8(v) => v.as_ref(),
_ => None,
}
}
pub fn as_string(&self) -> &String {
self.as_string_opt().unwrap()
} |
I've been trying to tackle the next issues in task (e.g., removing
An example would be We can do one of two things in
The problem with both approaches is that, once a test case fails, all debugging happens at runtime. I've been wondering if we could improve this by shifting the burden to compile time by changing the API of Here are some thoughts:
I think creating these two restrictions would greatly reduce the problems with unexpected types at runtime in datafusion and downstream dependencies. We can also iterate with this process as we move more functionality from LMK what you think. cc @jayzhan211 |
I think getting
Is it possible to carry
Getting
I agree, this is incorrect since I think the approach you mentioned are only possible if |
What I meant is that we should not be able to directly create a
I think we need to change many parts of DataFusion to use a
I think there are many parts like this in DF and maybe we can only start removing
+1
Yes all of the above only applies to a scenario where we have
I don't think this particular approach works on main as we must have this distinction between "physical" and "logical" Scalar. Then once we have this distinction, we would need to migrate stuff like the However, I've been experimenting with a few other approaches that might work in practice on main. These approaches aren't fleshed out but maybe you find some of them interesting:
|
Having a true "logical" Scalar sounds like a good idea, since either the current |
I found LogicalScalar to be a largely duplicated concept with only minor differences from ScalarValue, mainly aimed at reducing complexity. However, the reduction is minimal—for instance, eliminating Furthermore, if we separate logical and physical types in LogicalPlan and PhysicalPlan, we would need to implement coercion and casting logic at both layers. Since this isn’t currently the case, adding such complexity doesn’t seem worthwhile. As I previously thought about here, it may be a better approach to retain the "physical-type"—or more precisely, the "decoding-type" (arrow::DataType)—within the logical layer. We can still use simplified types like NativeType when beneficial, as we already do in function signature.
To solve the the issue mentioned, having function like In summary, I think we should keep |
Casting yes (eg for constant folding)
This may make sense in current architecture and with current design. |
Thanks for having a look at the PR (#14617, mentioning here for reference) and your input on this approach @jayzhan211 ! I agree that adding this type of complexity simply for eliminating
Here are the benefits that I'd see in
Some benefits that could become relevant in the future and may be more difficult to implement on
However, these are probably subjective and some of them are certainly underexplored so I can only make a guess that they would result in a net benefit.
I think with
I think you're right that we can get some of these benefits by using |
This is probably not true. Scalar is a free constant unlike column that has DataType in the defined table.
A question I have is: When is The LogicalScalar is used in
I think we can achieve similar benefit with methods of Scalar like Even if we can convert to To avoid needing the DataType for ScalarValue, we can explore two possible solutions. The reason we need the DataType for ScalarValue is to ensure that the conversion to the correct ArrayRef for arrow::kernel execution happens seamlessly. Here are two ways to eliminate the need for DataType in this context:
Both of these approaches are similar to what is described in issue #12720. If we no longer require DataType for ScalarValue, then bringing LogicalScalar into DataFusion becomes much easier. |
I see your point. Its definitely worth exploring the direction of improving ergonomics with helper functions. I'll close #14617 for now so it doesn't clutter the PR list. It will still be there if we ever need a |
Which may involve a bit of computation, especially for large expressions like deep AND or OR trees (#12604)
It's not helpful if logical planning remains tied to arrow
This is chicken and egg problem. |
First, we need to remove DataType from type coercion in the logical layer. Type coercion should only occur between NativeType values at this stage, while DataType coercion should be deferred to the physical plan or runtime (#12720). Once this is done, we can introduce LogicalScalar and replace An open question remains: Should type coercion be handled in the physical optimizer, or should we implement kernels for types that share the same NativeType but have different DataType values? The best approach likely involves both—using kernels when optimizations are possible and falling back on type coercion when they are not. |
Did you mean physical planner?
I think we should start with what is simpler, i.e the current state, where PP is statically typed with DataType. |
For example, the existing type coercion coerce string to exact DataType::Utf8, so if we have We can change to: The type coercion in LP only take care about NativeType::String and we end up with BUT, we don't do coercion in LP with Int and String coercion which has different NativeType, so even i32 and i64 is not allowed. Does this makes sense to you?
we can have coerce in physical optimizer. For somewhere else, maybe 🤔 |
If LP only cares about "native types" (aka logical types), then it doesn't know & doesn't care whether later it will be Utf8 or Utf8View. So, for this issue to be complete, we need the LP to be "decoupled" from physical types. I.e not use arrow DataType at all. |
Is your feature request related to a problem or challenge?
This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:
Describe the solution you'd like
Make
ScalarValue
values logical:Scalar
type for ColumnarValue #12536 (merged inlogical-types
)logical-types
)ScalarValue::LargeUtf8
andScalarValue::Utf8View
in favour ofScalarValue::Utf8
ScalarValue::LargeBinary
andScalarValue::BinaryView
in favour ofScalarValue::Binary
ScalarValue::Dictionary
(from Remove ScalarValue::Dictionary #12488)The text was updated successfully, but these errors were encountered: