Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Decouple logical from physical types #12622

Open
3 of 6 tasks
notfilippo opened this issue Sep 25, 2024 · 36 comments
Open
3 of 6 tasks

[EPIC] Decouple logical from physical types #12622

notfilippo opened this issue Sep 25, 2024 · 36 comments
Labels
enhancement New feature or request

Comments

@notfilippo
Copy link
Contributor

notfilippo commented Sep 25, 2024

Is your feature request related to a problem or challenge?

This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

LogicalPlans will use LogicalType while PhysicalPlans will use DataType.

Describe the solution you'd like

Make ScalarValue values logical:

@notfilippo notfilippo added the enhancement New feature or request label Sep 25, 2024
@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document.

cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 26, 2024

I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community.

@findepi
Copy link
Member

findepi commented Sep 26, 2024

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
see #11513 (comment)
and also @andygrove 's feedback from Comet perspective #11513 (comment)

We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
We should also make it a goal that function are expressible directly on logical types.

While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)).

I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas.

@findepi
Copy link
Member

findepi commented Sep 27, 2024

Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans.
It looks like we have same discussion in two places now. I won't copy my reply from #11513, it's here for reference: #11513 (comment)
TL;DR is. -- the goals impact the design; incremental improvements lead to local optimum.

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 30, 2024

What follows is my idea of the plan we might want to take to tackle this issue.


Status and goals

Current status

image

  • Currently Datafusion uses DFSchema (a qualified arrow::Schema) as the type source for both its logical and physical plan.
  • During execution a physical plan gets interpreted against some physical data, in the form of record batches which contain (if everything is correct) the same schema expected as input by the plan.

#11513 initial goal

image

  • Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
  • Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
  • Execution will remain unchanged.

"Record batches as the physical type source" goal

image

  • Datafusion will only use logical type information for both logical and physical planning
  • During execution the physical type information will be sourced by the record batches as the single source of truth.

How do we get there?

a.k.a. changing the tires of a moving vehicle.

image

Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical.

Towards a Logical LogicalPlan

The Scalar type

image

Currently the LogicalPlan uses physical type information in the tree. This information comes from table scans and scalar values. We could initially look into modifying the scalar value in order to potentially represent fewer variants of types (from Utf8, LargeUtf8, Utf8View, Dictionary(Utf8) and more to just Utf8). But the physical type information required to be understandable by the current system still needs to be temporarily taken into account.

image

This would require the introduction of a scalar type which would track the physical type of the scalar in the current logical plan, wrapping the "logical" value of the scalar.

image

Casting to logically equivalent types should become straightforward.

This effort is being tracked in this epic in Make ScalarValue values logical section.

Introduce the logical type

Introduce LogicalType and the "logically equivalent" notion. Now ScalarValue will not have a data_type() method, instead it will return its LogicalType, and data_type() will be instead defined in Scalar. Currently there is some discussion in #11513 around how should it be represented.

Logical expressions

image

It's time to make Expr logical. While type sources like Scalar and TableScans's columns will still provide physical types, their interaction internally will happen via logical type comparisons.

What about functions?

image

UDFs should be able to the work in the logical type system. A user defined function should be able to:

  • Decide, given a signature, if a / some logical type can fullfill the requirements
  • Identify, given some logical inputs, a logical return type

Discussion is happening here: #12635

Logical LogicalPlan

image

Once expressions are logical we can move type sources to the logical plane of translate their type information from physical to logical upstream. (Like a TableProvider schema() (physical) call being immediately translated to logical).


👷 WIP "Record batches as the physical type source" goal

cc @alamb, @jayzhan211, @findepi lmk if this plan seems logical 😄

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

#11513 initial goal

Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
Execution will remain unchanged.

This seems like a good first step to me (and also a massive project in itself)

"Record batches as the physical type source" goal

I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete

UDFs should be able to the work in the logical type system. A user defined function should be able to:

I think UDFs will need to retain physical information (at least in invoke() which manipulates the data directly)

All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen!

@notfilippo
Copy link
Contributor Author

notfilippo commented Oct 7, 2024

I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 8, 2024

I can work on the logical type that replace current function signature on main. #12751 (comment)

@notfilippo
Copy link
Contributor Author

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

@jayzhan211
Copy link
Contributor

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

Sure

@findepi
Copy link
Member

findepi commented Oct 9, 2024

What follows is my idea of the plan we might want to take to tackle this issue.

@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you

  • I was planning on introducing the logical types

@notfilippo awesome!
i was hoping we get logical types sooner than later, even if nothing uses them initially.
simple functions #12635 is currently blocked on this

@notfilippo
Copy link
Contributor Author

i was hoping we get logical types sooner than later, even if nothing uses them initially. Simple functions #12635 is currently blocked on this

I've opened #12853 targeting main.

@notfilippo
Copy link
Contributor Author

Can a committer merge #13016 on logical-types? cc @alamb @jayzhan211

@jayzhan211
Copy link
Contributor

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

@tobixdev
Copy link
Contributor

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

@jayzhan211
Copy link
Contributor

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

It would be great! I think we need to rebase the branch first then remove scalarvlaue utf8view and largeutf8

@tobixdev
Copy link
Contributor

Sounds good! I've started updating this branch to the main branch on my fork.
There were a bunch of merge conflicts, so hopefully I didn't mess anything up during resolving them.

I still need to iron out some issues, but I hope to create a PR tomorrow.

@tobixdev
Copy link
Contributor

tobixdev commented Jan 17, 2025

Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion).
Before investing more time in this, I'd like to start another discussion about whether this is the solution we are going for (kind of like a result from adapting the new code from main).

As I understand it, the new Scalar struct is only necessary for carrying a physical data type for a logical value so that we can support operations like arrow_cast.

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

Even this may be acceptable if the resulting solution is stable for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided.

Therefore, are we sure that we require the Scalar type? If yes, I can proceed with updating the PR. If not, maybe we should try to infer the necessary physical types from the schema to avoid this large breaking change.

I also experimented with adding a new variant WithPhysicalType to ScalarValue. This works great regarding breaking changes as most match statements only consider the set of "supported" scalars. Unfortunately, it allows users to miss "logically equivalent" values with a different physical type when they only match against the regular variants. So I also think this is not an adequate solution (e.g., ScalarValue::WithPhysicalType(Box::new(ScalarValue::Utf8("abc"), DataType::Utf8View) will not be matched by ScalarValue::Utf8("abc"). However, it may be a less invasive transitional vehicle to carry the physical data type until we can infer the data type from the schema (and update optimizations etc.) to respect this.

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

Adopting this strategy would mean that we cannot release logical types until we can get rid of the WithPhysicalType variant, which is unfortunate. Maybe we could even release this to the downstream dependencies with adequate documentation and maybe marking the variant as deprecated? However, not breaking the pattern-matching capabilities of ColumnarValue and Expr may be worth the effort, and synching the main branch should be more straightforward as we also do not have that many breaking changes.

Any thoughts on that?

cc @notfilippo @findepi @alamb @jayzhan211

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 18, 2025

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

The following is equivalent

ScalarValue::Utf8View

Scalar {
  ScalarValue::Utf8(String),
  DataType::Utf8View
}
ScalarValue::LargeUtf8

Scalar {
  ScalarValue::Utf8(String),
  DataType::LargeUtf8
}

Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided

I agree, ScalarValue::Utf8(String) is probably enough, but given the current status of the code, jump directly to this state might be a challenge because we might not have schema to tell which physical type the string is for any kind of type matching code 🤔

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

@tobixdev
Copy link
Contributor

tobixdev commented Jan 18, 2025

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

Sorry I wasn't clear on this. I meant that matching directly on ColumnarValue is not possible anymore. Basically, you have to create nested match statements and call .value() on the Scalar.

let take_idx = match &args[2] {
    ColumnarValue::Scalar(ScalarValue::Int64(Some(v))) if v < &2 => *v as usize,
    _ => unreachable!(),
};

becomes

let take_idx = match &args[2] {
    ColumnarValue::Scalar(scalar) => match scalar.value() {
        ScalarValue::Int64(Some(v)) if v < &2 => *v as usize,
        _ => unreachable!(),
    },
    _ => unreachable!(),
}

And this isn't a deal breaker for me as one may argue that distinguishing between scalar/non-scalar and scalar types are two pair of shoes and should be handled separately, but at least in some cases this will impact ergonomics (e.g., tests). Maybe this also isn't such a big deal if downstream projects are not using these types of patterns.

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

I also don't know a good solution and maybe Scalar is the best way to achieve this. But maybe we are also approaching this wrong and we might just require some tinkering at a different spot such that removing ScalarValue::LargeUtf8 and ScalarValue::Utf8View just becomes removing the enum variants. Sorry if I am rehashing some conversations here that you already had earlier. Just want to make sure that we are on the right track. :)

EDIT:
So I've given it some more thought and after experimenting a bit I think I am sold on using Scalar like it was intended. So if no one suggests an alternative I'll be working on getting the branch up-to-date.

@jayzhan211
Copy link
Contributor

Pattern matching is not impossible

        match value {
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Int64(_)) && scalar.as_i64() > &2 => {
                
            }
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Utf8(_)) && scalar.as_string().as_str() == "datafusion" => {
                
            }
        }
    pub fn as_i64_opt(&self) -> Option<&i64> {
        match self.value() {
            ScalarValue::Int64(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_i64(&self) -> &i64 {
        self.as_i64_opt().unwrap()
    }

    pub fn as_string_opt(&self) -> Option<&String> {
        match self.value() {
            ScalarValue::Utf8(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_string(&self) -> &String {
       self.as_string_opt().unwrap()
    }

@tobixdev
Copy link
Contributor

tobixdev commented Feb 5, 2025

I've been trying to tackle the next issues in task (e.g., removing Utf8View and Dictionary). However, removing them introduces subtle bugs (e.g., trying to compare a dictionary array with a string array causing an error) that basically result from the pattern:

  1. Convert a physical object (e.g., ArrowArray) or a DataType to a ScalarValue
  2. Convert this ScalarValue into to a physical object (or call scalar_value.data_type())

An example would be ScalarValue::try_from(DataType::Utf8View)?.to_array()

We can do one of two things in TryFrom<DataType> for ScalarValue once Utf8View has been removed:

  • Return an error for DataType::Utf8View.
  • Return a ScalarValue::Utf8

The problem with both approaches is that, once a test case fails, all debugging happens at runtime. I've been wondering if we could improve this by shifting the burden to compile time by changing the API of Scalar and ScalarValue before removing the problematic ScalarValue variants.

Here are some thoughts:

  • There should be no function for creating a ScalarValue from a physical object or DataType. If this restriction is not imposed, it is easy to accidentally drop the physical type once we start removing variants. We would move these functionality to Scalar. One can still obtain a ScalarValue by calling Scalar::try_from(DataType::Utf8View).into_value(). However, the potential loss of the physical type is now explicit by calling into_value().
  • There should be no method for creating a physical object from a ScalarValue without a DataType. If this restriction is not imposed, it is easy to create physical objects with an unexpected type. Methods like to_array would get a new DataType parameter that helps to avoid creating physical objects with an unexpected type. Methods on Scalar would not require this parameter as the data type is stored explicitly. Another approach would be to rename the current versions of to_array in ScalarValue to to_canonical_array (or smth. similar) and only have to_array in Scalar. This should prevent things like Scalar{ value: ScalarValue::Utf8("test"), data_type: DataType::Utf8View}.into_value().to_array() (of course in a more subtle scenario).

I think creating these two restrictions would greatly reduce the problems with unexpected types at runtime in datafusion and downstream dependencies. We can also iterate with this process as we move more functionality from ScalarValue to Scalar and hopefully removing variants from ScalarValue will become easier after adapting the API. No functionality would get lost, we would just distribute it between Scalar and ScalarValue before removing the variants.

LMK what you think.

cc @jayzhan211

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 6, 2025

There should be no function for creating a ScalarValue from a physical object or DataType.

I think getting Scalar from DataType or ArrayRef makes sense, so the restriction of this doesn't seem sound to me.

Scalar::try_from(DataType::Utf8View).into_value() lost the DataType so it should only be used if we don't care about the DataType.

The problem with both approaches is that, once a test case fails, all debugging happens at runtime

Is it possible to carry Scalar along where we need to know the DataType? If not, what might be the case? I guess we need get_value() instead of into_value()?

There should be no method for creating a physical object from a ScalarValue without a DataType
This makes sense to me.

Getting Array from ScalarValue is valid since physical type is tightly coupled by ScalarValue. When we switch to Scalar, ScalarValue no longer contains physical type, so we can only transform Scalar which contains DataType to Array.

Scalar{ value: ScalarValue::Utf8("test"), data_type: DataType::Utf8View}.into_value().to_array()

I agree, this is incorrect since into_value() no longer contains DataType so we can only call to_array() from Scalar.

I think the approach you mentioned are only possible if Scalar exists? Do you have any idea that is possible to work on main branch?

@tobixdev
Copy link
Contributor

tobixdev commented Feb 6, 2025

I think getting Scalar from DataType or ArrayRef makes sense, so the restriction of this doesn't seem sound to me.

What I meant is that we should not be able to directly create a ScalarValue from these things as it is easy to accidentally drop the physical type. For example, ScalarValue::from(DataType::Utf8View) should not be possible but Scalar::from(DataType::Utf8View) should be possible. As you said, one can call into_value() if you don't care about the physical type.

Is it possible to carry Scalar along where we need to know the DataType? If not, what might be the case? I guess we need get_value() instead of into_value()?

I think we need to change many parts of DataFusion to use a Scalar instead of a ScalarValue if we go with this approach. For example, in Accumulator (expr-common/src/accumulator.rs):

  • Accumulator::evaluate will need to return a Scalar instead of a ScalarValue. Otherwise aggregates cannot evaluate to physical objects.
  • This begs the question wheter Accumulator::state should return Result<Vec<Scalar>> instead of Result<Vec<ScalarValue>> (I think so)

I think there are many parts like this in DF and maybe we can only start removing ScalarValue variants once these parts have been migrated to use Scalar. Otherwise, we might unintentionally loose type information.

Getting Array from ScalarValue is valid since physical type is tightly coupled by ScalarValue. When we switch to Scalar, ScalarValue no longer contains physical type, so we can only transform Scalar which contains DataType to Array.

+1

I think the approach you mentioned are only possible if Scalar exists?

Yes all of the above only applies to a scenario where we have Scalar for the "physical Scalar" and ScalarValue for the logical value.

Do you have any idea that is possible to work on main branch?

I don't think this particular approach works on main as we must have this distinction between "physical" and "logical" Scalar. Then once we have this distinction, we would need to migrate stuff like the Accumulator example to Scalar. Once that has been done we can start removing the physical types from ScalarValue.

However, I've been experimenting with a few other approaches that might work in practice on main. These approaches aren't fleshed out but maybe you find some of them interesting:

ColumnarValue is not necessarily a Physical Object

Some experiments for applying this strategy when removing ScalarValue::Utf8View and ScalarValue::LargeUtf8 (tests not passing): https://github.com/tobixdev/datafusion/compare/main...tobixdev:datafusion:alternative-logical-types?expand=1

A ColumnarValue is either a physical object (ArrayRef) or a logical object (ScalarValue). We can remove variants from ScalarValue because it is a logical value and must not have a DataType. However, somehow we must then retain the type information to support things like casting to arrow types.

One approach to solve this issues is to use PhysicalExpr as a point where we always have a DataType. Within a PhysicalExpr we would have the ability to determine the physical type of a ScalarValue and ColumnarValue. Outside of a PhysicalExpr we have no physical type of a ScalarValue and can obtain the physical type of a ColumnarValue if it is an ArrayRef.

Some notable changes in the diff that you can look at:

  • datafusion/expr/src/udf.rs: Add physical argument types to ScalarFunctionArgs
  • datafusion/functions/src/string/bit_length.rs: An example that can use the return_type for physical type information
  • datafusion/functions/src/core/arrowtypeof.rs: An example that uses the physical types of the arguments

Some notes on this approach:

  • Once we remove variants from ScalarValue we still should alter its API to always take a DataType when creating physical objects as mentioned above. The API for creating ScalarValues should probably be altered too to avoid confusion (e.g., ScalarValue::from(DataType::Utf8View). However, loosing the type information is not as critical as in the Scalar approach as the PhysicalExpr is responsible for retaining it.
  • It shouldn't break that much existing code.
  • We also would have to adjust the API from ColumnarValue as, for example, data_type() can only return a Option<DataType>.
  • ColumnarValue being something between "physical" and "logical" seems a bit off.

Logical Overlays over ScalarValue

In this approach ScalarValue (and ColumnarValue) would still be a physical object. Based on this we introduce an API that can extract the logical values from the physical values. Basically, this is a generalization of try_as_str from @alamb.

One way of implementing this approach would require to have an enum that mirrors NativeType (maybe LogicalScalar?) and adds a value to it. This enum could also include a variant for extension types. Then we would have one scalar type that mirrors the physical value and one that mirrors the logical values. I think having this as an enum is good as it then supports pattern matching (contrary to just calling try_as_...).

One possible API to obtain this could be:

let scalar_value = ... ;
match scalar_value.into_logical_scalar() {
    LogicalScalar::Null => ...
    LogicalScalar::Boolean(false) => ...
    LogicalScalar::String(str) => ....
}

Furthermore, I would also like a generalized version of try_as_str like try_as_logical<LogicalString>(). LogicalString may be similar to ArrowPrimitiveType (e.g., UInt64Type). The idea is just to obtain a logical value (associated type of LogicalString?) from a physical one which could also be implemented in a trait to allow for extension types.

I cannot really say much about the benefits and drawbacks of this approach as I haven't experimented in this direction. However, I find it interesting as it retains compatibility with all approaches that assume ScalarValue to be a physical value. Users basically would opt into better ergonomics (not handling Utf8View and Option<String>) with logical values.

I'd like to experiment with the latter solution a bit. I'd also love to hear your opinions and ideas.

@jayzhan211
Copy link
Contributor

Having a true "logical" Scalar sounds like a good idea, since either the current ScalarValue or Scalar are actually still tightly coupled with DataType, if we can bring LogicalScalar into LogicalPlan and transform to ScalarValue only if we need it, maybe we could reduce the usage of ScalarValue we have now.

@findepi findepi marked this as a duplicate of #11513 Feb 10, 2025
@tobixdev tobixdev mentioned this issue Feb 11, 2025
6 tasks
@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 16, 2025

@tobixdev @notfilippo

I found LogicalScalar to be a largely duplicated concept with only minor differences from ScalarValue, mainly aimed at reducing complexity. However, the reduction is minimal—for instance, eliminating try_as_str alone doesn’t seem to justify introducing LogicalScalar.

Furthermore, if we separate logical and physical types in LogicalPlan and PhysicalPlan, we would need to implement coercion and casting logic at both layers. Since this isn’t currently the case, adding such complexity doesn’t seem worthwhile.

As I previously thought about here, it may be a better approach to retain the "physical-type"—or more precisely, the "decoding-type" (arrow::DataType)—within the logical layer. We can still use simplified types like NativeType when beneficial, as we already do in function signature.

Currently the type model in DataFusion, both in logical and physical plans, relies purely on arrow’s DataType enum. But whilst this choice makes sense its physical execution engine (DataTypes map 1:1 with the physical array representation, defining implicit rules for storage, access, and retrieval), it has flaws when dealing with its logical plans. This is due to the fact that some logically equivalent array types in the Arrow format have very different DataTypes – for example a logical string array can be represented in the Arrow array of DataType Utf8, Dictionary(Utf8), RunEndEncoded(Utf8), and StringView (without mentioning the different indexes types that can be specified for dictionaries and REE arrays).

To solve the the issue mentioned, having function like try_as_str or NativeType makes more sense to me now. Although we still need to deal with different similar DataType in function body, but it seems reasonable to me given that function invoke body is possible to have it's own optimized version for different decoding types

In summary, I think we should keep ScalarValue as it is. And utilize NativeType or TypeClass (similar to TypeSignatureClass but not for Signature only).

@findepi
Copy link
Member

findepi commented Feb 17, 2025

Furthermore, if we separate logical and physical types in LogicalPlan and PhysicalPlan, we would need to implement coercion and casting logic at both layers

Casting yes (eg for constant folding)
Coercion no -- "coercion" as understood at least by me, is a SQL language feature, applied by analyzer (or query builder #14618), and should not exist once we have a fully resolved plan such as physical plan (#12723)

As I previously thought about here, it may be a better approach to retain the "physical-type"—or more precisely, the "decoding-type" (arrow::DataType)—within the logical layer.

This may make sense in current architecture and with current design.
@jayzhan211 wdyt about #12720? This would make DF processing more flexible and more akin other systems (like e.g. Trino, Velox or Snowflake FWICT).

@tobixdev
Copy link
Contributor

Thanks for having a look at the PR (#14617, mentioning here for reference) and your input on this approach @jayzhan211 !

I agree that adding this type of complexity simply for eliminating try_as_str etc. is not worthwhile. However, I think that an approach with LogicalScalar can provide benefits in the long run where something similar to PhysicalScalar(DataType, LogicalScalar) replaces ScalarValue. Basically the result of this transition would be equal to Scalar and ScalarValue in the original approach but with a different migration process (not adapting ScalarValue, but replacing it step by step with either LogicalScalar or PhysicalScalar).

I found LogicalScalar to be a largely duplicated concept with only minor differences from ScalarValue, mainly aimed at reducing complexity. However, the reduction is minimal—for instance, eliminating try_as_str alone doesn’t seem to justify introducing LogicalScalar.

Here are the benefits that I'd see in LogicalScalar (I have not checked whether these are implementable):

  • Removing the Option from variants (e.g., ScalarValue::UInt8(None)) that we may not need in a logical setting (see the example of WIP: Add LogicalScalar #14617 where the None cases can be removed). I am relying here that we can extract the necessary type information from the schema or record batches.
  • Removing variants that only carry physical type information (e.g., ScalarValue::LargeUtf8 and ScalarValue::Dictionary)
  • Work with well-known rust types instead of arrow arrays (e.g., HashMap instead of StructArray in ScalarValue::Struct) for scalars. Maybe this can make implementing UDFs easier.

Some benefits that could become relevant in the future and may be more difficult to implement on ScalarValue:

  • Implementing Scalars of extension types by implementing a trait. For example, this could allow users to use rust enums for modelling scalars of arrow unions.

However, these are probably subjective and some of them are certainly underexplored so I can only make a guess that they would result in a net benefit.

Casting yes (eg for constant folding)

I think with LogicalScalar and PhyiscalScalar most of the regular casting logic would only be required in LogicalScalar while casting to arrow arrays would be part of PhysicalScalar.

In summary, I think we should keep ScalarValue as it is. And utilize NativeType or TypeClass (similar to TypeSignatureClass but not for Signature only).

I think you're right that we can get some of these benefits by using NativeType. If we believe that LogicalScalar is a step in the wrong direction I can close #14617 and we can explore different avenues.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 17, 2025

I am relying here that we can extract the necessary type information from the schema or record batches

This is probably not true. Scalar is a free constant unlike column that has DataType in the defined table.

However, these are probably subjective and some of them are certainly underexplored so I can only make a guess that they would result in a net benefit.

A question I have is: When is LogicalScalar actually helpful?

The LogicalScalar is used in Expr::Literal and is later converted into a ScalarValue during the physical plan execution. However, it can't be used directly in ColumnValue::Scalar or PhysicalExpr::Literal. Converting it to ScalarValue adds a slight overhead, but for most types, LogicalScalar and ScalarValue are nearly identical. The main differences arise with Dictionary, Utf8 variants, or possibly REE in the future, making the conversion seem unnecessary. So, why not just use ScalarValue from the start and create a utility function to simpl

  • Removing the Option from variants (e.g., ScalarValue::UInt8(None)) that we may not need in a logical setting (see the example of WIP: Add LogicalScalar #14617 where the None cases can be removed). I am relying here that we can extract the necessary type information from the schema or record batches.
  • Removing variants that only carry physical type information (e.g., ScalarValue::LargeUtf8 and ScalarValue::Dictionary)
  • Work with well-known rust types instead of arrow arrays (e.g., HashMap instead of StructArray in ScalarValue::Struct) for scalars. Maybe this can make implementing UDFs easier.

I think we can achieve similar benefit with methods of Scalar like ScalarValue::from() or ScalarValue::new_struct_from_hash_map.
Btw, HashMap to ScalarValue::Struct seems interesting, maybe we can add such method

Even if we can convert to ScalarValue, since we don't have DataType, LogicalScalar::String can only be converted to ScalarValue::Utf8 but not ScalarValue::Utf8View or ScalarValue::Diction(_, Utf8). Given type coercion resolve the DataType of the Expr not only the LogicalType, it is problematic if we can't convert the scalar string to the specific StringViewArray. (problematic because we can't run kernel with StringArray and StringViewArray)

To avoid needing the DataType for ScalarValue, we can explore two possible solutions. The reason we need the DataType for ScalarValue is to ensure that the conversion to the correct ArrayRef for arrow::kernel execution happens seamlessly.

Here are two ways to eliminate the need for DataType in this context:

  1. Support kernel functions for both Array and Scalar. For example, we could have comparisons like comparison(StringViewArray, rust::String) or arithmetic operations like add(Dictionary(k, I32Array), i64).

  2. Introduce a casting mechanism in the physical layer before calling the kernel function.

Both of these approaches are similar to what is described in issue #12720.

If we no longer require DataType for ScalarValue, then bringing LogicalScalar into DataFusion becomes much easier.

@tobixdev
Copy link
Contributor

I see your point. Its definitely worth exploring the direction of improving ergonomics with helper functions.

I'll close #14617 for now so it doesn't clutter the PR list. It will still be there if we ever need a LogicalScalar again.

@findepi
Copy link
Member

findepi commented Feb 19, 2025

I am relying here that we can extract the necessary type information from the schema or record batches

This is probably not true. Scalar is a free constant unlike column that has DataType in the defined table.

Which may involve a bit of computation, especially for large expressions like deep AND or OR trees (#12604)
(in #9375 we avoided stack overflow, not the computation).

A question I have is: When is LogicalScalar actually helpful?

The LogicalScalar is used in Expr::Literal and is later converted into a ScalarValue during the physical plan execution.

It's not helpful if logical planning remains tied to arrow DataType, which it remains to be today.
I think we wanted to have LogicalScalar to decouple logical planning from arrow type system.

Even if we can convert to ScalarValue, since we don't have DataType, LogicalScalar::String can only be converted to ScalarValue::Utf8 but not ScalarValue::Utf8View or ScalarValue::Diction(_, Utf8). Given type coercion resolve the DataType of the Expr not only the LogicalType, it is problematic if we can't convert the scalar string to the specific

This is chicken and egg problem.
Coercion resolves DataType, because that's the (current) type system for LP.
I believe the purpose of this issue (#12622) is "Logical operators during logical planning should unquestionably not have access to the physical type information".
So the end state would be: LP operating on "logical types" (without distinguishing Utf8 from Utf8View, from Dicitonary(_, Utf8), from REE). The coercions operate on the same logical types. The constants in the plan are expressed also in terms of these logical types.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 19, 2025

First, we need to remove DataType from type coercion in the logical layer. Type coercion should only occur between NativeType values at this stage, while DataType coercion should be deferred to the physical plan or runtime (#12720).

Once this is done, we can introduce LogicalScalar and replace Expr::Literal(DataType) with Expr::Literal(LogicalScalar). Whether LogicalScalar improves the implementation will become clear after completing the first step.

An open question remains: Should type coercion be handled in the physical optimizer, or should we implement kernels for types that share the same NativeType but have different DataType values? The best approach likely involves both—using kernels when optimizations are possible and falling back on type coercion when they are not.

@findepi
Copy link
Member

findepi commented Feb 19, 2025

Should type coercion be handled in the physical optimizer,

Did you mean physical planner?
Physical plan is statically typed, and uses DataType, If we go from LP to PP, we need to use types correctly (just as we need to use correct types when going from SQL text to LP #14618)

implement kernels for types that share the same NativeType but have different DataType values?

I think we should start with what is simpler, i.e the current state, where PP is statically typed with DataType.
later, IMO, we should relax that, and this is covered by #12720

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 19, 2025

Did you mean physical planner?
Physical plan is statically typed, and uses DataType, If we go from LP to PP, we need to use types correctly (just as we need to use correct types when going from SQL text to LP

For example, the existing type coercion coerce string to exact DataType::Utf8, so if we have DataType::Utf8 and DataType::Utf8View, we end up both DataType::Utf8View.

We can change to:

The type coercion in LP only take care about NativeType::String and we end up with DataType::Utf8 and DataType::Utf8View. We consider these resolved types in LP. Then, we continue type coercion in PP where we can coerce DataType::Utf8 to DataType::Utf8View for performance reason or whatever.

BUT, we don't do coercion in LP with Int and String coercion which has different NativeType, so even i32 and i64 is not allowed.

Does this makes sense to you?

Did you mean physical planner?

we can have coerce in physical optimizer. For somewhere else, maybe 🤔

@findepi
Copy link
Member

findepi commented Feb 19, 2025

The type coercion in LP only take care about NativeType::String and we end up with DataType::Utf8 and DataType::Utf8View. We consider these resolved types in LP.

If LP only cares about "native types" (aka logical types), then it doesn't know & doesn't care whether later it will be Utf8 or Utf8View.
W can think of "logical types" as just an equivalence relation over physical types, but that's not how the code & abstractions should look like.

So, for this issue to be complete, we need the LP to be "decoupled" from physical types. I.e not use arrow DataType at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants