Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TO_DATE, TO_TIMESTAMP scalar functions to support LargeUtf8, Utf8View #12928

Closed
Omega359 opened this issue Oct 15, 2024 · 1 comment · Fixed by #12929
Closed

Update TO_DATE, TO_TIMESTAMP scalar functions to support LargeUtf8, Utf8View #12928

Omega359 opened this issue Oct 15, 2024 · 1 comment · Fixed by #12929
Assignees
Labels
enhancement New feature or request

Comments

@Omega359
Copy link
Contributor

Omega359 commented Oct 15, 2024

Is your feature request related to a problem or challenge?

Part of #11752 and #11790

Currently, a call to TO_DATE or TO_TIMESTAMP* UDFs with a Utf8View datatypes fails. After the change that fixes this issue, it should not.

> create table ts_utf8_data(ts varchar(100), format varchar(100)) as values
  ('2020-09-08 12/00/00+00:00', '%Y-%m-%d %H/%M/%S%#z'),
  ('2031-01-19T23:33:25+05:00', '%+'),
  ('08-09-2020 12:00:00+00:00', '%d-%m-%Y %H:%M:%S%#z'),
  ('1926632005', '%s'),
  ('2000-01-01T01:01:01+07:00', '%+');
0 row(s) fetched. 
Elapsed 0.062 seconds.

> create table ts_utf8view_data as
select arrow_cast(ts, 'Utf8View') as ts, arrow_cast(format, 'Utf8View') as format from ts_utf8_data;
0 row(s) fetched. 
Elapsed 0.010 seconds.

> SELECT to_timestamp(t.ts, t.format),
       to_timestamp_seconds(t.ts, t.format),
       to_timestamp_millis(t.ts, t.format),
       to_timestamp_micros(t.ts, t.format),
       to_timestamp_nanos(t.ts, t.format)
       from ts_utf8view_data as t;
Execution error: to_timestamp function unsupported data type at index 1: Utf8View

We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See #10918 for more background.

Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.

To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.

Describe the solution you'd like

Update the functions to support DataType::Utf8View directly

Describe alternatives you've considered

No response

Additional context

The typical steps are:

Write some tests showing the function doesn't support Utf8View (see the tests in [string_view.slt](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string_view.slt) to ensure the arguments are not being cast
Change the Signature of the function to accept Utf8View in addition to Utf8/LargeUtf8
Update the implementation of the function to operate on Utf8View

Example PRs

Update to use an arrow kernel that already supports StringView: 

#11787
Change the implementation to support StringView directly:
#11676
Change implementation (option 2):
#11556

@Omega359 Omega359 added the enhancement New feature or request label Oct 15, 2024
@Omega359
Copy link
Contributor Author

take

@Omega359 Omega359 changed the title Update TO_DATE, TO_TIMESTAMP scalar functions to support Utf8View Update TO_DATE, TO_TIMESTAMP scalar functions to support LargeUtf8, Utf8View Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
1 participant