Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Native StringView support for string functions #11790

Open
21 tasks done
Tracked by #11752
alamb opened this issue Aug 2, 2024 · 3 comments
Open
21 tasks done
Tracked by #11752

[Epic] Native StringView support for string functions #11790

alamb opened this issue Aug 2, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 2, 2024

Is your feature request related to a problem or challenge?

We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See #10918 for more background.

Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.

To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.

Describe the solution you'd like

Port all string functions

Describe alternatives you've considered

No response

Additional context

See coordination plan with @tshauck and myself here: #11787 (comment)

@alamb alamb added the enhancement New feature or request label Aug 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2024

One thing I have noticed during implementations is that some functions such as ltrim/rtrim/btrim could be more efficient if they produced Utf8View as output in addition to accepting them as input

For example, in #11920 (comment) from @Kev1n8 it is actually probably a good idea to always generate StringView as output (rather than StringArray) as it could avoid a copy.

I am thinking once we get the string functions so they can support StringView as input then we can do a second pass and optimize some functions so they produce StringView as output

@2010YOUY01
Copy link
Contributor

Inspired by @Omega359 's great PR #11941, I have some suggestion on testing Utf8View support for functions:

Although most implementation is adapted from existing implementation, but the execution takes another path, so I think comprehensive end-to-end tests are still needed.
The good news is there already exists sqllogictests for original string functions, the only thing to do is to duplicate existing testings with Utf8View

Here are the examples on how to adapt existing test cases for Utf8View input

  1. For functions takes scalar value, use arrow_cast() like https://github.com/apache/datafusion/pull/11941/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196
  2. For functions read from a table, string column can be converted to Utf8View column like
    # Table with the different combination of column types
    statement ok
    create table test as
    SELECT
    arrow_cast(column1, 'Utf8') as column1_utf8,
    arrow_cast(column2, 'Utf8') as column2_utf8,
    arrow_cast(column1, 'LargeUtf8') as column1_large_utf8,
    arrow_cast(column2, 'LargeUtf8') as column2_large_utf8,
    arrow_cast(column1, 'Utf8View') as column1_utf8view,
    arrow_cast(column2, 'Utf8View') as column2_utf8view,
    arrow_cast(column1, 'Dictionary(Int32, Utf8)') as column1_dict,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as column2_dict
    FROM test_source;

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2024

We are making pretty good progress here -- just a few more functions left 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants