Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36250: [MATLAB] Add arrow.array.StringArray class #36366

Merged
merged 17 commits into from
Jul 7, 2023

Conversation

kevingurney
Copy link
Member

@kevingurney kevingurney commented Jun 28, 2023

Rationale for this change

Thanks to @sgilmore10's recent changes to enable UTF-8 <-> UTF-16 string conversions, we can now add support for creating Arrow String arrays (UTF-8 encoded) from MATLAB string arrays (UTF-16 encoded).

What changes are included in this PR?

  1. Added new arrow.array.StringArray class that can be constructed from MATLAB string and cellstr types. Note: We explicitly decided to not support char arrays for the time being.
  2. Factored out code for extracting "raw" const uint8_t* from a MATLAB logical Data Array into a new function bit::unpacked_as_ptr so that it can be reused across multiple Array Proxy classes. See [MATLAB] Use bit::unpacked_as_ptr in timestamp_array.cc #36335.
  3. Added new arrow.type.StringType type class and associated arrow.type.ID.String enum value.
  4. Enabled support for creating RecordBatch objects from MATLAB tables containing string data.
  5. Updated arrow::matlab::array::proxy::Array::toString code to convert from UTF-8 to UTF-16 for display in MATLAB.

Examples

Most MATLAB string arrays round-trip

>> matlabArray = ["A"; "B"; "C"]

matlabArray = 

  3x1 string array

    "A"
    "B"
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "A",
  "B",
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)          

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    "B"
    "C"

>> isequal(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1

MATLAB string(missing) Values get mapped to null by default

>> matlabArray = ["A"; string(missing); "C"]

matlabArray = 

  3x1 string array

    "A"
    <missing>
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray) 

arrowArray = 

[
  "A",
  null,
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray) 

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    <missing>
    "C"

>> isequaln(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1

Unicode characters round-trip

>> matlabArray = ["😊"; "🌲"; ""]

matlabArray = 

  3×1 string array

    "😊"
    "🌲"
    ""

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "😊",
  "🌲",
  ""
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "😊"
    "🌲"
    ""

Create StringArray from cellstr

>> matlabArray = {'red'; 'green'; 'blue'}

matlabArray =

  3×1 cell array

    {'red'  }
    {'green'}
    {'blue' }

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "red",
  "green",
  "blue"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "red"
    "green"
    "blue"

Create RecordBatch from MATLAB string data

>> matlabTable = table(["😊"; "🌲"; ""])

matlabTable =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "" 

>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)

arrowRecordBatch = 

Var1:   [
    "😊",
    "🌲",
    ""
  ]

>> matlabTableRoundTrip = toMATLAB(arrowRecordBatch)

matlabTableRoundTrip =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "" 

>> isequaln(matlabTable, matlabTableRoundTrip)

ans =

  logical

   1

Are these changes tested?

Yes.

  1. Added new tStringArray test class.
  2. Added new tStringType test class.
  3. Extended tRecordBatch test class to verify support for MATLAB tables which contain string data (see above).

Are there any user-facing changes?

Yes.

  1. Users can now create arrow.array.StringArray objects from MATLAB string arrays and cellstrs.
  2. Users can now create arrow.type.StringType objects.
  3. Users can now construct RecordBatch objects from MATLAB tables that contain string data.

Future Directions

  1. The implementation of this initial version of StringArray is relatively simple in that it does not include a BinaryArray class hierarchy. In the future, we will likely want to refactor StringArray to inherit from a more general abstract BinaryArray class hierarchy.
  2. Following on from 1., we will ideally want to add support for LargeStringArray, BinaryArray, and LargeBinaryArray, and FixedLengthBinaryArray by creating common infrastructure for representing binary types. This initial version of StringArray helps to solidify the user-facing design and provide a shorter term solution to working with string data, since it is quite common.
  3. It may make sense to change the arrow.type.Type hierarchy (e.g. arrow.type.StringType) in the future to delegate to C++ Proxy classes under the hood. See: [MATLAB] Create proxy classes for the DataType class hierarchy  #36363.
  4. Use bit::unpacked_as_ptr in other classes. See [MATLAB] Use bit::unpacked_as_ptr in timestamp_array.cc #36335.
  5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary data copies).

Notes

  1. Thank you @sgilmore10 for your help with this pull request!

@kevingurney kevingurney marked this pull request as ready for review June 28, 2023 20:04
@kevingurney kevingurney requested a review from assignUser as a code owner June 28, 2023 20:04
@kou
Copy link
Member

kou commented Jun 28, 2023

Could you rebase on main?

matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc Outdated Show resolved Hide resolved
strings.reserve(array_length);
for (size_t i = 0; i < array_length; ++i) {
auto string_array = std::static_pointer_cast<arrow::StringArray>(array);
const auto& str_utf8 = string_array->GetString(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using GetView() instead of GetString() to avoid copying?

Suggested change
const auto& str_utf8 = string_array->GetString(i);
const auto& str_utf8 = string_array->GetView(i);

(Ah, we should have used std::string_view for UTF8StringToUTF16()... cc: @sgilmore10 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @kou. I can submit a followup PR to add both UTF8StringToUTF16(std::string_view) and UTF16StringToUTF8(std::u16string_view).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR with this change. See #36485.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, @kou!

Thanks @sgilmore10 for the PR!

I'll make this change as soon as #36485 is merged in.

matlab/src/cpp/arrow/matlab/bit/unpack.cc Outdated Show resolved Hide resolved
matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 30, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 5, 2023
@kevingurney
Copy link
Member Author

kevingurney commented Jul 5, 2023

Thanks for the review @kou! My apologies for the delay responding to feedback. I was away for a few days.

kou pushed a commit that referenced this pull request Jul 6, 2023
…pt `string_views` (#36485)

### Rationale for this change

As @ kou pointed out in #36366, it's better to pass read-only string data via  as string_views than as const references. I recently submitted two new functions called `UTF8StringToUTF16` and `UTF16StringToUTF8`, both of which accept const references. This PR changes the signatures of those functions to accept string_views.

### What changes are included in this PR?

Changed the function signatures of `UTF8StringToUTF16` and `UTF16StringToUTF8`:

1. `arrow::Result<std::u16string> UTF8StringToUTF16(const std::string& source)` ->
    `arrow::Result<std::u16string> UTF8StringToUTF16(std::string_view source)` 

2. `arrow::Result<std::string> UTF16StringToUTF8(const std::u16string& source)` ->
    `arrow::Result<std::string> UTF16StringToUTF8(std::u16string_view source)` 

### Are these changes tested?

Uses existing tests in `utf8_util_test.cc`

### Are there any user-facing changes?

No.

Only the MATLAB Interface uses these functions, so  I don't expect these changes to break anyone's code:

1. Search [results](https://github.com/search?q=repo%3Aapache%2Farrow%20UTF8StringToUTF16&type=code) for `UTF8StringToUTF16`
2. Search [results](https://github.com/search?q=repo%3Aapache%2Farrow+UTF16StringToUTF8&type=code) for `UTF16StringToUTF8`

* Closes: #36483

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou
Copy link
Member

kou commented Jul 6, 2023

No problem. :-)

I've merged #36485 . Could you use it?

kevingurney and others added 14 commits July 6, 2023 09:17
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
… Array into bit::unpacked_as_ptr API.

Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
2. Uncomment error macros.

Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
2. Add basic MATLAB StringArray tests.
3. Update RecordBatch to support StringArray.

Co-authored-by: Sarah Gilmore <silgmore@mathworks.com>
… StringArray.

Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
mapped to the empty string value ("") when InferNulls=false.
2. Add tests to verify that an error is thrown if a char array is passed
to the StringArray constructor.
3. Indent comments in tStringArray.
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
…a copies.

Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
kevingurney and others added 3 commits July 6, 2023 09:17
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
@sgilmore10
Copy link
Member

Hi @kou,

Kevin's out today, but I just pushed the change to use GetView instead of GetString.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 80b7658 into apache:main Jul 7, 2023
@kou kou removed the awaiting change review Awaiting change review label Jul 7, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Jul 7, 2023
@kevingurney
Copy link
Member Author

+1

@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 80b76584.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details.

@sgilmore10 sgilmore10 deleted the GH-36250 branch August 21, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Add arrow.array.StringArray class
3 participants