-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36250: [MATLAB] Add arrow.array.StringArray
class
#36366
Conversation
Could you rebase on main? |
strings.reserve(array_length); | ||
for (size_t i = 0; i < array_length; ++i) { | ||
auto string_array = std::static_pointer_cast<arrow::StringArray>(array); | ||
const auto& str_utf8 = string_array->GetString(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using GetView()
instead of GetString()
to avoid copying?
const auto& str_utf8 = string_array->GetString(i); | |
const auto& str_utf8 = string_array->GetView(i); |
(Ah, we should have used std::string_view
for UTF8StringToUTF16()
... cc: @sgilmore10 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @kou. I can submit a followup PR to add both UTF8StringToUTF16(std::string_view)
and UTF16StringToUTF8(std::u16string_view)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a PR with this change. See #36485.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, @kou!
Thanks @sgilmore10 for the PR!
I'll make this change as soon as #36485 is merged in.
Thanks for the review @kou! My apologies for the delay responding to feedback. I was away for a few days. |
…pt `string_views` (#36485) ### Rationale for this change As @ kou pointed out in #36366, it's better to pass read-only string data via as string_views than as const references. I recently submitted two new functions called `UTF8StringToUTF16` and `UTF16StringToUTF8`, both of which accept const references. This PR changes the signatures of those functions to accept string_views. ### What changes are included in this PR? Changed the function signatures of `UTF8StringToUTF16` and `UTF16StringToUTF8`: 1. `arrow::Result<std::u16string> UTF8StringToUTF16(const std::string& source)` -> `arrow::Result<std::u16string> UTF8StringToUTF16(std::string_view source)` 2. `arrow::Result<std::string> UTF16StringToUTF8(const std::u16string& source)` -> `arrow::Result<std::string> UTF16StringToUTF8(std::u16string_view source)` ### Are these changes tested? Uses existing tests in `utf8_util_test.cc` ### Are there any user-facing changes? No. Only the MATLAB Interface uses these functions, so I don't expect these changes to break anyone's code: 1. Search [results](https://github.com/search?q=repo%3Aapache%2Farrow%20UTF8StringToUTF16&type=code) for `UTF8StringToUTF16` 2. Search [results](https://github.com/search?q=repo%3Aapache%2Farrow+UTF16StringToUTF8&type=code) for `UTF16StringToUTF8` * Closes: #36483 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
No problem. :-) I've merged #36485 . Could you use it? |
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
… Array into bit::unpacked_as_ptr API. Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
2. Uncomment error macros. Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
2. Add basic MATLAB StringArray tests. 3. Update RecordBatch to support StringArray. Co-authored-by: Sarah Gilmore <silgmore@mathworks.com>
… StringArray. Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
("") before converting to Arrow format.
mapped to the empty string value ("") when InferNulls=false.
2. Add tests to verify that an error is thrown if a char array is passed to the StringArray constructor. 3. Indent comments in tStringArray.
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
…a copies. Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Hi @kou, Kevin's out today, but I just pushed the change to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
+1 |
Conbench analyzed the 6 benchmark runs on commit There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
Thanks to @sgilmore10's recent changes to enable UTF-8 <-> UTF-16 string conversions, we can now add support for creating Arrow
String
arrays (UTF-8 encoded) from MATLABstring
arrays (UTF-16 encoded).What changes are included in this PR?
arrow.array.StringArray
class that can be constructed from MATLABstring
andcellstr
types. Note: We explicitly decided to not supportchar
arrays for the time being.const uint8_t*
from a MATLABlogical
Data Array into a new functionbit::unpacked_as_ptr
so that it can be reused across multiple ArrayProxy
classes. See [MATLAB] Usebit::unpacked_as_ptr
intimestamp_array.cc
#36335.arrow.type.StringType
type class and associatedarrow.type.ID.String
enum value.RecordBatch
objects from MATLABtable
s containingstring
data.arrow::matlab::array::proxy::Array::toString
code to convert from UTF-8 to UTF-16 for display in MATLAB.Examples
Most MATLAB
string
arrays round-tripMATLAB
string(missing)
Values get mapped tonull
by defaultUnicode characters round-trip
Create
StringArray
fromcellstr
Create
RecordBatch
from MATLABstring
dataAre these changes tested?
Yes.
tStringArray
test class.tStringType
test class.tRecordBatch
test class to verify support for MATLABtable
s which containstring
data (see above).Are there any user-facing changes?
Yes.
arrow.array.StringArray
objects from MATLABstring
arrays andcellstr
s.arrow.type.StringType
objects.RecordBatch
objects from MATLABtable
s that containstring
data.Future Directions
StringArray
is relatively simple in that it does not include aBinaryArray
class hierarchy. In the future, we will likely want to refactorStringArray
to inherit from a more general abstractBinaryArray
class hierarchy.LargeStringArray
,BinaryArray
, andLargeBinaryArray
, andFixedLengthBinaryArray
by creating common infrastructure for representing binary types. This initial version ofStringArray
helps to solidify the user-facing design and provide a shorter term solution to working withstring
data, since it is quite common.arrow.type.Type
hierarchy (e.g.arrow.type.StringType
) in the future to delegate to C++Proxy
classes under the hood. See: [MATLAB] Create proxy classes for theDataType
class hierarchy #36363.bit::unpacked_as_ptr
in other classes. See [MATLAB] Usebit::unpacked_as_ptr
intimestamp_array.cc
#36335.Notes
arrow.array.StringArray
class #36250