Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MATLAB] Create proxy classes for the DataType class hierarchy #36363

Closed
sgilmore10 opened this issue Jun 28, 2023 · 1 comment · Fixed by #36419
Closed

[MATLAB] Create proxy classes for the DataType class hierarchy #36363

sgilmore10 opened this issue Jun 28, 2023 · 1 comment · Fixed by #36419

Comments

@sgilmore10
Copy link
Member

Describe the enhancement requested

In our initial DataType submission, we did not use proxy classes to represent DataType objects. However, we've realized that it would be helpful to have proxy classes for the DataType class hierarchy in order to support more functionality.

Component(s)

MATLAB

@sgilmore10
Copy link
Member Author

take

kou added a commit that referenced this issue Jul 7, 2023
### Rationale for this change

Thanks to @ sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string conversions](#36167), we can now add support for creating Arrow `String` arrays (UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded).

### What changes are included in this PR?

1. Added new `arrow.array.StringArray` class that can be constructed from MATLAB [`string`](https://www.mathworks.com/help/matlab/ref/string.html?s_tid=doc_ta) and [`cellstr`](https://www.mathworks.com/help/matlab/ref/cellstr.html) types. **Note**: We explicitly decided to *not* support [`char`](https://www.mathworks.com/help/matlab/ref/char.html?s_tid=doc_ta) arrays for the time being.
2. Factored out code for extracting "raw" `const uint8_t*` from a MATLAB `logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can be reused across multiple Array `Proxy` classes. See #36335.
3. Added new `arrow.type.StringType` type class and associated `arrow.type.ID.String` enum value.
4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s containing `string` data.
5. Updated `arrow::matlab::array::proxy::Array::toString` code to convert from UTF-8 to UTF-16 for display in MATLAB.

**Examples**

*Most MATLAB `string` arrays round-trip*

```matlab
>> matlabArray = ["A"; "B"; "C"]

matlabArray = 

  3x1 string array

    "A"
    "B"
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "A",
  "B",
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)          

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    "B"
    "C"

>> isequal(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1
```

*MATLAB `string(missing)` Values get mapped to `null` by default*

```matlab
>> matlabArray = ["A"; string(missing); "C"]

matlabArray = 

  3x1 string array

    "A"
    <missing>
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray) 

arrowArray = 

[
  "A",
  null,
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray) 

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    <missing>
    "C"

>> isequaln(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1

```

*Unicode characters round-trip*

```matlab
>> matlabArray = ["😊"; "🌲"; "➞"]

matlabArray = 

  3×1 string array

    "😊"
    "🌲"
    "➞"

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "😊",
  "🌲",
  "➞"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "😊"
    "🌲"
    "➞"
```

*Create `StringArray` from `cellstr`*

```matlab
>> matlabArray = {'red'; 'green'; 'blue'}

matlabArray =

  3×1 cell array

    {'red'  }
    {'green'}
    {'blue' }

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "red",
  "green",
  "blue"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "red"
    "green"
    "blue"
```

*Create `RecordBatch` from MATLAB `string` data*

```matlab
>> matlabTable = table(["😊"; "🌲"; "➞"])

matlabTable =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "➞" 

>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)

arrowRecordBatch = 

Var1:   [
    "😊",
    "🌲",
    "➞"
  ]

>> matlabTableRoundTrip = toMATLAB(arrowRecordBatch)

matlabTableRoundTrip =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "➞" 

>> isequaln(matlabTable, matlabTableRoundTrip)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Added new `tStringArray` test class.
2. Added new `tStringType` test class.
3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s which contain `string` data (see above).

### Are there any user-facing changes?

Yes.

1. Users can now create `arrow.array.StringArray` objects from MATLAB `string` arrays and `cellstr`s.
2. Users can now create `arrow.type.StringType` objects.
3. Users can now construct `RecordBatch` objects from MATLAB `table`s that contain `string` data.

### Future Directions

1. The implementation of this initial version of `StringArray` is relatively simple in that it does not include a `BinaryArray` class hierarchy. In the future, we will likely want to refactor `StringArray` to inherit from a more general abstract `BinaryArray` class hierarchy.
2. Following on from 1., we will ideally want to add support for `LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and `FixedLengthBinaryArray` by creating common infrastructure for representing binary types. This initial version of `StringArray` helps to solidify the user-facing design and provide a shorter term solution to working with `string` data, since it is quite common.
3. It may make sense to change the `arrow.type.Type` hierarchy (e.g. `arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under the hood. See: #36363.
4. Use `bit::unpacked_as_ptr` in other classes. See #36335.
5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary data copies).

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!
* Closes: #36250

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sarah Gilmore <silgmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kevingurney pushed a commit that referenced this issue Jul 12, 2023
…chy (#36419)

### Rationale for this change

In the original pull request in which we added the MATLAB `arrow.type.<Type>Type` classes (e.g. `arrow.type.Float32Type`), we did implement these classes as proxies. At the time, we weren't sure if it would be advantageous to implement the type classes as proxies, but now realize it will be for composite data structures, i.e. `Schema`, `StructArray`, `ListArray`. 

### What changes are included in this PR?

1. All classes within the `arrow.type.Type` class hierarchy are implemented as proxies. 

### Are these changes tested?

Yes, we had existing tests for these classes. 

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR request, we plan on integrating the proxy type classes and the array classes so that they share the same underlying C++` arrow::DataType` object. We thought doing so in this change would be too much code churn.

### Notes

Thank you @ kevingurney for the help!

* Closes: #36363

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@kevingurney kevingurney added this to the 14.0.0 milestone Jul 12, 2023
kevingurney pushed a commit that referenced this issue Jul 17, 2023
…ay` subclasses from existing proxy ids (#36731)

### Rationale for this change

Now that the issue #36363 is closed via PR #36419, we can initialize the `Type` property of `arrow.array.Array` subclasses from existing proxy ids. Currently, we create a new proxy `Type` object whose underlying `arrow::DataType` are semantically equal to  - but not the same as - the `arrow::DataType` owned by the Array proxy. It would be preferable if the `Type` and `Array` proxy classes refer to the same `arrow::DataType` object (i.e. the same object on the heap).

### What changes are included in this PR?

1. Upgraded `libmexclass` to commit [d04f88d](mathworks/libmexclass@d04f88d). In this commit, we added a static "make-like" function to `Proxy` called `create`.
2. Modified the constructors of all `Type` objects to expect a single `Proxy` object as input. This is a breaking change and  means clients are no longer expected to build `Type` objects via their constructors. Instead, we introduced standalone functions that clients can use to construct `Type` objects, i.e.   `arrow.type.int8`, `arrow.type.string`, `arrow.type.timestamp`, etc. These functions deal with creating the `Proxy` objects to pass to the `Type` constructors. Below is an example of the new workflow for creating `Type` objects. 

```matlab
>> timestampType = arrow.type.timestamp(TimeUnit="second", TimeZone="America/New_York")

timestampType = 

  TimestampType with properties:

    ID: Timestamp
```
NOTE: We plan on enhancing the display to show the `TimeUnit` and `TimeZone` properties. 

3. Made `Type` a [dependent](https://www.mathworks.com/help/matlab/matlab_oop/access-methods-for-dependent-properties.html) property on `arrow.array.Array`. The `get.Type` method constructs a `Type` object on demand by making a proxy that wraps the same `arrow::DataType` object stored within the `arrow::Array`.

### Are these changes tested?

Yes, updated existing tests.

### Are there any user-facing changes?

Yes, we added new standalone functions for creating `Type` objects. Below is a table mapping standalone  functions to the `Type` object they output: 

| Standalone Function | Output Type Object |
|----------------------|---------------------|
|`arrow.type.boolean`| `arrow.type.BooleanType`|
|`arrow.type.int8`| `arrow.type.Int8Type`|
|`arrow.type.int16`| `arrow.type.Int16Type`|
|`arrow.type.int32`| `arrow.type.Int32Type`|
|`arrow.type.int64`| `arrow.type.Int64Type`|
|`arrow.type.uint8`| `arrow.type.UInt8Type`|
|`arrow.type.uint16`| `arrow.type.UInt16Type`|
|`arrow.type.uint32`| `arrow.type.UInt32Type`|
|`arrow.type.uint64`| `arrow.type.UInt64Type`|
|`arrow.type.string`| `arrow.type.StringType`|
|`arrow.type.timestamp`| `arrow.type.TimestampType`|

### Notes

Thanks @ kevingurney for the advice!
* Closes: #36652

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…hierarchy (apache#36419)

### Rationale for this change

In the original pull request in which we added the MATLAB `arrow.type.<Type>Type` classes (e.g. `arrow.type.Float32Type`), we did implement these classes as proxies. At the time, we weren't sure if it would be advantageous to implement the type classes as proxies, but now realize it will be for composite data structures, i.e. `Schema`, `StructArray`, `ListArray`. 

### What changes are included in this PR?

1. All classes within the `arrow.type.Type` class hierarchy are implemented as proxies. 

### Are these changes tested?

Yes, we had existing tests for these classes. 

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR request, we plan on integrating the proxy type classes and the array classes so that they share the same underlying C++` arrow::DataType` object. We thought doing so in this change would be too much code churn.

### Notes

Thank you @ kevingurney for the help!

* Closes: apache#36363

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…ay.Array` subclasses from existing proxy ids (apache#36731)

### Rationale for this change

Now that the issue apache#36363 is closed via PR apache#36419, we can initialize the `Type` property of `arrow.array.Array` subclasses from existing proxy ids. Currently, we create a new proxy `Type` object whose underlying `arrow::DataType` are semantically equal to  - but not the same as - the `arrow::DataType` owned by the Array proxy. It would be preferable if the `Type` and `Array` proxy classes refer to the same `arrow::DataType` object (i.e. the same object on the heap).

### What changes are included in this PR?

1. Upgraded `libmexclass` to commit [d04f88d](mathworks/libmexclass@d04f88d). In this commit, we added a static "make-like" function to `Proxy` called `create`.
2. Modified the constructors of all `Type` objects to expect a single `Proxy` object as input. This is a breaking change and  means clients are no longer expected to build `Type` objects via their constructors. Instead, we introduced standalone functions that clients can use to construct `Type` objects, i.e.   `arrow.type.int8`, `arrow.type.string`, `arrow.type.timestamp`, etc. These functions deal with creating the `Proxy` objects to pass to the `Type` constructors. Below is an example of the new workflow for creating `Type` objects. 

```matlab
>> timestampType = arrow.type.timestamp(TimeUnit="second", TimeZone="America/New_York")

timestampType = 

  TimestampType with properties:

    ID: Timestamp
```
NOTE: We plan on enhancing the display to show the `TimeUnit` and `TimeZone` properties. 

3. Made `Type` a [dependent](https://www.mathworks.com/help/matlab/matlab_oop/access-methods-for-dependent-properties.html) property on `arrow.array.Array`. The `get.Type` method constructs a `Type` object on demand by making a proxy that wraps the same `arrow::DataType` object stored within the `arrow::Array`.

### Are these changes tested?

Yes, updated existing tests.

### Are there any user-facing changes?

Yes, we added new standalone functions for creating `Type` objects. Below is a table mapping standalone  functions to the `Type` object they output: 

| Standalone Function | Output Type Object |
|----------------------|---------------------|
|`arrow.type.boolean`| `arrow.type.BooleanType`|
|`arrow.type.int8`| `arrow.type.Int8Type`|
|`arrow.type.int16`| `arrow.type.Int16Type`|
|`arrow.type.int32`| `arrow.type.Int32Type`|
|`arrow.type.int64`| `arrow.type.Int64Type`|
|`arrow.type.uint8`| `arrow.type.UInt8Type`|
|`arrow.type.uint16`| `arrow.type.UInt16Type`|
|`arrow.type.uint32`| `arrow.type.UInt32Type`|
|`arrow.type.uint64`| `arrow.type.UInt64Type`|
|`arrow.type.string`| `arrow.type.StringType`|
|`arrow.type.timestamp`| `arrow.type.TimestampType`|

### Notes

Thanks @ kevingurney for the advice!
* Closes: apache#36652

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…hierarchy (apache#36419)

### Rationale for this change

In the original pull request in which we added the MATLAB `arrow.type.<Type>Type` classes (e.g. `arrow.type.Float32Type`), we did implement these classes as proxies. At the time, we weren't sure if it would be advantageous to implement the type classes as proxies, but now realize it will be for composite data structures, i.e. `Schema`, `StructArray`, `ListArray`. 

### What changes are included in this PR?

1. All classes within the `arrow.type.Type` class hierarchy are implemented as proxies. 

### Are these changes tested?

Yes, we had existing tests for these classes. 

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR request, we plan on integrating the proxy type classes and the array classes so that they share the same underlying C++` arrow::DataType` object. We thought doing so in this change would be too much code churn.

### Notes

Thank you @ kevingurney for the help!

* Closes: apache#36363

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…ay.Array` subclasses from existing proxy ids (apache#36731)

### Rationale for this change

Now that the issue apache#36363 is closed via PR apache#36419, we can initialize the `Type` property of `arrow.array.Array` subclasses from existing proxy ids. Currently, we create a new proxy `Type` object whose underlying `arrow::DataType` are semantically equal to  - but not the same as - the `arrow::DataType` owned by the Array proxy. It would be preferable if the `Type` and `Array` proxy classes refer to the same `arrow::DataType` object (i.e. the same object on the heap).

### What changes are included in this PR?

1. Upgraded `libmexclass` to commit [d04f88d](mathworks/libmexclass@d04f88d). In this commit, we added a static "make-like" function to `Proxy` called `create`.
2. Modified the constructors of all `Type` objects to expect a single `Proxy` object as input. This is a breaking change and  means clients are no longer expected to build `Type` objects via their constructors. Instead, we introduced standalone functions that clients can use to construct `Type` objects, i.e.   `arrow.type.int8`, `arrow.type.string`, `arrow.type.timestamp`, etc. These functions deal with creating the `Proxy` objects to pass to the `Type` constructors. Below is an example of the new workflow for creating `Type` objects. 

```matlab
>> timestampType = arrow.type.timestamp(TimeUnit="second", TimeZone="America/New_York")

timestampType = 

  TimestampType with properties:

    ID: Timestamp
```
NOTE: We plan on enhancing the display to show the `TimeUnit` and `TimeZone` properties. 

3. Made `Type` a [dependent](https://www.mathworks.com/help/matlab/matlab_oop/access-methods-for-dependent-properties.html) property on `arrow.array.Array`. The `get.Type` method constructs a `Type` object on demand by making a proxy that wraps the same `arrow::DataType` object stored within the `arrow::Array`.

### Are these changes tested?

Yes, updated existing tests.

### Are there any user-facing changes?

Yes, we added new standalone functions for creating `Type` objects. Below is a table mapping standalone  functions to the `Type` object they output: 

| Standalone Function | Output Type Object |
|----------------------|---------------------|
|`arrow.type.boolean`| `arrow.type.BooleanType`|
|`arrow.type.int8`| `arrow.type.Int8Type`|
|`arrow.type.int16`| `arrow.type.Int16Type`|
|`arrow.type.int32`| `arrow.type.Int32Type`|
|`arrow.type.int64`| `arrow.type.Int64Type`|
|`arrow.type.uint8`| `arrow.type.UInt8Type`|
|`arrow.type.uint16`| `arrow.type.UInt16Type`|
|`arrow.type.uint32`| `arrow.type.UInt32Type`|
|`arrow.type.uint64`| `arrow.type.UInt64Type`|
|`arrow.type.string`| `arrow.type.StringType`|
|`arrow.type.timestamp`| `arrow.type.TimestampType`|

### Notes

Thanks @ kevingurney for the advice!
* Closes: apache#36652

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants