Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ergonomic field and schema creation with Metadata #1091

Closed
alamb opened this issue Dec 22, 2021 · 7 comments · Fixed by #2239
Closed

Ergonomic field and schema creation with Metadata #1091

alamb opened this issue Dec 22, 2021 · 7 comments · Fixed by #2239
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Dec 22, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are a few issues that make it annoying to work with Schema and Field metadata:

  1. there is no way to create and set metadata on fields or schema in one statement; Instead a mut Field has to be created
  2. There is no constructor for create a field with metadata (and thus it is inconsistent with Schema which has new_with_schema)
  3. The accessors are different. Schema provides a &HashMap and Field provides a Option<&BtreeMap>

Describe the solution you'd like

My proposal will be two parts:

  1. Some simpler API functions (like with_metatada())
  2. A more invasive API change to make the metadata handling more uniform

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Dec 22, 2021
@jorgecarleitao
Copy link
Member

In arrow2 we noticed the same problem.

The solution we took was to make all members of Schema public - there is no invariant betweem them, so there is nothing that users can break by accessing individual members.

@alamb
Copy link
Contributor Author

alamb commented Dec 23, 2021

@jorgecarleitao I looked at the Field and Schema in arrow2. They also use the seemingly inconsistent Option<BTreeMap<String, String>> and HashMap<String, String>. Do you remember if there is a reason for that inconsistency (both the Option and also the different Map structure)?

https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/schema.rs#L30
https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/field.rs#L39

@jorgecarleitao
Copy link
Member

Oh, sorry, I skipped the 3rd item by mistake. I do not remember the reason. Had a quick look at the spec:

  • IPC format: each metadata is an array of key pairs (here and here).
  • C data interface: list of key pairs here.

I suspect that in both cases they are meant to represent unique keys without a specific order, but it is not clear to me atm.

imo it is very much worth investigating if we can get away with a common representation under the integration tests - as you wrote, much easier to reason about.

@jorgecarleitao
Copy link
Member

I played with this idea and I think that for us the main difference between them is whether Hash is implemented: BtreeMap implements Hash, while HashMap does not.

Because some variants of DataType contain Box<Field> and we want DataType to implement Hash (e.g. DataFusion and others depend on this), imo Field needs to stick to BtreeMap.

The Hash requirement does not exist for Schema and thus it uses HashMap. I agree that the Option bit not needed.

@alamb
Copy link
Contributor Author

alamb commented Dec 27, 2021

👍 @jorgecarleitao

In arrow-rs there is code to handle serializing the metadata to/from json, which is also slightly different for the two.

I wonder if we implemented something similar to https://docs.rs/tonic/0.6.2/tonic/metadata/struct.MetadataMap.html

struct MetadataMap {
...
}

Which implements whatever semantics we needed

@alamb
Copy link
Contributor Author

alamb commented Jan 2, 2022

Looks like there is a related PR in arrow2: jorgecarleitao/arrow2#715

@jorgecarleitao
Copy link
Member

I was waiting for another (soon to be PR) before reporting here, but basically it is possible to remove Option and also to replace HashMap in Schema by BTreeMap, thereby making both use the same data structure.

There is a larger change going on there, jorgecarleitao/arrow2#717, that removes RecordBatch (in line with the requirement coming from this issue: apache/datafusion#1248 and jorgecarleitao/arrow2#673) - I am waiting for that to land before PRing the HashMap -> BTreeMap, since the broader observation is that Schema can be greatly simplified as there are no internal invariants on it (basically everything inside can be made public and we do not need any accessors for it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants