Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 494- Define "map" XDM type #511

Merged
merged 7 commits into from
Sep 27, 2018
Merged

Issue 494- Define "map" XDM type #511

merged 7 commits into from
Sep 27, 2018

Conversation

kstreeter
Copy link
Collaborator

This PR addresses issue #494. It introduces the concept of a "storage hint" that can be applied to a model or property definition to give storage and transmission systems additional hints on how to efficiently handle the data, without changing the semantic meaning of the underlying property.

The initial use of this is to support storage of "map-like" data in XDM. This support is described in the docs.

@kstreeter
Copy link
Collaborator Author

@cmathis, @harleensahni, @lrosenthol, @ogoldman please take a look. This proposal is based on the conversation on the initial design given by @cmathis, modified based on a conversation I had with @lrosenthol and @ogoldman earlier this week.

@cmathis
Copy link
Collaborator

cmathis commented Sep 19, 2018

Doesn't the meta:xdmType = map already tell you it should be treated as a map?

For example:

properties: {
    "mapField": {
         "type":"object",
         "meta:xdmType": "map",
         "additionalProperties": {
              "type":"string"
         }
     }
}

If a customer sends data in the form of JSON it would look like:
mapField: {key1:value1, key2:value2, key3:value3}

The consumer can see that mapField should be interpreted as a Map and store all the key/value pairs that way.

The other way this could be stored is as a struct which is more of a traditional object. In this case could we continue using the meta:xdmType (which everyone is looking for) and instead have meta:xdmType = object to imply this data should be stored as a dynamically growing struct/object?

@kstreeter
Copy link
Collaborator Author

Hey @cmathis, after speaking with @lrosenthol, @ogoldman a couple of concerns were raised with that approach:

  • We have discussed the "xdmType" concept for scalar fields, but this is first time we have tried to apply it to an object. There is a concern that we are opening the door for allowing semantic details to be defined via "xdmType" that really should be defined using actual schema definitions. I agree that this creates a gray area: what is a type, versus what is a schema? For example, why is a "measure" a schema and not a type?

  • It was observed that the representation we proposed for map is semantically equivalent to a JSON object (which is also a map). We are not defining new semantics for the type, simply giving an explicit signal that the keyspace is unconstrained and should be stored appropriately.

  • It was observed that so far the "map" concept is the only modifier we have for objects, so it might make sense to solve for the specific case, versus more generally opening up the type mechanism.

@cmathis
Copy link
Collaborator

cmathis commented Sep 20, 2018

So what is the xdmType for a field defined as:

properties: {
    "mapField": {
         "type":"object",
         "meta:storageHint": "xdm:storeAsMap"
         "additionalProperties": {
              "type":"string"
         }
     }
}

@lrosenthol
Copy link
Collaborator

There isn't one @cmathis - that's the point. We don't want to use meta:xdmType on an object, bcause of the concerns raised above. All objects are either a defined XDM schema or an arbitrary map of properties.

@cmathis
Copy link
Collaborator

cmathis commented Sep 20, 2018

So let's say I have an XDM schema like this:

properties: {
   field1: {
        type: object,
        properties: {
            subField1: {
                type: string
            }, 
            subField2: {
                type: integer
            }
       }
   },
   field2: {
        type: object,
        additionalProperties: {
            type: string
        }
   },
   field3: {
       type: integer,
       minimum: -128,
       maximum: 128,
       meta:xdmType: byte,
  }
}

If I'm trying to map this schema into a parquet schema, how do I know that field1 should be a struct, but field2 should be a map? I'm assuming this is where the meta:storageHint comes in, but now we have two ways to identify specific types. In the case of field3 I look at the meta:xdmType, but now your asking the user to know about this meta:storageHint as well. We need to be consistent one way or the other.

From the registry we use to not have an xdmType for the JSON schema type:array and type:object fields and that resulted in tickets which users flagged as a bug and demanded that every field have an xdmType. I cannot surface fields that do not have explicit data types that are exposed the same way across all fields.

@lrosenthol
Copy link
Collaborator

The way you know what should be treated as a struct vs. a map is meta:storageHint - that's exactly what it is for (and why it is named that way). Nothing more (or less).

meta:xdmType serves a completely different purpose.

xdmType was proposed as a way to provide something equivalent to JSON-LD's @type on an entry in a JSON schema - meaning that you want to give it some rich semantic meaning defined in some other standard. date is a good example, in that we could (theoretically) get rid of it and just use a type:string + xdmType::xsd:date. The current 'real' example is BigNumber.

@lrosenthol
Copy link
Collaborator

We are being 100% consistent.

meta:storageHint is what the name implies - a hint to a storage system (eg. parquet) about how it should store it - columns or not. It has nothing to do with the type of data.

meta:xdmType on the other hand is also what the name implies - the actual "rich" type of the scalar field to address providing those rich/extended type semantics.

@cmathis
Copy link
Collaborator

cmathis commented Sep 21, 2018

I understand. I guess I was considering a "map" as a type of scaler (which I agree it's not). I just wish it was a little more intuitive. Could we use meta:storageHint for everything - xdm:storeAsByte, xdm:storeAsLong, xdm:storeAsDate, etc. This is really what users are using the xdmType for today - how should I represent this field in my database, parquet, etc.

@ogoldman
Copy link
Collaborator

I'm with @cmathis on this one: call it storageHint, and use it consistently, instead of xdmType. (Otherwise, we're not really using the JSON Schema type system; we're using two type systems at once. In that case we may as well introduce a map type. But I fear this will be endlessly complicated to maintain.)

@lrosenthol
Copy link
Collaborator

@cmathis is there something we can do with the naming of the two keys which might help you?

@lrosenthol
Copy link
Collaborator

@ogoldman The (known) problem with the JSON Schema typing model is that it is limited and the JSON-LD community is solving it the wrong way (IMO) via the @type key in the JSON data....but we want that information in the schema. As such, we have to define our own method - in this case meta:xdmType. As long as we are consistent, I don't see the problem.

@cmathis
Copy link
Collaborator

cmathis commented Sep 21, 2018

@lrosenthol - This may be one of those areas where we pull in Product Management or some of the end users (individuals from Vasanthi Holtcamp's data on-boarding team) to get their opinion. When we started talking about Map support in XDM, I assumed it would be in the form of an xdmType=map. I may be the only one that thought this way. Would be interesting to see how PM was planning to define a map field in the UI editor and if they planned on adding a Map type next to the existing Integer, Long, Double, etc.

This may be a separate discussion, but as @ogoldman mentioned the xdmType has introduced a second type system which has been a pain. As you know on the XC side users are expecting an XDM schema to be translatable into Postgress tables, Parquet, CosmoDB, Java objects, you name it. This is what demanded the creation of a richer set of data types (xdmType), but it's not an intuitive process for users trying to define these types in JSON schema. According to the documentation (https://wiki.corp.adobe.com/pages/viewpage.action?spaceKey=DMSArchitecture&title=XDM+Architecture#XDMArchitecture-XDMDataTypes) XDM supports long, date, short, integer, etc. I'm constantly responding to Jira tickets that say "I tried to define a schema with "type":"long" and the registry API throws an error" (which it does because its not valid JSON schema). The same happens from users of the UI who define a field of type "Long" who then write up bugs because the schema ended up with a "type":"integer" (the UI translates into a JSON schema equivalent). We are trying to battle the problem with documentation, but I feel like it's a never ending pain point. The registry today injects a meta:xdmType on every field to avoid some of the confusion and many of the components have switched to looking at it rather then the type attribute. Even if we dropped meta:xdmType and transitioned to every field having a meta:storageHint it still becomes another type system as the basic JSON schema type will never be sufficient for the XC needs.

@cfraser
Copy link
Contributor

cfraser commented Sep 22, 2018

Can we circle back on why meta:xdmType is insufficient? It's currently metadata that makes it easier to understand what is being implied by the JSON Schema constraints on a field.

It seems like a natural fit for map fields as it would still be metadata that made it easy to understand what was implied by the additionalProperties structure in an object field. While we have no non-scalar analogs at the moment I would expect that if we wanted to define further complex object types that we would simply add another xdmType hint to help clients understand how the object should be treated. I don't understand why we need to add the complexity of a new meta property in this instance.

@kstreeter
Copy link
Collaborator Author

I definitely feel that we need to keep the definition of "meta:xdmType" for scalar properties. As @lrosenthol observed, that tag is intended on giving clear and specific semantics to the JSON types, where the JSON type system is too broadly defined compared to the type systems we interop with. Those definitions are type definitions, in the same way that the JSON types are, although they are carefully defined to also be fully compatible with the underlying JSON types.

For me, the only question is whether a "map" is really a new type, or is just a modifier on the object type. This PR says the later: maps are still objects, but with an unconstrained keyset which may be desirable to store differently than a regular object. But we could also go with the other interpretation, that map is a type distinct from object, in which case we would include it directly in "xdmType".

@lrosenthol
Copy link
Collaborator

So let's say that we opened up meta:xdmType to objects - that will introduce two additional "oddities" that will need to be corrected in documentation anyway.

1 - Why do we allow it for objects and not arrays?
2 - Since the only allowed value for meta:xdmType on an object would be map - we have to explain why that.

So either we document what we have today (+ storageHint) or we switch to something new & different and start documenting that. Either way, your documentation & support issue doesn't go away (IMO).

@cmathis
Copy link
Collaborator

cmathis commented Sep 24, 2018

@lrosenthol - I'm not sure what you mean by oddity number 1. Arrays in XDM are defined as a fields with type:array and have a meta:xdmType:array.

Yes, because we are using JSON schema to represent richer data types we are going to need a documentation no matter what we do. Users need to know that sometimes a raw type:string may end up with an xdmType:date (because of theformat attribute supplied) . Same thing for why a raw field defined as type:integer may end up with an xdmType of byte, long, etc. (because of the min/max attributes). This whole discussion is if we apply a similar translation process (based on the translation of the other attributes) and allow fields with type:object to end up with an xdmType:map,

Anyone apposed to me posting a poll in the #xdm-questions channel to see what the broader audience would expect to see?

@cmathis
Copy link
Collaborator

cmathis commented Sep 24, 2018

Just spoke with @kstreeter a little about this. If we go the route of using the xdmType for defining richer scaler types only, then every consumer of XDM schemas (ETL connectors, Data Ingestion, Unified Profile, Campaign, etc.) that need to map the fields to a storage format will need to do the following:

if (xdmType) use xdmType
else if (storageHint) use storageHint
else use type

You basically have three locations to check now. If everyone feels like this is the best way to go then so be it, but let's please send this out as a formal announcement to all platform users so they are aware of this requirement. We will need to update the ETL vendor documentation as well. This needs to be published as the formal way to find the storage format (some would say actual data type) so I do not get tickets against the registry saying that every field is suppose to have an xdmType which is the way it's been.

@kstreeter kstreeter added this to the 0.9.6 milestone Sep 24, 2018
Corrected typos, no semantic change.
@cdegroot-adobe
Copy link
Contributor

@cmathis, @kstreeter, @lrosenthol -> I agree with the PR as it stands.

Just catching up on this. The salient question to me after reading the thread is do we decide to:
a) follow correct theoretical design principles that will be lost on many users of the system, or
b) optimize for simplicity and place the property in a location that on first consideration seems like a logical place to put it.

My perspective is that we need to adhere consistently to the the best technical design (a). Intermingling schema definition with value formats is wrong.

Usability is important but this will just make people who understand JSON-Schema/JSON-LD feel that they cannot trust their natural understanding of XDM.

Usability will have to be delivered by helpful UX and for APIs effective validation with complete error messaging.

@cfraser - to your comment about what will we then do if/when we have other non-scalar types. The pattern for when we have a concrete new complex type is to externalize it as its own schema file for inclusion, via an $ref. I am hopeful we will both see constrained usage of the map feature, as it should never be used to avoid defining real schema and that we don't have other cases like this coming up.

@shonesadler
Copy link
Contributor

Catching up on this as well and a few comments:

  • Similar to @cmathis I would vote for leveraging xdmType rather than creating yet another syntax/construct that we need to manage (ie. storageHint)
  • A Map is a dataType and is distinct from a Struct in that it supports arbitrary key->value mappings that are not known ahead of time. This semantic is expressed in JSON Schema through the use of additionalProperties. To that extent we could detect whether a field needs to be stored as Map or Struct purely based on whether the JSON Schema allows for additionalProperties.
  • JSON supports both Maps and Structs with a more general "object" type. However, the type distinction is valid in language bindings (e.g. Java/Scala) and frameworks such as Spark. The choice in how these types are stored impacts how the related data is queried. For Example: Maps can be accessed using an indexed syntax:
  • myDataFrame.selectExpr("tags['key']")
  • or keys convered to columns with:
  • myDataFrame.select("explode(tags)")
    Point is we are not just talking about how the data is spread out on disk, but also the contract by which clients can interact with that data.

@lrosenthol
Copy link
Collaborator

Arrays in XDM are defined as a fields with type:array and have a meta:xdmType:array.

@cmathis That is incorrect. xdmType is only used when you are specifying something other (or more specifically, additional) to the actual type. As such, you should never have the construct above. If there is software currently generating such or existing schemas with that - they need to be fixed.

@lrosenthol
Copy link
Collaborator

Since I seem to be outnumbered here, I am willing to concede on using xdmType for doing map on objects - provided that it is the only allowed value on objects and that we continue to not allow it for arrays.

@kstreeter
Copy link
Collaborator Author

thanks @lrosenthol that is reasonable we will exclude arrays

@kstreeter kstreeter changed the title Issue 494- Storage hint definition to support map type Issue 494- Define "map" XDM type Sep 27, 2018
@kstreeter kstreeter merged commit 41cb4e8 into master Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants