Add support for multiple data types and schemas in Kamelets #1980

nicolaferraro · 2021-02-01T17:18:03Z

Seeking for help on improving the Kamelet model before going full on the Kamelet catalog effort.

Currently the model expects that one declares the default input/output of a Kamelet in the spec -> types -> in / out field, like:

# ...
  types:
    out:
      mediaType: application/json
      schema:
        #...

The same for Kamelets that consume an input, but the property is named in.

The meaning of those types is simply stated:

A Kamelet produces output with the format specified in out
A Kamelet consumes an input with the format specified in in

That has unfortunately some drawbacks, one of which is that a Kamelet must have a single data type as output (for sources) and/or a single data type for input.
Many implementations of Kamelets that produce JSON data, in fact, have the following route snippet in the flow part so far:

# ...
      steps:
      - marshal:
          json: {}

So e.g., if we go full with the Kamelet catalog and add support for them in camel-kafka-connector, I expect soon to have a salesforce-source-json and a salesforce-source-avro to overcome this limitation. But it's not ideal.

I think we should allow a Kamelet to have a default input/output format, without forcing users to use that one: they may have choices.

I was thinking to something like this:

# ...
  types:
    out:
      mediaType: application/json
      schema:
        #...
      camel:
        dataFormat: json-jackson

The dataFormat option tells the operator to automatically add camel:jackson and the marshalling step when the Kamelet is used in a KameletBinding.

For in, this translates into:

# ...
  types:
    in:
      mediaType: application/json
      schema:
        #...
      camel:
        dataFormat: json-jackson
        className: org.apache.camel.xxx.MyClass

This should add the unmarshalling to a specific (optional) class.

In case we want this behavior to be common for KameletBinding and standard integration, this should be better implemented at runtime level.

Now the question is how to deal with the case of multiple input/output data types.

A possibility would be to add another level of description:

# ...
  types:
    out:
      default: json
      json:
        mediaType: application/json
        schema:
          # the JSON schema
        camel:
          dataFormat: json-jackson
      avro:
        mediaType: application/avro
        schema:
          # the Avro schema
        camel:
          dataFormat: avro
          className: org.apache.camel.xxx.MyClass

That would break a bit the current schema, but it will provide more options in the future.

Having the possibility to choose, a user can specify the format option in a KameletBinding (that we're going to reserve, like we did for id), to select an input/output format that is different from the default (maybe including none, to obtain the original data in advanced use cases).

In case this should work also in the standard integration, we may use the following syntax:

from("kamelet:salesforce-source?format=avro").to("....")

From the operator side, the required libraries for Avro will be added, but the runtime should enhance the route with a loader/customizer.

Wdyt @lburgazzoli, @astefanutti , @davsclaus, @squakez

The text was updated successfully, but these errors were encountered:

lburgazzoli · 2021-02-01T17:26:52Z

Maybe we can do something like this:

steps:
    - marshal: "{{format}}"

This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties

heiko-braun · 2021-02-02T07:08:47Z

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)
Schemas attached to the message in transit (I.e in header)

nicolaferraro · 2021-02-02T08:02:16Z

Maybe we can do something like this:
steps:
    - marshal: "{{format}}"
This should be a reference to a in/out schema, the operator can then create the properties to configure the data format via properties

Yes, this would avoid having to write the implementation at runtime side, leaving also field to the user to implement custom transformations in the flow.

nicolaferraro · 2021-02-02T08:18:05Z

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)

Schemas attached to the message in transit (I.e in header)

Yeah, these are concerns we need to address now as well. The schema prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.

I think it's a good time to deprecate spec -> types and provide something like spec -> dataTypes just to build a different tree and provide oob migration from old Kamelets.

For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.

Wdyt @lburgazzoli ?

astefanutti · 2021-02-02T08:54:28Z

Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up 🤯):

We integrate the DataFormat concept into the Kamelet model,
The Kamelet writer can specify a set of DataFormats, and it's up to her/him to declare a property in the Kamelet schema, and use that property in the Kamelet Flow to do the IO conversion,
The Camel K runtime provides (auto-)configuration of DataFormats, so that the Kamelet writer can reference them by name in the Kamelet Flow,
We need to revisit how to associate schemas to DataFormats.

Is my understanding correct?

lburgazzoli · 2021-02-02T09:02:11Z

I like this idea and would add the possibility for:

references to external schemas (I.e. in schema registry)

Schemas attached to the message in transit (I.e in header)

Yeah, these are concerns we need to address now as well. The schema prop is set to JSONSchema currently, but we need to address other kinds of schemas, including schemas located elsewhere.

I think it's a good time to deprecate spec -> types and provide something like spec -> dataTypes just to build a different tree and provide oob migration from old Kamelets.

Maybe we can use "schemes" instead.

For the "schema in header", I think it's a good idea for sources. We can make sure the operator passes the location of the schema in a configuration property and, in case the schema is inline, it also mount it as a file in the pod, so that the header can always be an URL. The Kamelet runtime may also bind that property into a header. The destination (or an intermediate step) can the use that URL to do stuff.

I think we can improve data formats in general, as example:

they can automatically compute the schema at runtime if not provided and store the result in an header
they can use a provided scheme to validate that the marshalled/unmarshalled data conforms with the given schema

We can then define a specific schema like:

avro:
  media-type: application/avro
  schema:
    # the avro schema inline|reference
  data-format:
    # optional, if not provided use the scheme id 
    id: "avro"
    properties:
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      ...
  dependencies:
    - camel-avro
    - mvn:org.acme/my-artifact/1.0.0

Wdyt @lburgazzoli ?

lburgazzoli · 2021-02-02T09:03:35Z

Let me have the layman in me try summarising the discussion (at least for my brain to wrap this up exploding_head):

* We integrate the DataFormat concept into the Kamelet model,

* The Kamelet writer can specify a set of DataFormats, and it's up to her/him to declare a property in the Kamelet schema, and use that property in the Kamelet Flow to do the IO conversion,

* The Camel K runtime provides (auto-)configuration of DataFormats, so that the Kamelet writer can reference them by name in the Kamelet Flow,

* We need to revisit how to associate schemas to DataFormats.

Is my understanding correct?

looks correct :)

astefanutti · 2021-02-02T09:14:42Z

+1 then 😄!

lburgazzoli · 2021-02-02T09:24:43Z

Thinking a little bit more, wonder if this new schema/data-format thing is something we can define as a dedicated custom resource (which we can eventually embed in the kamelet) but could also be something we could use for the dynamic computation of schemes.

Eventually, schema registries can watch and duck type those resource to automatically load them.

squakez · 2021-02-02T09:38:41Z

Interesting ideas. I like the concept and I think we should understand if/how is possible to externalize such data formats. If the data formats are an external entity they could be reusable and keep the Kamelet design simpler. Finally, if you think about it, a format is just a view of the data, so it makes sense not to be part of the logic.

davsclaus · 2021-02-03T12:20:54Z

Yeah good idea to have support for multiple data types, especially when its common on kafka land to have avro, json, types.

For kamelets it would also be good if we could generate documentation (ascii doc files) to use for the website / kamelet repository. And in that documentation we can then easily grab the data types and prominently show in the docs what types are supported.

davsclaus · 2021-02-03T12:46:24Z

Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".

lburgazzoli · 2021-02-03T13:15:19Z

Btw do we have any thoughts on schema-less kamelets? For example if you just use a kamelet to route data from one messaging system to another between queues. And dont really want/need to specify any schema, as the data is just "raw".

Yep schema is not always required and to be honest for Camel it may not even needed (except for some components like kafka) so it is mainly a tooling related information.

nicolaferraro · 2021-02-04T09:09:25Z

Let's do another iteration on this...

I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.

DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce binary data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..

I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.

I think the following model may be good for the in-Kamelet specification of a "format":

kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-source
# ... 
spec:
  definition:
    properties:
      format:
        title: Format
        type: string
        enum:
        - JSON
        - Avro
        default: JSON
# ... 
formats:
- name: JSON
  # optional, useful in case of in/out Kamelets
  scope: out
  schema:
    mediaType: "application/json"
    data: # the JSON schema inline
    url: # alternative link to the shema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  # the source produces JSON by default, no libs or transformations needed

- name: Avro
  schema:
    type: avro-schema
    mediaType: "application/avro"
    data: # the avro schema inline
    url: # alternative link to the schema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  dataFormat:
    # optional, but if not provided "no format" is assumed
    id: "avro"
    properties: # only if "id" is present
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      # ...
    dependencies:
    - camel:jackson
    - camel:avro
    - mvn:org.acme/my-artifact/1.0.0

You can notice the scope property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standard format property (not an inputFormat and outputFormat). So if I choose CSV, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).

The schema here is declared inline in the Kamelet, to make it self-contained, but we can create also a Schema CR:

kind: Schema
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: my-avro-schema
spec:
  type: avro-schema
  mediaType: application/avro
  data: # the avro schema inline
  url: # alternative URL reference
  # no, ref is forbidden here

Structure is almost the same as the inline version.

The binding can use the predefined schema:

kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # may have been omitted, since it's the default
      format: JSON
  sink:
    # ...

The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:

kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # since there's no inline format named "my-avro", it refers to the external one
      format: Avro
    schema:
      # since it's a source, we assume this is the schema of the output
      ref:
        name: my-avro-schema
      # or alternatively also inline
      data: #...
      url: # ...
  sink:
    # ...

This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.

For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.

lburgazzoli · 2021-02-04T09:36:18Z

Let's do another iteration on this...

I'm thinking to your comments and I like the idea of having stuff also as CRs. I remember some brainstorming with @lburgazzoli about how dynamic schemas may work in this model. The idea was to let Kamelets define their schemas, if known in advance, but also let KameletBindings redefine them, if needed.

DataFormats are generic in Camel, but when talking about connectors (a.k.a. Kamelets), I think it's better for the Kamelet to enumerate all the possible dataformats it supports. E.g. @davsclaus was talking about sources that can only produce binary data (i.e. no dataformat), but there are many other examples: e.g. a "hello world" string cannot be transformed into FHIR data by simply plugging the FHIR JSON dataformat, as well as not all data is suitable for CSV encoding..

I also see that we're talking about formats and schemas as if they were the same thing, but even if they are related (i.e. dataFormat + Kamelet [+ Binding Properties] may imply a Schema), maybe we can do a better job in treating them as separate entities.

I think the following model may be good for the in-Kamelet specification of a "format":
kind: Kamelet
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-source
# ... 
spec:
  definition:
    properties:
      format:
        title: Format
        type: string
        enum:
        - JSON
        - Avro
        default: JSON
# ... 
formats:
- name: JSON
  # optional, useful in case of in/out Kamelets
  scope: out
  schema:
    mediaType: "application/json"
    data: # the JSON schema inline
    url: # alternative link to the shema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  # the source produces JSON by default, no libs or transformations needed

- name: Avro
  schema:
    type: avro-schema
    mediaType: "application/avro"
    data: # the avro schema inline
    url: # alternative link to the schema
    ref: # alternative Kubernetes reference to the schema (see below)
      name: # ...
  dataFormat:
    # optional, but if not provided "no format" is assumed
    id: "avro"
    properties: # only if "id" is present
      class-name: org.apache.camel.xxx.MyClass
      compute-schema: true|false
      # ...
    dependencies:
    - camel:jackson
    - camel:avro
    - mvn:org.acme/my-artifact/1.0.0
You can notice the scope property that allows to define the specific details of transformations for input and output of a particular format. I'd not complicate life and assume that users will choose only 1 format using the standard format property (not an inputFormat and outputFormat). So if I choose CSV, the Kamelet will consume and produce CSV. Anyway, the shape (schema) of the input CSV can be different from the one of the output CSV (and that's described in the Kamelet).

I think we could also have a case where we want the data format to automatically compute the schema i.e. from a pojo, so basically a formats whiteout the schema section.

The schema here is declared inline in the Kamelet, to make it self-contained, but we can create also a Schema CR:
kind: Schema
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: my-avro-schema
spec:
  type: avro-schema
  mediaType: application/avro
  data: # the avro schema inline
  url: # alternative URL reference
  # no, ref is forbidden here
Structure is almost the same as the inline version.

The binding can use the predefined schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # may have been omitted, since it's the default
      format: JSON
  sink:
    # ...
The binding above will produce objects in JSON format with the inline definition of the schema. The one below is using a custom schema:
kind: KameletBinding
apiVersion: camel.apache.org/v1alpha1
metadata:
  name: chuck-to-channel
spec:
  source:
    kind: Kamelet
    apiVersion: camel.apache.org/v1alpha1
    name: chuck-source
    properties:
      # since there's no inline format named "my-avro", it refers to the external one
      format: Avro
    schema:
      # since it's a source, we assume this is the schema of the output
      ref:
        name: my-avro-schema
      # or alternatively also inline
      data: #...
      url: # ...
  sink:
    # ...
This mechanism may be used also in cases where the schema can be computed dynamically before running the integration. In this case, an external entity saves the schema in a CR and references it in the KameletBinding.

For the use case of using the Schema CR to sync external entities (like registries), it's possible, but we should think more about that because of edge cases: sometimes the schema is known only at runtime and sometimes it varies from message to message. In that cases, it's the integration itself that needs to update the registries. Probably it would be cleaner if it's the integration that always updates the registry.

Yep, we don't need to publish each schema up-front but for pre-computed scheme (either because they are known at runtime or because they are computed before running the integration), we should store them as CR so other can eventually consume them.

lburgazzoli · 2021-02-04T09:41:46Z

I guess there may be some confusion from a user pov as you can define multiple in and multiple out schemes, how do we validate that ? Having an in/out formats separations would allow to define such semantic and validation, at CRD level.

This may also work the other way around, if an external tool creates a CR with the schema, then camel-k can consume it without the need to generate it.

But I agree, this is low priority.

nicolaferraro · 2021-02-04T10:10:28Z

Yeah, the schema associated inline with the format was intended to be optional, present only if known in advance.

When we think about sources I think there's no confusion: user chooses a format during binding and that's it. In general, we should validate there's only one dataFormat per <scope, format> pair.

The problem arises when you think to sinks: a Telegram sink may accept an image, a video, a text or a structured JSON.
We can let the user choose the format at binding time for the moment. But we can also think in the future to allow selecting all of them, or just a subset. But in general a sink may have multiple input types, that can be disambiguated at runtime via the mediaType.

- Introduce data type converters - Add data type processor to auto convert exchange message from/to given data type - Let user choose which data type to use (via Kamelet property) - Add data type registry and annotation based loader to find data type implementations by component scheme and name Relates to CAMEL-18698 and apache/camel-k#1980

- Enable service discovery for data type converter resources in order to enable factory finder mechanism when resolving data type implementations - Add proper quarkus-maven-plugin build time properties for quarkus.camel.* properties - Fixes the way camel-quarkus build time properties are set (set properties on the quarkus-maven-plugin instead of using generic Maven system properties) - Explicitly add quarkus.camel.service.discovery.include-patterns for data type converter resources in order to enable lazy loading of Kamelets data type implementations Relates to apache#1980

- Enable service discovery for data type converter resources in order to enable factory finder mechanism when resolving data type implementations - Add proper quarkus-maven-plugin build time properties for quarkus.camel.* properties - Fixes the way camel-quarkus build time properties are set (set properties on the quarkus-maven-plugin instead of using generic Maven system properties) - Explicitly add quarkus.camel.service.discovery.include-patterns for data type converter resources in order to enable lazy loading of Kamelets data type implementations Relates to #1980

- Adds input/output/error data type spec to Kamelet CRD. The data type specifications provides additional information to the user such as what kind of input is required to use a Kamelet and what output is produced by the Kamelet. - The data type specifications can be used by tooling and data type converters to improve the overall usability of Kamelets

- Adds input/output/error data type spec to Kamelet CRD. The data type specifications provides additional information to the user such as what kind of input is required to use a Kamelet and what output is produced by the Kamelet. - The data type specifications can be used by tooling and data type converters to improve the overall usability of Kamelets - Deprecate former types field and the EventTypeSpec

- Support data type reference in KameletBinding that automatically adds data type action Kamelet to the resulting integration template flow - Allow the user to specify the data types for output/input on Kamelet references in a binding - Camel K automatically adds respective steps (using the data-type-action Kamelet) in order to apply the data type conversion logic

- Support data type reference in KameletBinding that automatically adds data type action Kamelet to the resulting integration template flow - Allow the user to specify the data types for output/input on Kamelet references in a binding - Camel K automatically adds respective steps (using the data-type-action Kamelet) in order to apply the data type conversion logic - Update YAKS 0.14.3

- Adds input/output/error data type spec to Kamelet CRD. The data type specifications provides additional information to the user such as what kind of input is required to use a Kamelet and what output is produced by the Kamelet. - The data type specifications can be used by tooling and data type converters to improve the overall usability of Kamelets - Deprecate former types field and the EventTypeSpec

- Support data type reference in KameletBinding that automatically adds data type action Kamelet to the resulting integration template flow - Allow the user to specify the data types for output/input on Kamelet references in a binding - Camel K automatically adds respective steps (using the data-type-action Kamelet) in order to apply the data type conversion logic - Update YAKS 0.14.3

nicolaferraro added the kind/feature New feature or request label Feb 1, 2021

nicolaferraro added this to the 1.4.0 milestone Feb 1, 2021

astefanutti added the area/kamelets label Feb 2, 2021

nicolaferraro assigned nicolaferraro and unassigned nicolaferraro Feb 3, 2021

nicolaferraro modified the milestones: 1.4.0, 1.5.0 Mar 16, 2021

nicolaferraro modified the milestones: 1.5.0, 1.6.0 Jul 5, 2021

nicolaferraro modified the milestones: 1.6.0, 1.7.0 Sep 7, 2021

nicolaferraro modified the milestones: 1.7.0, 1.8.0 Nov 15, 2021

This was referenced Nov 24, 2022

Add data type converter factory finder discovery in camel-quarkus #3844

Closed

Add DataTypeRegistry as bean in Camel context #3845

Closed

christophd mentioned this issue Dec 15, 2022

fix(#3844): Add service discovery for Kamelet data type converter #3912

Merged

christophd mentioned this issue Feb 1, 2023

feat: Camel K 2023 roadmap apache/camel-website#971

Merged

christophd mentioned this issue Mar 3, 2023

fix(#1980): Support data types in Kamelets #4100

Merged

squakez added the status/wip Work in progress label Mar 6, 2023

squakez closed this as completed in #4100 Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multiple data types and schemas in Kamelets #1980

Add support for multiple data types and schemas in Kamelets #1980

nicolaferraro commented Feb 1, 2021

lburgazzoli commented Feb 1, 2021

heiko-braun commented Feb 2, 2021

nicolaferraro commented Feb 2, 2021

nicolaferraro commented Feb 2, 2021

astefanutti commented Feb 2, 2021

lburgazzoli commented Feb 2, 2021 •

edited

Loading

lburgazzoli commented Feb 2, 2021

astefanutti commented Feb 2, 2021

lburgazzoli commented Feb 2, 2021

squakez commented Feb 2, 2021

davsclaus commented Feb 3, 2021

davsclaus commented Feb 3, 2021

lburgazzoli commented Feb 3, 2021

nicolaferraro commented Feb 4, 2021

lburgazzoli commented Feb 4, 2021

lburgazzoli commented Feb 4, 2021 •

edited

Loading

nicolaferraro commented Feb 4, 2021

Add support for multiple data types and schemas in Kamelets #1980

Add support for multiple data types and schemas in Kamelets #1980

Comments

nicolaferraro commented Feb 1, 2021

lburgazzoli commented Feb 1, 2021

heiko-braun commented Feb 2, 2021

nicolaferraro commented Feb 2, 2021

nicolaferraro commented Feb 2, 2021

astefanutti commented Feb 2, 2021

lburgazzoli commented Feb 2, 2021 • edited Loading

lburgazzoli commented Feb 2, 2021

astefanutti commented Feb 2, 2021

lburgazzoli commented Feb 2, 2021

squakez commented Feb 2, 2021

davsclaus commented Feb 3, 2021

davsclaus commented Feb 3, 2021

lburgazzoli commented Feb 3, 2021

nicolaferraro commented Feb 4, 2021

lburgazzoli commented Feb 4, 2021

lburgazzoli commented Feb 4, 2021 • edited Loading

nicolaferraro commented Feb 4, 2021

lburgazzoli commented Feb 2, 2021 •

edited

Loading

lburgazzoli commented Feb 4, 2021 •

edited

Loading