You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The change in #1593 made the marquez-api jar incompatible with code that had depended on the LineageEvent class and its related classes. Any code that depended on those models must now be rewritten to rely on the OpenLineage.* models, which have a very different construction model, thus require a major effort to rewrite.
Moreover, the current OpenLineage API has introduced new fields in the InputDataset and OutputDataset models, which were never present in the Marquez implementation of the OpenLineage models. The LineageEvent model is annotated with @JsonIgnoreProperties so any new fields in the JSON are simply dropped during deserialization. Therefore, simply reverting the LineageEvent models would make the Marquez backend incompatible with the new OpenLineage models as new facets would be dropped from the model before storing.
I think we should revert #1593 and alter the models to support unknown fields. Some options for this are
Add a Map<String, Object> field annotated with @JsonAnySetter so that any unknown fields are added to the map, rather than dropped.
This is little work up front and offers backward and forward compatibility, as any unknown fields are automatically supported. There is some maintainability concern, as we need to update the Marquez model alongside the OL one.
Extend or wrap (using @JsonUnwrapped) Jackson ObjectNode so that objects are automatically deserialized into JsonNodes and setters/getters are written to work with expected properties in a compatible API
This is the most up-front work, but offers the most compatibility and least maintenance. Each model is backward and future compatible with any event POSTed and will always be serialized back into an exact replica of the original event. Accessor methods must be hand-written to replace the lombok-generated ones in order to maintain API compatibility.
Wrap new OpenLineage model classes with existing Marquez models
This provides the binary compatibility we need, while avoiding the maintenance issue of synchronizing the Marquez models with the OpenLineage ones. The payload would always be deserialized into OpenLineage models (so we can receive and store the data even if the Marquez model is never updated). However, we still need to maintain the compatibility layer (the accessor methods) and we are still limited to the fields defined in the version of the OL library deployed with Marquez. Moreover, the OL API for constructing events is a bit cumbersome to use in a case like this. Each model class must be instantiated by an instance of the OpenLineage class, which is instantiated with the appropriate producer field. Thus, we can't simply instantiate a new Job or JobFacet and expect the accompanying OpenLineage.Job or OpenLineage.JobFacets class to be instantiated, as there needs to be a shared OpenLineage instance to actually create the instances. This is easy enough to accomplish for model instances that are created purely from Marquez (e.g., a static utility instance), but makes it very difficult to build a processing workflow, such as one that clones a model and adds a new facet (and maintains the original models' producer fields) before handing off to another processor.
Write custom deserializer to automatically add raw JSON string to LineageEvent object
This is the least work and solves the most immediate problem- that data serialized and stored in the lineage_events table is incomplete. However, it makes processing objects that have unknown fields impossible- e.g., a workflow that copies a LineageEvent and adds another facet to the Run before passing on to storage or another processor would immediately lose information. It also does not offer any additional maintainability support, as the Marquez models must always be updated to synchronize with the OL models.
Of the four options, the first offers the most compatibility with the most flexibility while maintaining forward/backward compatibility and relatively low maintainability concern.
The text was updated successfully, but these errors were encountered:
Thanks for the great write up, @collado-mike. I think whatever approach we go with, Marquez should eventually use the OpenLineage server-specific models defined for consumption, see OpenLineage/OpenLineage#67. That said, I'd favor option 1. Using Map<String, Object> to capture any additional properties that are not part of the core OpenLineage RunEvent class gives us enough flexibility to access facets. For option 2, I'd like to avoid hand-written methods or classes in favor of using generated classes by OpenLineage, similarly for the remaining options.
There is some maintainability concern, as we need to update the Marquez model alongside the OL one
The change in #1593 made the
marquez-api
jar incompatible with code that had depended on theLineageEvent
class and its related classes. Any code that depended on those models must now be rewritten to rely on theOpenLineage.*
models, which have a very different construction model, thus require a major effort to rewrite.Moreover, the current OpenLineage API has introduced new fields in the
InputDataset
andOutputDataset
models, which were never present in the Marquez implementation of the OpenLineage models. TheLineageEvent
model is annotated with@JsonIgnoreProperties
so any new fields in the JSON are simply dropped during deserialization. Therefore, simply reverting theLineageEvent
models would make the Marquez backend incompatible with the new OpenLineage models as new facets would be dropped from the model before storing.I think we should revert #1593 and alter the models to support unknown fields. Some options for this are
Map<String, Object>
field annotated with@JsonAnySetter
so that any unknown fields are added to the map, rather than dropped.@JsonUnwrapped
) JacksonObjectNode
so that objects are automatically deserialized into JsonNodes and setters/getters are written to work with expected properties in a compatible APIOpenLineage
model classes with existing Marquez modelsOpenLineage
models (so we can receive and store the data even if the Marquez model is never updated). However, we still need to maintain the compatibility layer (the accessor methods) and we are still limited to the fields defined in the version of the OL library deployed with Marquez. Moreover, the OL API for constructing events is a bit cumbersome to use in a case like this. Each model class must be instantiated by an instance of theOpenLineage
class, which is instantiated with the appropriateproducer
field. Thus, we can't simply instantiate a newJob
orJobFacet
and expect the accompanyingOpenLineage.Job
orOpenLineage.JobFacets
class to be instantiated, as there needs to be a sharedOpenLineage
instance to actually create the instances. This is easy enough to accomplish for model instances that are created purely from Marquez (e.g., a static utility instance), but makes it very difficult to build a processing workflow, such as one that clones a model and adds a new facet (and maintains the original models'producer
fields) before handing off to another processor.lineage_events
table is incomplete. However, it makes processing objects that have unknown fields impossible- e.g., a workflow that copies aLineageEvent
and adds another facet to theRun
before passing on to storage or another processor would immediately lose information. It also does not offer any additional maintainability support, as the Marquez models must always be updated to synchronize with the OL models.Of the four options, the first offers the most compatibility with the most flexibility while maintaining forward/backward compatibility and relatively low maintainability concern.
The text was updated successfully, but these errors were encountered: