Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Builder Record Abstraction #5

72 changes: 72 additions & 0 deletions templates/000-databuilder-record-abstraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
- Feature Name: Add a abstraction layer to the databuilder's record model
- Start Date: 2020-10-02
- RFC PR: [amundsen-io/rfcs#5](https://github.com/amundsen-io/rfcs/pull/5)

- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)

# Add a abstraction layer to the databuilder model

## Summary

Currently the databuilder's records are all subclasses of the class `Neo4jCsvSerializable` and the output of these models are Dictionaries formatted for neo4j. This RFC proposes this coupling is broken and neo4j serialization is removed from the models and moved into the the loader portion.

## Motivation

While SeatGeek was in the process of writing the Neptune integration portion for the databuilder a problem came up of seperating Neo4j from the databuilder's records. The way Neo4j is receives nodes and edges is different from the way Neptune or other graph databases do. So SeatGeek wanted to come up with a process that:
> Maintained feature parity between the datastores.
> Did not add additional overhead for adding new features to the databuilder
> Made it easier to add other data stores in the future.

## Guide-level Explanation (aka Product Details)

This is less of a feature and more of a refactor. For this rfc this section is duplicate of the `Reference-level Explanation` section.

## UI/UX-level Explanation

This would be a backend only change no UI level changes are required.

## Reference-level Explanation (aka Technical Details)

Each of the models in the databuilder are built off of the `GraphSerializable` class. The `GraphSerializable` class is an abstract class that is responsible for validating and returning the nodes and relationships of its subclasses. The interface on the `GraphSerializable` is `next_node` with a return type of `Union[GraphNode, None]` and `next_relation` with a return type of `Union[GraphRelationship, None]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense


GraphNode would be defined as named tuple with the attributes:
- key: str
- label: str
- attributes: Dict

GraphRelationship would be a named tuple with the attributes:
- start_label: str
- end_label: str
- start_key: str
- end_key: str
- forward_label: str
- reverse_label: str
- attributes: Dict

The loader would be then be responsible for taking these graph nodes and graph relationships and converting them to the format expected by the publisher.

The unpolished changes mentioned above can be found: https://github.com/AndrewCiambrone/amundsendatabuilder/pull/6

## Drawbacks

This adds a level of complexity because it adds an additional layer to the databuilder. Instead of having the neo4j serialization built into the models it becomes a separate process. This also would add a small performance degradation because the models are producing named tuples and then they are being convered into another format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate a bit or maybe you could create the pr and point me to the place of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah what I was trying to get at here is that there is a slight performance degradation because you are now converting from a named tuple to a dictionary.

https://github.com/amundsen-io/amundsendatabuilder/pull/380/files#diff-1e0a50145a1e195178c2554706e02567df347f84dd1953e2b9308065de371bf7R21

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, it should be fine.



## Alternatives

1. Duplicate all the models for each datastore but that opens the possibility of missing features in one datastore over the other.

2. Translating from neo4j models into Neptune. However this would require future contributions to remember to make sure the changes are compatible with the neptune translator.

## Prior art

N/A

## Unresolved questions

> What parts of the design do you expect to resolve through the RFC process before this gets merged?
General agreement over the approach

## Future possibilities

This is the first step in introducing other data stores into the databuilder. In the future SeatGeek would like to introduce the other portions of the Neptune integration that they have been working on but since this part of the change is a bit of a overhaul of all the models, its important to get this portion in first.