Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage Stage 0 #24

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Codeowners file by GitHub
# Reference: https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/about-code-owners
# Each line is a file pattern followed by one or more owners.
# Order is important; the last matching pattern takes the most
# precedence.

# These owners will be the default owners for everything in
# the repo. Unless a later match takes precedence,
# @amundsen-io/amundsen-committerswill be requested for
# review when someone opens a pull request.
* @amundsen-io/amundsen-committers
84 changes: 84 additions & 0 deletions rfcs/000-abstract-serializable-for-relational-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
- Feature Name: Add abstract serializable for relational databases in databuilder models
- Start Date: 2021-01-21
- RFC PR: [amundsen-io/rfcs#21](https://github.com/amundsen-io/rfcs/pull/21)
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)

# Add abstract serializable for relational databases in databuilder models

## Summary

The new abstract serializable in databuilder models will work for the metadata injection to relational databases. This RFC would
bring two major changes.

- Add a new abstraction layer working as the serializable to generate next record.
- Add record iterator and its function for next record instance working with [Amundsenrds](https://github.com/amundsen-io/amundsenrds) in applicable [databuilder models](https://github.com/amundsen-io/amundsendatabuilder/tree/master/databuilder/models).

## Motivation

Currently, the databuilder models are working for graph db with node/relation iterator. In order to support relational database as backend store
in Amundsen, this RFC aims to make databuilder model as a centralized model which will not only work for graph db but also for relational databases.
Amundsenrds contains the ORM models for relational db use, and the new abstract relational DB serializable class would be inherited by applicable databuilder models
to build record iterator and yield record instance with Amundsenrds.

## Guide-level Explanation (aka Product Details)

The new abstraction layer `TableSerializable` will work similarly to the current `GraphSerializable` and will contain abstract method for applicable databuilder models to implement method for record iterator.
Then the record object from `TableSerializable` can be extracted in the future relational DB data loader.

## UI/UX-level Explanation

N/A

## Reference-level Explanation (aka Technical Details)

(1). `TableSerializable`: It mainly contains method `create_next_record()` and `next_record()`

(2). `record_iterator`: For the applicable databuilder models, they will inherit `TableSeriablizable` as well as the existing `GraphSerializable`.
We will treat the current databuilder model as a centralized model with a `record_iterator` in its initialization function. Also, the next method
for `record_iterator` will yield record instance by calling models in Amundsenrds.
For example(a partial change in `_create_next_record()` of [TableMetadata](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/table_metadata.py)):

```python
def _create_next_record(self):
""" omitted code snippet yielding Database, Cluster, Schema record instances
"""

# Table
yield RDBModels.Table(
rk=self._get_table_key(),
name=self.name,
is_view=self.is_view,
schema_rk=self._get_schema_key()
)

"""omitted code snippet yielding the rest
"""
```
Note:
Considering the foreign key constraints among ORM models, we would have to yield record instance in the topological order.
(e.g, in TableMetadata, yield metadata in the order of Database -> Cluster -> Schema -> Table -> Column)

## Drawbacks

For some property change, like the data type change, in databuilder model, we have to take potential schema update in
Amundsenrds into consideration. The schema migration will rely on Alembic model upgrade/downgrade.

## Alternatives

Instead of adding a new abstract TableSerializable, update current GraphSerializable as a general serializable containing
node/relation/record method definition. The concern is that some databuilder models are not applicable for relational DB, like
[Neo4jESLastUpdated](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/neo4j_es_last_updated.py) and
for future models that will be only pushed to graph db, adding dummy record iterator in these models would be redundant.

## Prior art

N/A

## Unresolved questions

N/A

## Future possibilities

With the new abstraction layer, serializable, in databuilder models, we will add new serializer, data loader and publisher
to support a specific relational database, like MySQL, as the metadata store.
180 changes: 180 additions & 0 deletions rfcs/000-lineage-stage-0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
- Feature Name: lineage_stage_0
- Start Date: 2021-02-22
- RFC PR: [amundsen-io/rfcs#24](https://github.com/amundsen-io/rfcs/pull/24)
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000)
# Amundsen Lineage - Stage 0

## Summary


Currently Amundsen doesn't have a way of surfacing lineage information for tables and columns. The idea for this first iteration is to have a way to show upstream and downstream tables and columns to users through the Table Details page so they can explore the current resource's lineage as well as navigate to related resources in Amundsen.
The first iteration is meant to be a fast implementation of the feature that we can get feedback on and improve in future iterations.

## Motivation

Lineage is essential to improving data discovery in Amundsen because it allows users to know where the data for a given resource is coming from as well as where this data is used downstream.


## Guide-level Explanation (aka Product Details)

> Explain the proposal as if it was already included in Amundsen and you were teaching it to an Amundsen user. That generally means:

> Introducing new named concepts.
> Explaining the feature largely in terms of examples.
> Explaining how Amundsen users should think about the feature, and how it should impact the way they use Amundsen. It should explain the impact as concretely as possible.
> If applicable, provide deprecation warnings, or migration guidance.
> For implementation-oriented RFCs, this section should focus on how maintainers should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms.

### New Concepts
- Lineage:
- Upstream:
- Downstream:

Those implementing Amundsen should keep in mind that this feature is meant to provide them with a way to surface their existing lineage data by calling the service containing that data from the metadata service. This iteration won't provide a model to persist lineage on neo4j, but rather a gateway to lineage data so it can be included on lineage API responses to displayed in frontend. It is also important to understand that the feature will be disabled by default and can be enable through configuration.


## UI/UX-level Explanation

> Explain the UI changes that your proposal would need (if applicable). This could mean:

> Provide a high level diagram or wireframe about the change.
> If applicable, suggest error, empty and loading states for the change.

## Reference-level Explanation (aka Technical Details)
### Architecture

![Lineage Stage 0 Architecture](assets/lineage_arch.png)
Implementing this feature will require defining a Lienage API on the metadata service for Tables and Columns. When the API is called it will make calls to neo4j and whatever the source of lineage data is. An interface needs to be created to interact with an implementer's lineage service in a generic way. The data from the calls to these services will be put together to form the lineage response as defined below.

### Backend Implementation

#### Table Lineage API

_The table details page must list X levels of downstream and upstream dataset name, level, source (database), badges, on the DOWNSTREAM and UPSTREAM tabs. These datasets should also be sortable by usage._

When the user clicks the DOWNSTREAM or UPSTREAM tabs on the table details page, either of 2 requests to metadata will be executed containing lineage direction (upstream/downstream) and depth (levels):

```https://amundsenmetadata.com/table/current_table_key/lineage?direction=upstream&depth=1```

OR

```https://amundsenmetadata.com/table/current_table_key/lineage?direction=downstream&depth=1```

will be executed and the lineage call will return a response:

```
{
“key”: “current_table_key”,
“direction”: “upstream”
“lineage_entities_upstream”: [
{
“table”: “table_key1”,
“level”: 1,
"source": “hive”,
“badges”: [“coco”, “beta”],
“usage”: 234,
},
...
],
“lineage_entities_downstream”: []
}
```

OR

```
{
“key”: “current_table_key”,
“direction”: “downstream”
“Lineage_entities_upstream”: [],
“lineage_entities_downstream”: [
{
“table”: “table_key2”,
“level”: 1,
"source": “hive”,
“badges”: [],
“usage”: 45,
},
...
]
}
```

#### Column Lineage API
_The expanded view of a column in the table details page must display lists of upstream and downstream columns at the same time._

When the user expands the column to see more details 2 requests to metadata will be executed as follows:


```https://amundsenmetadata.com/table/current_table_key/column/column_name/lineage?direction=both&depth=1```


and the lineage call will return a response:

```
{
“key”: “current_table_key/current_column_name”,
“direction”: “all”
“lineage_entities_upstream”: [
{
“key”: “table_key1/column_name1”,
“level”: 1,
"source": “hive”,
“usage”: 234,
},
...
],
“lineage_entities_downstream”: [
{
“key”: “table_key2/column_name2”,
“level”: 1,
"source": “hive”,
“usage”: 45,
},
...
]
}
```


## Drawbacks

> Why should we _not_ do this?

> Please consider:
> Implementation cost, both in term of code size and complexity
> Integration of this feature with other existing and planned features
> The impact on onboarding and learning about Amundsen
> Cost of migrating existing Amundsen installations (is it a breaking change?)
> If there are tradeoffs to choosing any path. Attempt to identify them here.

## Alternatives

> Why is this design the best in the space of possible designs?
> What other designs have been considered and what is the rationale for not choosing them?
> What is the impact of not doing this?

## Prior art

> Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:

> Does this feature exist in other data search applications and what experience have their community had?
> For community proposals: Is this done by some other community and what were their experiences with it?
> Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.

> This section is intended to encourage you as an author to think about the lessons from other projects, provide readers of your RFC with a fuller picture. If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other projects.

## Unresolved questions

> What parts of the design do you expect to resolve through the RFC process before this gets merged?
> What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
> What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?

## Future possibilities

> Think about what the natural extension and evolution of your proposal would be and how it would affect the project as a whole in a holistic way. Also consider how the this all fits into the roadmap for the project and of the relevant sub-team.
> This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related.
> If you have tried and cannot think of any future possibilities, you may simply state that you cannot think of anything.
- Persist lineage data on neo4j: create extractors for databuilder library to extract the data and publish it
- Implement lineage graph view for better discovery experience
- Introduce a Task entity on Amundsen to surface pipepline tasks between tables and column and understand what tasks are repsonsible for generating tables
Loading