Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage Stage 0 #25

Merged
merged 9 commits into from
Mar 9, 2021
Binary file added assets/025/column-lineage-preview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/025/lineage-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/025/table-lineage-preview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
153 changes: 153 additions & 0 deletions rfcs/025-lineage-stage-0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
- Feature Name: lineage_stage_0
- Start Date: 2021-02-22
- RFC PR: [amundsen-io/rfcs#25](https://github.com/amundsen-io/rfcs/pull/25)
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000)
# Amundsen Lineage - Stage 0

## Summary


Currently Amundsen doesn't have a way of surfacing lineage information for tables and columns. The idea for this first iteration is to have a way to show upstream and downstream tables and columns to users through the Table Details page so they can explore the current resource's lineage as well as navigate to related resources in Amundsen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, actually we could use programmatic description to surface the lineage . I think what it is lacking, it is a graph UI to surface the lineage intuitively.

The first iteration is meant to be a fast implementation of the feature that we can get feedback on and improve in future iterations.

## Motivation

Lineage is essential to improving data discovery in Amundsen because it allows users to know where the data for a given resource is coming from as well as where this data is used downstream.


## Guide-level Explanation (aka Product Details)

### New Concepts
- Lineage: Lineage is a term that describes the flow of data from one entity to another. While this term can broadly include everything from services, events, ETLs, and dashboards, we will focus on table-to-table and column-to-column data lineage in this RFC.
- Upstream: Upstream is a relative term that describes data sources from which we inherit. Data flows from upstream to downstream.
- Downstream: Downstream is a relative term that describes data entities which consume our data.

This feature will expose upstream and downstream tables and columns within the `Table Details` page.

Those implementing Amundsen should keep in mind that this feature is meant to provide them with a way to surface their existing lineage data by calling the service containing that data from the metadata service. This iteration won't provide a model to persist lineage on neo4j, but rather a gateway to lineage data so it can be included on lineage API responses to displayed in frontend. It is also important to understand that the feature will be disabled by default and can be enable through configuration.
allisonsuarez marked this conversation as resolved.
Show resolved Hide resolved


## UI/UX-level Explanation

![Table Lineage Preview](../assets/025/table-lineage-preview.png)

We will add two additional tabs to the `Table Details` page, `Upstream` and `Downstream`. Each tab will contain a list of tables from which data is inherited or consumed. This allows users view a table's lineage in a very simple manner.

![Column Lineage Preview](../assets/025/column-lineage-preview.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we don't go with lineage tab with column and dashboard, then have upstream/downstream subtab underneath?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a preference to avoid nested tabs when possible. At some point if we have too many tabs then we can reconsider nesting or other options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it feels weird to break it into two tabs. lineage is, after all, a big picture of data flow - so would make sense to have it in one place. It'd make it easier to analyze it for end user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, +1 on @mgorsk1 , I felt the same to have single lineage tab with nested tab for upstream/downstream (not sure how complexity it changes for FE implementation). @danwom @allisonsuarez could you share more about the details of the preferences?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not using the nested tabs where possible.

For Stage 0, it will be a list of items for each direction, which IMHO is okay to be placed in different tabs. But yes, for the next milestone, where we'll have a graph/chart of the complete lineage (and not the list of tables), that must be placed in one view under 1 tab to have a clear view of the complete lineage.

Copy link
Contributor

@mgorsk1 mgorsk1 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really advocating for nested tabs as this sounds more like implementation preference rather than UX question and im fine avoiding it if we don't like nesting.

But is nesting the only way this could be implemented in single tab ? I would even see something like single list (sorted by upstream and then downstream) with table names and icon/abbreviation if the table is upstream or downstream.


Additionally we will add lineage information at the column level, viewable by expanding column metadata.

These features will only appear when the lineage feature is enabled.

## Reference-level Explanation (aka Technical Details)
### Architecture

![Lineage Stage 0 Architecture](../assets/025/lineage-arch.png)

Implementing this feature will require defining a Lienage API on the metadata service for Tables and Columns. When the API is called it will make calls to neo4j and whatever the source of lineage data is. An interface needs to be created to interact with an implementer's lineage service in a generic way. The data from the calls to these services will be put together to form the lineage response as defined below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this generic API call be made to any of the metadata proxies? or is this simply a configurable function which users will be able to extend themselves?
Suppose a user has a proxy set to Atlas. For Stage 0, will this endpoint call a function inside the atlas_proxy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah on the Base Proxy class I added a get_lineage method that can be implemented and will be called for any proxy when the endpoint is hit https://github.com/amundsen-io/amundsenmetadatalibrary/blob/master/metadata_service/proxy/base_proxy.py#L160

Copy link
Contributor

@mgorsk1 mgorsk1 Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that with this method any lineage provider (like Atlas, Marquez, spline or open lineage) can be supported, right? Any proxy would support any lineage provider really.

Copy link
Member

@feng-tao feng-tao Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgorsk1 I think this is just step 0 which we should bring the backend model (or I assume we will bring) with ingestion in step 1/2. The reason on why we are doing this is because Lyft still uses a 3rd party vendor lineage service . It is easier for the implementation in these kinds of steps by focusing on FE first with backend proxy; then build a backend model/ ingestion with push mechanism in a subsequent step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we will hopefully add a backend model for lineage in the future so that we can query the db directly for lineage rather than making ad hoc calls to a provider from metadata. In that case we would have to add support for lineage extraction on databuilder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear, for me support for external service is more than fine, just wanted to make sure I got it right.


### Backend Implementation

#### Table Lineage API

_The table details page must list X levels of downstream and upstream dataset name, level, source (database), badges, on the DOWNSTREAM and UPSTREAM tabs. These datasets should also be sortable by usage._

When the user clicks the DOWNSTREAM or UPSTREAM tabs on the table details page, either of 2 requests to metadata will be executed containing lineage direction (upstream/downstream) and depth (levels):

```https://amundsenmetadata.com/table/current_table_key/lineage?direction=upstream&depth=1```
OR
```https://amundsenmetadata.com/table/current_table_key/lineage?direction=downstream&depth=1```
will be executed and the lineage call will return a response:
```
{
“key”: “current_table_key”,
“direction”: “upstream”
“upstream_entities”: [
{
“table”: “table_key1”,
“level”: 1,
"source": “hive”,
“badges”: [“core”, “beta”],
“usage”: 234,
},
...
],
“downstream_entities”: []
}
```
OR
```
{
“key”: “current_table_key”,
“direction”: “downstream”
“upstream_entities”: [],
“downstream_entities”: [
{
“table”: “table_key2”,
“level”: 1,
"source": “hive”,
“badges”: [],
“usage”: 45,
},
...
]
}
```
#### Column Lineage API
_The expanded view of a column in the table details page must display lists of upstream and downstream columns at the same time._
When the user expands the column to see more details 2 requests to metadata will be executed as follows:
```https://amundsenmetadata.com/table/current_table_key/column/column_name/lineage?direction=both&depth=1```
and the lineage call will return a response:
```
{
“key”: “current_table_key/current_column_name”,
“direction”: “all”
“upstream_entities”: [
{
“key”: “table_key1/column_name1”,
“level”: 1,
"source": “hive”,
“usage”: 234,
},
...
],
“downstream_entities”: [
{
“key”: “table_key2/column_name2”,
“level”: 1,
"source": “hive”,
“usage”: 45,
},
...
]
}
```
## Drawbacks
> Why should we _not_ do this?
> Please consider:
> Implementation cost, both in term of code size and complexity
> Integration of this feature with other existing and planned features
> The impact on onboarding and learning about Amundsen
> Cost of migrating existing Amundsen installations (is it a breaking change?)
> If there are tradeoffs to choosing any path. Attempt to identify them here.
## Alternatives
> Why is this design the best in the space of possible designs?
> What other designs have been considered and what is the rationale for not choosing them?
> What is the impact of not doing this?
## Prior art
> Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:
> Does this feature exist in other data search applications and what experience have their community had?
> For community proposals: Is this done by some other community and what were their experiences with it?
> Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
> This section is intended to encourage you as an author to think about the lessons from other projects, provide readers of your RFC with a fuller picture. If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other projects.
## Unresolved questions
> What parts of the design do you expect to resolve through the RFC process before this gets merged?
> What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
> What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
## Future possibilities
> Think about what the natural extension and evolution of your proposal would be and how it would affect the project as a whole in a holistic way. Also consider how the this all fits into the roadmap for the project and of the relevant sub-team.
> This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related.
> If you have tried and cannot think of any future possibilities, you may simply state that you cannot think of anything.
- Persist lineage data on neo4j: create extractors for databuilder library to extract the data and publish it
- Implement lineage graph view for better discovery experience
- Introduce a Task entity on Amundsen to surface pipepline tasks between tables and column and understand what tasks are repsonsible for generating tables