Skip to content

Commit

Permalink
Mode Dashboard extractor with Generic REST API Query (#194)
Browse files Browse the repository at this point in the history
* Initial check in on REST API Query

* Working version

* docstring

* Update

* Update

* Make unit test happy

* Update docstring

* Update

* Update

* Update

* Adding unit tests

* Updated README.md

* jsonpath_rw to extra_requires
  • Loading branch information
jinhyukchang authored Feb 20, 2020
1 parent 9556b18 commit 01a0f96
Show file tree
Hide file tree
Showing 16 changed files with 696 additions and 63 deletions.
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,3 +395,28 @@ Callback interface is built upon a [Observer pattern](https://en.wikipedia.org/w
Publisher is the first one adopting Callback where registered Callback will be called either when publish succeeded or when publish failed. In order to register callback, Publisher provides [register_call_back](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/publisher/base_publisher.py#L50 "register_call_back") method.

One use case is for Extractor that needs to commit when job is finished (e.g: Kafka). Having Extractor register a callback to Publisher to commit when publish is successful, extractor can safely commit by implementing commit logic into [on_success](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/callback/call_back.py#L18 "on_success") method.

### REST API Query
Databuilder now has a generic REST API Query capability that can be joined each other.
Most of the cases of extraction is currently from Database or Datawarehouse that is queryable via SQL. However, not all metadata sources provide our access to its Database and they mostly provide REST API to consume their metadata.

The challenges come with REST API is that:

1. there's no explicit standard in REST API. Here, we need to conform to majority of cases (HTTP call with JSON payload & response) but open for extension for different authentication scheme, and different way of pagination, etc.
2. It is hardly the case that you would get what you want from one REST API call. It is usually the case that you need to snitch (JOIN) multiple REST API calls together to get the information you want.

To solve this challenges, we introduce [RestApiQuery](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/rest_api/rest_api_query.py)

RestAPIQuery is:
1. Assuming that REST API is using HTTP(S) call with GET method -- RestAPIQuery intention's is **read**, not write -- where basic HTTP auth is supported out of the box. There's extension point on other authentication scheme such as Oauth, and pagination, etc.
2. Usually, you want the subset of the response you get from the REST API call -- value extraction. To extract the value you want, RestApiQuery uses [JSONPath](https://goessner.net/articles/JsonPath/) which is similar product as XPATH of XML.
3. You can JOIN multiple RestApiQuery together.

More detail on JOIN operation in RestApiQuery:
1. It joins multiple RestApiQuery together by accepting prior RestApiQuery as a constructor -- a [Decorator pattern](https://en.wikipedia.org/wiki/Decorator_pattern)
2. In REST API, URL is the one that locates the resource we want. Here, JOIN simply means we need to find resource **based on the identifier that other query's result has**. In other words, when RestApiQuery forms URL, it uses previous query's result to compute the URL `e.g: Previous record: {"dashboard_id": "foo"}, URL before: http://foo.bar/dashboard/{dashboard_id} URL after compute: http://foo.bar/dashboard/foo`
With this pattern RestApiQuery supports 1:1 and 1:N JOIN relationship.
(GROUP BY or any other aggregation, sub-query join is not supported)

To see in action, take a peek at [ModeDashboardExtractor](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/dashboard/mode_dashboard_extractor.py)

Empty file.
98 changes: 98 additions & 0 deletions databuilder/extractor/dashboard/mode_dashboard_extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import logging

from pyhocon import ConfigTree, ConfigFactory # noqa: F401
from requests.auth import HTTPBasicAuth
from typing import Any # noqa: F401

from databuilder import Scoped
from databuilder.extractor.base_extractor import Extractor
from databuilder.extractor.restapi.rest_api_extractor import RestAPIExtractor, REST_API_QUERY, MODEL_CLASS, \
STATIC_RECORD_DICT
from databuilder.rest_api.base_rest_api_query import RestApiQuerySeed
from databuilder.rest_api.rest_api_query import RestApiQuery

# CONFIG KEYS
ORGANIZATION = 'organization'
MODE_ACCESS_TOKEN = 'mode_user_token'
MODE_PASSWORD_TOKEN = 'mode_password_token'

LOGGER = logging.getLogger(__name__)


class ModeDashboardExtractor(Extractor):
"""
A Extractor that extracts core metadata on Mode dashboard. https://app.mode.com/
It extracts list of reports that consists of:
Dashboard group name (Space name)
Dashboard group id (Space token)
Dashboard group description (Space description)
Dashboard name (Report name)
Dashboard id (Report token)
Dashboard description (Report description)
Other information such as report run, owner, chart name, query name is in separate extractor.
"""

def init(self, conf):
# type: (ConfigTree) -> None

self._conf = conf

restapi_query = self._build_restapi_query()
self._extractor = RestAPIExtractor()
rest_api_extractor_conf = Scoped.get_scoped_conf(conf, self._extractor.get_scope()).with_fallback(
ConfigFactory.from_dict(
{
REST_API_QUERY: restapi_query,
MODEL_CLASS: 'databuilder.models.dashboard_metadata.DashboardMetadata',
STATIC_RECORD_DICT: {'product': 'mode'}
}
)
)

self._extractor.init(conf=rest_api_extractor_conf)

def extract(self):
# type: () -> Any

return self._extractor.extract()

def get_scope(self):
# type: () -> str

return 'extractor.mode_dashboard'

def _build_restapi_query(self):
"""
Build REST API Query. To get Mode Dashboard metadata, it needs to call two APIs (spaces API and reports
API) joining together.
:return: A RestApiQuery that provides Mode Dashboard metadata
"""
# type: () -> RestApiQuery

spaces_url_template = 'https://app.mode.com/api/{organization}/spaces?filter=all'
reports_url_template = 'https://app.mode.com/api/{organization}/spaces/{dashboard_group_id}/reports'

# Seed query record for next query api to join with
seed_record = [{'organization': self._conf.get_string(ORGANIZATION)}]
seed_query = RestApiQuerySeed(seed_record=seed_record)

params = {'auth': HTTPBasicAuth(self._conf.get_string(MODE_ACCESS_TOKEN),
self._conf.get_string(MODE_PASSWORD_TOKEN))}

# Spaces
# JSONPATH expression. it goes into array which is located in _embedded.spaces and then extracts token, name,
# and description
json_path = '_embedded.spaces[*].[token,name,description]'
field_names = ['dashboard_group_id', 'dashboard_group', 'dashboard_group_description']
spaces_query = RestApiQuery(query_to_join=seed_query, url=spaces_url_template, params=params,
json_path=json_path, field_names=field_names)

# Reports
# JSONPATH expression. it goes into array which is located in _embedded.reports and then extracts token, name,
# and description
json_path = '_embedded.reports[*].[token,name,description]'
field_names = ['dashboard_id', 'dashboard_name', 'description']
reports_query = RestApiQuery(query_to_join=spaces_query, url=reports_url_template, params=params,
json_path=json_path, field_names=field_names, skip_no_result=True)
return reports_query
Empty file.
70 changes: 70 additions & 0 deletions databuilder/extractor/restapi/rest_api_extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import logging
import importlib
from typing import Iterator, Any # noqa: F401

from pyhocon import ConfigTree # noqa: F401

from databuilder.extractor.base_extractor import Extractor
from databuilder.rest_api.base_rest_api_query import BaseRestApiQuery # noqa: F401


REST_API_QUERY = 'restapi_query'
MODEL_CLASS = 'model_class'

# Static record that will be added into extracted record
# For example, DashboardMetadata requires product name (static name) of Dashboard and REST api does not provide
# it. and you can add {'product': 'mode'} so that it will be included in the record.
STATIC_RECORD_DICT = 'static_record_dict'

LOGGER = logging.getLogger(__name__)


class RestAPIExtractor(Extractor):
"""
An Extractor that calls one or more REST API to extract the data.
This extractor almost entirely depends on RestApiQuery.
"""

def init(self, conf):
# type: (ConfigTree) -> None

self._restapi_query = conf.get(REST_API_QUERY) # type: BaseRestApiQuery
self._iterator = None # type: Iterator[Dict[str, Any]]
self._static_dict = conf.get(STATIC_RECORD_DICT, dict())
LOGGER.info('static record: {}'.format(self._static_dict))

model_class = conf.get(MODEL_CLASS, None)
if model_class:
module_name, class_name = model_class.rsplit(".", 1)
mod = importlib.import_module(module_name)
self.model_class = getattr(mod, class_name)

def extract(self):
# type: () -> Any

"""
Fetch one result row from RestApiQuery, convert to {model_class} if specified before
returning.
:return:
"""

if not self._iterator:
self._iterator = self._restapi_query.execute()

try:
record = next(self._iterator)
except StopIteration:
return None

if self._static_dict:
record.update(self._static_dict)

if hasattr(self, 'model_class'):
return self.model_class(**record)

return record

def get_scope(self):
# type: () -> str

return 'extractor.restapi'
105 changes: 80 additions & 25 deletions databuilder/models/dashboard_metadata.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from collections import namedtuple

from typing import Iterable, Any, Union, Iterator, Dict, Set # noqa: F401
from typing import Any, Union, Iterator, Dict, Set, Optional # noqa: F401

# TODO: We could separate TagMetadata from table_metadata to own module
from databuilder.models.table_metadata import TagMetadata
Expand All @@ -15,7 +15,10 @@

class DashboardMetadata(Neo4jCsvSerializable):
"""
Dashboard metadata that contains dashboardgroup, tags, description, userid and lastreloadtime.
Dashboard metadata that contains dashboard group name, dashboardgroup description, dashboard description,
along with tags, owner userid and lastreloadtime.
(Owner ID and last reload time will be supported by separate Extractor later on with more information)
It implements Neo4jCsvSerializable so that it can be serialized to produce
Dashboard, Tag, Description, Lastreloadtime and relation of those. Additionally, it will create
Dashboardgroup with relationships to Dashboard. If users exist in neo4j, it will create
Expand All @@ -24,23 +27,27 @@ class DashboardMetadata(Neo4jCsvSerializable):
Lastreloadtime is the time when the Dashboard was last reloaded.
"""
DASHBOARD_NODE_LABEL = 'Dashboard'
DASHBOARD_KEY_FORMAT = '{dashboard_group}://{dashboard_name}'
DASHBOARD_KEY_FORMAT = '{product}_dashboard://{cluster}.{dashboard_group}/{dashboard_name}'
DASHBOARD_NAME = 'name'

DASHBOARD_DESCRIPTION_NODE_LABEL = 'Description'
DASHBOARD_DESCRIPTION = 'description'
DASHBOARD_DESCRIPTION_FORMAT = '{dashboard_group}://{dashboard_name}/_description'
DASHBOARD_DESCRIPTION_FORMAT = \
'{product}_dashboard://{cluster}.{dashboard_group}/{dashboard_name}/_description'
DASHBOARD_DESCRIPTION_RELATION_TYPE = 'DESCRIPTION'
DESCRIPTION_DASHBOARD_RELATION_TYPE = 'DESCRIPTION_OF'

DASHBOARD_GROUP_NODE_LABEL = 'Dashboardgroup'
DASHBOARD_GROUP_KEY_FORMAT = 'dashboardgroup://{dashboard_group}'
DASHBOARD_GROUP_KEY_FORMAT = '{product}_dashboard://{cluster}.{dashboard_group}'
DASHBOARD_GROUP_DASHBOARD_RELATION_TYPE = 'DASHBOARD'
DASHBOARD_DASHBOARD_GROUP_RELATION_TYPE = 'DASHBOARD_OF'

DASHBOARD_GROUP_DESCRIPTION_KEY_FORMAT = '{product}_dashboard://{cluster}.{dashboard_group}/_description'

DASHBOARD_LAST_RELOAD_TIME_NODE_LABEL = 'Lastreloadtime'
DASHBOARD_LAST_RELOAD_TIME = 'value'
DASHBOARD_LAST_RELOAD_TIME_FORMAT = '{dashboard_group}://{dashboard_name}/_lastreloadtime'
DASHBOARD_LAST_RELOAD_TIME_FORMAT =\
'{product}_dashboard://{cluster}.{dashboard_group}/{dashboard_name}/_lastreloadtime'
DASHBOARD_LAST_RELOAD_TIME_RELATION_TYPE = 'LAST_RELOAD_TIME'
LAST_RELOAD_TIME_DASHBOARD_RELATION_TYPE = 'LAST_RELOAD_TIME_OF'

Expand All @@ -60,50 +67,78 @@ def __init__(self,
dashboard_group, # type: str
dashboard_name, # type: str
description, # type: Union[str, None]
last_reload_time, # type: str
user_id, # type: str
tags # type: List
last_reload_time=None, # type: Optional[str]
user_id=None, # type: Optional[str]
tags=None, # type: List
cluster='gold', # type: str
product='', # type: Optional[str]
dashboard_group_id=None, # type: Optional[str]
dashboard_id=None, # type: Optional[str]
dashboard_group_description=None, # type: Optional[str]
**kwargs
):
# type: (...) -> None

self.dashboard_group = dashboard_group
self.dashboard_name = dashboard_name
self.dashboard_group_id = dashboard_group_id if dashboard_group_id else dashboard_group
self.dashboard_id = dashboard_id if dashboard_id else dashboard_name
self.description = description
self.last_reload_time = last_reload_time
self.user_id = user_id
self.tags = tags
self.product = product
self.cluster = cluster
self.dashboard_group_description = dashboard_group_description
self._node_iterator = self._create_next_node()
self._relation_iterator = self._create_next_relation()

def __repr__(self):
# type: () -> str
return 'DashboardMetadata({!r}, {!r}, {!r}, {!r}, {!r}, {!r}, {!r}' \
return 'DashboardMetadata({!r}, {!r}, {!r}, {!r}, {!r}, {!r}, {!r}, {!r}, {!r})' \
.format(self.dashboard_group,
self.dashboard_name,
self.description,
self.last_reload_time,
self.user_id,
self.tags
self.tags,
self.dashboard_group_id,
self.dashboard_id,
self.dashboard_group_description
)

def _get_dashboard_key(self):
# type: () -> str
return DashboardMetadata.DASHBOARD_KEY_FORMAT.format(dashboard_group=self.dashboard_group,
dashboard_name=self.dashboard_name)
return DashboardMetadata.DASHBOARD_KEY_FORMAT.format(dashboard_group=self.dashboard_group_id,
dashboard_name=self.dashboard_id,
cluster=self.cluster,
product=self.product)

def _get_dashboard_description_key(self):
# type: () -> str
return DashboardMetadata.DASHBOARD_DESCRIPTION_FORMAT.format(dashboard_group=self.dashboard_group,
dashboard_name=self.dashboard_name)
return DashboardMetadata.DASHBOARD_DESCRIPTION_FORMAT.format(dashboard_group=self.dashboard_group_id,
dashboard_name=self.dashboard_id,
cluster=self.cluster,
product=self.product)

def _get_dashboard_group_description_key(self):
# type: () -> str
return DashboardMetadata.DASHBOARD_GROUP_DESCRIPTION_KEY_FORMAT.format(dashboard_group=self.dashboard_group_id,
cluster=self.cluster,
product=self.product)

def _get_dashboard_group_key(self):
# type: () -> str
return DashboardMetadata.DASHBOARD_GROUP_KEY_FORMAT.format(dashboard_group=self.dashboard_group)
return DashboardMetadata.DASHBOARD_GROUP_KEY_FORMAT.format(dashboard_group=self.dashboard_group_id,
cluster=self.cluster,
product=self.product)

def _get_dashboard_last_reload_time_key(self):
# type: () -> str
return DashboardMetadata.DASHBOARD_LAST_RELOAD_TIME_FORMAT.format(dashboard_group=self.dashboard_group,
dashboard_name=self.dashboard_name)
dashboard_name=self.dashboard_id,
cluster=self.cluster,
product=self.product)

def _get_owner_key(self):
# type: () -> str
Expand Down Expand Up @@ -131,6 +166,12 @@ def _create_next_node(self):
DashboardMetadata.DASHBOARD_NAME: self.dashboard_group,
}

# Dashboard group description
if self.dashboard_group_description:
yield {NODE_LABEL: DashboardMetadata.DASHBOARD_DESCRIPTION_NODE_LABEL,
NODE_KEY: self._get_dashboard_group_description_key(),
DashboardMetadata.DASHBOARD_DESCRIPTION: self.dashboard_group_description}

# Dashboard description node
if self.description:
yield {NODE_LABEL: DashboardMetadata.DASHBOARD_DESCRIPTION_NODE_LABEL,
Expand Down Expand Up @@ -160,6 +201,17 @@ def create_next_relation(self):
def _create_next_relation(self):
# type: () -> Iterator[Any]

# Dashboard group > Dashboard group description relation
if self.dashboard_group_description:
yield {
RELATION_START_LABEL: DashboardMetadata.DASHBOARD_GROUP_NODE_LABEL,
RELATION_END_LABEL: DashboardMetadata.DASHBOARD_DESCRIPTION_NODE_LABEL,
RELATION_START_KEY: self._get_dashboard_group_key(),
RELATION_END_KEY: self._get_dashboard_group_description_key(),
RELATION_TYPE: DashboardMetadata.DASHBOARD_DESCRIPTION_RELATION_TYPE,
RELATION_REVERSE_TYPE: DashboardMetadata.DESCRIPTION_DASHBOARD_RELATION_TYPE
}

# Dashboard group > Dashboard relation
yield {
RELATION_START_LABEL: DashboardMetadata.DASHBOARD_NODE_LABEL,
Expand Down Expand Up @@ -205,14 +257,17 @@ def _create_next_relation(self):
}

# Dashboard > Dashboard owner relation
others = [
RelTuple(start_label=DashboardMetadata.DASHBOARD_NODE_LABEL,
end_label=DashboardMetadata.OWNER_NODE_LABEL,
start_key=self._get_dashboard_key(),
end_key=self._get_owner_key(),
type=DashboardMetadata.DASHBOARD_OWNER_RELATION_TYPE,
reverse_type=DashboardMetadata.OWNER_DASHBOARD_RELATION_TYPE)
]
others = []

if self.user_id:
others.append(
RelTuple(start_label=DashboardMetadata.DASHBOARD_NODE_LABEL,
end_label=DashboardMetadata.OWNER_NODE_LABEL,
start_key=self._get_dashboard_key(),
end_key=self._get_owner_key(),
type=DashboardMetadata.DASHBOARD_OWNER_RELATION_TYPE,
reverse_type=DashboardMetadata.OWNER_DASHBOARD_RELATION_TYPE)
)

for rel_tuple in others:
if rel_tuple not in DashboardMetadata.serialized_rels:
Expand Down
Empty file.
Loading

0 comments on commit 01a0f96

Please sign in to comment.