Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint #22332

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Dec 5, 2022

SUMMARY

Per SIP-64 this PR adds the /api/v1/dataset/{pk}/column endpoints for obtaining just the columns associated with a dataset to aid the performance of the native filters.

As indicated in the SIP the /api/v1/dataset/{pk} endpoint is highly inefficient when fetching a dataset comprising of a large number of columns and/or metrics, i.e., specifically at Airbnb we have a virtual dataset which houses thousands of metrics and dimensions which results in an interim query which computes the cross product of # metrics x # columns x # owners—resulting in millions of rows.

Given the native filters only use the columns/metrics independently from the /api/v1/dataset/{pk} endpoint this PR simply adds a new endpoint to fetch only the columns metadata associated with said dataset.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests. Also tested the new endpoint with a dataset comprising of ~ 7,000 columns and the /api/v1/dataset/{pk}/column endpoint returned in ~ 1 second whereas /api/v1/dataset/{pk} timed out.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@@ -257,7 +257,7 @@
"AnnotationLayerRestApi.get_list": {
"properties": {
"changed_by": {
"$ref": "#/components/schemas/AnnotationLayerRestApi.get_list.User1"
"$ref": "#/components/schemas/AnnotationLayerRestApi.get_list.User"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerated by running superset update-api-docs.

@john-bodley john-bodley force-pushed the john-bodley--api-datasets-columns branch 4 times, most recently from f4c2c43 to a4aa974 Compare December 6, 2022 01:17
@@ -36,14 +38,99 @@
class DatasetColumnsRestApi(BaseSupersetModelRestApi):
datamodel = SQLAInterface(TableColumn)

include_route_methods = {"delete"}
allow_browser_login = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

@john-bodley john-bodley marked this pull request as ready for review December 6, 2022 01:31
@john-bodley john-bodley force-pushed the john-bodley--api-datasets-columns branch from a4aa974 to 939d13e Compare December 6, 2022 02:10
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #22332 (e10ca2b) into master (d41cb66) will decrease coverage by 0.64%.
The diff coverage is 78.26%.

❗ Current head e10ca2b differs from pull request most recent head 6845322. Consider uploading reports for the commit 6845322 to get more accurate results

@@            Coverage Diff             @@
##           master   #22332      +/-   ##
==========================================
- Coverage   66.75%   66.11%   -0.65%     
==========================================
  Files        1847     1847              
  Lines       70562    70577      +15     
  Branches     7742     7737       -5     
==========================================
- Hits        47106    46664     -442     
- Misses      21455    21906     +451     
- Partials     2001     2007       +6     
Flag Coverage Δ
hive 52.53% <76.19%> (?)
mysql ?
postgres ?
presto 52.43% <76.19%> (+<0.01%) ⬆️
python 79.67% <76.19%> (-1.34%) ⬇️
sqlite 76.48% <76.19%> (-0.01%) ⬇️
unit 50.92% <76.19%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/datasets/columns/api.py 70.00% <76.19%> (-30.00%) ⬇️
...tersConfigModal/FiltersConfigForm/ColumnSelect.tsx 76.92% <100.00%> (ø)
superset/datasets/commands/create.py 30.61% <0.00%> (-69.39%) ⬇️
superset/datasets/commands/update.py 25.00% <0.00%> (-69.05%) ⬇️
superset/datasets/commands/bulk_delete.py 33.33% <0.00%> (-53.34%) ⬇️
superset/datasets/columns/commands/delete.py 44.11% <0.00%> (-52.95%) ⬇️
superset/datasets/metrics/commands/delete.py 44.11% <0.00%> (-52.95%) ⬇️
superset/datasets/dao.py 44.21% <0.00%> (-50.35%) ⬇️
superset/sql_validators/postgres.py 50.00% <0.00%> (-50.00%) ⬇️
superset/datasets/commands/delete.py 48.57% <0.00%> (-48.58%) ⬇️
... and 45 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rusackas
Copy link
Member

rusackas commented Dec 6, 2022

Thanks for taking this performance issue under your wing. I'm hopeful this can mean other orgs with this issue can take this new filters feature to production, and move us another step toward 3.0 and the end of Filter Boxes :D

@john-bodley john-bodley force-pushed the john-bodley--api-datasets-columns branch 3 times, most recently from 7684198 to f1bd7f2 Compare December 6, 2022 18:44
Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why cypress is failing but the change makes sense to me

@@ -98,17 +98,17 @@ export function ColumnSelect({
}
if (datasetId != null) {
cachedSupersetGet({
endpoint: `/api/v1/dataset/${datasetId}`,
endpoint: `/api/v1/dataset/${datasetId}/column`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endpoint: `/api/v1/dataset/${datasetId}/column`,
endpoint: `/api/v1/dataset/${datasetId}/columns`,

Should this end point be /columns?

Copy link
Member

@michael-s-molina michael-s-molina Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on columns as it represents a set of columns.

I added a ticket to Superset 3.0 project board to refactor the endpoints to use the plural form.

Copy link
Member Author

@john-bodley john-bodley Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud and @michael-s-molina I agree that it should be columns and not column—it should also be datasets and not dataset, however per,

Screenshot 2022-12-09 at 6 53 13 AM

we already use the singular form, hence using column ensures consistency which I sense is more important. Per @michael-s-molina's comment I think changing to plural forms should be a bulk operation.

rison_data.setdefault("filters", [])
rison_data["filters"].append({"col": "table_id", "opr": "eq", "value": pk})
rison_data["page_size"] = -1
return self.get_list_headless(**kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpgaspar, @michael-s-molina, or @villebro I would love your input on this as I'm not overly familiar with the new FAB API and/or how best to define said endpoint. I came across the pattern by prodding around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpgaspar I added the rison_data["page_size"] = -1 as the /api/v1/dataset/{pk} response is non-paginated and I wanted to ensure consistency. Additionally this endpoint is currently solely used in a non-paginated way, i.e., it's not powering any /*/list/ endpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the pattern I see it being only used twice and it only applies one rison override (as opposed to two which is needed in our case). I think there's merit in refactoring if it's used more than once. Furthermore this example is actually a misnomer in terms of naming, i.e., the function name has been copied and pasted as is even though it's applying the report_schedule as opposed to layer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Agree let's refactor this after. Still think we need to apply the DatasourceFilter here, have you checked it?

method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP

list_columns = [ # See DatasetRestApi.show_select_columns.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly sure the difference between list_columns and show_columns but I thought that it was prudent that the table_columns columns returned was consistent with the columns field in the /api/v1/dataset response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this opportunity to add comments in the code explaining the difference?

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @john-bodley! thank you

rison_data.setdefault("filters", [])
rison_data["filters"].append({"col": "table_id", "opr": "eq", "value": pk})
rison_data["page_size"] = -1
return self.get_list_headless(**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

@@ -36,14 +38,99 @@
class DatasetColumnsRestApi(BaseSupersetModelRestApi):
datamodel = SQLAInterface(TableColumn)

include_route_methods = {"delete"}
allow_browser_login = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

class_permission_name = "Dataset"
include_route_methods = {RouteMethod.DELETE, RouteMethod.GET, RouteMethod.GET_LIST}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just need the RouteMethod.GET_LIST

method_permission_name = MODEL_API_RW_METHOD_PERMISSION_MAP

list_columns = [ # See DatasetRestApi.show_select_columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

@john-bodley john-bodley force-pushed the john-bodley--api-datasets-columns branch from f1bd7f2 to 6845322 Compare December 8, 2022 19:55
@rusackas
Copy link
Member

rusackas commented Dec 8, 2022

Kicking CI 🤞

@john-bodley john-bodley changed the title chore(datasets): Add /api/v1/dataset/{pk}/column endpoint chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints Dec 9, 2022
@john-bodley john-bodley changed the title chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints chore(datasets): Add /api/v1/dataset/{pk}/column endpoint Dec 9, 2022
@john-bodley
Copy link
Member Author

john-bodley commented Dec 10, 2022

Regrettably when examining the initial problem I missed that the /api/v1/dataset/{pk} endpoint is also used as part of the native filters to fetch the metrics + dataset metadata here.

I thought it was prudent to return to the drawing board and close this PR in favor of trying to optimize the underlying FAB/SQLAlchemy logic when fetching a model with its associated relationships. @dpgaspar hopefully dpgaspar/Flask-AppBuilder#1959 is a step in the right direction.

All going well I'll follow up with another PR which updates the existing /api/v1/dataset/{pk} queries to filter on only the relevant dataset, columns, and metrics fields which should further improve query performance and reduce the response payload size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants