-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: tweaks for gremlin support #60
feat: tweaks for gremlin support #60
Conversation
934e115
to
7cb2175
Compare
7cb2175
to
86a0704
Compare
add optional key parameter to column, table. rename Statistics -> Stat. add fixtures, streaming, utils to common. Update mypy to 761 to fix false errors. Fix flake8 complaints about extra line in __init__.py files Signed-off-by: Joshua Hoskins <hoskins@squareup.com>
86a0704
to
013735c
Compare
amundsen_common/fixtures.py
Outdated
|
||
|
||
class Fixtures: | ||
counter = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to add some documentation somewhere on how to use this Fixtures class (a summary of some kind so that people don't have to read all the code to figure it out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- document fixtures
This looks really good to me @friendtocephalopods! Had one comment about potentially adding more documentation for the casual browser. Also I know that I have tried to upstream that Statistics -> Stat change before. I'd love to see it go through :) |
This looked good to me as well. Nice work! One top-level question: why the rename of the Statistics class? It might be nice to call out the rationale for that change since it's less obvious than the other diffs in this branch (imo). |
Oh, good question! Each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for putting up the pr. haven't time to go through, put some initial comments, let me know what you think.
amundsen_common/fixtures.py
Outdated
from amundsen_common.models.user import User | ||
|
||
|
||
class Fixtures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fixtures only for testing purpose? if so, we should move it under the tests folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, it's for testing! But it's intended to be useful in multiple repos to create test objects easily. I don't think the tests/ folder gets carried over cleanly when importing in other repos.
Maybe amundsen_common/tests/fixtures.py would be preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, amundsen_common/tests/fixtures.py would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- amundsen_common/tests/fixtures.py
@@ -61,26 +61,27 @@ class Meta: | |||
|
|||
|
|||
@attr.s(auto_attribs=True, kw_only=True) | |||
class Statistics: | |||
class Stat: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, this will break all the other services (https://github.com/amundsen-io/amundsenmetadatalibrary/search?q=Statistics&unscoped_q=Statistics) . I know this is not perfect. But let's keep it as it is and do it differently (e.g add another class named Stat and mark Statistics
deprecated etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the suggestion is to change it all over to a new Stat
class here and only remove Statistics
after references are gone in the other repos? Sure, that works!
- new class, not rename
stat_type: str | ||
stat_val: Optional[str] = None | ||
start_epoch: Optional[int] = None | ||
end_epoch: Optional[int] = None | ||
|
||
|
||
class StatisticsSchema(AttrsSchema): | ||
class StatSchema(AttrsSchema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for the rename
@@ -151,6 +153,7 @@ class Table: | |||
cluster: str | |||
schema: str | |||
name: str | |||
key: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share more what the key is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! This is the table_uri. We store it on the object in the graph for indexing anyhow, so this saves us from having to recompute it everywhere each time we fetch it.
If we fully roll this out and make it non-optional, it also lets us not recompute the key downstream in the frontend as well: https://github.com/amundsen-io/amundsenfrontendlibrary/blob/ee0474fdcce6c8c1b01c9032b5cd2750b1d0d92c/amundsen_application/api/utils/metadata_utils.py#L110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
amundsen_common/utils/streams.py
Outdated
return 1 | ||
|
||
|
||
class PeekingIterator(Iterator[V]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I don't have much context for all the code. If it is for gremlin lib, could we move it to https://github.com/amundsen-io/amundsengremlin ? The common lib is only used to store common model for all the repos. Let me know what you thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah...I was wondering myself if this is useful in anyone else's databuilder. We use this in the gremlin code, but also for some of our databuilder jobs to chunk the results (e.g. groups of 1,000 or 10,000. Unlimited size batches can be problematic in some circumstances). Do y'all ever use airflow to explicitly set batch sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define the batch / transaction size through config during neo4j ingestion (https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/publisher/neo4j_csv_publisher.py#L95). Otherwise the neo4j will hang without batch size setting. So given we haven't had a clear use case, I would suggest moving the code to gremlin repo? we could always use gremlin as an extra library deps in databuilder if we plan to enhance and build neptune publisher there. cc @AndrewCiambrone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good to me! Thanks for the colour, Tao!
- streams to gremlin repo
Remove streams Move fixtures to amundsen_common/tests Stat as additional class, deprecate Statistics Signed-off-by: Joshua Hoskins <hoskins@squareup.com>
b7e75ef
to
06f1f73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm mostly, a few more nits, let me know what you think
setup.cfg
Outdated
@@ -35,6 +35,7 @@ check_untyped_defs = true | |||
disallow_any_generics = true | |||
disallow_incomplete_defs = true | |||
disallow_untyped_defs = true | |||
ignore_missing_imports = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share more on why we set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! When I run mypy on the entire repo, I get it complaining at me about three libraries missing stubs.
e.g.
setup.py:4: error: No library stub file for module 'setuptools'
amundsen_common/models/user.py:7: error: Cannot find implementation or library stub for module named 'marshmallow'
amundsen_common/models/user.py:8: error: Cannot find implementation or library stub for module named 'marshmallow_annotations.ext.attrs'
I can set it specifically for those libraries in the cfg though, which is a less stringent option.
amundsen_common/utils/utils.py
Outdated
# Copyright Contributors to the Amundsen project. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from typing import Optional, TypeVar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we move the utils folder to gremlin as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
- move the utils folder to gremlin
amundsen_common/utils/__init__.py
Outdated
@@ -0,0 +1,2 @@ | |||
# Copyright Contributors to the Amundsen project. | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -151,6 +153,7 @@ class Table: | |||
cluster: str | |||
schema: str | |||
name: str | |||
key: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
remove utils, make mypy ignore missing imports module-specific Signed-off-by: Joshua Hoskins <hoskins@squareup.com>
a318f64
to
3f1557a
Compare
thanks |
🙏 |
Summary of Changes
add optional key parameter to column, table. rename Statistics class to Stat.
add fixtures, streaming, utils to common.
Update mypy to 761 to fix false errors.
Added ignore_missing_imports to ignore marshmallow dependency typing complaints.
Fix flake8 complaints about extra line in init.py files.
Added type ignore for thorny mypy complaint in User.py
As with the rest of the gremlin code I'm upstreaming, this was a team effort. Kudos to @dsimms, @alran, @worldwise001.
CheckList
Make sure you have checked all steps below to ensure a timely review.