-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validating historical features against reference dataset with "great expectations" profiler #2243
Conversation
e1d4e59
to
5d222c2
Compare
Codecov Report
@@ Coverage Diff @@
## master #2243 +/- ##
===========================================
- Coverage 85.41% 58.59% -26.83%
===========================================
Files 112 116 +4
Lines 9573 9777 +204
===========================================
- Hits 8177 5729 -2448
- Misses 1396 4048 +2652
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
f4588e5
to
45c83b2
Compare
Why not making it a parameter to the RetrievalJob then ? Especially if there is only one small difference between the 2 objects. It's not obvious to me. In the following example, job = fs.get_historical_features(
...,
validation_reference=fs
.get_saved_dataset("my_reference_dataset")
.as_reference(profiler=my_profiler)
) it's not clear what job = fs.get_historical_features(
...,
validation_reference=BigQueryReferenceDataset("my_project.my_table.my_dataset")
) Our users like to try things out manually in Jupyter notebooks (or locally) before writing the code running in prod. Otherwise the rest of the PR (based on the description) is clean 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left couple of comments. Really like the direction of DQM.
docs/reference/dqm.md
Outdated
``` | ||
|
||
### Dataset profile | ||
Currently, Feast supports only [great expectation's](https://greatexpectations.io/) [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this normal that we already make reference to the "legacy doc" of GE ? Referring to https://legacy.docs.greatexpectations.io
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link API Reference
from docs.greatexpectations.io
leads to this legacy documentation. see https://docs.greatexpectations.io/docs/reference/api_reference/
@@ -0,0 +1,126 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to folder /dqm/
? Or would the following sdk/python/feast/profilers/ge_profiler.py
be enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doesn't sound very clear what is feast.profilers
module for. So I thought feast.dqm
would be a useful namespace.
sdk/python/feast/dqm/utils.py
Outdated
from feast.on_demand_feature_view import OnDemandFeatureView | ||
|
||
|
||
class RetrievalJobWithValidation(RetrievalJob): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not a big fan of maintaining 2 objects "RetrievalJob" that are that similar ...
It sounds like an optional parameter validation_reference
in the current RetrievalJob would be better IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we should be able to use DQM and ODFV together. So maintaining that logic in the same object looks more long term to me.
My expectation would be if DQM and ODFV are used together, it would be processed as follow
- Run
odfv.get_transformed_features_df()
for each ODFV transformation - Validate the dataset against its reference
- Return the dataset
Steps 1. and 2. are optional to me and can be used alone and together.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with Matt on having one implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested.
sdk/python/feast/dqm/utils.py
Outdated
from feast.on_demand_feature_view import OnDemandFeatureView | ||
|
||
|
||
class RetrievalJobWithValidation(RetrievalJob): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with Matt on having one implementation
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
ffd260e
to
1ca7eaa
Compare
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
472c53b
to
0f87c7b
Compare
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
validation_reference expects instance of ValidationReference which consists of saved dataset and profiler. So you can't just point to a dataset. Profiler is also required (to generate
|
Btw you are adding |
@woop, yes. |
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
I'm also working on tutorial that supposed to provide more comprehensive ready-to-use code example. Please find it here: https://github.com/feast-dev/dqm-tutorial/blob/main/validating-historical-features.ipynb |
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Yea I have also run into that issue a lot. I really hoped that we could solve it without sprinkling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adchia, pyalex The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: pyalex moskalenko.alexey@gmail.com
What this PR does / why we need it:
This PR introduces validation of historical features. If parameter
validation_reference
is passed toRetrievalJob.to_df
orRetrievalJob.to_arrow
- resulting dataset will be validated against provided reference. If all expectations are met - data will be returned to user, otherwiseValidationFailed
exception with details about validation failures will be raised. Validation reference is an instance of classValidationReference
with saved dataset and user defined profiler inside.In addition this PR adds dataset profiler API. In order to compare datasets we need to generate profile from reference dataset and then run this profile against tested dataset. Profiler - is a function that accepts dataset as input and returns its profile as result (see API declaration)
Currently only one profiler available - based on integration with
great_expectation
.As you can see, right now profiler is part of user code. It is defined right before usage. In the future it could be defined along with features, feature views, etc and stored in feature definition repository and subsequently committed to the Feast registry (and read from there). Relevant proto message (and serialization / deserialization method) were added in this PR.
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: