Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating historical features against reference dataset with "great expectations" profiler #2243

Merged
merged 17 commits into from
Feb 2, 2022

Conversation

pyalex
Copy link
Collaborator

@pyalex pyalex commented Jan 27, 2022

Signed-off-by: pyalex moskalenko.alexey@gmail.com

What this PR does / why we need it:

This PR introduces validation of historical features. If parameter validation_reference is passed to RetrievalJob.to_df or RetrievalJob.to_arrow - resulting dataset will be validated against provided reference. If all expectations are met - data will be returned to user, otherwise ValidationFailed exception with details about validation failures will be raised. Validation reference is an instance of class ValidationReference with saved dataset and user defined profiler inside.

from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite

from feast.dqm.profilers.ge_profiler import ge_profiler

@ge_profiler
def my_profiler(dataset: Dataset) -> ExpectationSuite:
    dataset.expect_column_max_to_be_between("column", 1, 2)
    return dataset.get_expectation_suite()

job = fs.get_historical_features(...)

# validation will be executed on this point and exception will be raised here
job.to_df(
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=my_profiler))  

In addition this PR adds dataset profiler API. In order to compare datasets we need to generate profile from reference dataset and then run this profile against tested dataset. Profiler - is a function that accepts dataset as input and returns its profile as result (see API declaration)

Currently only one profiler available - based on integration with great_expectation.

As you can see, right now profiler is part of user code. It is defined right before usage. In the future it could be defined along with features, feature views, etc and stored in feature definition repository and subsequently committed to the Feast registry (and read from there). Relevant proto message (and serialization / deserialization method) were added in this PR.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

(1) `RetrievalJob.to_df` and `RetrievalJob.to_arrow` have new (optional) parameter `validation_reference`.
(2) new API for creation reference dataset from saved dataset

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2022

Codecov Report

Merging #2243 (059444e) into master (f6712f0) will decrease coverage by 26.82%.
The diff coverage is 53.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2243       +/-   ##
===========================================
- Coverage   85.41%   58.59%   -26.83%     
===========================================
  Files         112      116        +4     
  Lines        9573     9777      +204     
===========================================
- Hits         8177     5729     -2448     
- Misses       1396     4048     +2652     
Flag Coverage Δ
integrationtests ?
unittests 58.59% <53.73%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/file.py 84.56% <ø> (-12.09%) ⬇️
...dk/python/tests/integration/e2e/test_validation.py 35.18% <35.18%> (ø)
...python/feast/infra/offline_stores/offline_store.py 69.04% <37.50%> (-18.10%) ⬇️
sdk/python/feast/dqm/profilers/ge_profiler.py 58.73% <58.73%> (ø)
sdk/python/feast/saved_dataset.py 50.87% <62.50%> (-17.50%) ⬇️
sdk/python/feast/dqm/errors.py 66.66% <66.66%> (ø)
sdk/python/feast/dqm/profilers/profiler.py 70.83% <70.83%> (ø)
.../integration/online_store/test_online_retrieval.py 17.39% <0.00%> (-82.61%) ⬇️
.../integration/online_store/test_universal_online.py 15.00% <0.00%> (-82.50%) ⬇️
sdk/python/tests/utils/online_read_write_test.py 20.68% <0.00%> (-79.32%) ⬇️
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6712f0...059444e. Read the comment docs.

@MattDelac
Copy link
Collaborator

MattDelac commented Jan 30, 2022

RetrievalJobWithValidation has the same interface and behaviour as RetrievalJob with only one difference: once data is retrieved (after calling .to_df()) - it will be validated against provided validation_reference

Why not making it a parameter to the RetrievalJob then ? Especially if there is only one small difference between the 2 objects. It's not obvious to me.
I am scared we would need to maintain RetrievalJob & RetrievalJobWithValidation separately even though they might have 90% in common)


In the following example,

job = fs.get_historical_features(
    ...,
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=my_profiler)
)

it's not clear what validation_reference expects. Could I manually point it to a BigQuery dataset (or whatever equivalent) ? For example if I want to just try out things, can I do something like

job = fs.get_historical_features(
    ...,
    validation_reference=BigQueryReferenceDataset("my_project.my_table.my_dataset")
)

Our users like to try things out manually in Jupyter notebooks (or locally) before writing the code running in prod.


Otherwise the rest of the PR (based on the description) is clean 👍

Copy link
Collaborator

@MattDelac MattDelac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left couple of comments. Really like the direction of DQM.

```

### Dataset profile
Currently, Feast supports only [great expectation's](https://greatexpectations.io/) [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this normal that we already make reference to the "legacy doc" of GE ? Referring to https://legacy.docs.greatexpectations.io

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link API Reference from docs.greatexpectations.io leads to this legacy documentation. see https://docs.greatexpectations.io/docs/reference/api_reference/

@@ -0,0 +1,126 @@
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to folder /dqm/ ? Or would the following sdk/python/feast/profilers/ge_profiler.py be enough ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doesn't sound very clear what is feast.profilers module for. So I thought feast.dqm would be a useful namespace.

from feast.on_demand_feature_view import OnDemandFeatureView


class RetrievalJobWithValidation(RetrievalJob):
Copy link
Collaborator

@MattDelac MattDelac Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not a big fan of maintaining 2 objects "RetrievalJob" that are that similar ...
It sounds like an optional parameter validation_reference in the current RetrievalJob would be better IMO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should be able to use DQM and ODFV together. So maintaining that logic in the same object looks more long term to me.

My expectation would be if DQM and ODFV are used together, it would be processed as follow

  1. Run odfv.get_transformed_features_df() for each ODFV transformation
  2. Validate the dataset against its reference
  3. Return the dataset

Steps 1. and 2. are optional to me and can be used alone and together.

WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with Matt on having one implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

@adchia adchia assigned MattDelac and unassigned achals Jan 30, 2022
@adchia adchia removed the request for review from felixwang9817 January 30, 2022 17:05
from feast.on_demand_feature_view import OnDemandFeatureView


class RetrievalJobWithValidation(RetrievalJob):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with Matt on having one implementation

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
pyalex and others added 9 commits February 1, 2022 10:35
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Co-authored-by: Danny Chiao <d.chiao@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
pyalex and others added 2 commits February 1, 2022 11:20
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Co-authored-by: Danny Chiao <d.chiao@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
@pyalex
Copy link
Collaborator Author

pyalex commented Feb 1, 2022

it's not clear what validation_reference expects. Could I manually point it to a BigQuery dataset (or whatever equivalent) ?

validation_reference expects instance of ValidationReference which consists of saved dataset and profiler. So you can't just point to a dataset. Profiler is also required (to generate ExpectationSuite).
To make life easier I added this SavedDataset.as_reference method which creates validation reference from dataset with provided profiler:

dataset = store.get_saved_dataset(name='my_dataset')
reference = dataset.as_reference(profiler=my_profiler)

job = store.get_historical_features(...)
job.to_df(validation_reference=reference)

@MattDelac

@woop
Copy link
Member

woop commented Feb 1, 2022

Btw you are adding __init__.py files. Is that intentional?

@pyalex
Copy link
Collaborator Author

pyalex commented Feb 1, 2022

@woop, yes. __init__.py files are required for correct imports. I've been struggling for a while with import conflicts. Since we now have ipython as ci dependency (somehow great expectations can't work without it) and it has tests module on root level - it can be loaded earlier than our tests module (due to absence of __init__.py).
So I got import error on this line from tests.data.data_creator import create_dataset

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
@pyalex
Copy link
Collaborator Author

pyalex commented Feb 1, 2022

I'm also working on tutorial that supposed to provide more comprehensive ready-to-use code example. Please find it here: https://github.com/feast-dev/dqm-tutorial/blob/main/validating-historical-features.ipynb

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>
@woop
Copy link
Member

woop commented Feb 1, 2022

@woop, yes. __init__.py files are required for correct imports. I've been struggling for a while with import conflicts. Since we now have ipython as ci dependency (somehow great expectations can't work without it) and it has tests module on root level - it can be loaded earlier than our tests module (due to absence of __init__.py). So I got import error on this line from tests.data.data_creator import create_dataset

Yea I have also run into that issue a lot. I really hoped that we could solve it without sprinkling __init__.py everywhere, but if you tried everything else then I guess we don't have a choice.

Copy link
Collaborator

@adchia adchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adchia, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 6d7f6e5 into feast-dev:master Feb 2, 2022
@adchia adchia linked an issue Apr 21, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data quality monitoring
7 participants