Validating historical features against reference dataset with "great expectations" profiler #2243

pyalex · 2022-01-27T08:02:02Z

Signed-off-by: pyalex moskalenko.alexey@gmail.com

What this PR does / why we need it:

This PR introduces validation of historical features. If parameter validation_reference is passed to RetrievalJob.to_df or RetrievalJob.to_arrow - resulting dataset will be validated against provided reference. If all expectations are met - data will be returned to user, otherwise ValidationFailed exception with details about validation failures will be raised. Validation reference is an instance of class ValidationReference with saved dataset and user defined profiler inside.

from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite

from feast.dqm.profilers.ge_profiler import ge_profiler

@ge_profiler
def my_profiler(dataset: Dataset) -> ExpectationSuite:
    dataset.expect_column_max_to_be_between("column", 1, 2)
    return dataset.get_expectation_suite()

job = fs.get_historical_features(...)

# validation will be executed on this point and exception will be raised here
job.to_df(
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=my_profiler))

In addition this PR adds dataset profiler API. In order to compare datasets we need to generate profile from reference dataset and then run this profile against tested dataset. Profiler - is a function that accepts dataset as input and returns its profile as result (see API declaration)

Currently only one profiler available - based on integration with great_expectation.

As you can see, right now profiler is part of user code. It is defined right before usage. In the future it could be defined along with features, feature views, etc and stored in feature definition repository and subsequently committed to the Feast registry (and read from there). Relevant proto message (and serialization / deserialization method) were added in this PR.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

(1) `RetrievalJob.to_df` and `RetrievalJob.to_arrow` have new (optional) parameter `validation_reference`.
(2) new API for creation reference dataset from saved dataset

codecov-commenter · 2022-01-27T08:55:33Z

Codecov Report

Merging #2243 (059444e) into master (f6712f0) will decrease coverage by 26.82%.
The diff coverage is 53.73%.

@@             Coverage Diff             @@
##           master    #2243       +/-   ##
===========================================
- Coverage   85.41%   58.59%   -26.83%     
===========================================
  Files         112      116        +4     
  Lines        9573     9777      +204     
===========================================
- Hits         8177     5729     -2448     
- Misses       1396     4048     +2652

Flag	Coverage Δ
integrationtests	`?`
unittests	`58.59% <53.73%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/feast/infra/offline_stores/file.py	`84.56% <ø> (-12.09%)`	⬇️
...dk/python/tests/integration/e2e/test_validation.py	`35.18% <35.18%> (ø)`
...python/feast/infra/offline_stores/offline_store.py	`69.04% <37.50%> (-18.10%)`	⬇️
sdk/python/feast/dqm/profilers/ge_profiler.py	`58.73% <58.73%> (ø)`
sdk/python/feast/saved_dataset.py	`50.87% <62.50%> (-17.50%)`	⬇️
sdk/python/feast/dqm/errors.py	`66.66% <66.66%> (ø)`
sdk/python/feast/dqm/profilers/profiler.py	`70.83% <70.83%> (ø)`
.../integration/online_store/test_online_retrieval.py	`17.39% <0.00%> (-82.61%)`	⬇️
.../integration/online_store/test_universal_online.py	`15.00% <0.00%> (-82.50%)`	⬇️
sdk/python/tests/utils/online_read_write_test.py	`20.68% <0.00%> (-79.32%)`	⬇️
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6712f0...059444e. Read the comment docs.

MattDelac · 2022-01-30T14:15:35Z

RetrievalJobWithValidation has the same interface and behaviour as RetrievalJob with only one difference: once data is retrieved (after calling .to_df()) - it will be validated against provided validation_reference

Why not making it a parameter to the RetrievalJob then ? Especially if there is only one small difference between the 2 objects. It's not obvious to me.
I am scared we would need to maintain RetrievalJob & RetrievalJobWithValidation separately even though they might have 90% in common)

In the following example,

job = fs.get_historical_features(
    ...,
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=my_profiler)
)

it's not clear what validation_reference expects. Could I manually point it to a BigQuery dataset (or whatever equivalent) ? For example if I want to just try out things, can I do something like

job = fs.get_historical_features(
    ...,
    validation_reference=BigQueryReferenceDataset("my_project.my_table.my_dataset")
)

Our users like to try things out manually in Jupyter notebooks (or locally) before writing the code running in prod.

Otherwise the rest of the PR (based on the description) is clean 👍

MattDelac

Left couple of comments. Really like the direction of DQM.

MattDelac · 2022-01-30T14:19:30Z

docs/reference/dqm.md

+```
+
+### Dataset profile
+Currently, Feast supports only [great expectation's](https://greatexpectations.io/) [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite)


Is this normal that we already make reference to the "legacy doc" of GE ? Referring to https://legacy.docs.greatexpectations.io

Link API Reference from docs.greatexpectations.io leads to this legacy documentation. see https://docs.greatexpectations.io/docs/reference/api_reference/

MattDelac · 2022-01-30T14:22:19Z

sdk/python/feast/dqm/profilers/ge_profiler.py

@@ -0,0 +1,126 @@
+import json


Do we need to folder /dqm/ ? Or would the following sdk/python/feast/profilers/ge_profiler.py be enough ?

It's doesn't sound very clear what is feast.profilers module for. So I thought feast.dqm would be a useful namespace.

sdk/python/feast/dqm/profilers/ge_profiler.py

MattDelac · 2022-01-30T14:26:44Z

sdk/python/feast/dqm/utils.py

+    from feast.on_demand_feature_view import OnDemandFeatureView
+
+
+class RetrievalJobWithValidation(RetrievalJob):


I am really not a big fan of maintaining 2 objects "RetrievalJob" that are that similar ...
It sounds like an optional parameter validation_reference in the current RetrievalJob would be better IMO

Also we should be able to use DQM and ODFV together. So maintaining that logic in the same object looks more long term to me.

My expectation would be if DQM and ODFV are used together, it would be processed as follow

Run odfv.get_transformed_features_df() for each ODFV transformation

Validate the dataset against its reference

Return the dataset

Steps 1. and 2. are optional to me and can be used alone and together.

WDYT ?

agreed with Matt on having one implementation

Updated as suggested.

sdk/python/setup.py

docs/reference/dqm.md

sdk/python/feast/dqm/profilers/ge_profiler.py

adchia · 2022-01-31T00:42:12Z

sdk/python/feast/dqm/utils.py

+    from feast.on_demand_feature_view import OnDemandFeatureView
+
+
+class RetrievalJobWithValidation(RetrievalJob):


agreed with Matt on having one implementation

sdk/python/feast/dqm/errors.py

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

pyalex · 2022-02-01T05:33:53Z

it's not clear what validation_reference expects. Could I manually point it to a BigQuery dataset (or whatever equivalent) ?

validation_reference expects instance of ValidationReference which consists of saved dataset and profiler. So you can't just point to a dataset. Profiler is also required (to generate ExpectationSuite).
To make life easier I added this SavedDataset.as_reference method which creates validation reference from dataset with provided profiler:

dataset = store.get_saved_dataset(name='my_dataset')
reference = dataset.as_reference(profiler=my_profiler)

job = store.get_historical_features(...)
job.to_df(validation_reference=reference)

@MattDelac

woop · 2022-02-01T05:56:50Z

Btw you are adding __init__.py files. Is that intentional?

pyalex · 2022-02-01T06:08:41Z

@woop, yes. __init__.py files are required for correct imports. I've been struggling for a while with import conflicts. Since we now have ipython as ci dependency (somehow great expectations can't work without it) and it has tests module on root level - it can be loaded earlier than our tests module (due to absence of __init__.py).
So I got import error on this line from tests.data.data_creator import create_dataset

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

pyalex · 2022-02-01T12:26:12Z

I'm also working on tutorial that supposed to provide more comprehensive ready-to-use code example. Please find it here: https://github.com/feast-dev/dqm-tutorial/blob/main/validating-historical-features.ipynb

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

woop · 2022-02-01T18:00:08Z

@woop, yes. __init__.py files are required for correct imports. I've been struggling for a while with import conflicts. Since we now have ipython as ci dependency (somehow great expectations can't work without it) and it has tests module on root level - it can be loaded earlier than our tests module (due to absence of __init__.py). So I got import error on this line from tests.data.data_creator import create_dataset

Yea I have also run into that issue a lot. I really hoped that we could solve it without sprinkling __init__.py everywhere, but if you tried everything else then I guess we don't have a choice.

adchia

/lgtm

feast-ci-bot · 2022-02-02T15:01:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adchia, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adchia]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pyalex requested a review from a team as a code owner January 27, 2022 08:02

pyalex requested review from felixwang9817 and removed request for a team January 27, 2022 08:02

feast-ci-bot added do-not-merge/release-note-label-needed needs-kind labels Jan 27, 2022

github-actions bot assigned achals Jan 27, 2022

feast-ci-bot added the size/XL label Jan 27, 2022

pyalex added kind/feature New feature or request and removed size/XL labels Jan 27, 2022

feast-ci-bot removed the needs-kind label Jan 27, 2022

pyalex force-pushed the dataset-profile branch from e1d4e59 to 5d222c2 Compare January 27, 2022 08:22

feast-ci-bot added size/XL size/XXL and removed size/XL labels Jan 27, 2022

pyalex added the ok-to-test label Jan 27, 2022

feast-ci-bot added release-note and removed do-not-merge/release-note-label-needed labels Jan 27, 2022

pyalex added ok-to-test and removed ok-to-test labels Jan 27, 2022

pyalex requested review from adchia and MattDelac January 27, 2022 10:19

pyalex force-pushed the dataset-profile branch from f4588e5 to 45c83b2 Compare January 28, 2022 01:00

MattDelac reviewed Jan 30, 2022

View reviewed changes

adchia assigned MattDelac and unassigned achals Jan 30, 2022

adchia removed the request for review from felixwang9817 January 30, 2022 17:05

adchia reviewed Jan 31, 2022

View reviewed changes

Validating historical features against reference dataset

038fd04

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

pyalex and others added 9 commits February 1, 2022 10:35

fixes after rebase

3abda73

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

fixes after rebase

d4cbe97

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

update ci requirements

3d7fcea

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

some reverts

07ae012

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

format

2121ae5

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

move ValidationReference proto message

af2576e

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

grammar in docs

17c4ec0

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Update docs/reference/dqm.md

d3449db

Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

rebase & resolve conflicts

1ca7eaa

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

pyalex force-pushed the dataset-profile branch from ffd260e to 1ca7eaa Compare February 1, 2022 02:51

pyalex and others added 2 commits February 1, 2022 11:20

remove RetrievalJobWithValidation

e377431

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

Update docs/reference/dqm.md

0f87c7b

Co-authored-by: Danny Chiao <d.chiao@gmail.com> Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

pyalex force-pushed the dataset-profile branch from 472c53b to 0f87c7b Compare February 1, 2022 03:24

pyalex added 3 commits February 1, 2022 11:31

fix type hint

7c8660b

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

fix function flow

80f0814

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

more docstrings

3413587

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

update docs

059444e

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

improve docs

c8d57c9

Signed-off-by: pyalex <moskalenko.alexey@gmail.com>

adchia approved these changes Feb 2, 2022

View reviewed changes

feast-ci-bot assigned adchia Feb 2, 2022

feast-ci-bot added the lgtm label Feb 2, 2022

feast-ci-bot added the approved label Feb 2, 2022

feast-ci-bot merged commit 6d7f6e5 into feast-dev:master Feb 2, 2022

adchia linked an issue Apr 21, 2022 that may be closed by this pull request

Data quality monitoring #2596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating historical features against reference dataset with "great expectations" profiler #2243

Validating historical features against reference dataset with "great expectations" profiler #2243

pyalex commented Jan 27, 2022 •

edited

Loading

codecov-commenter commented Jan 27, 2022 •

edited

Loading

MattDelac commented Jan 30, 2022 •

edited

Loading

MattDelac left a comment

MattDelac Jan 30, 2022

pyalex Feb 1, 2022

MattDelac Jan 30, 2022

pyalex Feb 1, 2022

MattDelac Jan 30, 2022 •

edited

Loading

MattDelac Jan 30, 2022

adchia Jan 31, 2022

pyalex Feb 1, 2022

adchia Jan 31, 2022

pyalex commented Feb 1, 2022 •

edited

Loading

woop commented Feb 1, 2022 •

edited

Loading

pyalex commented Feb 1, 2022 •

edited

Loading

pyalex commented Feb 1, 2022

woop commented Feb 1, 2022 •

edited

Loading

adchia left a comment

feast-ci-bot commented Feb 2, 2022

		from feast.on_demand_feature_view import OnDemandFeatureView


		class RetrievalJobWithValidation(RetrievalJob):

Validating historical features against reference dataset with "great expectations" profiler #2243

Validating historical features against reference dataset with "great expectations" profiler #2243

Conversation

pyalex commented Jan 27, 2022 • edited Loading

codecov-commenter commented Jan 27, 2022 • edited Loading

Codecov Report

MattDelac commented Jan 30, 2022 • edited Loading

MattDelac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattDelac Jan 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyalex commented Feb 1, 2022 • edited Loading

woop commented Feb 1, 2022 • edited Loading

pyalex commented Feb 1, 2022 • edited Loading

pyalex commented Feb 1, 2022

woop commented Feb 1, 2022 • edited Loading

adchia left a comment

Choose a reason for hiding this comment

feast-ci-bot commented Feb 2, 2022

pyalex commented Jan 27, 2022 •

edited

Loading

codecov-commenter commented Jan 27, 2022 •

edited

Loading

MattDelac commented Jan 30, 2022 •

edited

Loading

MattDelac Jan 30, 2022 •

edited

Loading

pyalex commented Feb 1, 2022 •

edited

Loading

woop commented Feb 1, 2022 •

edited

Loading

pyalex commented Feb 1, 2022 •

edited

Loading

woop commented Feb 1, 2022 •

edited

Loading