Great Expectations is the leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams
- Automated data profiling
The library profiles your data to get basic statistics, and automatically generates a suite of Expectations based on what is observed in the data. - Data validation
Expectation Suite passes or fails, and returns any unexpected values that failed a test - Data Docs
Renders HTML file of Expectations in clean, human-readable documentation containing both Expectation Suites and data Validation Results - Diverse Datasources and Store backends
Various datasources such Pandas dataframes, Spark dataframes, and SQL databases via SQLAlchemy.
- Expectations suite json
- Data Docs html report
- Validation run report
Refer: Getting started with Great Expectations
pip install great_expectations
great_expectations --version
Output: great_expectations, version 0.15.46
great_expectations init
Change working dir to the newly created dir, great_expectations
cd great_expectations
Copy the csv into great_expectations/data
Files:
faa_registration.csv
great_expectations datasource new
Input following in the prompt
1
- Local File
1
- Pandas
data
- relative path to datasets
This open a Jupyter notebook,
- Change to
datasource_name
var tonyc_yellow_taxi_trip_data
- Update
example_yaml
to ignore all non csv filesexample_yaml = f""" name: {datasource_name} class_name: Datasource execution_engine: class_name: PandasExecutionEngine data_connectors: default_inferred_data_connector_name: class_name: InferredAssetFilesystemDataConnector base_directory: data default_regex: group_names: - data_asset_name pattern: (.*) default_runtime_data_connector_name: class_name: RuntimeDataConnector assets: my_runtime_asset_name: batch_identifiers: - runtime_batch_identifier_name """ print(example_yaml)
- Save the datasource Configuration
- Close Jupyter notebook
- Wait for terminal to show
Saving file at /datasource_new.ipynb
great_expectations suite new
Input following in the prompt
3
- Automatically, using a profiler
1
- Select index of filefaa_registration.csv
faa_registration_suite
- suite name
This open a Jupyter notebook,
-
Change to
datasource_name
var tospy_plane_data
-
Update
exclude_column_names
toexclude_column_names = [ "N-NUMBER", "SERIAL NUMBER", "MFR MDL CODE", "ENG MFR MDL", # "YEAR MFR", "TYPE REGISTRANT", "NAME", "STREET", "STREET2", "CITY", "STATE", "ZIP CODE", "REGION", "COUNTY", "COUNTRY", # "LAST ACTION DATE", # "CERT ISSUE DATE", "CERTIFICATION", "TYPE AIRCRAFT", "TYPE ENGINE", "STATUS CODE", "MODE S CODE", "FRACT OWNER", "AIR WORTH DATE", "OTHER NAMES(1)", "OTHER NAMES(2)", "OTHER NAMES(3)", "OTHER NAMES(4)", "OTHER NAMES(5)", # "EXPIRATION DATE", # "UNIQUE ID", "KIT MFR", "KIT MODEL", "MODE S CODE HEX", "X35", ]
-
Run to create default expectation and analyze the result
-
Wait for terminal to show
Saving file at /*.ipynb
-
Modify expectation as per need
Modified the JSON filegreat_expectations/expectations/faa_registration_suite.json
and kept necessary expectationsgreat_expectations suite edit faa_registration_suite
Input following in the prompt (! SYS ERROR, COULD NOT LOAD THE NOTEBOOK)
1
- Manually, without interacting with a sample batch of data (default)Updated to:
This Expectation suite currently contains 4 total Expectations across 1 columns.
great_expectations checkpoint new planes_features_checkpoint_v0.1
This open a Jupyter notebook,
- Run all cells.
- Report in new page
-
great_expectations/data
-
great_expectations/expectations/*.json
-
great_expectations/uncommitted/data_docs/*
-
great_expectations/uncommitted/*.ipynb
Resource: https://git-scm.com/docs/gitignore