Skip to content

Data Profiling and Basic Quality Assessment tool, for the very beginning phase of your project.

License

Notifications You must be signed in to change notification settings

otiai10/dataviper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


dataviper

PyPI version GitHub Action codecov Python 3.6

dataviper is a SQL-based tool to get the basic data preparation done in easy way, with doing

  • Create "Data Profile" report of a table
  • One-hot encode for "Categorical Columns" and create a "one-hot" table
  • Check "Joinability" between 2 tables
  • // TODO: and more

Example

pip install dataviper

Your main.py will look like this

from dataviper import Client
from dataviper.source import MySQL

client = Client(source=MySQL({
    'host': 'localhost',
    'user': 'root',
    'password': 'xxxxxx',
    'database': 'your_database'
}))

with client.connect():
    profile = client.profile('Your_Table')
    profile.to_excel()
python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗

image

Why?

It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas or any libraries which internally use pandas, it's not avoidable to use bunch of memory of your local machine and freeze your work.

dataviper is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.

With dataviper, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.

Data Sources

You can choose your data source from

  • SQL Server
    • profile
    • pivot
    • joinability
    • histogram
  • MySQL
    • profile
    • pivot
    • joinability
    • histogram
  • PostgreSQL
  • CSV
    • profile
    • pivot
    • joinability
    • histogram
  • Excel

APIs

profile

Create "Data Profile" excel file of a specified table.

When you have Sales table like this

id region sales_type price rep_id
1 jp phone 240 115723
2 us web 90 125901
3 jp web 560 8003
4 us shop 920 182234
5 jp NULL 90 92231
6 us shop 180 100425
7 us shop 70 52934

do

with client.connect() as conn:
    table_name = 'Sales'
    profile = client.profile(table_name, example_count=3)
    profile.to_excel()

then you will get profile_Sales.xlsx file with

column_name data_type null_count null_% unique_count unique_% min max avg std example_top_3 example_last_3
id bigint 0 0 7 100.00 1 7 4.0 2.0 [1,2,3] [5,6,7]
region varchar 0 0 2 28.57 [jp,us,jp] [jp,us,us]
sales_type varchar 1 14.28 3 42.85 [phone,web,web] [None,shop,shop]
price int 0 0 6 85.71 70 920 307.1428 295.379 [240,90,560] [90,180,70]
rep_id int 0 0 7 100.00 8003 182234 96778.7142 51195.79065 [115723,125901,8003] [92231,100425,52934]

pivot

Spread categorical columns to N binary columns.

When you have Sales table like above, do

with client.connect() as conn:
    table_name = 'Sales'
    key = 'id'
    categorical_columns = ['region', 'sales_type']
    profile = client.get_schema(table_name)
    client.pivot(profile, key, categorical_columns)

then you will get Sales_PIVOT_YYYYmmddHHMM table with

id region_jp region_us sales_type_phone sales_type_web sales_type_shop
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 0 1 0
4 0 1 0 0 1
5 1 0 0 0 0
6 0 1 0 0 1
7 0 1 0 0 1

joinability

Count how much 2 tables can be joined.

When you have Sales table like above, and Reps table like this

id name tenure
8003 Hiromu 9
8972 Ochiai 6
52934 Taro 1
92231 otiai10 2
100425 Hanako 7
125901 Chika 3
182234 Mary 5
199621 Jack 1

do

with client.connect() as conn:
    report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
    report.to_excel()

then you will get join_Sales_Reps.xlsx file with

table key total match match_% drop drop_%
Sales [rep_id] 7 6 85.714 1 14.285
Reps [id] 8 6 75.00 2 25.00

Issues and TODOs

About

Data Profiling and Basic Quality Assessment tool, for the very beginning phase of your project.

Resources

License

Stars

Watchers

Forks

Packages

No packages published