dataviper
is a SQL-based tool to get the basic data preparation done in easy way, with doing
- Create "Data Profile" report of a table
- One-hot encode for "Categorical Columns" and create a "one-hot" table
- Check "Joinability" between 2 tables
- // TODO: and more
pip install dataviper
Your main.py
will look like this
from dataviper import Client
from dataviper.source import MySQL
client = Client(source=MySQL({
'host': 'localhost',
'user': 'root',
'password': 'xxxxxx',
'database': 'your_database'
}))
with client.connect():
profile = client.profile('Your_Table')
profile.to_excel()
python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗
It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas
or any libraries which internally use pandas
, it's not avoidable to use bunch of memory of your local machine and freeze your work.
dataviper
is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.
With dataviper
, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.
You can choose your data source from
- SQL Server
-
profile
-
pivot
-
joinability
-
histogram
-
- MySQL
-
profile
-
pivot
-
joinability
-
histogram
-
- PostgreSQL
- CSV
-
profile
-
pivot
-
joinability
-
histogram
-
- Excel
Create "Data Profile" excel file of a specified table.
When you have Sales
table like this
id | region | sales_type | price | rep_id |
---|---|---|---|---|
1 | jp | phone | 240 | 115723 |
2 | us | web | 90 | 125901 |
3 | jp | web | 560 | 8003 |
4 | us | shop | 920 | 182234 |
5 | jp | NULL | 90 | 92231 |
6 | us | shop | 180 | 100425 |
7 | us | shop | 70 | 52934 |
do
with client.connect() as conn:
table_name = 'Sales'
profile = client.profile(table_name, example_count=3)
profile.to_excel()
then you will get profile_Sales.xlsx
file with
column_name | data_type | null_count | null_% | unique_count | unique_% | min | max | avg | std | example_top_3 | example_last_3 |
---|---|---|---|---|---|---|---|---|---|---|---|
id | bigint | 0 | 0 | 7 | 100.00 | 1 | 7 | 4.0 | 2.0 | [1,2,3] | [5,6,7] |
region | varchar | 0 | 0 | 2 | 28.57 | [jp,us,jp] | [jp,us,us] | ||||
sales_type | varchar | 1 | 14.28 | 3 | 42.85 | [phone,web,web] | [None,shop,shop] | ||||
price | int | 0 | 0 | 6 | 85.71 | 70 | 920 | 307.1428 | 295.379 | [240,90,560] | [90,180,70] |
rep_id | int | 0 | 0 | 7 | 100.00 | 8003 | 182234 | 96778.7142 | 51195.79065 | [115723,125901,8003] | [92231,100425,52934] |
Spread categorical columns to N binary columns.
When you have Sales
table like above, do
with client.connect() as conn:
table_name = 'Sales'
key = 'id'
categorical_columns = ['region', 'sales_type']
profile = client.get_schema(table_name)
client.pivot(profile, key, categorical_columns)
then you will get Sales_PIVOT_YYYYmmddHHMM
table with
id | region_jp | region_us | sales_type_phone | sales_type_web | sales_type_shop |
---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 1 | 0 |
4 | 0 | 1 | 0 | 0 | 1 |
5 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | 1 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 0 | 1 |
Count how much 2 tables can be joined.
When you have Sales
table like above, and Reps
table like this
id | name | tenure |
---|---|---|
8003 | Hiromu | 9 |
8972 | Ochiai | 6 |
52934 | Taro | 1 |
92231 | otiai10 | 2 |
100425 | Hanako | 7 |
125901 | Chika | 3 |
182234 | Mary | 5 |
199621 | Jack | 1 |
do
with client.connect() as conn:
report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
report.to_excel()
then you will get join_Sales_Reps.xlsx
file with
table | key | total | match | match_% | drop | drop_% |
---|---|---|---|---|---|---|
Sales | [rep_id] | 7 | 6 | 85.714 | 1 | 14.285 |
Reps | [id] | 8 | 6 | 75.00 | 2 | 25.00 |