-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model
Swapna Y edited this page May 7, 2021
·
8 revisions
Data Warehouse consists of data zones.
- Landing Zone (tables and files)
- Processed Zone (Fact tables)
- Analytics data (Aggregation and query tables to help run business/analytics query faster)
- Contains files extracted from various data sources. csv files downloaded from sources and raw tables.
ls
CL.csv EXC.csv IPGP.csv MSCI.csv Unemployment_By_State_0.json
CLX.csv EXPD.csv IQV.csv MSFT.csv Unemployment_By_State_25.json
CMA.csv EXPE.csv IR.csv MSI.csv Unemployment_By_State_50.json
- Example raw file structure:Raw data is true to source.Data is either not cleaned at all or minimally cleaned so that can be dumped in to raw tables.
mysql> describe covid_usa_raw;
+-----------------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------+------+-----+---------+-------+
| submission_date | text | YES | | NULL | |
| state | text | YES | | NULL | |
| tot_cases | int | YES | | NULL | |
| conf_cases | double | NO | | NULL | |
| prob_cases | double | NO | | NULL | |
| new_case | int | YES | | NULL | |
| pnew_case | double | NO | | NULL | |
| tot_death | int | YES | | NULL | |
| conf_death | double | NO | | NULL | |
| prob_death | double | NO | | NULL | |
| new_death | int | YES | | NULL | |
| pnew_death | double | NO | | NULL | |
| created_at | text | YES | | NULL | |
| consent_cases | text | YES | | NULL | |
| consent_deaths | text | YES | | NULL | |
+-----------------+--------+------+-----+---------+-------+
15 rows in set (0.00 sec)
describe unemployment_rate_by_state_raw;
+--------------------------------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------------+--------+------+-----+---------+-------+
| message | text | YES | | NULL | |
| responseTime | bigint | YES | | NULL | |
| status | text | YES | | NULL | |
| Results_series_seriesID | text | YES | | NULL | |
| Results_series_data_period | text | YES | | NULL | |
| Results_series_data_periodName | text | YES | | NULL | |
| Results_series_data_value | text | YES | | NULL | |
| Results_series_data_year | text | YES | | NULL | |
+--------------------------------+--------+------+-----+---------+-------+
*** minimally transformed json data:
----------------+
| message | responseTime | status | Results_series_seriesID | Results_series_data_period | Results_series_data_periodName | Results_series_data_value | Results_series_data_year |
+----------------------------------------------------------------------+--------------+-------------------+-------------------------+----------------------------+--------------------------------+---------------------------+--------------------------+
| Year range has been reduced to the system-allowed limit of 10 years. | 255 | REQUEST_SUCCEEDED | ST5600000000000 | M12 | December | 5.0 | 2016 |
| Year range has been reduced to the system-allowed limit of 10 years. | 255 | REQUEST_SUCCEEDED | ST5600000000000 | M11 | November | 5.1 | 2016 |
Kimball's dimensional modeling is used.It is also called Star Schema data model. Main advantages being:
- Queries are simpler
- Easier business insights reporting
- Better-performing queries:
- Provides data to OLAP systems
- Allows n number of data sources without being complex Project Data model can be found here
Transformed fact table after cleaning:
describe unemployment_rate_by_state_fact;
+-------------------+-----------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-----------------+------+-----+-------------------+-------------------+
| id | bigint unsigned | NO | PRI | NULL | auto_increment |
| date_id | bigint unsigned | YES | MUL | NULL | |
| state_id | bigint unsigned | YES | MUL | NULL | |
| unemployment_rate | decimal(4,2) | YES | | NULL | |
| submission_date | timestamp | NO | | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
+-------------------+-----------------+------+-----+-------------------+-------------------+
***Transformed data
select * from unemployment_rate_by_state_fact limit 2;
+------+---------+----------+-------------------+---------------------+
| id | date_id | state_id | unemployment_rate | submission_date |
+------+---------+----------+-------------------+---------------------+
| 8075 | 144 | 39 | 3.00 | 2007-12-01 00:00:00 |
| 8076 | 246 | 39 | 3.20 | 2016-06-01 00:00:00 |
+------+---------+----------+-------------------+---------------------+
Example Star schema from By SqlPac at English Wikipedia, CC BY-SA 3.0, Link
These are the aggregated data tables.For example in current project Table Home_Prices_fact
has millions of rows and average time to run queries is very high. So we created tables with aggregated data so queries could be run on the intermediate tables instead.
select * from home_prices_monthly where state='CA' limit 3;
+------+-------+------------+----------------+-----------+------+-------+----------------+
| id | state | state_name | inventory_type | avg_price | year | month | inventory_date |
+------+-------+------------+----------------+-----------+------+-------+----------------+
| 8233 | CA | California | single_family | 491028.00 | 2020 | 10 | 2020-10-31 |
| 8234 | CA | California | bottom_tier | 376970.00 | 2020 | 4 | 2020-04-30 |
| 8235 | CA | California | mid_tier | 447156.00 | 2019 | 6 | 2019-06-29 |
+------+-------+------------+----------------+-----------+------+-------+----------------+