Skip to content

Data Model

Swapna Y edited this page May 7, 2021 · 8 revisions

Data Warehouse consists of data zones.

  • Landing Zone (tables and files)
  • Processed Zone (Fact tables)
  • Analytics data (Aggregation and query tables to help run business/analytics query faster)

Landing Zone

  • Contains files extracted from various data sources. csv files downloaded from sources and raw tables.
ls
CL.csv				EXC.csv				IPGP.csv			MSCI.csv						Unemployment_By_State_0.json
CLX.csv				EXPD.csv			IQV.csv				MSFT.csv						Unemployment_By_State_25.json
CMA.csv				EXPE.csv			IR.csv				MSI.csv							Unemployment_By_State_50.json
  • Example raw file structure:Raw data is true to source.Data is either not cleaned at all or minimally cleaned so that can be dumped in to raw tables.
mysql> describe covid_usa_raw;
+-----------------+--------+------+-----+---------+-------+
| Field           | Type   | Null | Key | Default | Extra |
+-----------------+--------+------+-----+---------+-------+
| submission_date | text   | YES  |     | NULL    |       |
| state           | text   | YES  |     | NULL    |       |
| tot_cases       | int    | YES  |     | NULL    |       |
| conf_cases      | double | NO   |     | NULL    |       |
| prob_cases      | double | NO   |     | NULL    |       |
| new_case        | int    | YES  |     | NULL    |       |
| pnew_case       | double | NO   |     | NULL    |       |
| tot_death       | int    | YES  |     | NULL    |       |
| conf_death      | double | NO   |     | NULL    |       |
| prob_death      | double | NO   |     | NULL    |       |
| new_death       | int    | YES  |     | NULL    |       |
| pnew_death      | double | NO   |     | NULL    |       |
| created_at      | text   | YES  |     | NULL    |       |
| consent_cases   | text   | YES  |     | NULL    |       |
| consent_deaths  | text   | YES  |     | NULL    |       |
+-----------------+--------+------+-----+---------+-------+
15 rows in set (0.00 sec)
describe unemployment_rate_by_state_raw;
+--------------------------------+--------+------+-----+---------+-------+
| Field                          | Type   | Null | Key | Default | Extra |
+--------------------------------+--------+------+-----+---------+-------+
| message                        | text   | YES  |     | NULL    |       |
| responseTime                   | bigint | YES  |     | NULL    |       |
| status                         | text   | YES  |     | NULL    |       |
| Results_series_seriesID        | text   | YES  |     | NULL    |       |
| Results_series_data_period     | text   | YES  |     | NULL    |       |
| Results_series_data_periodName | text   | YES  |     | NULL    |       |
| Results_series_data_value      | text   | YES  |     | NULL    |       |
| Results_series_data_year       | text   | YES  |     | NULL    |       |
+--------------------------------+--------+------+-----+---------+-------+

*** minimally transformed json data:


----------------+
| message                                                              | responseTime | status            | Results_series_seriesID | Results_series_data_period | Results_series_data_periodName | Results_series_data_value | Results_series_data_year |
+----------------------------------------------------------------------+--------------+-------------------+-------------------------+----------------------------+--------------------------------+---------------------------+--------------------------+
| Year range has been reduced to the system-allowed limit of 10 years. |          255 | REQUEST_SUCCEEDED | ST5600000000000         | M12                        | December                       | 5.0                       | 2016                     |
| Year range has been reduced to the system-allowed limit of 10 years. |          255 | REQUEST_SUCCEEDED | ST5600000000000         | M11                        | November                       | 5.1                       | 2016                     |

Transformed data:

Kimball's dimensional modeling is used.It is also called Star Schema data model. Main advantages being:

  • Queries are simpler
  • Easier business insights reporting
  • Better-performing queries:
  • Provides data to OLAP systems
  • Allows n number of data sources without being complex Project Data model can be found here

Transformed fact table after cleaning:

describe unemployment_rate_by_state_fact;
+-------------------+-----------------+------+-----+-------------------+-------------------+
| Field             | Type            | Null | Key | Default           | Extra             |
+-------------------+-----------------+------+-----+-------------------+-------------------+
| id                | bigint unsigned | NO   | PRI | NULL              | auto_increment    |
| date_id           | bigint unsigned | YES  | MUL | NULL              |                   |
| state_id          | bigint unsigned | YES  | MUL | NULL              |                   |
| unemployment_rate | decimal(4,2)    | YES  |     | NULL              |                   |
| submission_date   | timestamp       | NO   |     | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
+-------------------+-----------------+------+-----+-------------------+-------------------+

***Transformed data

 select * from unemployment_rate_by_state_fact limit 2;
+------+---------+----------+-------------------+---------------------+
| id   | date_id | state_id | unemployment_rate | submission_date     |
+------+---------+----------+-------------------+---------------------+
| 8075 |     144 |       39 |              3.00 | 2007-12-01 00:00:00 |
| 8076 |     246 |       39 |              3.20 | 2016-06-01 00:00:00 |
+------+---------+----------+-------------------+---------------------+

Example Star schema from By SqlPac at English Wikipedia, CC BY-SA 3.0, Link Wikipedia

Analytics data:

These are the aggregated data tables.For example in current project Table Home_Prices_fact has millions of rows and average time to run queries is very high. So we created tables with aggregated data so queries could be run on the intermediate tables instead.

select * from home_prices_monthly where state='CA' limit 3;
+------+-------+------------+----------------+-----------+------+-------+----------------+
| id   | state | state_name | inventory_type | avg_price | year | month | inventory_date |
+------+-------+------------+----------------+-----------+------+-------+----------------+
| 8233 | CA    | California | single_family  | 491028.00 | 2020 |    10 | 2020-10-31     |
| 8234 | CA    | California | bottom_tier    | 376970.00 | 2020 |     4 | 2020-04-30     |
| 8235 | CA    | California | mid_tier       | 447156.00 | 2019 |     6 | 2019-06-29     |
+------+-------+------------+----------------+-----------+------+-------+----------------+