Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental, partitioned session dim and fact tables. Allows for dimensional modeling on very large installs #175

Merged
merged 8 commits into from
Apr 12, 2023
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ Features include:
| stg_ga4__session_conversions_daily | Produces daily counts of conversions per session. The list of conversion events to include is configurable (see documentation below) |
| stg_ga4__sessions_traffic_sources | Finds the first source, medium, campaign, content, paid search term (from UTM tracking), and default channel grouping for each session. |
| dim_ga4__user_pseudo_ids | Dimension table for user devices as indicated by user_pseudo_ids. Contains attributes such as first and last page viewed.|
| dim_ga4__sessions | Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data |
| dim_ga4__sessions | Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data. Can be expensive to run on large installs (see `dim_ga4__sessions_daily`) |
| dim_ga4__sessions_daily | Query-optimized session dimension table that is incremental and partitioned on date. Assumes that each partition is contained within a single day |
| fct_ga4__pages | Fact table for pages which aggregates common page metrics by page_location, date, and hour. |
| fct_ga4__sessions_daily | Fact table for session metrics, partitioned by date. A single session may span multiple rows given that sessions can span multiple days. |
| fct_ga4__sessions | Fact table that aggregates session metrics across days. This table is not partitioned, so be mindful of performance/cost when querying. |
Expand Down
6 changes: 0 additions & 6 deletions models/marts/core/core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,6 @@ models:
- name: session_key
tests:
- unique
- name: fct_ga4__sessions_daily
description: Incremental session metrics model providing aggregate metrics per day such as number of pageviews and event value accrued. Each row represents 1 day of metrics for a single session.
columns:
- name: session_partition_key
tests:
- unique
- name: fct_ga4__pages
description: Incremental model with page metrics such as visits, users, new_users, entrances and exits as well as configurable conversion counts. Each row is grouped by page_location, event_date_dt, and hour.
columns:
Expand Down
185 changes: 185 additions & 0 deletions models/marts/core/dim_ga4__sessions_daily.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
{% if var('static_incremental_days', false ) %}
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "session_partition_date",
"data_type": "date",
"granularity": "day"
},
partitions = partitions_to_replace
)
}}
{% else %}
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "session_partition_date",
"data_type": "date",
"granularity": "day"
}
)
}}
{% endif %}

with event_dimensions as
(
select
user_pseudo_id,
session_key,
session_partition_key,
event_date_dt as session_partition_date,
event_timestamp,
page_path,
page_location,
page_hostname,
page_referrer,
geo_continent,
geo_country,
geo_region,
geo_city,
geo_sub_continent,
geo_metro,
stream_id,
platform,
device_category,
device_mobile_brand_name,
device_mobile_model_name,
device_mobile_marketing_name,
device_mobile_os_hardware_model,
device_operating_system,
device_operating_system_version,
device_vendor_id,
device_advertising_id,
device_language,
device_is_limited_ad_tracking,
device_time_zone_offset_seconds,
device_browser,
device_web_info_browser,
device_web_info_browser_version,
device_web_info_hostname,
user_campaign,
user_medium,
user_source,
from {{ref('stg_ga4__events')}}
where event_name != 'first_visit'
and event_name != 'session_start'
{% if is_incremental() %}
{% if var('static_incremental_days', false ) %}
and event_date_dt in ({{ partitions_to_replace | join(',') }})
{% else %}
and event_date_dt >= _dbt_max_partition
{% endif %}
{% endif %}
)
,traffic_sources as (
select
session_partition_key,
session_source,
session_medium,
session_campaign,
session_content,
session_term,
session_default_channel_grouping,
session_source_category
from {{ref('stg_ga4__sessions_traffic_sources_daily')}}
where 1=1
{% if is_incremental() %}
{% if var('static_incremental_days', false ) %}
and session_partition_date in ({{ partitions_to_replace | join(',') }})
{% else %}
and session_partition_date >= _dbt_max_partition
{% endif %}
{% endif %}
)
{% if var('derived_session_properties', false) %}
,session_properties as (
select
* except (session_partition_date)
from {{ref('stg_ga4__derived_session_properties_daily')}}
where 1=1
{% if is_incremental() %}
{% if var('static_incremental_days', false ) %}
and session_partition_date in ({{ partitions_to_replace | join(',') }})
{% else %}
and session_partition_date >= _dbt_max_partition
{% endif %}
{% endif %}
)
{% endif %}
,session_dimensions as
(
select
distinct -- Distinct call will, in effect, group by session_partition_key
stream_id
,session_key
,session_partition_key
,session_partition_date
,FIRST_VALUE(event_timestamp IGNORE NULLS) OVER (session_partition_window) AS session_partition_start_timestamp
,FIRST_VALUE(page_path IGNORE NULLS) OVER (session_partition_window) AS landing_page_path
,FIRST_VALUE(page_location IGNORE NULLS) OVER (session_partition_window) AS landing_page_location
,FIRST_VALUE(page_hostname IGNORE NULLS) OVER (session_partition_window) AS landing_page_hostname
,FIRST_VALUE(page_referrer IGNORE NULLS) OVER (session_partition_window) AS referrer
,FIRST_VALUE(geo_continent IGNORE NULLS) OVER (session_partition_window) AS geo_continent
,FIRST_VALUE(geo_country IGNORE NULLS) OVER (session_partition_window) AS geo_country
,FIRST_VALUE(geo_region IGNORE NULLS) OVER (session_partition_window) AS geo_region
,FIRST_VALUE(geo_city IGNORE NULLS) OVER (session_partition_window) AS geo_city
,FIRST_VALUE(geo_sub_continent IGNORE NULLS) OVER (session_partition_window) AS geo_sub_continent
,FIRST_VALUE(geo_metro IGNORE NULLS) OVER (session_partition_window) AS geo_metro
,FIRST_VALUE(platform IGNORE NULLS) OVER (session_partition_window) AS platform
,FIRST_VALUE(device_category IGNORE NULLS) OVER (session_partition_window) AS device_category
,FIRST_VALUE(device_mobile_brand_name IGNORE NULLS) OVER (session_partition_window) AS device_mobile_brand_name
,FIRST_VALUE(device_mobile_model_name IGNORE NULLS) OVER (session_partition_window) AS device_mobile_model_name
,FIRST_VALUE(device_mobile_marketing_name IGNORE NULLS) OVER (session_partition_window) AS device_mobile_marketing_name
,FIRST_VALUE(device_mobile_os_hardware_model IGNORE NULLS) OVER (session_partition_window) AS device_mobile_os_hardware_model
,FIRST_VALUE(device_operating_system IGNORE NULLS) OVER (session_partition_window) AS device_operating_system
,FIRST_VALUE(device_operating_system_version IGNORE NULLS) OVER (session_partition_window) AS device_operating_system_version
,FIRST_VALUE(device_vendor_id IGNORE NULLS) OVER (session_partition_window) AS device_vendor_id
,FIRST_VALUE(device_advertising_id IGNORE NULLS) OVER (session_partition_window) AS device_advertising_id
,FIRST_VALUE(device_language IGNORE NULLS) OVER (session_partition_window) AS device_language
,FIRST_VALUE(device_is_limited_ad_tracking IGNORE NULLS) OVER (session_partition_window) AS device_is_limited_ad_tracking
,FIRST_VALUE(device_time_zone_offset_seconds IGNORE NULLS) OVER (session_partition_window) AS device_time_zone_offset_seconds
,FIRST_VALUE(device_browser IGNORE NULLS) OVER (session_partition_window) AS device_browser
,FIRST_VALUE(device_web_info_browser IGNORE NULLS) OVER (session_partition_window) AS device_web_info_browser
,FIRST_VALUE(device_web_info_browser_version IGNORE NULLS) OVER (session_partition_window) AS device_web_info_browser_version
,FIRST_VALUE(device_web_info_hostname IGNORE NULLS) OVER (session_partition_window) AS device_web_info_hostname
,FIRST_VALUE(user_campaign IGNORE NULLS) OVER (session_partition_window) AS user_campaign
,FIRST_VALUE(user_medium IGNORE NULLS) OVER (session_partition_window) AS user_medium
,FIRST_VALUE(user_source IGNORE NULLS) OVER (session_partition_window) AS user_source
from event_dimensions
WINDOW session_partition_window AS (PARTITION BY session_partition_key ORDER BY event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
,join_traffic_source as (
select
session_dimensions.*,
session_source,
session_medium,
session_campaign,
session_content,
session_term,
session_default_channel_grouping,
session_source_category
from session_dimensions
left join traffic_sources sessions_traffic_sources using (session_partition_key)
)
,join_session_properties as (
select
*
from join_traffic_source
{% if var('derived_session_properties', false) %}
-- If derived session properties have been assigned as variables, join them on the session_partition_key
left join session_properties using (session_partition_key)
{% endif %}
)

-- Collapse
select distinct * from join_session_properties
22 changes: 22 additions & 0 deletions models/marts/core/dim_ga4__sessions_daily.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
version: 2

models:
- name: dim_ga4__sessions_daily
description: >
Incremental, partitioned dimension table for session partitions. Partitioned on session_partition_date for improved query optimization when filtering on date.
Contains context useful for filtering sessions such as acquisition source, medium, and campaign.
Each row represents a daily session partition (as opposed to a session).
Unique on session_partion_key
columns:
- name: session_partition_key
description: >
Unique key assigned to session partitions which are daily partitions of a session. In GA4, sessions can span multiple days.
To improve query performance, it's easier to work with 'session partitions' which can be filtered/partitioned by date.
tests:
- unique
- name: session_key
description: >
Unique key assigned to sessions. Sessions can span multiple dates/partitions.
- name: session_partition_date
description: >
Date associated with the session_partition_key. Used to partition the table. Filter on this column to optimize query cost and performance.
60 changes: 44 additions & 16 deletions models/marts/core/fct_ga4__sessions_daily.sql
Original file line number Diff line number Diff line change
@@ -1,34 +1,57 @@
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "session_partition_date",
"data_type": "date",
"granularity": "day"
}
)
}}
{% if var('static_incremental_days', false ) %}
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "session_partition_date",
"data_type": "date",
"granularity": "day"
},
partitions = partitions_to_replace
)
}}
{% else %}
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
tags = ["incremental"],
partition_by={
"field": "session_partition_date",
"data_type": "date",
"granularity": "day"
}
)
}}
{% endif %}

with session_metrics as (
select
session_key,
session_partition_key,
user_pseudo_id,
stream_id,
min(event_date_dt) as session_partition_date, -- Used only as a method of partitioning sessions within this incremental table. Does not represent the true session start date
min(event_date_dt) as session_partition_date, -- Date of the session partition, does not represent the true session start date which, in GA4, can span multiple days
min(event_timestamp) as session_partition_min_timestamp,
countif(event_name = 'page_view') as session_partition_count_page_views,
countif(event_name = 'purchase') as session_partition_count_purchases,
sum(event_value_in_usd) as session_partition_sum_event_value_in_usd,
ifnull(max(session_engaged), 0) as session_partition_max_session_engaged,
sum(engagement_time_msec) as session_partition_sum_engagement_time_msec
from {{ref('stg_ga4__events')}}
-- Give 1 extra day to ensure we beging aggregation at the start of a session
where session_key is not null
{% if is_incremental() %}
and event_date_dt >= DATE_SUB(_dbt_max_partition, INTERVAL 1 DAY)
{% if var('static_incremental_days', false ) %}
and event_date_dt in ({{ partitions_to_replace | join(',') }})
{% else %}
and event_date_dt >= _dbt_max_partition
{% endif %}
{% endif %}
group by 1,2,3,4
)
Expand All @@ -38,8 +61,13 @@ with session_metrics as (
,
session_conversions as (
select * from {{ref('stg_ga4__session_conversions_daily')}}
where 1=1
{% if is_incremental() %}
where session_partition_date >= DATE_SUB(_dbt_max_partition, INTERVAL 1 DAY)
{% if var('static_incremental_days', false ) %}
and session_partition_date in ({{ partitions_to_replace | join(',') }})
{% else %}
and session_partition_date >= _dbt_max_partition
{% endif %}
{% endif %}
),
join_metrics_and_conversions as (
Expand Down
19 changes: 19 additions & 0 deletions models/marts/core/fct_ga4__sessions_daily.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
version: 2

models:
- name: fct_ga4__sessions_daily
description: >
Incremental fact table with metrics related to daily session partitions.
columns:
- name: session_partition_key
description: >
Unique key assigned to session partitions which are daily partitions of a session. In GA4, sessions can span multiple days.
To improve query performance, it's easier to work with 'session partitions' which can be filtered/partitioned by date.
tests:
- unique
- name: session_key
description: >
Unique key assigned to sessions. Sessions can span multiple dates/partitions.
- name: session_partition_date
description: >
Date associated with the session_partition_key. Used to partition the table. Filter on this column to optimize query cost and performance.
Loading