Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add optional lookback_interval param to insert_by_period materialization #394

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
target/
dbt_modules/
logs/
venv/
venv/
10 changes: 7 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# dbt-utils v0.7.0 (unreleased)
# dbt-utils Next
## Features
* Add new `lookback_interval` parameter to the [`insert_by_period`](https://github.com/dbt-labs/dbt-utils#insert_by_period-source) materialization ([#394](https://github.com/dbt-labs/dbt-utils/pull/394) from [@GClunies](https://github.com/GClunies)). Allows the materialization to be used with sessionization models built by the [dbt-Labs Segment package](https://github.com/dbt-labs/segment).

# dbt-utils v0.7.0
## Breaking changes

### 🚨 New dbt version

dbt v0.20.0 or greater is required for this release. If you are not ready to upgrade, consider using a previous release of this package. In accordance with the version upgrade, this package release includes breaking changes to:
- Generic (schema) tests
- `dispatch` functionality
Expand Down Expand Up @@ -45,7 +48,8 @@ If you were relying on the position to match up your optional arguments, this ma
* Add `slugify` macro, and use it in the pivot macro. :rotating_light: This macro uses the `re` module, which is only available in dbt v0.19.0+. As a result, this feature introduces a breaking change. ([#314](https://github.com/fishtown-analytics/dbt-utils/pull/314))

## Under the hood
* Update the default implementation of concat macro to use `||` operator ([#373](https://github.com/fishtown-analytics/dbt-utils/pull/314) [@ChristopheDuong](https://github.com/ChristopheDuong)). Note this may be a breaking change for spark users.
* Update the default implementation of concat macro to use `||` operator ([#373](https://github.com/fishtown-analytics/dbt-utils/pull/314) from [@ChristopheDuong](https://github.com/ChristopheDuong)). Note this may be a breaking change for adapters that support `concat()` but not `||`, such as Apache Spark.
- Use `power()` instead of `pow()` in `generate_series()` and `haversine_distance()` as they are synonyms in most SQL dialects, but some dialects only have `power()` ([#354](https://github.com/fishtown-analytics/dbt-utils/pull/354) from [@swanderz](https://github.com/swanderz))

# dbt-utils v0.6.6

Expand Down
96 changes: 93 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -542,7 +542,7 @@ Arguments:
```

```sql
-- Returns the list sorted my most recently observed
-- Returns the list sorted by most recently observed
{% set payment_methods = dbt_utils.get_column_values(
table=ref('stg_payments'),
column='payment_method',
Expand Down Expand Up @@ -1049,7 +1049,7 @@ Should a run of a model using this materialization be interrupted, a subsequent

Progress is logged in the command line for easy monitoring.

**Usage:**
##### Simple usage:
```sql
{{
config(
Expand All @@ -1072,14 +1072,104 @@ with events as (

```

##### Usage with dbt Segment package models:
Sessionization models built by the [dbt Segment package](https://github.com/dbt-labs/segment) (or similar approaches) can be very large and complex. In particular, the [`segment_web_page_views__sessionized`](https://github.com/dbt-labs/segment/blob/master/models/sessionization/segment_web_page_views__sessionized.sql)model. This makes it a natural candidate for the `insert_by_period` materialization
during initial builds.

Unfortunately, the simple usage shown above will not work for the `segment_web_page_views__sessionized` model. This is caused by some [unusually complex SQL](https://github.com/dbt-labs/segment/blob/master/models/sessionization/segment_web_page_views__sessionized.sql) that handles re-sessionization of user page views in the model logic:
```sql
-- filename: segment_web_page_views__sessionized.sql

with pageviews as (
/*
This CTE selects ALL page views for users seen some specified number of
hours (set by the `segment_sessionization_trailing_window` variable)
before the last timestamp in {{ this }} model. These users page views are
up for re-sessionization.
*/
select * from {{ref('segment_web_page_views')}}

{% if is_incremental() %}

where anonymous_id in (

select distinct anonymous_id

from {{ref('segment_web_page_views')}}
where cast(tstamp as datetime) >= (

select
{{ dbt_utils.dateadd(
'hour',
-var('segment_sessionization_trailing_window'),
'max(tstamp)'
) }}

from {{ this }})

)

{% endif %}

),

... sessionization logic continues ...

select * from session_ids
```
To apply the `insert_by_period` materialization to the `segment_web_page_views__sessionized` model, you can set an optional `lookback_interval` parameter that modifies the period's window to... *lookback*... an additional interval at the start of each period. This is analougous to the `'segment_sessionization_trailing_window` variable used by the dbt Segment package. The SQL above can then be replaced with:
```sql
-- filename: segment_web_page_views__sessionized.sql

{{
config(
materialized = "insert_by_period",
period = "week",
lookback_interval = "2 days",
timestamp_field = "tstamp",
start_date = "2018-01-01",
stop_date = "2018-06-01"
unique_key = "id",
sort = "tstamp",
dist = "id"
)
}}

with pageviews as (

select *

from {{ref('segment_web_page_views')}}
where anonymous_id in (

select distinct anonymous_id

from {{ref('segment_web_page_views')}}
where __PERIOD_FILTER_WITH_LOOKBACK__ -- This will be replaced with a filter in the materialization code that includes the additional lookback interval.

)

),

... sessionization logic continues ...

-- You can still use __PERIOD_FILTER__ in the same model too
select *
from session_ids
where __PERIOD_FILTER__
```

**Configuration values:**
* `period`: period to break the model into, must be a valid [datepart](https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html) (default='Week')
* `timestamp_field`: the column name of the timestamp field that will be used to break the model into smaller queries
* `start_date`: literal date or timestamp - generally choose a date that is earlier than the start of your data
* `stop_date`: literal date or timestamp (default=current_timestamp)
* `lookback_interval`: The lookback interval to be added to the start of each period during materialization. Must be a valid interval literal (e.g., `'2 days'`). Value should be:
* *Equivalent in time* to the `segment_sessionization_trailing_window` used in the Segment dbt package.
* Greater than the maximum session duration you would reasonably expect in your users sessions.

**Caveats:**
* This materialization is compatible with dbt 0.10.1.
* This materialization is compatible with dbt >= 0.10.1.
* This materialization has been written for Redshift.
* This materialization can only be used for a model where records are not expected to change after they are created.
* Any model post-hooks that use `{{ this }}` will fail using this materialization. For example:
Expand Down
32 changes: 32 additions & 0 deletions integration_tests/data/materializations/data_pageviews.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
id,tstamp,anonymous_id
1,2020-01-01,50f42e4d-535d-4117-934f-bdf2deff1234
2,2020-01-02,50f42e4d-535d-4117-934f-bdf2deff1234
3,2020-01-03,50f42e4d-535d-4117-934f-bdf2deff1234
4,2020-01-04,f164ffb3-b789-4d93-b280-4c35f50abcde
5,2020-01-05,f164ffb3-b789-4d93-b280-4c35f50abcde
6,2020-01-06,50f42e4d-535d-4117-934f-bdf2deff1234
7,2020-01-07,50f42e4d-535d-4117-934f-bdf2deff1234
8,2020-01-08,f164ffb3-b789-4d93-b280-4c35f50abcde
9,2020-01-09,50f42e4d-535d-4117-934f-bdf2deff1234
10,2020-01-10,50f42e4d-535d-4117-934f-bdf2deff1234
11,2020-01-11,f164ffb3-b789-4d93-b280-4c35f50abcde
12,2020-01-12,f164ffb3-b789-4d93-b280-4c35f50abcde
13,2020-01-13,f164ffb3-b789-4d93-b280-4c35f50abcde
14,2020-01-14,f164ffb3-b789-4d93-b280-4c35f50abcde
15,2020-01-15,50f42e4d-535d-4117-934f-bdf2deff1234
16,2020-01-16,f164ffb3-b789-4d93-b280-4c35f50abcde
17,2020-01-17,f164ffb3-b789-4d93-b280-4c35f50abcde
18,2020-01-18,50f42e4d-535d-4117-934f-bdf2deff1234
19,2020-01-19,50f42e4d-535d-4117-934f-bdf2deff1234
20,2020-01-20,50f42e4d-535d-4117-934f-bdf2deff1234
21,2020-01-21,f164ffb3-b789-4d93-b280-4c35f50abcde
22,2020-01-22,f164ffb3-b789-4d93-b280-4c35f50abcde
23,2020-01-23,50f42e4d-535d-4117-934f-bdf2deff1234
24,2020-01-24,f164ffb3-b789-4d93-b280-4c35f50abcde
25,2020-01-25,50f42e4d-535d-4117-934f-bdf2deff1234
26,2020-01-26,f164ffb3-b789-4d93-b280-4c35f50abcde
27,2020-01-27,50f42e4d-535d-4117-934f-bdf2deff1234
28,2020-01-28,f164ffb3-b789-4d93-b280-4c35f50abcde
29,2020-01-29,50f42e4d-535d-4117-934f-bdf2deff1234
30,2020-01-30,f164ffb3-b789-4d93-b280-4c35f50abcde
31,2020-01-31,50f42e4d-535d-4117-934f-bdf2deff1234
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
id,tstamp,anonymous_id,page_view_number,previous_tstamp,period_of_inactivity,new_session,session_number,session_id
1,2020-01-01,50f42e4d-535d-4117-934f-bdf2deff1234,1,,,1,1,50f42e4d-535d-4117-934f-bdf2deff1234--1
2,2020-01-02,50f42e4d-535d-4117-934f-bdf2deff1234,2,2020-01-01,86400,0,1,50f42e4d-535d-4117-934f-bdf2deff1234--1
3,2020-01-03,50f42e4d-535d-4117-934f-bdf2deff1234,3,2020-01-02,86400,0,1,50f42e4d-535d-4117-934f-bdf2deff1234--1
4,2020-01-04,f164ffb3-b789-4d93-b280-4c35f50abcde,1,,,1,1,f164ffb3-b789-4d93-b280-4c35f50abcde--1
5,2020-01-05,f164ffb3-b789-4d93-b280-4c35f50abcde,2,2020-01-04,86400,0,1,f164ffb3-b789-4d93-b280-4c35f50abcde--1
6,2020-01-06,50f42e4d-535d-4117-934f-bdf2deff1234,4,2020-01-03,259200,1,2,50f42e4d-535d-4117-934f-bdf2deff1234--2
7,2020-01-07,50f42e4d-535d-4117-934f-bdf2deff1234,5,2020-01-06,86400,0,2,50f42e4d-535d-4117-934f-bdf2deff1234--2
8,2020-01-08,f164ffb3-b789-4d93-b280-4c35f50abcde,3,2020-01-05,259200,1,2,f164ffb3-b789-4d93-b280-4c35f50abcde--2
9,2020-01-09,50f42e4d-535d-4117-934f-bdf2deff1234,6,2020-01-07,172800,1,3,50f42e4d-535d-4117-934f-bdf2deff1234--3
10,2020-01-10,50f42e4d-535d-4117-934f-bdf2deff1234,7,2020-01-09,86400,0,3,50f42e4d-535d-4117-934f-bdf2deff1234--3
11,2020-01-11,f164ffb3-b789-4d93-b280-4c35f50abcde,4,2020-01-08,259200,1,3,f164ffb3-b789-4d93-b280-4c35f50abcde--3
12,2020-01-12,f164ffb3-b789-4d93-b280-4c35f50abcde,5,2020-01-11,86400,0,3,f164ffb3-b789-4d93-b280-4c35f50abcde--3
13,2020-01-13,f164ffb3-b789-4d93-b280-4c35f50abcde,6,2020-01-12,86400,0,3,f164ffb3-b789-4d93-b280-4c35f50abcde--3
14,2020-01-14,f164ffb3-b789-4d93-b280-4c35f50abcde,7,2020-01-13,86400,0,3,f164ffb3-b789-4d93-b280-4c35f50abcde--3
15,2020-01-15,50f42e4d-535d-4117-934f-bdf2deff1234,8,2020-01-10,432000,1,4,50f42e4d-535d-4117-934f-bdf2deff1234--4
16,2020-01-16,f164ffb3-b789-4d93-b280-4c35f50abcde,8,2020-01-14,172800,1,4,f164ffb3-b789-4d93-b280-4c35f50abcde--4
17,2020-01-17,f164ffb3-b789-4d93-b280-4c35f50abcde,9,2020-01-16,86400,0,4,f164ffb3-b789-4d93-b280-4c35f50abcde--4
18,2020-01-18,50f42e4d-535d-4117-934f-bdf2deff1234,9,2020-01-15,259200,1,5,50f42e4d-535d-4117-934f-bdf2deff1234--5
19,2020-01-19,50f42e4d-535d-4117-934f-bdf2deff1234,10,2020-01-18,86400,0,5,50f42e4d-535d-4117-934f-bdf2deff1234--5
20,2020-01-20,50f42e4d-535d-4117-934f-bdf2deff1234,11,2020-01-19,86400,0,5,50f42e4d-535d-4117-934f-bdf2deff1234--5
21,2020-01-21,f164ffb3-b789-4d93-b280-4c35f50abcde,10,2020-01-17,345600,1,5,f164ffb3-b789-4d93-b280-4c35f50abcde--5
22,2020-01-22,f164ffb3-b789-4d93-b280-4c35f50abcde,11,2020-01-21,86400,0,5,f164ffb3-b789-4d93-b280-4c35f50abcde--5
23,2020-01-23,50f42e4d-535d-4117-934f-bdf2deff1234,12,2020-01-20,259200,1,6,50f42e4d-535d-4117-934f-bdf2deff1234--6
24,2020-01-24,f164ffb3-b789-4d93-b280-4c35f50abcde,12,2020-01-22,172800,1,6,f164ffb3-b789-4d93-b280-4c35f50abcde--6
25,2020-01-25,50f42e4d-535d-4117-934f-bdf2deff1234,13,2020-01-23,172800,1,7,50f42e4d-535d-4117-934f-bdf2deff1234--7
26,2020-01-26,f164ffb3-b789-4d93-b280-4c35f50abcde,13,2020-01-24,172800,1,7,f164ffb3-b789-4d93-b280-4c35f50abcde--7
27,2020-01-27,50f42e4d-535d-4117-934f-bdf2deff1234,14,2020-01-25,172800,1,8,50f42e4d-535d-4117-934f-bdf2deff1234--8
28,2020-01-28,f164ffb3-b789-4d93-b280-4c35f50abcde,14,2020-01-26,172800,1,8,f164ffb3-b789-4d93-b280-4c35f50abcde--8
29,2020-01-29,50f42e4d-535d-4117-934f-bdf2deff1234,15,2020-01-27,172800,1,9,50f42e4d-535d-4117-934f-bdf2deff1234--9
30,2020-01-30,f164ffb3-b789-4d93-b280-4c35f50abcde,15,2020-01-28,172800,1,9,f164ffb3-b789-4d93-b280-4c35f50abcde--9
31,2020-01-31,50f42e4d-535d-4117-934f-bdf2deff1234,16,2020-01-29,172800,1,10,50f42e4d-535d-4117-934f-bdf2deff1234--10
6 changes: 6 additions & 0 deletions integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_modules"

vars:
# Sessionization inactivity cutoff: if time gap betweeen page/track calls
# exceed this # of seconds, start new session.
# Used in `test_insert_by_period_with_lookback.sql`
segment_inactivity_cutoff: 60 * 60 * 24 # 1 day for simplicity

dispatch:
- macro_namespace: 'dbt_utils'
search_order: ['dbt_utils_integration_tests', 'dbt_utils']
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{{
config(
materialized = 'view',
enabled=(target.type == 'redshift')
)
}}

select * from {{ ref('expected_pageviews__sessionized') }}
5 changes: 5 additions & 0 deletions integration_tests/models/materializations/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,8 @@ models:
tests:
- dbt_utils.equality:
compare_model: ref('expected_insert_by_period')

- name: test_insert_by_period_with_lookback
tests:
- dbt_utils.equality:
compare_model: ref('expected_insert_by_period_with_lookback')
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
{{
config(
materialized = 'insert_by_period',
period = 'week',
lookback_interval = '2 days',
timestamp_field = 'tstamp',
start_date = '2019-12-31',
stop_date = '2020-02-01',
unique_key = 'id',
sort = 'tstamp',
dist = 'id',
enabled=(target.type == 'redshift')
)
}}

with pageviews as (

select *

from {{ref('data_pageviews')}}
where anonymous_id in (

select distinct anonymous_id

from {{ref('data_pageviews')}}
where __PERIOD_FILTER_WITH_LOOKBACK__

)

),

numbered as (
--This CTE is responsible for assigning an all-time page view number for a
--given anonymous_id. We don't need to do this across devices because the
--whole point of this field is for sessionization, and sessions can't span
--multiple devices.

select
*,

row_number() over (
partition by anonymous_id
order by tstamp
) as page_view_number

from pageviews

),

lagged as (

--This CTE is responsible for simply grabbing the last value of `tstamp`.
--We'll use this downstream to do timestamp math--it's how we determine the
--period of inactivity.

select
*,

lag(tstamp) over (
partition by anonymous_id
order by page_view_number
) as previous_tstamp

from numbered

),

diffed as (

--This CTE simply calculates `period_of_inactivity`.

select
*,
{{ dbt_utils.datediff('previous_tstamp', 'tstamp', 'second') }} as period_of_inactivity

from lagged

),

new_sessions as (

--This CTE calculates a single 1/0 field--if the period of inactivity prior
--to this page view was greater than 30 minutes, the value is 1, otherwise
--it's 0. We'll use this to calculate the user's session #.

select
*,

case
when period_of_inactivity <= {{var('segment_inactivity_cutoff')}} then 0 -- Cutoff is set to 1 day (86,400s) in dbt_project.yml
else 1
end as new_session

from diffed

),

session_numbers as (

--This CTE calculates a user's session (1, 2, 3) number from `new_session`.
--This single field is the entire point of the entire prior series of
--calculations.

select
*,

sum(new_session) over (
partition by anonymous_id
order by page_view_number
rows between unbounded preceding and current row
) as session_number

from new_sessions

),

session_ids as (

--This CTE assigns a globally unique session id based on the combination of
--`anonymous_id` and `session_number`.

select
*,
anonymous_id || '--' || session_number as session_id

{# Replaced below with above to simplify test #}

{# {{dbt_utils.star(ref('data_insert_by_period_with_lookback'))}},
page_view_number,
{{dbt_utils.surrogate_key(['anonymous_id', 'session_number'])}} as session_id #}

from session_numbers

)

select *
from session_ids
where __PERIOD_FILTER__
Loading