Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental, partitioned session dim and fact tables. Allows for dimensional modeling on very large installs #175

Merged
merged 8 commits into from
Apr 12, 2023

Conversation

adamribaudo-velir
Copy link
Collaborator

@adamribaudo-velir adamribaudo-velir commented Apr 7, 2023

Description & motivation

Current Issue:

Building dim_ga4__sessions requires very expensive window functions that run against the entire event table. This is because GA4 sessions can span multiple days and the primary method available to us to reduce query cost is to incrementally work with N days of data. This causes a conflict: When working with N days of data, you may be working with the last-half of a session that started at an earlier date or working with the beginning-half of a session that ends at a later date.

Solution:

The only way to align a date-partitioned strategy with session dimensions is to revert to the assumption used by GA3 that sessions are wholly contained within a date. This PR adds a series of _daily models that make this assumption. Users of the package can make the decision whether they would like to work with multi-day sessions (at the expense of excess query costs) or daily sessions that are query-efficient.

Summary of changes:

  • New concept: session partition which is a session split into date-grain records. A 2-day session will have 2 session partitions
  • Addedstg_ga4__sessions_traffic_sources_daily which finds acquisition sources for daily session partitions
  • Added stg_ga4__derived_session_properties_daily which finds session properties within session partitions
  • Added dim_ga4__sessions_daily which finds the first value of all dimension columns, windowing on session partition
  • Made stg_ga4__session_conversions_daily incremental and partitioned (was already grouping by day)
  • Made fct_ga4__sessions_daily incremental and partitioned

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have run dbt test and python -m pytest . to validate existing tests

@adamribaudo-velir adamribaudo-velir changed the title Incremental, daily partitioned session dimension and fact tables Incremental, query-optimized session dimension and fact tables. Apr 7, 2023
@adamribaudo-velir adamribaudo-velir changed the title Incremental, query-optimized session dimension and fact tables. Incremental, partitioned session dimension and fact tables. Allows for dimensional modeling on very large installs Apr 9, 2023
@adamribaudo-velir adamribaudo-velir changed the title Incremental, partitioned session dimension and fact tables. Allows for dimensional modeling on very large installs Incremental, partitioned session dim and fact tables. Allows for dimensional modeling on very large installs Apr 9, 2023
@adamribaudo-velir adamribaudo-velir marked this pull request as ready for review April 9, 2023 16:17
@willbryant
Copy link
Contributor

Honestly I do not understand what the point of separating dim & fct tables is for a single entity like sessions, it's not like the dim table keys anything long-lived. Since this MR doubles the tables, is this a good time to combine them?

@adamribaudo-velir
Copy link
Collaborator Author

I agree there isn't much value in separating these dim & fct tables as they are today, but I'd like to give it some research into comparable Kimball modeling scenarios before making a final call. It's a separate concern than what's being addressed in the PR, so let me open an issue and we can move the conversation there.

@dgitis
Copy link
Collaborator

dgitis commented Apr 11, 2023

I approved these changes, but also wanted to add that I think we should have a view model on top of these session tables that handles cross-partition aggregation.

I would go so far as to suggest that we give our session model or models the int prefix rather than fct or dim and move them out of the marts folder and put the view in the marts folder so there's no confusion about which model to use.

This would result in sessions breaking when they are at the start or the end of the date range being queried but would leave partitions within the range unbroken.

I think this is a good, default behavior that people can choose to over-ride if they are willing to pay the performance price.

@adamribaudo-velir
Copy link
Collaborator Author

Agree that the organization could use more thought.

I had a realization after completing this: It would be easy to create a dim_ga4__sessions model on top of dim_ga4__sessions_daily that uses its expensive window functions across session partitions, instead of events. So you recoup the GA4 multi-day sessions at a fraction of the cost compared to the current implementation which scans all events. It'd still be expensive, but way less so.

Just putting that idea out there for now.

@adamribaudo-velir adamribaudo-velir merged commit 2933aef into main Apr 12, 2023
@adamribaudo-velir adamribaudo-velir deleted the dim-session-partition branch April 12, 2023 02:13
@willbryant
Copy link
Contributor

I had a realization after completing this: It would be easy to create a dim_ga4__sessions model on top of dim_ga4__sessions_daily that uses its expensive window functions across session partitions, instead of events. So you recoup the GA4 multi-day sessions at a fraction of the cost compared to the current implementation which scans all events. It'd still be expensive, but way less so.

This is what I did in my own project some months back. Works well enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants