Skip to content

Latest commit

 

History

History
338 lines (316 loc) · 15 KB

TODO.md

File metadata and controls

338 lines (316 loc) · 15 KB

TODO

Low LOE

Macro Changes:

  • Consider using the following dbt.utils in place of the current url-related macros:
  • Consider recoupling the extract_hostname_from_url & extract_query_string_from_url into a singular parse_url() macro --> parse_url([hostname|query]):
    • Checks for remove_query_parameters automatically & simpler than having multiple macros with specific names.

Medium LOE

Adding Metrics & Dimensions:

Individual Metrics:

  • USER-SCOPE Metrics:
    • days_active = COUNT(DISTINCT event_date) AS day_count: The number of days a user has been active on your website or application.
    • ?event_count? = COUNT(*[events]) AS event_count: The total number of a events a user has triggered on your webiste or application.
    • Event-related:
      • avg_events_per_session
      • lifetime_events
    • Pageview-related:
      • avg_page_views_per_session
      • avg_page_view_duration
    • Session-related:
      • avg_session_duration
      • lifetime_session_duration SEE BELOW IN User Lifetime Metrics.
  • SESSION-SCOPE Metrics:
    • Event-related:
      • avg_events_per_session
      • lifetime_events
    • Pageview-related:
      • avg_page_view_duration
  • PAGE-SCOPE Metrics:
    • Expand on Entrances & Exits:
      • entrance_rate: The percentage of sessions that started on a page or screen (Entrances / Sessions).
      • exit_rate: The percentage of sessions that ended on a page or screen (Exits / Sessions).
    • Event-related:
      • events_per_page
    • Mirroring GA4 UI:
      • views, just stick with page_views?
      • users
      • new_users
      • views_per_user
      • total_engagement_duration
      • avg_engagement_duration
      • unique_user_scrolls
      • ?event_count -- DECIDE HOW TO BREAK THESE ALL OUT --
      • ?conversions -- DECIDE HOW TO HANDLE GENERAL CONVERSION_EVENT LOGIC --
      • ?total_revenue -- CHOOSE 1 NAME, event_value, total_evet_value, total_revenue? --
  • IN GENERAL CONSIDER ADDING THESE FOR EACH METRIC WHERE APPLICABLE:
    • avg_[entity]
    • total_[entity]
    • first_[entity]
    • last_[entity]

Groups of Metrics:

  • User Lifetime Metrics:
    • lifetime_engaged_sessions: The number of engaged sessions a user had since they first visited your website or application.
    • lifetime_engagemenet_duration:The length of time since a user's first visit that the user was active on your website or application while it was in the foreground.
    • lifetime_session_duration: The total duration of user sessions, from their first session until the current session expires, including time when your website or application is in the background.
    • lifetime_ad_revenue: The ad revenue you generate from someone since their first visit to your website or app.
    • LTV or lifetime_value: Lifetime value (LTV) shows the total revenue from purchases on your website or application. You can use the data to determine how valuable users are based on additional revenue you generate.
    • lifetime_sessions: ALREADY HAVE THIS - sessions @ User Level
    • lifetime_transactions ALREADY HAVE THIS - purchases @ User Level
  • ...WAY MORE...
  • See Automatically Collected Events in GA4.

High LOE

Creating A Dynamic Macro To Handle All Events:

REASONING: This will help the setup / configuartion to be almost instant. Where every event_name in the stg_ga4_events model will be automatically detected and create subsequent models for them.

CURRENT PROGRESS: I am so close, but no cigar yet. I have almost figured out how to dynamically handle all event_params and their assosiated value.<dtype>_type through macros, one solved, I can implement it to build one enormous Staging Model for all non statically handled events.

  • [ ]

Add Tests & Packages:

Transition from dbt Cloud --> dbt Core:

  • A LOT TO DO FOR THIS...

Add Integration Tests:

Other General Issues:

  • Decide on considerations for handling certain event_params, such as:
    • Google click-related: gclsrc and gclid.
      • See here for more info.
    • Others like: debug_mode, term, and ?clean_event.
  • Decide how to handle default events (e.g., click, scroll, etc.)


COMPLETED

Adding & Fixing Docs:

  • Docs to Add:
    • Add a singular _metric_definitions.md file to root of the models folder. See here for inspo.
      • Essentially each metric definition would be in this format: {% docs %} <Enter your metric definition here, like this. [Source](https://like-to-source-file.here/)> {% enddocs %}
      • Need to update as metrics & dimensions are added, removed, or changed.
    • Add in this.
    • Restructure to be:
      • A single __schema__.yml file, for all sources & models at the root of A NEW docs folder.
        • i.e., combine _core__models.yml, _ga4__models.yml & _ga4__sources.yml.
        • REMOVE _core__models.yml, _ga4__models.yml & _ga4__sources.yml.
      • A single __macros__.yml file.
        • Rename _macros.yml, may choose to join this with the __schema__.yml file later if we can find fix for models failing to compile.
      • A single __docs__.md file, for all model, metric, & macro definitions/descriptions at the root of A NEW docs folder.
        • i.e., combine _core__docs.md, _metric_definitions.md, & _docs.md.
        • REMOVE _core__docs.md, _metric_definitions.md, & _docs.md.
      • Then also move __overview__.md to the root of A NEW docs folder.
  • Docs to Fix:
    • Change __overview__.md to simply be a high-level overview with links to say the README.md, Projet Style Guide, and the whatever else here for the project.
      • See here for inspo.

Metric & Dimension Renaming:

The general Fixes are as follows:

num_[entity] --> 
total_[entity] OR
[entity] (**Recommended**)

// EX: num_page_views --> page_views
  • dim_ga4__users table:
    • num_sessions --> sessions
    • num_page_views --> pages_views
    • num_purchases --> purchases
  • dim_ga4__sessions table:
  • fct_ga4__pages table:
    • Consider if we should handle hour differently?
    • total_time_on_page --> time_on_page --> total_engagement_duration
  • fct_ga4__sessions table:
    • count_page_views --> page_views
    • Consider changing sum_event_value_in_usd --> event_value, will want to allign with UA/Internal usage.
    • session_engaged --> engaged_sessions?
  • ga4__events table:
    • Inlcude the is_ prefix to all boolean fields (e.g. session_enagaged, engaged_session_event, entrances), to be like: is_page_view, is_purchase, etc.
    • Add an additiona page_path metric.

Restructing Certain Staging Models --> Intermediate Models:

REASONING: Currently there is only a Staging and Mart Models, and no Intermediate layer. Additionally of the following models are breaking the underlying rule of Staging Models: NO JOINS & NO AGGREGATION / RE-GRAINING.

  • Handling conversion_events:
    • 1. ADD A stg_ga4__event_conversions UPSTREAM INSTEAD OF HAVING BOTH stg_ga4__page_conversions AND stg_ga4__session_conversions.
    • 2. MOVE stg_ga4__page_conversions --> int_ga4__page_conversions, AND PULL FROM stg_ga4__event_conversions.
    • 3. MOVE stg_ga4__session_conversions --> int_ga4__session_conversions, AND PULL FROM stg_ga4__event_conversions.
  • Handling FIRST & LAST events:
    • stg_ga4__sessions_first_last_pageviews
    • stg_ga4__users_first_last_pageviews
    • stg_ga4__users_first_last_events
  • Handling user_properties:
    • stg_ga4__users_properties
    • stg_ga4__derived_users_properties
  • Handling all event_* related models:
    • FIX stg_ga4__event_page_view, it is trying to join with something up stream.

Mart Table Restructuring:

REASONING: Wide & Denomalized. Unlike old school warehousing, in the modern data stack storage is cheap and it’s compute that is expensive and must be prioritized as such, packing these into very wide denormalized concepts that can provide everything somebody needs about a concept as a goal.

  • dim_ga4__sessions & fct_ga4__sessions --> ga4__sessions
  • dim_ga4__users --> ga4__users
  • fct_ga4__pages --> ga4__pages
  • ALSO, HONESTLY SHOULD INCLUDE AN ENRICHED & UNNESTED ga4__events TABLE AS WELL.

Status Overview:

  • Models:
    • ga4__events
      • event_date
      • event_timestamp
      • event_name
      • event_params
      • user_properties
      • user_ltv
      • device
      • geo
      • traffic_source
      • ecommerce
      • items
      • keys
        • user_key
        • session_key
        • page_key
        • event_key
      • ``
      • ``
      • ``
      • ``
      • ``
      • ``
    • ga4__pages
      • date
      • hour
      • page_path
      • page_title
      • page_key
      • page_views
      • users
      • pageviews_per_user
      • new_users
      • sessions
      • pageviews_per_session
      • entrances
      • entrance_rate
      • exits
      • exit_rate
      • total_engagement_time
      • avg_engagement_duration
    • ga4__sessions
      • user_key
      • session_key
      • session_date
      • session_timestamp
      • is_new_user
      • ga_session_number
      • is_engaged_session
      • landing_page
      • default_channel_grouping
      • source
      • medium
      • campaign
      • country
      • city
      • device_category
      • operating_system
      • session_duration
      • page_views
      • session_event_count
      • avg_event_per_page
      • session_value
    • ga4__users
  • Macros:
    • get_position
      • get_first
      • get_last
    • [x]
  • Metrics & Dimensions:
    • user_properties
    • user_ltv
    • device
      • device_category
      • mobile_brand_name
      • mobile_model_name
      • mobile_marketing_name
      • mobile_os_hardware_model
      • operating_system
      • operating_system_version
      • vendor_id
      • advertising_id
      • language
      • is_limited_ad_tracking
      • time_zone_offset_seconds
      • browser
      • browser_version
    • geo
      • continent
      • sub_continent
      • country
      • region
      • city
      • metro
    • traffic_source
      • source
      • medium
      • campaign
    • ecommerce
    • items
      • item_id
      • item_name
      • item_brand
      • item_variant
      • item_category
      • item_category2
      • item_category3
      • item_category4
      • item_category5
      • price
      • quantity
      • item_revenue
      • item_refund
      • coupon
      • affiliation
      • location_id
      • item_list_id
      • item_list_name
      • item_list_index
      • promotion_id
      • promotion_name
      • creative_name
      • creative_slot

OTHER CONSIDERATIONS & IDEAS

  • Possible Good Addition: Identify Bot Traffic
  • Check this out.
  • IDEALS FOR WHY I AM MAKING SOME OF MY CHOICES REVOLVE AROUND: IMPLICIT VS EXPLICIT
    • TO WRITE ON THIS EXTENSIVELY ONCE PROJECT IS MORE MATURE...
  • See this Stack Overflow Comment, as this may be a better way to implement the unest_params macro:
{%- call statement('my_statement', fetch_result=True) -%}
      SELECT my_field FROM my_table
{%- endcall -%}

{%- set my_var = load_result('my_statement')['data'][0][0] -%}
  • Consider adding some Metrics / Dimensions for User-Scoped illustrating their most_freq & unique_num_of for the following:

    • device & device_category, etc.
    • location
    • SURE SOME OTHER GREAT ONES, WILL BRAINSTORM / WHITESTORM WITH OTEHRS.
  • CHECK THIS RESOURCE OUT: Data Modeling for a Customer 360 View.

  • Monitoring Core Web Vitals in BigQuery.

  • Churn Prediction Using GA4 and BQML.

  • ADD IN ABITRARY SCALES OF 1 TO 10 FOR SAY SOMETHING LIKE user_activity_scale OR user_engagement_scale.

    • WHERE YOU COULD USE ~ APPROX_QUANTILES(<metrric_name>, 100)[OFFSET(<percentage_amount>)] AS p<percentage_amount>.
    • SEE Sample Queries for Audiences FOR MORE INSPORATION.
  • IDEA FOR HOW TO HANDLE DYNAMIC UNNEST_BY_KEY FOR ALL NON HARD-CODED EVENT MODELS:

    • ESSENTIALLY SHOULD PASS A exclude_event_params VARIABLE LIST TO EXLUDE FOR EACH EVENT, MOST LIKELY INCLUDING THE FOLLOWING:
      • ['ga_session_id', 'page_location', 'ga_session_number', 'session_engaged', 'engagement_time_msec', 'page_title', 'page_refferr', 'source', 'medium', 'campaign']
  • These example queries for GA4 are perfect!

    • USE THIS, PERFECT SOLUTION!!! : SELECT COALESCE(value.int_value, value.float_value, value.double_value) FROM:
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'value'
) AS event_value
- IMPLEMENTATION WOULD PROB LOOK LIKE:
{% if value.string_value not null%}
value.string_value
{% else %}
COALESCE(value.int_value, value.float_value, value.double_value)
{% endif %}