Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add drop and recompute media model #1

Conversation

agnessnowplow
Copy link
Contributor

The media model written in a drop and recompute fashion until the integration details are finalised. It is less likely it will be a standalone model, depending on the player it should either use a web model or mobile model (Roku) as a base.

The schema.yml describes what each model does but as a whole the idea is that all relevant event level fields are collected under 'interactions', which is aggregated to a page view level, which is the basis for play events. There are further downstream models which aggregate this data from different perspectives. The most important model is the 'media_stats', which aggregates the page view level data to a video/audio level with the key video analytics KPIs in mind.

Please note that currently it only takes test data from the web model, therefore some key fields from Roku events are currently missing which will result in fails for the 'not_null' tests and incorrect / missing calculations for the relevant data in the models. This should fix itself as soon as we have test data available and populate the correct ids that are currently missing.

Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job Agnes, its coming along really well! I think there's a couple of main points to look out for that are just purely cosmetic:

  • Standardising indentation across the board
  • Making sure files end on a newline
  • Adding as when providing aliases for columns

Then there's the replacing of casting from the :: format to making use of the various {{ dbt_utils.type_X }} to ensure that this model is compatible with all of the different databases that are supported.

I'm also not sure if it's worth implementing a fully incremental logic now, but it seems like a waste to continually recompute so many metrics around users and videos, although if that's out of scope then that's fine too.

I think breaking down the models into different subsections and therefore directories could also help with the organisation of the project, although that's not necessary.

I think we can also add some more variables into dbt_project.yml to parameterise some settings to make it easier to modify.

Some of my comments/questions are probably quite ignorant since I haven't worked with the media dataset before so feel free to ignore if the questions don't make sense at all, but overall I'm excited to see how this project continues to evolve!

dbt_project.yml Outdated Show resolved Hide resolved
dbt_project.yml Show resolved Hide resolved
dbt_project.yml Outdated Show resolved Hide resolved
@@ -0,0 +1,195 @@
{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for redshift/postgres? If so, should we specify a subdirectory for redshift within the models folder? Is the initial version only supporting redshift/postgres or everything? If everything, consider clustering and partitioning in the config

Copy link
Contributor Author

@agnessnowplow agnessnowplow Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially it is for redshift/postgres but for sure we can add the additional configs to support the rest

models/interactions.sql Outdated Show resolved Hide resolved
models/plays_by_user.sql Outdated Show resolved Hide resolved
models/schema.yml Show resolved Hide resolved
models/schema.yml Outdated Show resolved Hide resolved
packages.yml Show resolved Hide resolved
models/interactions.sql Show resolved Hide resolved
Copy link

@bill-warner bill-warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a really good effort. Well done.

I have left plenty of comments throughout but here are some points I would like to highlight:

  • The play time calculations I am still a little unclear on. I suggest we work through the comments, make any changes/clarifications and then revisit this part.
  • I think we need to think through the session and user level tables. Do we include these in the package as standalone tables, do we provide snippets as to how you might add these cols to the corresponding table in the web package, do we provide models that do this for you. I am not sure at this point. I think we probably need to experiment with that a little. I think what we have already at a sessions and users level is really useful, I think we just need to think how we surface this.
  • How do we integrate Roku. It is tricky because Roku's session_id is taken from com_snowplowanalytics_snowplow_client_session_1, where as all web traffic's session id is domain_sessionid. As a result, the mobile base module will surface all Roku events, where as the web base module will surface the remaining media events. How we handle that I am not sure at this stage. It might be that we just focus on either or for now.

Once you have had some time to digest the comments and make any required changes, I suggest we sit down as a 3 and discuss any outstanding questions and think through the points above

e.domain_userid,
coalesce(y.player_id, me.html_id, r.content_title) as media_id, -- replace content_title with id field when available
round(mp.duration) as duration,
coalesce(y.player_id, me.html_id, r.content_title) as title, -- replace ids with title fields when available

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not just the same as media_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Roku we have a content_id and a content_title as well. As we just decided to split Roku from the web models, I will remove this (which was intended as a placeholder for now) and keep the id only

models/interactions.sql Show resolved Hide resolved
event_type,
start_tstamp,
media_start_pos,
lead(start_tstamp, 1) over(partition by page_view_id, media_id order by start_tstamp) end_tstamp,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are essentially partitioning by play_id right? What do you thinking about introducing the surrogate key play_id earlier in this model rather than in the plays_by_pageview model? That way it is exposed as early as possible. We could also use the surrogate_key utils macro rather than the hash macro. V similar under the hood but ensures all cols are casted to strings + replaces Nulls (Which would otherwise result in a null pk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, it makes sense. In the reworked model it will be added to interactions already.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I guess my point here was that it might be cleaner to have the surrogate key defined earlier within the interactions model so it is more obvious what the partition actually represents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do that of course but the issue I faced was {{ dbt_utils.surrogate_key(['p.page_view_id', coalesce(y.player_id, me.html_id) ]) }} play_id doesn't work. Do you have a workaround for this perhaps? The only thing I can think of is to add another CTE after prep and before the partitioning CTE just for the surrogate key calculation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need to wrap the coalesce in quotes i.e. {{ dbt_utils.surrogate_key(['p.page_view_id', 'coalesce(y.player_id, me.html_id)' ]) }} play_id.

models/interactions.sql Outdated Show resolved Hide resolved
f.event_id,
f.event_type,
f.start_tstamp,
case when f.end_tstamp is null then p.start_tstamp else f.end_tstamp end end_tstamp,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step really required? In the previous CTE, find_nulls_round_1, if instead of case when event_type in ('pause', 'paused', 'ended') then coalesce(end_tstamp, start_tstamp) else end_tstamp end end_tstamp you had coalesce(end_tstamp, start_tstamp) would you get the same result as this two step NULL identification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to highlight it in the comment but essentially part2 was just a placeholder to eventually calculate the end timestamps of events that are not a paused or ended event and take them from the web model's pageview table's end_tstamps to get a more accurate view. Not sure if it is worth the extra calculation and dependence though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I guess you could consider either taking the end_tstamp from the page views table or use the page ping events to calculate the end_tstamp within this model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it with the simplified version, meaning I will equal the end_timestamp of the last event as the start_tstamp because as I found that when I added the end_tstamps from the page view table, the values I got was less than the start_tstamp of the last event.

models/user_stats.sql Outdated Show resolved Hide resolved
models/user_stats.sql Outdated Show resolved Hide resolved
models/media_stats.sql Outdated Show resolved Hide resolved

from {{ ref("interactions") }}

where event_type = 'ready'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should include impressions in the plays_by_pageview model? I know strictly speaking it wouldnt be just plays then but just thinking out loud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we agreed on the group discussion, I have changed the logic to fit the impressions into plays_by_pageview. This has led me to realise I need to reconsider KPIs for media_stats in general. Currently I only have a rough idea which metric needs which logic (or more than one) out of the following 4 levels of filtering: let's say for unique users as a base we could take 1) all interactions including those cases when there is only a ready event within the page view so no actual interaction with the video 2) all page views where some interaction with the video happened but not necessarily play event 3) interactions within page views with SOME play_time_sec 4) interactions with valid plays (currently longer than 30 min). You will see I opted for calculating both 3) unique_users_with_plays and 4) unique_users_with_valid_plays whereas when I look at completion_rate, I only consider option 3) (completion_rate_by_play)

dbt_project.yml Outdated Show resolved Hide resolved
@agnessnowplow
Copy link
Contributor Author

Thank you both again for the reviews, I have reworked the model taking into consideration the recommendations you suggested. I have removed Roku from the model for now so that it is cleaner and the test don't fail anymore. I have also added documentation for the columns so that we can be clear on the definitions behind the KPIs. There are still a handful of points that might still be open that needs further discussion, I tried to reply to comments everywhere it is needed, feel free to resolve any open points you raised to get a clearer view of what is actually left to discuss / rework.

@bill-warner
Copy link

Thanks for making this changes. I have started to review the code base again but stopped because I realised it doesn't actually run e2e. There is 1 issue in the mp_events model where you filter on page_view_id not pvc.id. The remaining issues I am unsure on. Apologies in advance if it is me misconfiguring the model but I have left the configs as is. Once it is running I will have another look :)

@agnessnowplow
Copy link
Contributor Author

Apologies, the join is now fixed. The rest of the models run fine.

@bill-warner
Copy link

bill-warner commented Mar 2, 2022

I am still getting an error. Any idea why that is?

image

I think it is the way I was running the package. It wasn't executing the pivot_base model as part of the run, which seems slightly odd since it is downstream of the mp_events model but hey. It works now!

@agnessnowplow
Copy link
Contributor Author

I was just about to comment something similar but I'm glad it works now! I could only reproduce your issue if I first run the pivot_base without 10 within snowplow__percent_progress_boundaries and then try to run plays by page view with a different percent progress configuration list where 10 is added to the list within snowplow__percent_progress_boundaries. They are indeed dependent.

Copy link

@bill-warner bill-warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes. Good work. I think this is shaping up nicely.

I am still slightly unsure about the play time calculation. I think it is less to do with your logic, more with the underlying data. I have pulled out some rows of data which I had questions on here. I have left comments in the relevant cells.

More generally on this time calculation it seems like we have two approaches to this. I have outlined these below and my initial thoughts on the two. I am not wedded to either method yet but I would be interested to get both your thoughts on this.

1. Using tstamp of media events (Current implementation)

Pros:

  • Can calc abs time between events.
  • Assuming clean data, should be the most accurate calc.

Cons:

  • Relatively expensive to use lag/lead functions across event level data
  • The user must be tracking a specific set of events for this to work i.e. play, pause, seek, ended. Unsure how tracking additional media events will affect this calc.
  • Reliant on accurate tstamps and sequence of events being correct.
  • If there is only a single play event, you wont have any watch time. Could use page pings to accommodate this fact.
  • Complex logic, difficult to follow assumptions e.g. how seek events are being treated, what happens if elapsed time > play time

2. Use more granular percentage boundaries (e.g. every 10%)

Pros:

  • Less prone to significant error (I think)
  • Cheaper to calc
  • Would allow for more granular retention curves.
  • Only need to capture boundary events for this calc to work.
  • Accuracy can be increased by increasing number of boundaries.

Cons:

  • Purely an estimate of play time, rather than abs time.
  • Increases number of percentage boundary events
  • Hard to calc things that require fine grain measurements such as is_valid_play which could be play_time_sec > 1s

I haven't gone into the minutiae on this review as I feel the decision we make on the above point will impact the design a bit.

@@ -120,21 +87,24 @@ select
d.play_time_sec,
d.play_time_sec_muted,
d.play_time_sec_estimated,
d.retention_rate,
case when d.play_time_sec > 0 then true else false end is_played,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is going to miss plays which just have a single play event and nothing else?! Perhaps it is better to define this on play events. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we could do a count of plays as the basis of the logic to be 100% correct

dbt_project.yml Show resolved Hide resolved
models/interactions.sql Show resolved Hide resolved
min(start_tstamp) as first_play,
max(start_tstamp) as last_play,
count(*) as valid_plays,
count(distinct domain_userid) as unique_users_with_valid_plays,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to just call this users_w_valid_plays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I decided to leave this out and only make a distinction bw users (who clicked play) and users with complete plays as I find it slightly superfluous, but shortened all the key metrics with with to w

models/interactions.sql Outdated Show resolved Hide resolved

{% if not loop.last %}
{% for element in var("snowplow__percent_progress_boundaries") %}
{% if element < 0 or element > 100 %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably cleaner to move these checks into a macro, like get_percentage_boundaries(tracked_boundaries), where you validate the inputted boundaries, append 100 to the array if needed, then return the array. You could call this macro from here and iterate over the results as you have done. Shout if this doesn't make sense. Can explain in a bit more detail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done now, thanks for the suggestion.

coalesce(sum(i.play_time_sec_amended)/ nullif(max(i.duration), 0), 0) as retention_rate

sum(case when i.is_muted then i.play_time_sec_amended else 0 end) as play_time_sec_muted,
sum(i.playback_rate * i.play_time_sec) / nullif(sum(i.play_time_sec), 0) as avg_playback_rate,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to use play_time_sec for this calc but play_time_sec_amended elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

play_time_sec_amended is renamed to play_time_sec in downstream models. It is probably cleaner if the play_time_sec in this model is called play_time_sec_original and call play_time_sec_amended play_time_sec right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not relevant anymore. I have only kept one play_time_sec calculation which is the estimate based on the percent progresses reached, as discussed earlier.

group by 1, 2, 3, 4, 5

group by 1, 2, 3, 4

)

, valid_play_stats as (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the plan to have this media stats model as drop and recompute i.e. reprocess all plays each time? We might be able to be a cunning and avoid that situation by using {{ this }} to self reference the table, then adding the new data to the old data however this approach will likely not work for metrics like that require count distincts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try my best to come up with something along these lines. For the drop and recompute model I did not plan to make this change yet.

from {{ ref("plays_by_pageview") }}
select
media_id,
min(start_tstamp) as first_play,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these should either be called first_valid_play or we move the first/last play tstamps calc to the prep CTE i.e. consider both valid and non valid plays when calculating first and last plays. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both could be useful, especially now that we decided to reconsider the playtime calculation so that valid plays will become less accurate in certain scenarios

@agnessnowplow
Copy link
Contributor Author

I have reworked the model according to the less expensive and less error prone play_time_sec calculation, and that is based on the percentage progresses reached. The more the users track, the more accurate the estimates should be. This simplified the model quite a bit after all, but it also meant that I reconsidered the KPIs (some I renamed, some I removed, some I changed the logic for) I have changed the logic for avg_retention_rate which will be the average of the latest percentage progress reached before any seek events to see where the interest drops. This is different from the 'avg_percentage_played' which is the total play_time_sec (including rewatches) divided by the media duration.

@emielver emielver closed this May 13, 2022
@emielver
Copy link
Contributor

emielver commented May 13, 2022

Closed since we have the incremental model instead. (#2)

@emielver emielver deleted the dev/drop_and_recompute_media_model branch August 26, 2022 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants