-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Flow and Occupancy Detection and Correction #506
Conversation
…ers for flow and occupancy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a high-level, this logic seems reasonable to me. I do wonder if it would be worth breaking out the outlier removal into a separate intermediate model: I could imagine that more than one "performance" model would want to use the version with outliers removed. What do you think @thehanggit ?
monthly_stats as ( | ||
select | ||
detector_id, | ||
date_trunc('month', sample_date) as month, | ||
avg(volume_sum) as volume_mean, | ||
stddev(volume_sum) as volume_stddev, | ||
-- consider using max_capacity | ||
percentile_cont(0.95) within group (order by volume_sum) as volume_95th, | ||
percentile_cont(0.95) within group (order by occupancy_avg) as occupancy_95th | ||
from five_minute_agg | ||
group by detector_id, date_trunc('month', sample_date) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this covers the intended time period: five_minute_agg
is an incremental model, so in the normal usage it only covers the past two days. For the most part, this "monthly" aggregate will only have two days of data in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I can use the preceding month's statistics (mean, std) to predict outlier for the current month.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, but the incremental logic would have to change: the logic here doesn't really work for any date because you are not actually ever computing monthly stats (five_minute_agg
only has two days of data in it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated approach will be to create a separate high flow outlier detection model in the clearinghouse schema that will need to account for the incremental logic scenario @ian-r-rose mentions in the previous comment. I can see a scenario where incrementality is not needed but we will see what @thehanggit comes up with. Looking forward to seeing the update!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Thank you @ian-r-rose for taking a look! I talked with @kengodleskidot just now and we may want to detect and fix the true outliers in the clearinghouse. As the outliers in the performance model might be attributed to imputation, which are the "fake" outliers, and we want to leave that part to the imputation model. I will develop a separate outlier removal in the clearinghouse folder because it would also influence the g-factor speed calculation. |
…nd fixed the logic for incremental model. Please check the logic and see if it's reasonable so we can move forward to connecting the imputed data to downstream models.
…-pems into hang_high_flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @thehanggit! I don't want my comments to hold up this PR since I do not believe it would change the output of the model but feel free to reach out if you would like to discuss any of my comments.
occupancy_avg | ||
from {{ ref('int_clearinghouse__detector_agg_five_minutes') }} | ||
where | ||
sample_date >= dateadd(month, -1, date_trunc('month', current_date)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe one of these values is a timestamp and the other is a date. Do you run into any issues when preforming this comparison?
case | ||
when | ||
fa.volume_sum - ms.volume_mean / nullifzero(ms.volume_stddev) > 3 | ||
then coalesce(ms.volume_95th, 173) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding a note here explaining why a value of 173 is being used.
case | ||
when | ||
fa.occupancy_avg > ms.occupancy_95th | ||
then coalesce(ms.occupancy_95th, 0.8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a similar note here on why 0.8 is being used would be beneficial.
detector_id, | ||
sample_date | ||
from {{ ref('int_diagnostics__detector_status') }} | ||
where status = 'Good' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you add a date timeframe to this where statement similar to the previous CTE? Otherwise, you are probably grabbing a very large data set that may affect your model performance.
@@ -16,11 +16,10 @@ detector_agg as ( | |||
station_id, | |||
lane, | |||
station_type, | |||
volume_sum, | |||
occupancy_avg, | |||
speed_weighted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are removing speed_weighted from this model, will this cause any downstream issues?
@thehanggit this model timed out after taking 10 hours today -- I think we may need disable it and take a closer look at the performance characteristics. |
For sure, I will take some time to check the queries tomorrow. I think the mean and variance calculation based on one month data might contribute to it. |
@ian-r-rose I checked the query profile by running a test in worksheet. It turns out the aggregate (not sure but it may relate to the var calculate?) is expensive. I switched from monthly data to weekly data, reducing the size by half. Another way is to use 95th percentile only to detect outliers, which I tested before and they also have a good performance for high flow value detection. |
This is related to Issue #278