Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

total_vaccinations and daily_vaccinations for Saudi Arabia do not match #333

Closed
nibble0101 opened this issue Jan 18, 2021 · 7 comments
Closed

Comments

@nibble0101
Copy link

nibble0101 commented Jan 18, 2021

In one of the issues here, which I have failed to locate, I was made to understand that daily_vaccinations values are estimated from total_vaccinations values using interpolation for countries which do not report daily vaccination figures. Below is an extract of vaccination figures for Saudi Arabia.

image

My understanding is that since 2021-01-08, 2021-01-09 and 2021-01-10 do not have total_vaccinations values, the daily_vaccinations are then estimated using the reported total_vaccinations figures for 2021-01-07 and 2021-01-11 which are 137862 and 178337 respectively. If the intermediate values are interpolated using the two values, then the cummulative sum of the estimated shouldn't exceed the second value from which they were estimated. But in this case,
137862 + 23990 + 19366 + 17055 + 15667 equals 213940. Which is much greater than 178337 . What am I missing here @edomt ?

@edomt
Copy link
Collaborator

edomt commented Jan 19, 2021

You're missing the last step of the process, which is the 7-day rolling average! :)
This is better explained here: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/

daily_vaccinations: new doses administered per day (7-day smoothed). For countries that don't report data on a daily basis, we assume that doses changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window.

@nibble0101
Copy link
Author

You're missing the last step of the process, which is the 7-day rolling average! :)
This is better explained here: https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/

daily_vaccinations: new doses administered per day (7-day smoothed). For countries that don't report data on a daily basis, we assume that doses changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window.

So essentially from the two intervals, you estimate daily vaccinations by distributing the total equally to the days in that particular interval. Since some intervals are less than 7 days long, you use multiple intervals while calculating the moving averages resulting in the cumulative total of the daily estimates being greater than the actual total recorded like in the above case. From my understanding of 7-day moving average, for this case, you will have 4 data points for the rolling averages which from your estimates, starts from 2021-01-14 till the end. I am struggling to understand how you arrived at the estimates before 2021-01-14. To be exact from 2021-01-08 till 2021-01-13

@edomt
Copy link
Collaborator

edomt commented Jan 19, 2021

That's because our 7-day window allows partial data. So the earliest result will be based on 1 point, the second one will be the average of 2 points, etc. And from the 7th one onwards, it's the average of the last 7 points.

@nibble0101
Copy link
Author

nibble0101 commented Jan 19, 2021

That's because our 7-day window allows partial data. So the earliest result will be based on 1 point, the second one will be the average of 2 points, etc. And from the 7th one onwards, it's the average of the last 7 points.

Pardon my ignorance. If I understood your explanation, essentially you are saying you calculate 1-day, 2-day, ..., n-day, ..., 7-day average at each n-th day and then continue with 7-day moving averages from the 7th data point on wards. If my interpretation of your explanation is correct don't you think the above figures are incorrect because I would expect the first 4 figures to be the
same. From 2021-01-08 till 2021-01-11, a span of 4 days,

const estimatedDailyVaccination = (178337 - 137862) / 4; // 10118.75
const smoothedEstimatedDailyVaccination1 = 10118.75  // For day 1
const smoothedEstimatedDailyVaccination2 = (10118.75 + 10118.75 ) / 2 // For day 2
// And so on

@edomt
Copy link
Collaborator

edomt commented Jan 19, 2021

No worries, it'll give me a good opportunity to explain this process fully and redirect people here if the same questions arise later. Here's the step-by-step process, based on our current data for Saudi Arabia:

image

  • Step 1: we fill the gaps by assuming a perfectly linear progression between distant totals (same number of vaccinations on each missing day).
  • Step 2: we calculate the daily difference between each of these new totals to get a series of daily vaccinations.
  • Step 3: we apply a 7-day rolling average with partial window where necessary. In this example, the cells in blue (F9 to F13) are complete averages based on 7 days of data. On the other hand, the red cells (F3 to F8) are partial averages based on what's available so far. F8 is the average of 6 available days (E3 to E8), F7 is the average of 5 available days (E3 to E7), all the way down to F3 that is the "average" of only 1 available day (E3).

Here's the complete spreadsheet, where I've left all the formulas in the cells, so you can check how I arrived at each number: saudi_arabia_example.xlsx

@nibble0101
Copy link
Author

nibble0101 commented Jan 19, 2021

Thanks for the spreadsheet formulas. I was actually using the same method except that I was starting from the first estimated value while calculating the rolling averages but you are starting from the lower observed value instead.

@lizhiwei1994
Copy link

No worries, it'll give me a good opportunity to explain this process fully and redirect people here if the same questions arise later. Here's the step-by-step process, based on our current data for Saudi Arabia:

image

  • Step 1: we fill the gaps by assuming a perfectly linear progression between distant totals (same number of vaccinations on each missing day).
  • Step 2: we calculate the daily difference between each of these new totals to get a series of daily vaccinations.
  • Step 3: we apply a 7-day rolling average with partial window where necessary. In this example, the cells in blue (F9 to F13) are complete averages based on 7 days of data. On the other hand, the red cells (F3 to F8) are partial averages based on what's available so far. F8 is the average of 6 available days (E3 to E8), F7 is the average of 5 available days (E3 to E7), all the way down to F3 that is the "average" of only 1 available day (E3).

Here's the complete spreadsheet, where I've left all the formulas in the cells, so you can check how I arrived at each number: saudi_arabia_example.xlsx

Sorry to trouble you. Is there a function that can do this?
Such as if I have a vector that has some missing value, then use a function the smoothed vector without missing value would be returned.
say that a is a numeric vector that has some missing value, b is the smoothed vector without missing value.

b = function(a)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants