-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970
Comments
I thought the difference might be caused by the query I think the API uses (or some version of it) and the number of entries in the database's guestbookresponse table that don't have timestamps (responsetimes). But is that likely for newer Dataverse installations like QDR's that probably don't have entries in its guestbookresponse table with no timestamps? Here's the query: select to_char(date_trunc('month', guestbookresponse.responsetime), 'Mon YYYY') as months, count(guestbookresponse.id) AS new_datasets, For Harvard Dataverse, if you remove |
@jggautier For QDR, it looks like the GUI counts entries where dvobject.publicationdate is null. Removing that clause from the query above makes the counts match. |
Thanks @qqmyers. Does that mean that for QDR, downloading files that are in unpublished datasets increments the download count in the UI? That would be a bug :( |
@jggautier - maybe one that's fixed though - #4637 . I suspect we have some legacy counts from before that was fixed. |
Oh right! Can we remove from the guestbookresponse table the download counts created by the bug @qqmyers described in #4637? It looks like doing that will make QDR's UI and API counts match. The bug accounts for a small part of the mismatch for Harvard Dataverse. So I think we're aware of three issues with the counts, and if possible I think we should:
#4637 caused Dataverse to count downloads of files in datasets that weren't published. I think we can find those downloads in the guestbookresponse table entries by querying for (1) entries that were created for currently unpublished datasets and (2) entries created before the currently published dataset was published: where dvobject.publicationdate > guestbookresponse.responsetime and dvobject.publicationdate is not null
The other situation @qqmyers described in #4637 is double counting. If there was a way to find entries in the guestbookresponse table where the same file was downloaded by the same person within the same second, removing those could make the count more accurate (it might correct more than just the counts created by the bug). How do sessionids work? I dug into the sessionid column a little to see if we could use it to represent one user, but it looks like sometimes sessionids are assigned to more than one user. Or maybe we could use the authenticateduserids with each entry. Since the double counting described in #4637 happened on draft datasets, the user would've been logged in, so the entries created by that bug would have authenticateduserids. But even if nothing can be done about the double counting, it seems that removing counts from unpublished datasets will make QDR's UI and API counts match (and hopefully other installations').
On Harvard Dataverse, most of the difference between the counts from the UI and the API are from guestbookresponse entries with no timestamps (#3324). The UI is counting those, and the API is not. I think the counts without timestamps should be counted by the API. @pdurbin suggested adding a timestamp to the guestbookresponse table. It could be an obviously fake timestamp. Also in #3324 @landreev suggested:
If we add fake timestamps to entries in the guestbookresponse table that have no timestamps, then when all of the counts are displayed by month, in addition to each month there would a group with that fake timestamp that we could describe as @landreev suggested. Or if we leave them null, we could remove the clause (When the total counts are displayed, I would think we don't have to explain that the dates of some counts are unknown.) |
Here's the query I've used to find file downloads (and any other downloadtypes e.g. "explores") that are counted two or more times within the same second:
|
Two other issues that may be at play:
|
Looking at the initial issues in this story, it seems that QDR's counts match now. The only other issue I was able to identify is that we are not counting undated historic download counts in the per month query. This fix will count those undated results if the month queried is on/after the oldest dated record. Alongside this, all the metrics table entries for downloads should be cleared so they are requeried with the updated values. |
@matthew-a-dunlap assigning to you and pulling back to team dev based on our discussion |
Numbers match on copy of prod db. Closing but will merge when custom home page passes. |
For some, but not all, Dataverse repositories running Dataverse 4.9.1-4.9.2, the count of all file downloads we get from the new metrics API doesn't match the download count of all files in an installation that's displayed in the metrics bar on the homepage:
vs.
{"status":"OK","data":{"count":11120}}
Dataverses running 4.9.1-4.9.2 with different counts
Harvard Dataverse
UAL Dataverse
Qualitative Data Repository
Dataverses running 4.9.1-4.9.2 with matching acounts
The text was updated successfully, but these errors were encountered: