Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

Closed
jggautier opened this issue Aug 16, 2018 · 10 comments

Comments

@jggautier
Copy link
Contributor

jggautier commented Aug 16, 2018

For some, but not all, Dataverse repositories running Dataverse 4.9.1-4.9.2, the count of all file downloads we get from the new metrics API doesn't match the download count of all files in an installation that's displayed in the metrics bar on the homepage:

sefsef vs. {"status":"OK","data":{"count":11120}}

Dataverses running 4.9.1-4.9.2 with different counts

Dataverses running 4.9.1-4.9.2 with matching acounts

@jggautier
Copy link
Contributor Author

jggautier commented Aug 16, 2018

I thought the difference might be caused by the query I think the API uses (or some version of it) and the number of entries in the database's guestbookresponse table that don't have timestamps (responsetimes). But is that likely for newer Dataverse installations like QDR's that probably don't have entries in its guestbookresponse table with no timestamps?

Here's the query:

select to_char(date_trunc('month', guestbookresponse.responsetime), 'Mon YYYY') as months, count(guestbookresponse.id) AS new_datasets,
sum(count(guestbookresponse.id)) over (order by date_trunc('month', guestbookresponse.responsetime)) as cumulative
from guestbookresponse
join dvobject on dvobject.id = guestbookresponse.datafile_id
where dvobject.publicationdate is not null
and guestbookresponse.responsetime is not null
group by date_trunc('month', responsetime)
order by date_trunc('month', responsetime) desc
limit 12;

For Harvard Dataverse, if you remove and guestbookresponse.responsetime is not null, the cumulative total includes the entries with no (null) responsetimes, which is closer to the count shown in the front page's metrics bar.

@qqmyers
Copy link
Member

qqmyers commented Aug 17, 2018

@jggautier For QDR, it looks like the GUI counts entries where dvobject.publicationdate is null. Removing that clause from the query above makes the counts match.

@jggautier
Copy link
Contributor Author

Thanks @qqmyers. Does that mean that for QDR, downloading files that are in unpublished datasets increments the download count in the UI? That would be a bug :(

@qqmyers
Copy link
Member

qqmyers commented Aug 17, 2018

@jggautier - maybe one that's fixed though - #4637 . I suspect we have some legacy counts from before that was fixed.

@jggautier
Copy link
Contributor Author

jggautier commented Aug 20, 2018

Oh right! Can we remove from the guestbookresponse table the download counts created by the bug @qqmyers described in #4637?

It looks like doing that will make QDR's UI and API counts match. The bug accounts for a small part of the mismatch for Harvard Dataverse.

So I think we're aware of three issues with the counts, and if possible I think we should:

  • remove counts created by the first two issues
  • make sure the query used for the UI and for the API count are similar (so that they handle the counts with no timestamps in the same way). I think the UI count is coming from a query that counts all rows in the guestbookresponse table. That query should be similar to the query the API uses (same where clauses).
  1. Counts from unpublished datasets

#4637 caused Dataverse to count downloads of files in datasets that weren't published. I think we can find those downloads in the guestbookresponse table entries by querying for (1) entries that were created for currently unpublished datasets and (2) entries created before the currently published dataset was published:

where dvobject.publicationdate > guestbookresponse.responsetime and dvobject.publicationdate is not null
(the API query ignores counts from currently unpublished datasets, but the UI is counting those)

  1. Counts from double counting

The other situation @qqmyers described in #4637 is double counting. If there was a way to find entries in the guestbookresponse table where the same file was downloaded by the same person within the same second, removing those could make the count more accurate (it might correct more than just the counts created by the bug). How do sessionids work? I dug into the sessionid column a little to see if we could use it to represent one user, but it looks like sometimes sessionids are assigned to more than one user. Or maybe we could use the authenticateduserids with each entry. Since the double counting described in #4637 happened on draft datasets, the user would've been logged in, so the entries created by that bug would have authenticateduserids.

But even if nothing can be done about the double counting, it seems that removing counts from unpublished datasets will make QDR's UI and API counts match (and hopefully other installations').

  1. Counts with no timestamps

On Harvard Dataverse, most of the difference between the counts from the UI and the API are from guestbookresponse entries with no timestamps (#3324). The UI is counting those, and the API is not. I think the counts without timestamps should be counted by the API.

@pdurbin suggested adding a timestamp to the guestbookresponse table. It could be an obviously fake timestamp.

Also in #3324 @landreev suggested:

when we generate access reports/otherwise display this data, we can think of presenting it in some sensible way: like, instead of listing all these downloads with no recorded times, we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."

If we add fake timestamps to entries in the guestbookresponse table that have no timestamps, then when all of the counts are displayed by month, in addition to each month there would a group with that fake timestamp that we could describe as @landreev suggested.

Or if we leave them null, we could remove the clause guestbookresponse.responsetime is not null from the query that the API uses. And when all of the counts are displayed by month, there would be a null group that we could describe as @landreev suggested.

(When the total counts are displayed, I would think we don't have to explain that the dates of some counts are unknown.)

@jggautier
Copy link
Contributor Author

jggautier commented Aug 23, 2018

Here's the query I've used to find file downloads (and any other downloadtypes e.g. "explores") that are counted two or more times within the same second:

select email, name, authenticateduser_id, responsetime, sessionid, datafile_id, downloadtype, guestbook_id
from guestbookresponse g1
where authenticateduser_id is not null --assuming we would look for only "downloads" from logged in users
and exists (
	select 1
	from guestbookresponse g2
	where 

	--find "downloads" that occurred at the same time
	g2.responsetime = g1.responsetime

	--OR find "downloads" within one second
	--g2.responsetime > g1.responsetime - interval '1 sec' 
	--and g2.responsetime < g1.responsetime + interval '1 sec'
	
	--from the same logged in user
	and g2.authenticateduser_id = g1.authenticateduser_id

	--for the same file
	and g2.datafile_id = g1.datafile_id
	
	--where the guestbookresponse entries aren't the same
	and g2.id <> g1.id
	)

--to show the results are what we want
order by sessionid, datafile_id, responsetime;

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jan 9, 2019

Two other issues that may be at play:

  • The metrics caching. Most of the queries default to 7 days unless configured.
  • There may be weirdness in how the metrics compare dates. Not sure but worth investigating. (works as expected)

@djbrooke djbrooke changed the title Count of all file downloads from metrics API doesn't always match what's in UI (on homepage metrics bar) Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) Jan 9, 2019
@matthew-a-dunlap matthew-a-dunlap self-assigned this Jan 9, 2019
@matthew-a-dunlap
Copy link
Contributor

Looking at the initial issues in this story, it seems that QDR's counts match now. The only other issue I was able to identify is that we are not counting undated historic download counts in the per month query. This fix will count those undated results if the month queried is on/after the oldest dated record.

Alongside this, all the metrics table entries for downloads should be cleared so they are requeried with the updated values.

@djbrooke
Copy link
Contributor

@matthew-a-dunlap assigning to you and pulling back to team dev based on our discussion

@kcondon
Copy link
Contributor

kcondon commented Feb 1, 2019

Numbers match on copy of prod db. Closing but will merge when custom home page passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants