Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

jggautier · 2018-08-16T19:56:08Z

For some, but not all, Dataverse repositories running Dataverse 4.9.1-4.9.2, the count of all file downloads we get from the new metrics API doesn't match the download count of all files in an installation that's displayed in the metrics bar on the homepage:

vs. {"status":"OK","data":{"count":11120}}

Dataverses running 4.9.1-4.9.2 with different counts

Harvard Dataverse
- Homepage count: https://dataverse.harvard.edu
- API count: https://dataverse.harvard.edu/api/info/metrics/downloads/toMonth
UAL Dataverse
- Homepage count: https://dataverse.library.ualberta.ca
- API count: https://dataverse.library.ualberta.ca/api/info/metrics/downloads/toMonth
Qualitative Data Repository
- Homepage count: https://data.qdr.syr.edu
- API count: https://data.qdr.syr.edu/api/info/metrics/downloads/toMonth

Dataverses running 4.9.1-4.9.2 with matching acounts

UNB Libraries Dataverse
- Homepage count: https://dataverse.lib.unb.ca
- API count: https://dataverse.lib.unb.ca/api/info/metrics/downloads/toMonth

The text was updated successfully, but these errors were encountered:

jggautier · 2018-08-16T19:59:20Z

I thought the difference might be caused by the query I think the API uses (or some version of it) and the number of entries in the database's guestbookresponse table that don't have timestamps (responsetimes). But is that likely for newer Dataverse installations like QDR's that probably don't have entries in its guestbookresponse table with no timestamps?

Here's the query:

select to_char(date_trunc('month', guestbookresponse.responsetime), 'Mon YYYY') as months, count(guestbookresponse.id) AS new_datasets,
sum(count(guestbookresponse.id)) over (order by date_trunc('month', guestbookresponse.responsetime)) as cumulative
from guestbookresponse
join dvobject on dvobject.id = guestbookresponse.datafile_id
where dvobject.publicationdate is not null
and guestbookresponse.responsetime is not null
group by date_trunc('month', responsetime)
order by date_trunc('month', responsetime) desc
limit 12;

For Harvard Dataverse, if you remove and guestbookresponse.responsetime is not null, the cumulative total includes the entries with no (null) responsetimes, which is closer to the count shown in the front page's metrics bar.

qqmyers · 2018-08-17T15:03:53Z

@jggautier For QDR, it looks like the GUI counts entries where dvobject.publicationdate is null. Removing that clause from the query above makes the counts match.

jggautier · 2018-08-17T15:51:43Z

Thanks @qqmyers. Does that mean that for QDR, downloading files that are in unpublished datasets increments the download count in the UI? That would be a bug :(

qqmyers · 2018-08-17T16:00:15Z

@jggautier - maybe one that's fixed though - #4637 . I suspect we have some legacy counts from before that was fixed.

jggautier · 2018-08-20T12:19:36Z

Oh right! Can we remove from the guestbookresponse table the download counts created by the bug @qqmyers described in #4637?

It looks like doing that will make QDR's UI and API counts match. The bug accounts for a small part of the mismatch for Harvard Dataverse.

So I think we're aware of three issues with the counts, and if possible I think we should:

remove counts created by the first two issues
make sure the query used for the UI and for the API count are similar (so that they handle the counts with no timestamps in the same way). I think the UI count is coming from a query that counts all rows in the guestbookresponse table. That query should be similar to the query the API uses (same where clauses).

Counts from unpublished datasets

#4637 caused Dataverse to count downloads of files in datasets that weren't published. I think we can find those downloads in the guestbookresponse table entries by querying for (1) entries that were created for currently unpublished datasets and (2) entries created before the currently published dataset was published:

where dvobject.publicationdate > guestbookresponse.responsetime and dvobject.publicationdate is not null
(the API query ignores counts from currently unpublished datasets, but the UI is counting those)

Counts from double counting

The other situation @qqmyers described in #4637 is double counting. If there was a way to find entries in the guestbookresponse table where the same file was downloaded by the same person within the same second, removing those could make the count more accurate (it might correct more than just the counts created by the bug). How do sessionids work? I dug into the sessionid column a little to see if we could use it to represent one user, but it looks like sometimes sessionids are assigned to more than one user. Or maybe we could use the authenticateduserids with each entry. Since the double counting described in #4637 happened on draft datasets, the user would've been logged in, so the entries created by that bug would have authenticateduserids.

But even if nothing can be done about the double counting, it seems that removing counts from unpublished datasets will make QDR's UI and API counts match (and hopefully other installations').

Counts with no timestamps

On Harvard Dataverse, most of the difference between the counts from the UI and the API are from guestbookresponse entries with no timestamps (#3324). The UI is counting those, and the API is not. I think the counts without timestamps should be counted by the API.

@pdurbin suggested adding a timestamp to the guestbookresponse table. It could be an obviously fake timestamp.

Also in #3324 @landreev suggested:

when we generate access reports/otherwise display this data, we can think of presenting it in some sensible way: like, instead of listing all these downloads with no recorded times, we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."

If we add fake timestamps to entries in the guestbookresponse table that have no timestamps, then when all of the counts are displayed by month, in addition to each month there would a group with that fake timestamp that we could describe as @landreev suggested.

Or if we leave them null, we could remove the clause guestbookresponse.responsetime is not null from the query that the API uses. And when all of the counts are displayed by month, there would be a null group that we could describe as @landreev suggested.

(When the total counts are displayed, I would think we don't have to explain that the dates of some counts are unknown.)

jggautier · 2018-08-23T23:30:26Z

Here's the query I've used to find file downloads (and any other downloadtypes e.g. "explores") that are counted two or more times within the same second:

select email, name, authenticateduser_id, responsetime, sessionid, datafile_id, downloadtype, guestbook_id
from guestbookresponse g1
where authenticateduser_id is not null --assuming we would look for only "downloads" from logged in users
and exists (
	select 1
	from guestbookresponse g2
	where 

	--find "downloads" that occurred at the same time
	g2.responsetime = g1.responsetime

	--OR find "downloads" within one second
	--g2.responsetime > g1.responsetime - interval '1 sec' 
	--and g2.responsetime < g1.responsetime + interval '1 sec'
	
	--from the same logged in user
	and g2.authenticateduser_id = g1.authenticateduser_id

	--for the same file
	and g2.datafile_id = g1.datafile_id
	
	--where the guestbookresponse entries aren't the same
	and g2.id <> g1.id
	)

--to show the results are what we want
order by sessionid, datafile_id, responsetime;

matthew-a-dunlap · 2019-01-09T17:34:37Z

Two other issues that may be at play:

The metrics caching. Most of the queries default to 7 days unless configured.
~~There may be weirdness in how the metrics compare dates. Not sure but worth investigating.~~ (works as expected)

matthew-a-dunlap · 2019-01-10T17:46:16Z

Looking at the initial issues in this story, it seems that QDR's counts match now. The only other issue I was able to identify is that we are not counting undated historic download counts in the per month query. This fix will count those undated results if the month queried is on/after the oldest dated record.

Alongside this, all the metrics table entries for downloads should be cleared so they are requeried with the updated values.

djbrooke · 2019-01-10T19:46:49Z

@matthew-a-dunlap assigning to you and pulling back to team dev based on our discussion

kcondon · 2019-02-01T22:34:12Z

Numbers match on copy of prod db. Closing but will merge when custom home page passes.

jggautier added Type: Bug a defect Feature: Metrics + Reports labels Aug 16, 2018

scolapasta mentioned this issue Jan 9, 2019

Dynamic Custom Homepage - ROUND TWO #5445

Closed

22 tasks

djbrooke added the Status: Ready label Jan 9, 2019

djbrooke assigned jggautier Jan 9, 2019

djbrooke added ready for estimation and removed Status: Ready labels Jan 9, 2019

djbrooke changed the title ~~Count of all file downloads from metrics API doesn't always match what's in UI (on homepage metrics bar)~~ Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) Jan 9, 2019

djbrooke unassigned jggautier Jan 9, 2019

djbrooke removed the ready for estimation label Jan 9, 2019

matthew-a-dunlap self-assigned this Jan 9, 2019

matthew-a-dunlap added a commit that referenced this issue Jan 9, 2019

Include undated guestbooks in download count #4970

dfd7ff2

djbrooke added Status: Development and removed Status: This/Next Sprint labels Jan 10, 2019

matthew-a-dunlap added a commit that referenced this issue Jan 10, 2019

Download count undated only after first dated #4970

b150321

matthew-a-dunlap mentioned this issue Jan 10, 2019

4970 metrics counts off #5453

Closed

5 tasks

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Jan 10, 2019

matthew-a-dunlap removed their assignment Jan 10, 2019

djbrooke assigned matthew-a-dunlap Jan 10, 2019

djbrooke added Status: Development and removed Status: Code Review labels Jan 10, 2019

matthew-a-dunlap mentioned this issue Jan 10, 2019

Add Metrics API to get datasets by subject AND to month #5398

Closed

matthew-a-dunlap added a commit that referenced this issue Jan 14, 2019

Fix historic download counts #4970 #5447

7b7c9dd

mheppler mentioned this issue Jan 17, 2019

5445 dynamic custom hmpg redux #5475

Merged

5 tasks

matthew-a-dunlap removed their assignment Jan 28, 2019

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Jan 28, 2019

djbrooke assigned sekmiller Jan 28, 2019

djbrooke added this to the 4.11 - Preservation Integrations milestone Jan 28, 2019

sekmiller removed their assignment Jan 30, 2019

djbrooke assigned matthew-a-dunlap Jan 30, 2019

matthew-a-dunlap removed their assignment Jan 30, 2019

djbrooke assigned djbrooke and TaniaSchlatter and unassigned TaniaSchlatter Jan 30, 2019

mheppler added Status: QA and removed Status: Code Review labels Feb 1, 2019

mheppler unassigned djbrooke Feb 1, 2019

kcondon self-assigned this Feb 1, 2019

kcondon closed this as completed Feb 1, 2019

kcondon removed the Status: QA label Feb 1, 2019

jggautier mentioned this issue Oct 7, 2024

Adjust Metrics API's "metrics/filedownloads" endpoint to include downloads where download date is unknown #10911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

jggautier commented Aug 16, 2018 •

edited

Loading

jggautier commented Aug 16, 2018 •

edited

Loading

qqmyers commented Aug 17, 2018

jggautier commented Aug 17, 2018

qqmyers commented Aug 17, 2018

jggautier commented Aug 20, 2018 •

edited

Loading

jggautier commented Aug 23, 2018 •

edited

Loading

matthew-a-dunlap commented Jan 9, 2019 •

edited

Loading

matthew-a-dunlap commented Jan 10, 2019

djbrooke commented Jan 10, 2019

kcondon commented Feb 1, 2019

Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

Count of all file downloads from metrics API doesn't always match what's in UI (on old homepage metrics bar) #4970

Comments

jggautier commented Aug 16, 2018 • edited Loading

jggautier commented Aug 16, 2018 • edited Loading

qqmyers commented Aug 17, 2018

jggautier commented Aug 17, 2018

qqmyers commented Aug 17, 2018

jggautier commented Aug 20, 2018 • edited Loading

jggautier commented Aug 23, 2018 • edited Loading

matthew-a-dunlap commented Jan 9, 2019 • edited Loading

matthew-a-dunlap commented Jan 10, 2019

djbrooke commented Jan 10, 2019

kcondon commented Feb 1, 2019

jggautier commented Aug 16, 2018 •

edited

Loading

jggautier commented Aug 16, 2018 •

edited

Loading

jggautier commented Aug 20, 2018 •

edited

Loading

jggautier commented Aug 23, 2018 •

edited

Loading

matthew-a-dunlap commented Jan 9, 2019 •

edited

Loading