Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GuestBook bug? 633,795 downloads with no timestamps #3324

Closed
raprasad opened this issue Aug 30, 2016 · 9 comments
Closed

GuestBook bug? 633,795 downloads with no timestamps #3324

raprasad opened this issue Aug 30, 2016 · 9 comments
Assignees
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Type: Bug a defect

Comments

@raprasad
Copy link
Contributor

raprasad commented Aug 30, 2016

Bug 1

bug 1 update: @kcondon mentioned that these may be downloads ported from the 3.x system

I have a July snapshot of data. I just noticed that 633,795 entries in the table guestbookresponse have NULL responsetime. This is fully 1/3 of the guestbookresponse entries.

In other words 633,795 file downloads have no timestamp attached (and don't appear in the metrics visualizations).

Both sql statements below give the 633,795 number

select count(id) from guestbookresponse where responsetime is  NULL
select count(id) from guestbookresponse where responsetime is  NULL and downloadtype='download';

Fix:

  • Make the database field responsetime NOT NULL
  • Check the code

@djbrooke
Copy link
Contributor

djbrooke commented Sep 6, 2016

@kcondon - you spurred a quick discussion about this during backlog grooming. You said that this may be as expected and not a bug? Can you leave some details here? Thanks!

@raprasad
Copy link
Contributor Author

raprasad commented Sep 6, 2016

@djbrooke @kcondon : There were two bugs in this ticket. To avoid confusion, I separated the 2nd one (API downloads not being counted at all) into this ticket: #3331

@djbrooke
Copy link
Contributor

djbrooke commented Sep 7, 2016

Thanks @raprasad. So if this one is expected behavior (if confirmed by @kcondon), we'll close this one out and keep the other around.

@kcondon
Copy link
Contributor

kcondon commented Sep 7, 2016

Yes, this is likely due to migration and expected behavior. Leonid would best be able to confirm since I believe he worked on this aspect of migration.

@djbrooke
Copy link
Contributor

@landreev - can you take a look and see if this is expected? Thanks!

@landreev
Copy link
Contributor

So yes, this is expected behavior. In a sense that those are grandfathered in, older download entries for which we don't have dates/timestamps recorded. (Before we started logging individual downloads, we only had download counters).

There is nothing we can do about it - it's just missing data. However, when we generate access reports/otherwise display this data, we can think of presenting it in some sensible way: like, instead of listing all these downloads with no recorded times, we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."

@jggautier
Copy link
Contributor

I'm measuring the rate of downloads over time for a particular dataverse on Harvard Dataverse, and trying to account for download entries with no timestamps. Thought others digging into this issue might find these details helpful:

It seems fair to say that only download entries associated with files migrated from 3.x to 4.0 have no timestamps. The migration happened in April 2015. Among the download entries with no timestamps (i.e. guestbookresponse.responsetime is null), the latest createdate (dvobject.createdate) of the associated files is April 23, 2015.

we should probably just say "plus N downloads were recorded before [earliest download date recorded]; no further information is available about those prehistoric downloads, sorry for the inconvenience."

The earliest download date (guestbookresponse.responsetime) recorded is 2008-07-31. This makes me think that Dataverse was adding timestamps to download entries long before Harvard Dataverse's April 2015 migration to Dataverse 4.0. So instead of "plus N downloads were recorded before [earliest download date recorded]," would it be more accurate to say "plus N downloads were recorded before April 2015"?

@scolapasta
Copy link
Contributor

There was tracking of download date in DVN 3 (i.e. before April 2015); but not from beginning of the project (i.e. 2006). So adding before April 2015 doesn't make sense, since we do have recorded ones before that. The ones without timestamps would all be (I think) before the earliest date).

@jggautier
Copy link
Contributor

jggautier commented Apr 4, 2018

Spoke with @scolapasta, who suggested also looking at the guestbookresponse.id that the database assigns to each guestbook entry. The id number increases chronologically:

smallest id largest id
no timestamp1 1001 1954821
has timestamp2 2 3829250 3

Just to be sure:
Number of guestbook entries whose ids are greater than 1954821 and that has no timestamp4: 0

The earliest timestamp of the guestbook entries whose ids are greater than 1954821 5 is 2015-04-05

It seems fair to say that:

  • All of the downloads with no recorded dates happened before April 5, 2015.
  • All of the downloads on and after April 5, 2015 have dates.

1 select min(id), max(id)
from guestbookresponse
where guestbookresponse.responsetime is null

2 select min(id), max(id)
from guestbookresponse
where guestbookresponse.responsetime is not null

3 Copy of database I query was updated April 2, 2018, so data recorded after April 2 isn't included.

4 select count(*)
from guestbookresponse
where guestbookresponse.id > 1954821
and guestbookresponse.responsetime is null

5 select min(responsetime)
from guestbookresponse
where guestbookresponse.id > 1954821

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

6 participants