Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive ActionLogRecord entries #5916

Closed
landreev opened this issue Jun 5, 2019 · 15 comments · Fixed by #8292
Closed

Archive ActionLogRecord entries #5916

landreev opened this issue Jun 5, 2019 · 15 comments · Fixed by #8292
Assignees

Comments

@landreev
Copy link
Contributor

landreev commented Jun 5, 2019

In the current implementation, action log entries keep accumulating unimpeded.
A quick (and somewhat anti-scientific) check suggests that our own prod. database is more than 50% action log entries by size, by now.

(the test performed was comparing the total output of pg_dump of the full database, vs. pg_dump --table actionlogrecord. It's "anti-scientific" because the output is generated as sql text that does not correspond 1:1 to the size of the data structures stored. But the ratio should be about right as a ballpark).

@djbrooke
Copy link
Contributor

djbrooke commented Jul 31, 2019

  • We'll bring it into a sprint and decide trim vs. clean/archive.
  • If we archive, we should make sure that @jggautier can still do what he does as far monitoring curation activities
  • Investigate if the enhanced logging that we added for @pameyer will be of use.

@mheppler
Copy link
Contributor

@landreev in Slack, 11:58 AM Tue Jan 26, 2021

actionlogrecords; we absolutely need to start purging that crap.

💩

@jggautier
Copy link
Contributor

jggautier commented Jan 26, 2021

Thought it might be helpful to describe how I've used the info in this actionlogrecord database table:

To help quantify curation activities on the Harvard Dataverse repository for an infographic that used to be posted to the IQSS website's metrics page. The metrics for that page are now being pulled from RT tickets, specifically how the tickets are tagged. I think the curation team is meeting this week to talk about this (e.g. how well is our use of the RT ticket tags, both the tagging and the counting of tags, representing curation activity?), so I'm not sure if what's in the actionlogrecord could continue to be used for this.

Also I used the actionlogrecord table to get a sense of and report the number of spam Dataverse collections in Harvard Dataverse that were deleted during the great spamming of 2020.

And the table is used in a few helpful queries in that Dataverse Database Queries Google Doc, like for counting the number of times that the Private URL and Submit for Review features have been used, which I think went into decisions about prioritizing GitHub issues related to those features.

In all of these cases, I don't think more than one or two year's worth of information has been needed. If the actionlogrecord table is trimmed, could the last two years be retained?

I was also told a few years ago, although I forgot who said it, that the actionlogrecord table wasn't really designed and isn't maintained for uses like these. That makes perfect sense, and I've always wondered how these uses of the actionlogrecord table intersect with other efforts, like talks about an integration with a curation tool developed at Yale and work on sending to and using event data from Google Analytics. This is all to say that maybe using the actionlogrecord in the ways I described should be considered temporary while more maintainable ways of getting this information are explored.

@djbrooke
Copy link
Contributor

  • Address for Harvard Dataverse Repository
  • Add a release note about how other installations should handle this

@landreev
Copy link
Contributor Author

landreev commented Nov 17, 2021

I'd like to address this quickly, before the upcoming Postgres upgrade and transfer.
But to be clear, I'm talking less about "add a mechanism", and more about "archive the older records quickly". Which is in line with what's described in the last comment above (from April).

So, no automated mechanisms in the application - just dumping the table and saving the backup on S3, in case anyone needs to look up past entries. And adding 2 words in the admin guide: "btw, this table is going to be constantly growing; so it'll end up being huge on a busy installation. This is how we trim it once in a while: ..."

@djbrooke djbrooke changed the title Add a mechanism to trim and/or archive ActionLogRecord entries Archive ActionLogRecord entries Nov 17, 2021
@djbrooke
Copy link
Contributor

djbrooke commented Nov 17, 2021

  • This will potentially reduce downtime, and will allow @jggautier to run queries more quickly
  • @jggautier should still be able to do what he needs - he currently accesses from AWS (thanks @kcondon for the set up). @landreev can provide instructions if needed.
  • Could potentially go in https://guides.dataverse.org/en/5.8/admin/maintenance.html We should add to the Guides in an appropriate location.
  • There's some speculation that this is slowing things down, so if possible we should test before/after

@landreev
Copy link
Contributor Author

As was mentioned in the opening comment, the size of the exported tables in SQL format is not an accurate representation of the database space use, as stored. Still, somewhat notable that, as of now, the actionlogrecord table accounts for almost 3/4 of the total SQL dump of the prod. db (32GB out of 44GB). Up from 1/2 back when the issue was opened.

@landreev
Copy link
Contributor Author

OK, a serious bloodletting has been administered to the database - purged everything up to 2021. Will post info on where the older records have been archived.

@djbrooke
Copy link
Contributor

Thanks @landreev. I'll move this over to in progress on the board. It sounds like what's remaining is:

  • Sharing the access information internally with the IQSS team
  • Updating the Guides (the maintenance page was mentioned above) to provide information for other installations

@landreev
Copy link
Contributor Author

All good; except for the maintenance page reference, above/in the post-planning update 5 days above - not sure what that was about. It may have been something from another conversation that ended up in this issue by mistake. (?)

@djbrooke
Copy link
Contributor

@landreev I understood that we wanted other installations to be able to go through the same process that we just did when their DBs start to grow because of the actionlogrecord entries, so we wanted to provide something on that page. If this is routine for a sysadmin and/or will be a relatively unique process for each installation depending on their setup, I'd be OK not adding the docs. What do you think?

@landreev
Copy link
Contributor Author

Oh, ok - yes, I was going to document it.
But the 'maintenance page' in the guide, linked above - that's a sample "sorry, we are down for maintenance" static page that we supply. (not a section in the guide on how to maintain your Dataverse installation...)

I'll find a spot for a section on this in the admin guide somewhere.

@djbrooke
Copy link
Contributor

@landreev - oh yes, you're correct, that's not the right spot. Thanks for documenting and putting it somewhere more appropriate!

@landreev
Copy link
Contributor Author

As I mentioned last week, I'm now wondering if we should add a simple auto-cleanup option to the application, as part of this issue, while we are at it.

The way it would work, an installation could specify a cutoff time period (as a number of days, probably?) - and then all the actionlogrecords older than that would be erased automatically every night, by a timer job. We would emphasize in the documentation/release note that it would be 100% the responsibility of the administrator to archive the records outside of the database before they are erased, in case they see any value in preserving them. I can see how having this option could be handy for an installation owner who doesn't see much of a need for keeping these around, and would prefer not having to worry/think about the table growing out of control.

For an installation that does not enable the option, nothing changes, the records live on forever, until/unless manually removed.

I'm curious what others think. If this feels too out of scope, I'm happy to document the problem and the manual solution in the guide and close the issue, as was the plan we agreed on earlier.

@pdurbin
Copy link
Member

pdurbin commented Nov 29, 2021

My two cents are that I'm ok with this issue being about documenting the problem and the fix.

A future issue could be about automating the fix.

landreev added a commit that referenced this issue Dec 9, 2021
landreev added a commit that referenced this issue Dec 9, 2021
landreev added a commit that referenced this issue Dec 9, 2021
landreev added a commit that referenced this issue Dec 13, 2021
kcondon added a commit that referenced this issue Dec 13, 2021
doc entries on trimming the actionlogrecord table. (#5916)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants