Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: store hashed remote address on journals #13595

Closed

Conversation

miketheman
Copy link
Member

@miketheman miketheman commented May 5, 2023

Another step towards #8158 .

Starts storing only hashed IP addresses in JournalEntry.submitted_from text column from merge-time forward for any new Journals.

Provides a CLI command:

$ python -m warehouse hashing journal-entry -h
Usage: python -m warehouse hashing journal-entry [OPTIONS]

  Hash `journals.submitted_from` column with salt

Options:
  -s, --salt TEXT            Pass value instead of prompting for salt
  -b, --batch-size INTEGER   Number of rows to hash at a time  [default:
                             10000]
  -st, --sleep-time INTEGER  Number of seconds to sleep between batches
                             [default: 1]
  --continue-until-done      Continue hashing until all rows are hashed
  -h, --help                 Show this message and exit.

@miketheman miketheman requested a review from a team as a code owner May 5, 2023 23:47
@miketheman miketheman force-pushed the miketheman/journal-entry-hashed-1 branch from 6016541 to ae36a83 Compare May 5, 2023 23:52
@miketheman miketheman marked this pull request as draft May 8, 2023 12:07
@miketheman miketheman added the blocked Issues we can't or shouldn't get to yet label May 8, 2023
@miketheman miketheman removed the blocked Issues we can't or shouldn't get to yet label May 24, 2023
Signed-off-by: Mike Fiedler <miketheman@gmail.com>
Hopefully there's enough operator toggles to tune this run,
as there's a large amount of records in the database.
This should hopefully allow for balancing load vs locking.

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
@miketheman miketheman force-pushed the miketheman/journal-entry-hashed-1 branch from bde4bd7 to 8c122f8 Compare May 24, 2023 21:18
@miketheman miketheman marked this pull request as ready for review May 24, 2023 21:19
@miketheman miketheman enabled auto-merge (squash) May 25, 2023 17:18
@miketheman miketheman disabled auto-merge May 25, 2023 17:18

# Update the rows
session.add_all(unhashed_rows)
session.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking deeply about one thing, which is: Do we want to ensure that an IpAddress object and corresponding row exists in the DB before we obliterate the data?

probably not is my gut reaction. since for all data since #6339 (August, 2019) there already exists a corresponding ProjectEvent that has it.

Copy link
Member

@ewdurbin ewdurbin May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm now with the date in mind.... and the fact that the IP addresses exist on ProjectEvents... dropping the column altogether may be the right call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I enjoyed writing the CLI tool (and the tests for it!) I'd be much happier to remove the column instead. I'll work that up instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miketheman
Copy link
Member Author

Closing, the alternative was to drop it like it was hot.

@miketheman miketheman closed this May 25, 2023
@miketheman miketheman deleted the miketheman/journal-entry-hashed-1 branch May 25, 2023 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants