Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish retention policy and retent old data #3532

Open
htgoebel opened this issue Apr 3, 2018 · 7 comments
Open

Establish retention policy and retent old data #3532

htgoebel opened this issue Apr 3, 2018 · 7 comments
Labels
needs discussion a product management/policy issue maintainers and users should discuss

Comments

@htgoebel
Copy link

htgoebel commented Apr 3, 2018

I found that pypi stores IP addresses and exact action dates for several years, e.g.

create | Nov 24, 2008, 11:33:29 AM | htgoebel from 79.207.178.171

According to the privacy policy there are four reasons to store the data. FMPOV none of these reasons requires storing this exact data for almost 10 years. The day and user might be of interest, but the exact time and IP address for sure is not.

Please establish a retention policy for delete old data and then delete this old data. Thanks!

Background: As you might know, the European General Data Protection Regulation (GDPR) requires all services offering to the European market to have a retention policy. Also an European court as decided that IP (v4) addresses are personal are personal data too.

@dstufft
Copy link
Member

dstufft commented Apr 3, 2018

I just want to ack this request, and say that I don't know the answer to this yet, will have to do some internal research to figure out what applies here.

@brainwane
Copy link
Contributor

@dstufft has done some research and we're waiting for him to update this issue with what he learned. :) Thanks for the report @htgoebel.

@brainwane brainwane added the needs discussion a product management/policy issue maintainers and users should discuss label Jun 11, 2019
@brainwane
Copy link
Contributor

@dstufft could you please reply to this thread with your data? cc @ewdurbin with his PSF hat on.

Per our meeting today @nlhkabu is going to do some research on this toward #5863, understanding best practices & prior art in other similar sites. Simply Secure may have some good resources on this.

@nlhkabu
Copy link
Contributor

nlhkabu commented Aug 1, 2019

I couldn't find anything on Simply Secure, but I did manage to find a couple of other sources:

GDPR

Recital 39 of the GDPR states that the period for which the personal data is stored should be limited to a strict minimum and that time limits should be established by the data controller for deletion of the records (referred to as erasure in the GDPR) or for a periodic review.

Organisations must therefore ensure personal data is securely disposed of when no longer needed. This will reduce the risk that it will become inaccurate, out of date or irrelevant.

National Cyber Security Center (UK Gov)

This is a very useful guide: https://www.ncsc.gov.uk/guidance/introduction-logging-security-purposes

Are logs held for long enough to answer incident questions?
For each log source you hold, you need to decide how long to store the data. This will depend on a number of factors including the cost and availability of storage, and the volume and usefulness of different data types (see Logging source section below). In general, we recommend that you hold logs which allow you to answer the incident questions from step 2 for a minimum of 6 months. The M-Trends 2018 report suggests that the average time to detect a cyber attack is 101 days and it's not uncommon for this figure to be significantly longer, so you may wish to store for longer if budget allows. Review and fine-tune as necessary.

Prior art

  • Github "The security log lists the last 50 actions or those performed within the last 90 days."
  • Google up to 6 months, depending on the log
  • Gitlab authentication log - all time (as far as I can tell, as I couldn't find their policy, but I have logs dating back 3 years on my personal account)
  • npm keeps logs for 90 days - although as far as I can tell from my personal account, these are not exposed to the end user

@htgoebel
Copy link
Author

htgoebel commented Aug 1, 2019

Please note that the retention policy must not only include server log files, but also the action log for each of the packages.

@woodruffw
Copy link
Member

FMPOV none of these reasons requires storing this exact data for almost 10 years. The day and user might be of interest, but the exact time and IP address for sure is not.

FWIW, exact time and IP address do serve a forensic purpose: they make it easier to triage and establish provenance when doing a postmortem. As an example:

Project Foo has had 50 releases, 45 of which came from an IP range publicly associated with a hosting provider (probably CI) and published within 5 minutes of midnight at timezone X (probably a cronjob). The last 5 releases came from varying IPs, some of which show up in blacklists, and upload times indicate timezone Y.

In terms of policy, it might make sense to research (if any research exists?) the average time between package breach and discovery/triage and use that (with a sufficient window) as the baseline for removing IPs and exact timestamps.

@htgoebel
Copy link
Author

FWIW, exact time and IP address do serve a forensic purpose:

Keep in mind: Privacy is a Human Right, but there is not right for forensics.

  • Do you need forensics after 10 years?!
  • Has there been a need for such forensics somewhen?
  • If there has been a need: Has this information actually been required for solving the forensic case? Have other means actually been tested?
  • Does this need take precedence over legally over the persons rights for deleting the data?

if any research exists?

Obviously there as been none for the last 10 years. Thus there is no need to keep this data.

Keeping data just for the vague case someone, somewhen might eventually be interested in this data is not a reason, but data retention without legal base.

According to EU-GDPR neither forensics nor research are reasons to give date retention precedence over legally the persons rights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

5 participants