Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improving caching, making a full NVD mirror available #2577

Closed
terriko opened this issue Jan 23, 2023 · 15 comments
Closed

feat: Improving caching, making a full NVD mirror available #2577

terriko opened this issue Jan 23, 2023 · 15 comments
Labels
gsoc Tasks related to our participation in Google Summer of Code

Comments

@terriko
Copy link
Contributor

terriko commented Jan 23, 2023

What we're currently doing as far as cache goes:

What I think might be useful:

  1. Generate a full mirror of the NVD data both for testing and for use.
    • I think making the yearly json files here would be particularly helpful for addressing the "first run" problem where populating the initial database or any database that's too out of date can be problematic.
    • I wouldn't bother with some of the other historical formats, since JSON is the only one we actually care about.
    • We could probably (a) make json files and check them in to github (b) pre-parse a database and check it in to github and (c) update the github cache all in the same job to reduce maintenance overhead and be sure that they're all in sync.
  2. Possibly do the same for other data sources if licensing allows.
  3. Configure cve-bin-tool to use our own mirror(s) as an option.
  4. Make one of our own sources the default for cve-bin-tool.
    • Mostly for user experience improvements: that first run of cve-bin-tool is problematic for a lot of users.
    • If we're keeping the mirror in github, this would potentially be a significant performance improvement for folk running tests in github actions specifically, as well as being better than the current rate limits.
    • the pre-parsed version may make more sense than a full json mirror, but a full json mirror is probably easier for others to audit, so it's possible both would be good?
  5. If making it the default doesn't feel right, we could also use it as a "seed" to start the data then leaving NVD thereafter. e.g.
    • If there's nothing in ~/.cache/cve-bin-tool get a copy of the pre-parsed database to put in there first. (e.g. during first run and during -u now)
    • If the database is more than X days out of date, try to update to what's in the cve-bin-tool mirror before querying NVD for final values.
  6. Provide explicit instructions/scripts for other groups to manage their own mirrors if desired.

Potential problems:

  1. What if everyone tries to use our mirror?
    • I think this is not impossible -- if it turns out to be too much of a burden on our license with github I think we'd plan to have a cross-industry group (e.g. OpenSSF) provide a mirror instead and we could switch to that. But I think we'd be better served by setting something up ourselves to see how we'd want it to work first even if we hope to hand it off (and potentially so we could hand the scripts to someone else).
  2. What if people don't want to trust our mirror?
    • Then they should be able to configure it to never be used.
    • The nvd update should clobber any discrepancies.
    • What tools should we provide to help people validate the data?
  3. What if caching in github actions remains broken/sporadic?
    • not sure what our best option is then. Possibly running it more than daily?

Any thoughts? I'm mentally trying this out as a potential GSoC project but I'm not sure if it's quite the right size/complexity for that, so thoughts on that as well as the general technical/social challenges.

@terriko terriko added the gsoc Tasks related to our participation in Google Summer of Code label Jan 23, 2023
@b31ngd3v
Copy link
Contributor

hi @terriko, is it exclusively for gsoc contributors? or open for everyone? if it's open, i would love to work on this.

@terriko
Copy link
Contributor Author

terriko commented Jan 24, 2023

@b31ngd3v If you're able to work on it right now, go ahead and we'll find something else for the gsoc folk. I think we're going to want this sooner rather than later if possible.

@b31ngd3v
Copy link
Contributor

b31ngd3v commented Jan 25, 2023

I'll start working on it then 👍🏻

@b31ngd3v
Copy link
Contributor

b31ngd3v commented Feb 6, 2023

hi @terriko, looks like we can't push the database to github

image

@terriko
Copy link
Contributor Author

terriko commented Feb 6, 2023

Not entirely surprising though I was hoping we wouldn't hit that point for a while. We'll have to see if chopping it up makes sense or if we need other storage options.

@terriko
Copy link
Contributor Author

terriko commented Feb 6, 2023

This is, incidentally, a pro for the "make a bunch of json files that can be re-loaded" theory, since they could and would be chopped into more manageable year/month chunks.

@b31ngd3v
Copy link
Contributor

b31ngd3v commented Mar 1, 2023

I was having health issues and was busy with university exams, i would continue working on this issue with more speed now, and sorry for the delay!

@terriko
Copy link
Contributor Author

terriko commented Mar 15, 2023

Summarizing some thoughts here so they don't wind up buried in #2807 and #2811

@b31ngd3v has gotten us to the point where we have a json export, so now we need to figure out

  1. where to store the mirror data
  2. how to keep the mirror up to date
  3. how to use the mirror in cve-bin-tool

For parts 1 & 2: mirror data has the potential to get big and messy, and is potentially not the greatest for git since every single change winds up in the tree forever (even if you can't see it, the data is in the git history). BUT having the history opens some interesting options for research and the ability of others to examine vulnerability data and do things like verify the validity of the mirror over time, which are advantages that we might want. Plus, github gives us some space to play around and a CI system that I don't have to set up.

So, I've set up https://github.com/sec-data/mirror-sandbox as a repo for us to experiment with scripts without "tainting" the existing cve-bin-tool repo. Since "sec-data" is a personal free org, I can add anyone I want to it, so I'll add @b31ngd3v and @anthonyharrison to it now. (you should have emails shortly)

Longer-term: I've approached the micro mirror team about distributing our json mirror on their servers. They currently handle mirroring for a number of Linux distributions and open source projects, and they've got machines in data centers across the US and are starting to build out more globally. Once we get the mirror scripts working and are able to use the data in cve-bin-tool, we can basically hand that off to them and let them replicate, and they'll be able to watch the traffic and see what's happening. I'll probably add some of those folk to the sec-data org as well.

@terriko
Copy link
Contributor Author

terriko commented Mar 15, 2023

So then the next question is how should we use this mirror once it's set up?

What I was envisioning was something like this:

  • mirror gets data
  • cve-bin-tool defaults to using the mirror (so no one needs to get a nvd key on first run of the product, which is currently a large barrier to entry)
  • cve-bin-tool has options to configure the mirror(s) in use (presumably to choose a more "local" one, but also allowing people to use an internal company one or share a cache across machines in an air-gapped network)
  • cve-bin-tool provides an option skip the mirror and go directly to nvd (i.e. the current default behaviour)
  • in future: we figure out how to also deal with mirroring of gad/redhat/etc. and maybe how to configure mirroring of each of those separately/together

So if we wanted to do that, we need some config options:

  • providing a list of mirrors
  • maybe some options about default mirrors
  • options for failover if mirrors are broken (inaccessible, content is invalid)
  • probably some failover options if mirrors are out of date too (do we use the same 24 rule, allow this to be configured separately, something else?)
  • maybe a way to pull from multiple mirrors at once?
  • an option to revert to the current behaviour (using NVD directly)

I'd been thinking about this specifically with NVD since that's our biggest barrier to entry, but we should also consider:

  • having mirrors for each data source in separate directories
  • allowing configuration options to use the mirror for all/some sources

And we probably need to consider some basic info provided per-mirror with the json files..

  1. Original data source
  2. License
  3. Time of last update
  4. Any signatures, etc. for validation?
  5. where to find our mirroring code
  6. How to set up your own

@anthonyharrison
Copy link
Contributor

anthonyharrison commented Mar 15, 2023 via email

@terriko
Copy link
Contributor Author

terriko commented Mar 16, 2023

Checksums: probably?

In an ideal world, this mirroring system would be 100% automated with no human in the loop unless it can't update for some reason, but a checksum to make sure you downloaded correctly seems still useful. I'm not sure how valuable a signature would be since we'd likely be blindly signing whatever we download rather than attesting to its reliability, but it could fill the same niche if we wanted.

@terriko
Copy link
Contributor Author

terriko commented Mar 22, 2023

Okay, had a bit of a chat with my mirroring expert:

  • the mirrors would like it if we pgp signed things. That would provide both a "we have all the data" check and a "this mirror is not tampering with the data" check.
  • We could maybe do the signing in github actions using the built-in secrets ability. (If it doens't work in Actions I have access to VMs that could be used for building the mirror instead.)
  • cve-bin-tool probably wants to check the signatures when we load in data (again, for rogue mirrors)
  • I would want to generate a special signing key just for this data, which would not be used for anything else. I'm not sure how key management would work for this yet , but I expect we'd start with a temp pgp key for testing that one of us generated, and potentially shift to something with actual processes and stuff.
  • I think we also want a jsonschema as another integrity check

Now, it's a little debatable what the automated signature is really going to mean in terms of data quality and integrity with respect to NVD:

  • we can validate the NVD jsonschema. From experience, it's broken a few times a week, but a mirror job could potentially retry until it's fixed and throw some error /refuse to sign/ file a report with NVD/something if it's invalid (currently we mostly have to ignore this because it happens so often, but we have on occasion reported it and gotten things fixed)
  • I'm not sure what else we can validate via the NVD API2 but we should figure out what else and do so
  • Other data sources will be different.

@b31ngd3v
Copy link
Contributor

@terriko @anthonyharrison what if we use gpg clearsign feature?

image

@terriko
Copy link
Contributor Author

terriko commented Mar 27, 2023

@b31ngd3v yeah, I think that's likely the one we need to use. Basically for the mirroring folk it makes their lives easiest if we use whatever the distro folk use, and pgp is it.

terriko pushed a commit that referenced this issue Jun 1, 2023
* feat: add sign with pgp flag while exporting json data

* feat: verify sign while importing the json data

* feat: update FETCH_JSON_DB to use pgp signing

* fix: update test_fetch_json_db.py

* fix: existing broken tests

* fix: change the file extension to `.asc`

* fix: removed `signed: true`

* fix: update tests

* fix: update tests
@terriko
Copy link
Contributor Author

terriko commented Aug 23, 2023

I think we're good to close this one alongside #3181

@terriko terriko closed this as completed Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Tasks related to our participation in Google Summer of Code
Projects
None yet
Development

No branches or pull requests

3 participants