feat: Improving caching, making a full NVD mirror available #2577

terriko · 2023-01-23T21:26:20Z

Previously discussed in Discussion: what's next? #2451 and elsewhere.

What we're currently doing as far as cache goes:

Each copy of cve-bin-tool keeps a local data cache (updated daily by default)
We keep a daily cache in github actions for tests (well, daily in theory -- right now it's only actually updating every 5 days or so. See https://github.com/intel/cve-bin-tool/actions/workflows/update-cache.yml)
We have a script that also keeps a copy of our parsed cve database. We aren't currently using it for anything. (see https://github.com/intel/cve-bin-tool/actions/workflows/export_data.yml )

What I think might be useful:

Generate a full mirror of the NVD data both for testing and for use.
- I think making the yearly json files here would be particularly helpful for addressing the "first run" problem where populating the initial database or any database that's too out of date can be problematic.
- I wouldn't bother with some of the other historical formats, since JSON is the only one we actually care about.
- We could probably (a) make json files and check them in to github (b) pre-parse a database and check it in to github and (c) update the github cache all in the same job to reduce maintenance overhead and be sure that they're all in sync.
Possibly do the same for other data sources if licensing allows.
Configure cve-bin-tool to use our own mirror(s) as an option.
Make one of our own sources the default for cve-bin-tool.
- Mostly for user experience improvements: that first run of cve-bin-tool is problematic for a lot of users.
- If we're keeping the mirror in github, this would potentially be a significant performance improvement for folk running tests in github actions specifically, as well as being better than the current rate limits.
- the pre-parsed version may make more sense than a full json mirror, but a full json mirror is probably easier for others to audit, so it's possible both would be good?
If making it the default doesn't feel right, we could also use it as a "seed" to start the data then leaving NVD thereafter. e.g.
- If there's nothing in ~/.cache/cve-bin-tool get a copy of the pre-parsed database to put in there first. (e.g. during first run and during -u now)
- If the database is more than X days out of date, try to update to what's in the cve-bin-tool mirror before querying NVD for final values.
Provide explicit instructions/scripts for other groups to manage their own mirrors if desired.

Potential problems:

What if everyone tries to use our mirror?
- I think this is not impossible -- if it turns out to be too much of a burden on our license with github I think we'd plan to have a cross-industry group (e.g. OpenSSF) provide a mirror instead and we could switch to that. But I think we'd be better served by setting something up ourselves to see how we'd want it to work first even if we hope to hand it off (and potentially so we could hand the scripts to someone else).
What if people don't want to trust our mirror?
- Then they should be able to configure it to never be used.
- The nvd update should clobber any discrepancies.
- What tools should we provide to help people validate the data?
What if caching in github actions remains broken/sporadic?
- not sure what our best option is then. Possibly running it more than daily?

Any thoughts? I'm mentally trying this out as a potential GSoC project but I'm not sure if it's quite the right size/complexity for that, so thoughts on that as well as the general technical/social challenges.

b31ngd3v · 2023-01-24T03:25:17Z

hi @terriko, is it exclusively for gsoc contributors? or open for everyone? if it's open, i would love to work on this.

terriko · 2023-01-24T22:35:39Z

@b31ngd3v If you're able to work on it right now, go ahead and we'll find something else for the gsoc folk. I think we're going to want this sooner rather than later if possible.

b31ngd3v · 2023-01-25T05:19:36Z

I'll start working on it then 👍🏻

b31ngd3v · 2023-02-06T18:03:06Z

hi @terriko, looks like we can't push the database to github

terriko · 2023-02-06T18:47:33Z

Not entirely surprising though I was hoping we wouldn't hit that point for a while. We'll have to see if chopping it up makes sense or if we need other storage options.

terriko · 2023-02-06T18:48:38Z

This is, incidentally, a pro for the "make a bunch of json files that can be re-loaded" theory, since they could and would be chopped into more manageable year/month chunks.

b31ngd3v · 2023-03-01T21:58:54Z

I was having health issues and was busy with university exams, i would continue working on this issue with more speed now, and sorry for the delay!

terriko · 2023-03-15T19:01:10Z

Summarizing some thoughts here so they don't wind up buried in #2807 and #2811

@b31ngd3v has gotten us to the point where we have a json export, so now we need to figure out

where to store the mirror data
how to keep the mirror up to date
how to use the mirror in cve-bin-tool

For parts 1 & 2: mirror data has the potential to get big and messy, and is potentially not the greatest for git since every single change winds up in the tree forever (even if you can't see it, the data is in the git history). BUT having the history opens some interesting options for research and the ability of others to examine vulnerability data and do things like verify the validity of the mirror over time, which are advantages that we might want. Plus, github gives us some space to play around and a CI system that I don't have to set up.

So, I've set up https://github.com/sec-data/mirror-sandbox as a repo for us to experiment with scripts without "tainting" the existing cve-bin-tool repo. Since "sec-data" is a personal free org, I can add anyone I want to it, so I'll add @b31ngd3v and @anthonyharrison to it now. (you should have emails shortly)

Longer-term: I've approached the micro mirror team about distributing our json mirror on their servers. They currently handle mirroring for a number of Linux distributions and open source projects, and they've got machines in data centers across the US and are starting to build out more globally. Once we get the mirror scripts working and are able to use the data in cve-bin-tool, we can basically hand that off to them and let them replicate, and they'll be able to watch the traffic and see what's happening. I'll probably add some of those folk to the sec-data org as well.

terriko · 2023-03-15T19:33:50Z

So then the next question is how should we use this mirror once it's set up?

What I was envisioning was something like this:

mirror gets data
cve-bin-tool defaults to using the mirror (so no one needs to get a nvd key on first run of the product, which is currently a large barrier to entry)
cve-bin-tool has options to configure the mirror(s) in use (presumably to choose a more "local" one, but also allowing people to use an internal company one or share a cache across machines in an air-gapped network)
cve-bin-tool provides an option skip the mirror and go directly to nvd (i.e. the current default behaviour)
in future: we figure out how to also deal with mirroring of gad/redhat/etc. and maybe how to configure mirroring of each of those separately/together

So if we wanted to do that, we need some config options:

providing a list of mirrors
maybe some options about default mirrors
options for failover if mirrors are broken (inaccessible, content is invalid)
probably some failover options if mirrors are out of date too (do we use the same 24 rule, allow this to be configured separately, something else?)
maybe a way to pull from multiple mirrors at once?
an option to revert to the current behaviour (using NVD directly)

I'd been thinking about this specifically with NVD since that's our biggest barrier to entry, but we should also consider:

having mirrors for each data source in separate directories
allowing configuration options to use the mirror for all/some sources

And we probably need to consider some basic info provided per-mirror with the json files..

Original data source
License
Time of last update
Any signatures, etc. for validation?
where to find our mirroring code
How to set up your own

anthonyharrison · 2023-03-15T20:30:38Z

We should add checksums to the data (is that what signatures means?) to add some integrity checks to the data.

…

On Wed, 15 Mar 2023, 19:34 Terri Oda, ***@***.***> wrote: So then the next question is how should we use this mirror once it's set up? What I was envisioning was something like this: - mirror gets data - cve-bin-tool defaults to using the mirror (so no one needs to get a nvd key on first run of the product, which is currently a large barrier to entry) - cve-bin-tool has options to configure the mirror(s) in use (presumably to choose a more "local" one, but also allowing people to use an internal company one or share a cache across machines in an air-gapped network) - cve-bin-tool provides an option skip the mirror and go directly to nvd (i.e. the current default behaviour) - in future: we figure out how to also deal with mirroring of gad/redhat/etc. and maybe how to configure mirroring of each of those separately/together So if we wanted to do that, we need some config options: - providing a list of mirrors - maybe some options about default mirrors - options for failover if mirrors are broken (inaccessible, content is invalid) - probably some failover options if mirrors are out of date too (do we use the same 24 rule, allow this to be configured separately, something else?) - maybe a way to pull from multiple mirrors at once? - an option to revert to the current behaviour (using NVD directly) I'd been thinking about this specifically with NVD since that's our biggest barrier to entry, but we should also consider: - having mirrors for each data source in separate directories - allowing configuration options to use the mirror for all/some sources And we probably need to consider some basic info provided per-mirror with the json files.. 1. Original data source 2. License 3. Time of last update 4. Any signatures, etc. for validation? 5. where to find our mirroring code 6. How to set up your own — Reply to this email directly, view it on GitHub <#2577 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAID2YF2SAUPZK7HM7MCV3W4IKSRANCNFSM6AAAAAAUEKSFQI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

terriko · 2023-03-16T17:04:18Z

Checksums: probably?

In an ideal world, this mirroring system would be 100% automated with no human in the loop unless it can't update for some reason, but a checksum to make sure you downloaded correctly seems still useful. I'm not sure how valuable a signature would be since we'd likely be blindly signing whatever we download rather than attesting to its reliability, but it could fill the same niche if we wanted.

terriko · 2023-03-22T00:27:00Z

Okay, had a bit of a chat with my mirroring expert:

the mirrors would like it if we pgp signed things. That would provide both a "we have all the data" check and a "this mirror is not tampering with the data" check.
We could maybe do the signing in github actions using the built-in secrets ability. (If it doens't work in Actions I have access to VMs that could be used for building the mirror instead.)
cve-bin-tool probably wants to check the signatures when we load in data (again, for rogue mirrors)
I would want to generate a special signing key just for this data, which would not be used for anything else. I'm not sure how key management would work for this yet , but I expect we'd start with a temp pgp key for testing that one of us generated, and potentially shift to something with actual processes and stuff.
I think we also want a jsonschema as another integrity check

Now, it's a little debatable what the automated signature is really going to mean in terms of data quality and integrity with respect to NVD:

we can validate the NVD jsonschema. From experience, it's broken a few times a week, but a mirror job could potentially retry until it's fixed and throw some error /refuse to sign/ file a report with NVD/something if it's invalid (currently we mostly have to ignore this because it happens so often, but we have on occasion reported it and gotten things fixed)
I'm not sure what else we can validate via the NVD API2 but we should figure out what else and do so
Other data sources will be different.

b31ngd3v · 2023-03-27T14:48:49Z

@terriko @anthonyharrison what if we use gpg clearsign feature?

terriko · 2023-03-27T18:28:46Z

@b31ngd3v yeah, I think that's likely the one we need to use. Basically for the mirroring folk it makes their lives easiest if we use whatever the distro folk use, and pgp is it.

* feat: add sign with pgp flag while exporting json data * feat: verify sign while importing the json data * feat: update FETCH_JSON_DB to use pgp signing * fix: update test_fetch_json_db.py * fix: existing broken tests * fix: change the file extension to `.asc` * fix: removed `signed: true` * fix: update tests * fix: update tests

terriko · 2023-08-23T16:39:58Z

I think we're good to close this one alongside #3181

terriko added the gsoc Tasks related to our participation in Google Summer of Code label Jan 23, 2023

b31ngd3v mentioned this issue Mar 2, 2023

feat: import and export database as json (#2577) #2774

Merged

terriko pushed a commit that referenced this issue Mar 10, 2023

feat: import and export database as json (#2577) (#2774)

63543d0

This was referenced Mar 10, 2023

fix: export database ci #2807

Merged

feat: pull updates from mirror with --use-mirror flag #2811

Merged

b31ngd3v mentioned this issue Apr 2, 2023

feat: add support for pgp signing (#2577) #2882

Merged

terriko closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improving caching, making a full NVD mirror available #2577

feat: Improving caching, making a full NVD mirror available #2577

terriko commented Jan 23, 2023 •

edited

Loading

b31ngd3v commented Jan 24, 2023

terriko commented Jan 24, 2023

b31ngd3v commented Jan 25, 2023 •

edited

Loading

b31ngd3v commented Feb 6, 2023

terriko commented Feb 6, 2023

terriko commented Feb 6, 2023

b31ngd3v commented Mar 1, 2023

terriko commented Mar 15, 2023 •

edited

Loading

terriko commented Mar 15, 2023

anthonyharrison commented Mar 15, 2023 via email

terriko commented Mar 16, 2023

terriko commented Mar 22, 2023

b31ngd3v commented Mar 27, 2023

terriko commented Mar 27, 2023

terriko commented Aug 23, 2023

feat: Improving caching, making a full NVD mirror available #2577

feat: Improving caching, making a full NVD mirror available #2577

Comments

terriko commented Jan 23, 2023 • edited Loading

b31ngd3v commented Jan 24, 2023

terriko commented Jan 24, 2023

b31ngd3v commented Jan 25, 2023 • edited Loading

b31ngd3v commented Feb 6, 2023

terriko commented Feb 6, 2023

terriko commented Feb 6, 2023

b31ngd3v commented Mar 1, 2023

terriko commented Mar 15, 2023 • edited Loading

terriko commented Mar 15, 2023

anthonyharrison commented Mar 15, 2023 via email

terriko commented Mar 16, 2023

terriko commented Mar 22, 2023

b31ngd3v commented Mar 27, 2023

terriko commented Mar 27, 2023

terriko commented Aug 23, 2023

terriko commented Jan 23, 2023 •

edited

Loading

b31ngd3v commented Jan 25, 2023 •

edited

Loading

terriko commented Mar 15, 2023 •

edited

Loading