Manage vulnerability sources in database #123

haikoschol · 2019-10-22T09:58:55Z

Add a model for available data sources where we can keep metadata such as the license under which the vulnerability data is published. This would make it easier to maintain this kind of information than having it in the code. When we have importers for standardized formats like OVAL it would also allow to add additional data sources that follow this standard by adding a new entry to this DB table.

An implementation of this would benefit from a bit of refactoring to some kind of base class for vulnerability sources.

haikoschol · 2020-01-30T21:16:46Z

Some quick notes:
My initial idea was to have a base class called Importer or something like that, which defines an interface and contains common logic from which implementations for various data souces inherit.

An alternative would be to have only one Importer which is a Django model and contains fields like

name (e.g. "ubuntu", "npm", etc.)
license (license of the vulnerability data)
last_run
one or two fields that contain names of classes which contain the data source specific code. That code can be roughly categorized as "fetching data" and "processing data".
possibly a JSON field with configuration for the data source specific code

Decoupling the fetching from the processing seems to help with using fixtures for testing, but I'm not sure there is enough to do for the fetching to warrant a separate class. That's not very important though as long as the interface consumed by the Importer separates these steps clearly. One might expect only providing the data newer than a given timestamp (to be used with Importer.last_run) to be part of the "fetching" step as well, but in many cases that will require parsing the data format.

The processing step would take whatever the format of the data source is (OVAL XML, one huge JSON, one YAML document per advisory, etc.) and transform it into a common intermediate format that is then turned into model objects and stored in the database by the Importer.
The intermediate format also helps with testing (no need to run a full import and then assert that expected rows have been written to the database) and it allows us to keep iterating on the data model without having to constantly update all importers (we might need a new name for the classes that implement the data source specific logic if there is one common class called Importer as well).

All this should be done in a "streaming" fashion, probably using generators and somehow reasonably efficiently written to the DB (probably using bulk inserts).

Requirements for the new and improved import process are:

Preventing duplicates in the database
Ability to deal with updated information (i.e. a preliminary security advisory is updated with more details)
Reasonable efficiency & performance (i.e. using bulk inserts instead of making a few SELECT queries to check for existing data and possibly timestamps on it before issuing a single INSERT)

haikoschol · 2020-05-26T15:25:11Z

closed by #152

haikoschol added enhancement Priority: medium labels Oct 22, 2019

haikoschol added this to the Core data collection milestone Nov 18, 2019

pombredanne mentioned this issue Nov 18, 2019

Track licenses for each data pointers and records #63

Closed

haikoschol added the Core models label Nov 18, 2019

pombredanne mentioned this issue Jan 14, 2020

Debian vulnerabilities are by default looked for 'jessie' release #139

Closed

haikoschol mentioned this issue Mar 1, 2020

Manage vulnerability sources in database #152

Merged

haikoschol mentioned this issue Apr 3, 2020

Use and store hashes of all imported data #169

Closed

haikoschol mentioned this issue May 26, 2020

Include affected platforms in stored vulnerability data #149

Closed

haikoschol closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage vulnerability sources in database #123

Manage vulnerability sources in database #123

haikoschol commented Oct 22, 2019

haikoschol commented Jan 30, 2020 •

edited

Loading

haikoschol commented May 26, 2020

Manage vulnerability sources in database #123

Manage vulnerability sources in database #123

Comments

haikoschol commented Oct 22, 2019

haikoschol commented Jan 30, 2020 • edited Loading

haikoschol commented May 26, 2020

haikoschol commented Jan 30, 2020 •

edited

Loading