Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage vulnerability sources in database #123

Closed
haikoschol opened this issue Oct 22, 2019 · 2 comments
Closed

Manage vulnerability sources in database #123

haikoschol opened this issue Oct 22, 2019 · 2 comments

Comments

@haikoschol
Copy link
Collaborator

Add a model for available data sources where we can keep metadata such as the license under which the vulnerability data is published. This would make it easier to maintain this kind of information than having it in the code. When we have importers for standardized formats like OVAL it would also allow to add additional data sources that follow this standard by adding a new entry to this DB table.

An implementation of this would benefit from a bit of refactoring to some kind of base class for vulnerability sources.

@haikoschol
Copy link
Collaborator Author

haikoschol commented Jan 30, 2020

Some quick notes:
My initial idea was to have a base class called Importer or something like that, which defines an interface and contains common logic from which implementations for various data souces inherit.

An alternative would be to have only one Importer which is a Django model and contains fields like

  • name (e.g. "ubuntu", "npm", etc.)
  • license (license of the vulnerability data)
  • last_run
  • one or two fields that contain names of classes which contain the data source specific code. That code can be roughly categorized as "fetching data" and "processing data".
  • possibly a JSON field with configuration for the data source specific code

Decoupling the fetching from the processing seems to help with using fixtures for testing, but I'm not sure there is enough to do for the fetching to warrant a separate class. That's not very important though as long as the interface consumed by the Importer separates these steps clearly. One might expect only providing the data newer than a given timestamp (to be used with Importer.last_run) to be part of the "fetching" step as well, but in many cases that will require parsing the data format.

The processing step would take whatever the format of the data source is (OVAL XML, one huge JSON, one YAML document per advisory, etc.) and transform it into a common intermediate format that is then turned into model objects and stored in the database by the Importer.
The intermediate format also helps with testing (no need to run a full import and then assert that expected rows have been written to the database) and it allows us to keep iterating on the data model without having to constantly update all importers (we might need a new name for the classes that implement the data source specific logic if there is one common class called Importer as well).

All this should be done in a "streaming" fashion, probably using generators and somehow reasonably efficiently written to the DB (probably using bulk inserts).

Requirements for the new and improved import process are:

  • Preventing duplicates in the database
  • Ability to deal with updated information (i.e. a preliminary security advisory is updated with more details)
  • Reasonable efficiency & performance (i.e. using bulk inserts instead of making a few SELECT queries to check for existing data and possibly timestamps on it before issuing a single INSERT)

@haikoschol
Copy link
Collaborator Author

closed by #152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant