Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design on disk storage structure for packages and vulnerabilties data #3

Closed
pombredanne opened this issue Feb 9, 2024 · 4 comments · Fixed by aboutcode-org/vulnerablecode#1609
Assignees

Comments

@pombredanne
Copy link
Contributor

See the attached zip for a design discussed with @ziadhany and @TG1999
federatedcode-data-structure.zip
The approach would be to have separate trees/repos for package metadata and vulnerabilities metadata, and have a cross reference from packages to vulns in packages and the other way in vulnerabilities.

The file tree would be looking more or less this way:

./aboutcode-vulnerabilities-1223
./aboutcode-vulnerabilities-1223/3434
./aboutcode-vulnerabilities-1223/3434/VCID-1223-3434-34343
./aboutcode-vulnerabilities-1223/3434/VCID-1223-3434-34343/advisories
./aboutcode-vulnerabilities-1223/3434/VCID-1223-3434-34343/VCID-1223-3434-34343.yml
./aboutcode-packages-ed5
./aboutcode-packages-ed5/maven
./aboutcode-packages-ed5/maven/org.apache.log4j
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.4
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/vulnerabilities.yml
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/ossf-scorecard
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/ossf-scorecard/scorecard.json
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/spdx
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/cyclonedx
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/scancode-toolkit
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/scancode-toolkit/scancode-toolkit-scan.json
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/clearlydefined-curation
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/vulnerabilities.yml
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/osselot
./aboutcode-packages-ed5/maven/org.apache.log4j/log4j-core/versions/1.2.3/osselot/osselot-spdx.json
@ziadhany
Copy link
Collaborator

@pombredanne @TG1999
what ed5 stand for ? ./aboutcode-packages-ed5/maven/org.apache.log4j

@pombredanne
Copy link
Contributor Author

pombredanne commented Jul 15, 2024

@ziadhany re:

what ed5 stand for ? ./aboutcode-packages-ed5/maven/org.apache.log4j

sorry for the late reply, and we discussed it since: this is a hash

@ziadhany
Copy link
Collaborator

@pombredanne yes, we discussed this , and I have updated the pull request #1206 to match the new file structure:

./aboutcode-vulnerabilities-12/34/VCID-1223-3434-34343/VCID-1223-3434-34343.yml.

However, I'm still concerned about the performance of this script. I believe there's a more efficient method to detect updates on VulnerableCode. At the very least, we should aim to minimize the number of queries in this script.

@pombredanne pombredanne self-assigned this Sep 11, 2024
@pombredanne
Copy link
Contributor Author

@ziadhany the performance of the script should now be fine with the updates I applied.

There are a few thing that I would like to consider further:

  • Have fewer Git repositories for packages: right now 8192 repos is likely too much and too many
  • Use package type aka. ecosystem as part of the package path key, because I would typically be most interested in a few ecosystems and not all, all the time.

Here are the latest counts as of today (Using some data from @edebill modulecount, Thank you Erik!)

Package count ignoring versions		
6,076,210	app pkgs spread on ~25 app pkg ecosystems
2,500,000	sys pkgs spread on ~50 linux, bsd and related distros
75		ecosystems overall
3,349,708	npm pkgs

npm is the largest ecosystem with ~3.5M package and about 16M package versions.
We can use 5M as a high count for a single ecosystem with 20M versions.

We want to have about 4 to 5K packages data stored in anyone git repo.
so 5M/5K would be about 1000 repos , or about 2**10 = 1024 repos for each large ecosystem
And more like a 100 repos or less for other ecosystems, or just one repo for small ecosystems.

The purl hash function https://github.com/aboutcode-org/vulnerablecode/blob/e273c67e7b48e09337de00367cabd23ceb566604/aboutcode/hashid/__init__.py#L290C31-L290C41 could become aware of the ecosystem and produce different hashes for different ecosystem. It could also be aware of namespace and if need be, to split things for deb and rpm distros.

So there could be three or four ecosystem tiers :

  • super large ecosystem using 10 bits of hash and 1024 repos: one ecosystem (npm) with 5M packages
  • large ecosystem using 7 bits of hash and 128 repos: about ten ecosystems (pypi, maven, golang, perl, ruby, nuget, php with 500K packages)
  • medium ecosystem using 5 bits of hash and 32 repos: about ten distro ecosystems (rpm, deb with 50K packages) or use the large
  • small ecosystem using 0 bits and a single repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants