Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 1944 - update purl generation to use a consistent groupID #2033

Merged
merged 18 commits into from
Aug 22, 2023

Conversation

spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Aug 17, 2023

Summary

This PR aims to improve 3 related aspects of syft's java cataloging. Pulls in #2032

  1. Separate the logic for CPE and PURL generation. PURL generation needs a single answer for groupID based on a priority of discovering the field. CPE generation still uses multiple potential groupID to populate the candidate cpe.

  2. Improve GroupID detection. Currently syft does not use any hierarchy for GroupID detection and treats all sources as equal. It treats fields from the manifest file with priority. We should try to obtain a GroupID answer from Pom Properties, then the Pom Project and finally, if a GroupID has not been found, the Manifest. Manifest should not return an answer if one was found in PomProject, and PomProject should not return a GroupID answer if PomProperties found one.

Note: we've removed the child matches parent by virtual path case because the virtual path now has more uniqueness that depends on a packages metadata. The parent did not fulfill this requirement and thus a new package was created, virtual paths no longer match, and the two packages were not merged

kzantow and others added 2 commits August 16, 2023 15:32
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs force-pushed the 1944-consistent-purl branch from f03e747 to f8a6582 Compare August 17, 2023 04:43
@spiffcs
Copy link
Contributor Author

spiffcs commented Aug 17, 2023

@westonsteimel I combined the work from 2032 and the single groupID answer code together in this branch - this does not alter any CPE generation or change how java packages are deduped - I'll be updating those in a separate PR

I'm adding two commits here -

  1. the first one f8a6582 does not pull from any manifest fields
    • db crawler results: partial
      • 12347 total jars
      • 3493 correct
  2. The second commit fe85967 will add back the manifest logic after I've taken a look at the results from the java db crawler to see what this update looks like with/without
    • db crawler results: partial
      • 10323 total jars
      • 1258 correct

Syft Main

db crawler results : currently running:

  • 9598 total jars
  • 2383 correct

^ I have to run these again but my ISP just went out and I don't want to pull this down over data 😬

Latest Comparison

{
  "commit": "1944-consistent-purl",
  "correct": 95164,
  "incorrect": 69402
}
{
  "commit": "main",
  "correct": 65722,
  "incorrect": 98246
}

@github-actions
Copy link

github-actions bot commented Aug 17, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
goos: linux%0Agoarch: amd64%0Apkg: github.com/anchore/syft/test/integration%0Acpu: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz%0A                                                              │ ./.tmp/benchmark-f5a4700.txt │%0A                                                              │            sec/op            │%0AImagePackageCatalogers/alpmdb-cataloger-2                                        15.43m ± 5%25%0AImagePackageCatalogers/apkdb-cataloger-2                                         964.8µ ± 5%25%0AImagePackageCatalogers/binary-cataloger-2                                        279.5µ ± 5%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                        827.8µ ± 6%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                    30.13µ ± 4%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                              138.0µ ± 4%25%0AImagePackageCatalogers/java-cataloger-2                                          17.69m ± 7%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                          138.7µ ± 8%25%0AImagePackageCatalogers/javascript-package-cataloger-2                            549.6µ ± 6%25%0AImagePackageCatalogers/nix-store-cataloger-2                                     407.1µ ± 3%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                        1.097m ± 1%25%0AImagePackageCatalogers/portage-cataloger-2                                       699.9µ ± 4%25%0AImagePackageCatalogers/python-package-cataloger-2                                4.625m ± 3%25%0AImagePackageCatalogers/r-package-cataloger-2                                     316.8µ ± 3%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                        777.2µ ± 2%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                  1.278m ± 2%25%0AImagePackageCatalogers/sbom-cataloger-2                                          161.0µ ± 2%25%0Ageomean                                                                          679.8µ%0A%0A                                                              │ ./.tmp/benchmark-f5a4700.txt │%0A                                                              │             B/op             │%0AImagePackageCatalogers/alpmdb-cataloger-2                                       5.133Mi ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                        184.7Ki ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                       30.46Ki ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                       141.3Ki ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                   3.695Ki ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                             9.906Ki ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                         2.782Mi ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                         8.594Ki ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                           83.81Ki ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                    38.94Ki ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                       155.3Ki ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                      109.8Ki ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                               986.0Ki ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                    42.90Ki ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                       170.9Ki ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                 123.2Ki ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                         14.20Ki ± 0%25%0Ageomean                                                                         92.45Ki%0A%0A                                                              │ ./.tmp/benchmark-f5a4700.txt │%0A                                                              │          allocs/op           │%0AImagePackageCatalogers/alpmdb-cataloger-2                                        88.06k ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                         4.034k ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                         848.0 ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                        2.911k ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                     132.0 ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                               281.0 ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                          40.35k ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                           228.0 ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                            1.264k ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                      820.0 ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                        3.845k ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                       2.194k ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                                16.13k ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                      851.0 ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                        3.914k ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                  2.291k ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                           394.0 ± 0%25%0Ageomean                                                                          1.996k

@spiffcs spiffcs force-pushed the 1944-consistent-purl branch from 2ce6400 to fe85967 Compare August 17, 2023 05:20
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs force-pushed the 1944-consistent-purl branch from fe85967 to 9f3157c Compare August 17, 2023 05:46
@spiffcs spiffcs marked this pull request as ready for review August 17, 2023 06:10
@spiffcs spiffcs added the enhancement New feature or request label Aug 17, 2023
spiffcs and others added 6 commits August 18, 2023 06:19
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs changed the title 1944 consistent purl feat: 1944 - update purl generation to use a consistent groupID Aug 18, 2023
// 2. The group ID from the POM project
// 3. The group ID from a select map of known group IDs
// 3. The group ID from the Java manifest
func GroupIDFromJavaMetadata(pkgName string, metadata pkg.JavaMetadata) (groupID string) {
Copy link
Contributor

@wagoodman wagoodman Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these new functions probably don't belong in the cpe package since they have been replicated to not affect the CPE generation logic today. My vote would be to put this under the java cataloger package (as near duplicate functions) and unexport all of the functions (I'd duplicate the looksLikeGroupID function too... since there isn't a specific definition for this and the impl may change based on the use case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I parsed this comment correctly - check f1e05da

If there are some edits or things to move around from here we can do a quick sync on the package organization and exports/imports

Comment on lines 384 to 385
parentKey := fmt.Sprintf("%s:%s:%s", groupID, parentPkg.Name, parentPkg.Version)
pomProjectKey := fmt.Sprintf("%s:%s:%s", pomProperties.GroupID, pomProperties.ArtifactID, pomProperties.Version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be trying to correlate the parent name and the current pom properties artifact ID and changing the virtual path when this occurs. Can you leave a code comment to why this is important? (and this might look test worthy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with commentary and test updates in 06bd7b9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify: why is the parent package name and pom properties artifact ID being compared? (instead of artifact id vs artifact id or name vs name?)

Copy link
Contributor Author

@spiffcs spiffcs Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment explaining the significance and why we use the values we're using: 1bbb6c5

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay -- I meant to submit this a few days ago!

syft/pkg/cataloger/common/cpe/java.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/common/cpe/java.go Show resolved Hide resolved
syft/pkg/cataloger/common/cpe/java.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/common/cpe/java.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/common/cpe/java.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/java/archive_parser.go Outdated Show resolved Hide resolved
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs
Copy link
Contributor Author

spiffcs commented Aug 21, 2023

Latest run: 1e4f54a

{
  "commit": "1944-consistent-purl",
  "correct": 96337,
  "incorrect": 71162
}

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs
Copy link
Contributor Author

spiffcs commented Aug 22, 2023

Ratio when removing ArtifactID: 77a6fd4

INFO:root:{
  "commit": "1944-consistent-purl",
  "correct": 86641,
  "incorrect": 128203
}

Conclusion is we had less correct by about 10,000 with a larger sample size so I'm reverting the commit.

@spiffcs spiffcs force-pushed the 1944-consistent-purl branch from 05f7f67 to c3b3e79 Compare August 22, 2023 13:09
This reverts commit 77a6fd4.

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs spiffcs force-pushed the 1944-consistent-purl branch from c3b3e79 to 1bbb6c5 Compare August 22, 2023 13:30
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the deep dive on this one @spiffcs 🙌

@spiffcs spiffcs merged commit ee121cf into main Aug 22, 2023
@spiffcs spiffcs deleted the 1944-consistent-purl branch August 22, 2023 14:47
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
…ore#2033)

Separate the logic for CPE and PURL generation. 

PURL generation needs a single answer for groupID based on a priority of discovering the field. 
CPE generation still uses multiple potential groupID to populate the candidate cpe.

Improve GroupID detection. 

Currently syft does not use any hierarchy for GroupID detection and treats all sources as equal. 
It treats fields from the manifest file with priority. This change adds a hierarchy to the fields and returns a single answer based on that hierarchy.
---------
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Co-authored-by: Keith Zantow <kzantow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants