Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file catalogers to selection configuration #3505

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Dec 6, 2024

Description

This PR adds file catalogers to the cataloger selection configuration:

$ syft cataloger list
Default selections: 1
  • 'all'
Selection expressions: 0
┌───────────────────────────┬───────────────────────┐
│ FILE CATALOGER            │ TAGS                  │
├───────────────────────────┼───────────────────────┤
│ file-content-cataloger    │ content, file         │
│ file-digest-cataloger     │ digest, file          │
│ file-executable-cataloger │ binary-metadata, file │
│ file-metadata-cataloger   │ file, file-metadata   │
└───────────────────────────┴───────────────────────┘
┌────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────┐
│ PACKAGE CATALOGER                      │ TAGS                                                                      │
├────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────┤
(same as today...)
└────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘

To ensure configuration is backwards compatible, file is added to the list of default tags, no matter if the user attempts to override this:

# try and set the default set to only "image", leaving out "file" implicitly
syft cataloger list --override-default-catalogers image
[0000]  WARN adding 'file' tag to the default cataloger selection, to override add '-file' to the cataloger selection request
...

Once the user specifies -file the warning does not appear:

# explicitly remove catalogers with 'file' tag
$ syft cataloger list --override-default-catalogers image --select-catalogers -file     
Default selections: 1
  • 'image'
Selection expressions: 1
  • '-file' (remove)
No file catalogers selected

(show package cataloger list, just as today...)

Expressions in the past were applied to all package cataloger tasks, and there was one group of tasks where selections could be applied. This PR adds the ability to apply the selection request to different groups of tasks, applying only the tags and names which a task in the group responds to. Any tags or names that don't apply to tasks within a group are ignored. Any tags or names that were in the original request but were not used against any group results in an error (stop from cataloging). This effectively splits the original user selection request dynamically amongst the group of tasks.

What does this mean in terms of syntax rules? The syntax for cataloger selection does not change. This means that package catalogers and file catalogers can be operated on independently without risk of affecting the one another.

Take for example:

syft python:latest --select-catalogers python
...

Today this uses the default tag (image) and subselects to package catalogers that have the python tag and file cataloging stays intact (as evidence of "file metadata", "file digests" and "executables"):

...
 ✔ Cataloged contents                                                                                                                                                                                           11663d4f2372df9c95bc55815c98527fd9f7267062c95b37424f485b2cc9df39
   ├── ✔ Packages                        [2 packages]  
   ├── ✔ File metadata                   [4 locations]  
   ├── ✔ File digests                    [4 files]  
   └── ✔ Executables                     [1,425 executables]  
NAME       VERSION  TYPE     
mercurial  6.3.2    python    
pip        24.2     python  

Without operating on groups independently and splitting the request, we'd see the same command result in NOT running any file catalogers:

...
 ✔ Cataloged contents                                                                                                                                                                                           11663d4f2372df9c95bc55815c98527fd9f7267062c95b37424f485b2cc9df39
   └── ✔ Packages                        [2 packages]  
NAME       VERSION  TYPE     
mercurial  6.3.2    python    
pip        24.2     python  

Instead we want a much more explicit expression to remove file cataloging:

syft python:latest --select-catalogers 'python,-file'

This is made possible by always injecting file as one of the default tags used for all cataloger selections.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

@wagoodman wagoodman added the enhancement New feature or request label Dec 6, 2024
@wagoodman wagoodman requested a review from a team December 6, 2024 20:26
@wagoodman wagoodman self-assigned this Dec 6, 2024
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman force-pushed the file-cataloger-selection branch from d919d15 to bd11b5b Compare December 6, 2024 21:42
@willmurphyscode
Copy link
Contributor

I'd like to understand a little more context here. It seems like SYFT_FILE_METADATA_SELECTION=none has the same effect on main today as passing --select-catalogers -file will after this. Is that correct?

@wagoodman
Copy link
Contributor Author

That is correct. I also think it's pretty non-obvious to folks that SYFT_FILE_METADATA_SELECTION affects more than just the file metadata cataloger. This PR addresses only this.

@@ -62,6 +70,142 @@ func Select(allTasks []Task, selectionRequest pkgcataloging.SelectionRequest) ([
return finalTasks, selection, nodes.Validate()
}

// ensureDefaultSelectionHasFiles ensures that the default selection request has the "file" tag, as this is a required
// for backwards compatibility (when catalogers we only for packages and not for separate groups of tasks).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/we only for packages/were only for packages

removals := strset.New(selectionRequest.RemoveNamesOrTags...)
missingFileIshTag := !defaultNamesOrTags.Has(filecataloging.FileTag) && !defaultNamesOrTags.Has("all") && !defaultNamesOrTags.Has("default")
if missingFileIshTag && !removals.Has(filecataloging.FileTag) {
log.Warnf("adding '%s' tag to the default cataloger selection, to override add '-%s' to the cataloger selection request", filecataloging.FileTag, filecataloging.FileTag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit like it says Warning: doing what Syft always did, but now we feel weird about it. Does this need to be a log.Warn?

Can we just document somewhere that some amount of file cataloging is on unless removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this is trying to convey is what to do next when there is surprising behavior. The rule of thumb for logging here is to only use warn when we have something to direct the user to do to correct a situation. This feels like the right spot for that since we're taking an action to be backwards compatible but could be surprising to folks that already have the sub-selection operation down.

To your point though, I should also update the readme with this context (but I think the warning is good to have too).

}

// SelectInGroups is a convenience function that allows for selecting tasks from multiple groups of tasks. The original
// request is splut into sub-requests, where only tokens that are relevant to the given group of tasks are considered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo s/splut/split

@willmurphyscode
Copy link
Contributor

I have sort of a general question:

Currently, file cataloger selection is controlled by SYFT_FILE_METADATA_SELECTION, which can be all or owned-by-package or none, where none seems to do the same thing as --select-catalogers -file would in this PR. However, it's possible to make contradictory selections, and people may already have .file.metadata_selection set in their Syft configs. How is the cataloger selection meant to interact with that?

@wagoodman
Copy link
Contributor Author

Good point -- though it's similar to having a package cataloger configuration to enable data enrichment then de-selecting the cataloger. I think we should allow the action to occur (override configuration with cataloger selection) but we could log-warn in these kinds of situations.

@spiffcs spiffcs mentioned this pull request Dec 12, 2024
4 tasks
@popey
Copy link
Contributor

popey commented Dec 12, 2024

Note from live stream to take on board feedback from #3468 also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants