Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-623 ci: Add a workflow for Trivy scan #14623

Merged
merged 17 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions .github/workflows/triviy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
name: Trivy scan

on:
workflow_dispatch:
schedule:
- cron: '46 8 * * 0'
Copy link
Contributor

@brianjmurrell brianjmurrell Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a schedule here instead of just on landing? I.e.:

Suggested change
schedule:
- cron: '46 8 * * 0'
push:
branches: ["master"]

This way we know exactly which landing may have caused a regression and don't have to bisect 24h worth landings to find the one that is causing a failure -- assuming anyone even notices the failed result of a scheduled test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two factors in Trivy scans:

  • a new code that enables new use cases and new potential vulnerability areas
  • new vulnerabilities detected in already used packages

I assume that the second bullet will be the root cause of the majority of incidents

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding the first bullet, then that is something that the developer introduced with their PR and so they need to resolve it there.

Even the second bullet does not represent cases we don't already experience. Any one of our linters for example can start flagging pre-existing code that it never used to simply because a new version of the linter is introduced and starts flagging existing constructs that it does not like any more. When this happens we have to fix the code it starts flagging. It fails PRs. Because if it doesn't it is ignored and landed and then everyone's PRs start flagging the offending (yet pre-existing) code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risk is that random PR can be affected by a new vulnerability detected only because of the time coincidence.
You can make some changes in C-code, and your PR will be blocked because a new Java/python vulnerability will be detected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the risk and we already accept that kind of risk and resolution process with our other checkers. We frequently see unrelated linting issues creep into PRs due to updated linting tools, spelling checkers, etc.

Those are either corrected in the PR that found them or somebody is assigned the task of getting master updated to the correction. But it's only because they block progress that they get fixed. Otherwise nobody takes on the task and distraction to address them if they don't halt progress. History has proven this to us.

PRs that are blocked wait for the update and rebase, or they can be force-landed if gatekeepers are willing.

Given that we accept this process for less serious issues like spell-checking and code linting, we must surely accept this process for actual security vulnerabilities that are detected!

pull_request:
branches: ["master", "release/**"]

# Declare default permissions as nothing.
permissions: {}

jobs:
trivy-scan:
name: Trivy scan
runs-on: ubuntu-22.04

steps:
- name: Install trivy package
run: |
sudo apt-get install wget apt-transport-https gnupg lsb-release
wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
echo deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main | \
sudo tee -a /etc/apt/sources.list.d/trivy.list
sudo apt-get update
sudo apt-get install trivy

- name: Checkout code
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1
with:
persist-credentials: false

- name: Checkout latest trivy configuration
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
with:
ref: grom72/trivy # 'grom72/trivy' to be changed to 'master' before the merge
path: trivy
persist-credentials: false

- name: Update trivy configuration
run: |
cp -f -r ./trivy/utils/trivy ./utils
rm -rf ./trivy

- name: Scan with trivy
run: |
trivy fs -c utils/trivy/trivy.yaml -f table --dependency-tree \
--skip-files "src/client/java/hadoop-daos/pom.xml" \
--show-suppressed --exit-code 1 .

# generate trivy report only if no errors detected
- name: Generate trivy report file extension
# github.base_ref-based for PR and github.ref_name-based for schedule/workflow_dispatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using landing-time testing instead of scheduled resolve this either/or for github.base_ref/github.ref_name?

In either case, see below for a more concise solution.

Copy link
Contributor Author

@grom72 grom72 Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to block PR because of detected vulnerabilities, as the solution is to update the corrupted package to the proper version or, like what we did with Hadoop, create a suppression.

It will be great to add a status badge to monitor the Trivy scan as we have it done for other regular actions, when we have this PR landed

License
Coverity Scan Build Status
Build
Codespell
Doxygen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to block PR because of detected vulnerabilities,

Why not? That is how we operate here. If some new "check" comes up as failed, it needs to be fixed before the PR can land. Otherwise we propagate that failure to everyone else's PR once they rebase on the branch that allowed the PR with the failure in it to land. This is huge waste of everyone else's time.

It will be great to add a status badge to monitor the Trivy scan as we have it done for other regular actions

Nobody monitors these badges. All of these checks behind these badges fail PRs (and landing branches even) when they start failing. That is the only way failures are noticed here.

Even scheduled (i.e. cron) failures will go unnoticed until they start showing up in PRs, by which time it now becomes somebody's job to start bisecting and otherwise hunting for the source of the failure. That somebody will most assuredly come from the SRE team, disrupting whatever tasks they had going on and adds zero value to their OKRs.

Failures need to be alerted and fixed when and where they happen and by whomever introduces them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the Trivy scan might fail not because of new code but because of a new vulnerability detected in some unrelated to the PR code.
There are two goals (from the SDLe process perspective):

  • prevent any new vulnerability from being introduced by developers (rare case)
  • detect new vulnerabilities that are detected and reported by subcomponent owners

The second

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the Trivy scan might fail not because of new code but because of a new vulnerability detected in some unrelated to the PR code.

Yes, that is understood. As described above, we already accept this process model with our other checks and validations.

The problem is that we do not have processes that support allowing new (even unrelated) issues to "slide" and land so that they are addressed later. Nobody ever owns fixing those and they continue to pile up until the pile is so big that it's a nightmare. So the only process that works for us is to have new -- even unrelated issues -- block progress, forcing somebody to own getting it fixed.

id: gen_extension
run: |
EXTENSION=$(echo "${{ github.base_ref }}" | sed -e 's/release\///' -e's/\//_/' )
if [ -z "${EXTENSION}" ]; then
EXTENSION=$(echo "${{ github.ref_name }}" | sed -e 's/release\///' -e's/\//_/' )
fi
brianjmurrell marked this conversation as resolved.
Show resolved Hide resolved
echo "EXTENSION=$EXTENSION" >> $GITHUB_OUTPUT

- name: Generate trivy report
run: |
trivy fs -c utils/trivy/trivy.yaml \
--skip-files "src/client/java/hadoop-daos/pom.xml" \
--show-suppressed \
--output trivy-report-daos.${{ steps.gen_extension.outputs.EXTENSION }}.txt .

- name: Print trivy report
run: cat trivy-report-daos.${{ steps.gen_extension.outputs.EXTENSION }}.txt

- name: Prepare the report to be uploaded to the GitHub artifact store
run: |
mkdir report
cp trivy-report-daos.${{ steps.gen_extension.outputs.EXTENSION }}.txt report
cp utils/trivy/.trivyignore report/trivyignore.txt

- name: Upload the report to the GitHub artifact store
uses: actions/upload-artifact@65462800fd760344b1a7b4382951275a0abb4808 # v4.3.3
env:
EXTENSION: ${{ steps.gen_extension.outputs.EXTENSION }}
with:
path: report/*
name: trivy-report-daos.${{ steps.gen_extension.outputs.EXTENSION }}
31 changes: 31 additions & 0 deletions utils/trivy/.trivyignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of ignored CVEs. Are we allowed to ignore this many? Some of them are HIGH even.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Trivy is because of SDL
  2. It has been agreed with SDL people that we can ignore all Hadoop-related CVE
  3. We do not release any Java code so we can practically ignore all Java-related issues
  4. I want first to have the running workflow and next clean all required issues (in a separate PR)
  5. An example of a failed build with suppressed ignores: https://github.com/grom72/daos/actions/runs/10159867708/job/28095002601#step:8:31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of a failed build with suppressed ignores: grom72/daos/actions/runs/10159867708/job/28095002601#step:8:31

Nice. Is there a reason we don't take advantage of the whole Code Scanning workflow in GitHub with this action?

https://github.com/aquasecurity/trivy-action#using-trivy-with-github-code-scanning

Copy link
Contributor Author

@grom72 grom72 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow supports two things:

  • SDL evidence generation where I want to have also ignored issues documented
  • Clear info in the log when something goes wrong
  • Clear info in the log about ignored issues

Sarif scan added. Results of the run w/ expected failures are here (see first two issues):
https://github.com/daos-stack/daos/security/code-scanning?query=pr%3A14623+is%3Aclosed

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Ignored hadoop related CVE
## CVE-2023-52428,MEDIUM,,"Denial of Service in Connect2id Nimbus JOSE+JWT","com.nimbusds:nimbus-jose-jwt","9.8.1","9.37.2",https://avd.aquasec.com/nvd/cve-2023-52428
CVE-2023-52428
## CVE-2023-39410,HIGH,7.5,"apache-avro: Apache Avro Java SDK: Memory when deserializing untrusted data in Avro Java SDK","org.apache.avro:avro","1.7.7","1.11.3",https://avd.aquasec.com/nvd/cve-2023-39410
CVE-2023-39410
## CVE-2024-25710,HIGH,5.5,"commons-compress: Denial of service caused by an infinite loop for a corrupted DUMP file","org.apache.commons:commons-compress","1.21","1.26.0",https://avd.aquasec.com/nvd/cve-2024-25710
CVE-2024-25710
## CVE-2024-26308,HIGH,5.5,"commons-compress: OutOfMemoryError unpacking broken Pack200 file","org.apache.commons:commons-compress","1.21","1.26.0",https://avd.aquasec.com/nvd/cve-2024-26308
CVE-2024-26308
## CVE-2024-29131,MEDIUM,,"commons-configuration: StackOverflowError adding property in AbstractListDelimiterHandler.flattenIterator()","org.apache.commons:commons-configuration2","2.8.0","2.10.1",https://avd.aquasec.com/nvd/cve-2024-29131
CVE-2024-29131
## CVE-2024-29133,MEDIUM,,"commons-configuration: StackOverflowError calling ListDelimiterHandler.flatten(Object, int) with a cyclical object tree","org.apache.commons:commons-configuration2","2.8.0","2.10.1",https://avd.aquasec.com/nvd/cve-2024-29133
CVE-2024-29133
## CVE-2022-40150,HIGH,7.5,"jettison: memory exhaustion via user-supplied XML or JSON data","org.codehaus.jettison:jettison","1.1","1.5.2",https://avd.aquasec.com/nvd/cve-2022-40150
CVE-2022-40150
## CVE-2022-45685,HIGH,7.5,"jettison: stack overflow in JSONObject() allows attackers to cause a Denial of Service (DoS) via crafted JSON data","org.codehaus.jettison:jettison","1.1","1.5.2",https://avd.aquasec.com/nvd/cve-2022-45685
CVE-2022-45685
## CVE-2022-45693,HIGH,7.5,"jettison: If the value in map is the map's self, the new new JSONObject(map) cause StackOverflowError which may lead to dos","org.codehaus.jettison:jettison","1.1","1.5.2",https://avd.aquasec.com/nvd/cve-2022-45693
CVE-2022-45693
## CVE-2023-1436,HIGH,7.5,"jettison: Uncontrolled Recursion in JSONArray","org.codehaus.jettison:jettison","1.1","1.5.4",https://avd.aquasec.com/nvd/cve-2023-1436
CVE-2023-1436
## CVE-2022-40149,MEDIUM,7.5,"jettison: parser crash by stackoverflow","org.codehaus.jettison:jettison","1.1","1.5.1",https://avd.aquasec.com/nvd/cve-2022-40149
CVE-2022-40149
## CVE-2023-34455,HIGH,7.5,"snappy-java: Unchecked chunk length leads to DoS","org.xerial.snappy:snappy-java","1.1.8.2","1.1.10.1",https://avd.aquasec.com/nvd/cve-2023-34455
CVE-2023-34455
## CVE-2023-43642,HIGH,7.5,"snappy-java: Missing upper bound check on chunk length in snappy-java can lead to Denial of Service (DoS) impact","org.xerial.snappy:snappy-java","1.1.8.2","1.1.10.4",https://avd.aquasec.com/nvd/cve-2023-43642
CVE-2023-43642
## CVE-2023-34453,MEDIUM,7.5,"snappy-java: Integer overflow in shuffle leads to DoS","org.xerial.snappy:snappy-java","1.1.8.2","1.1.10.1",https://avd.aquasec.com/nvd/cve-2023-34453
CVE-2023-34453
## CVE-2023-34454,MEDIUM,7.5,"snappy-java: Integer overflow in compress leads to DoS","org.xerial.snappy:snappy-java","1.1.8.2","1.1.10.1",https://avd.aquasec.com/nvd/cve-2023-34454
CVE-2023-34454
29 changes: 29 additions & 0 deletions utils/trivy/csv.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{{ range . }}
Trivy Vulnerability Scan Results ({{- .Target -}})
VulnerabilityID,Severity,CVSS Score,Title,Library,Vulnerable Version,Fixed Version,Information URL,Triage Information
{{ range .Vulnerabilities }}
{{- .VulnerabilityID }},
{{- .Severity }},
{{- range $key, $value := .CVSS }}
{{- if (eq $key "nvd") }}
{{- .V3Score -}}
{{- end }}
{{- end }},
{{- quote .Title }},
{{- quote .PkgName }},
{{- quote .InstalledVersion }},
{{- quote .FixedVersion }},
{{- .PrimaryURL }}
{{ else -}}
No vulnerabilities found at this time.
{{ end }}
Trivy Dependency Scan Results ({{ .Target }})
ID,Name,Version,Notes
{{ range .Packages -}}
{{- quote .ID }},
{{- quote .Name }},
{{- quote .Version }}
{{ else -}}
No dependencies found at this time.
{{ end }}
{{ end }}
Loading
Loading