Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine CVE check in scs-0210-v2 test script. #526

Open
mbuechse opened this issue Mar 18, 2024 · 13 comments · May be fixed by #779
Open

Refine CVE check in scs-0210-v2 test script. #526

mbuechse opened this issue Mar 18, 2024 · 13 comments · May be fixed by #779
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Milestone

Comments

@mbuechse
Copy link
Contributor

mbuechse commented Mar 18, 2024

The test script currently does not really check whether any patch-level update that targets any critical CVEs is deployed in time.

Furthermore, the standard is a bit vague about whether this part is actually required or recommended.

Thirdly, could you make some kind of suggestion of how to best integrate with CVE check tools? For instance, the test script could accept a log file by one of these tools and just verify that the tool ran fine. You could then add this to the standard as a recommendation; I think we might get this in even with the now stable standard because it wouldn't turn any compliant clouds non-compliant.

@martinmo
Copy link
Member

Note to myself: while working on #476 I noticed that the VersionRange class doesn't cover all use cases nicely. Without workarounds, it covers the cases:

  • a single version is affected (upper_version is None)
  • a range is affected, but both ends must be known

It doesn't cover the case that all versions prior upper_version are affected, which I have worked around using a version 0.0.0 as the lower version. And also the other way around (which would mean, there is no patched version available/known).

@martinmo
Copy link
Member

And a general note: I would actually like to replace our custom CVE retrieval and parsing with an existing library, if possible. An obvious candidate I want to evaluate in this regard is cvelib.

@mbuechse
Copy link
Contributor Author

@martinmo Yes, very well. Also (as stated in the description of this issue) we might require the use of some external CVE check tool, if you can find something appropriate.

@mbuechse
Copy link
Contributor Author

Oh, and one other note: maybe in the course of this issue, you can also try to make the check work for any given date, instead of just the current, so that unit tests can work without monkeypatching.

@martinmo
Copy link
Member

martinmo commented Mar 26, 2024

I did some research on the CVE/vulnerability scanning part of this issue.

An additional candidate for a Python CVE query library is nvdlib, which uses the "National Vulnerability Database". But there is a big caveat and the database is not reliable at the moment (see https://heise.de/-9656574, for example).

However, before digging deeper into the Python libraries, I decided to look for vulnerability scanning solutions in the K8s ecosystem:

  1. Because even if we have a good CVE query library, we still need to scan the K8s cluster ourselves and match the results. I am sure this is an already solved problem.
  2. Furthermore, while experimenting with this I noticed that our current approach has another shortcoming. We just compare the K8s version of one particular component when we connect to the cluster with the kubernetes-asyncio package, and not the complete cluster. (Nodes could, in theory, run slightly different versions of kubelet and container images in the kube-system namespace, such as kube-apiserver.)

The proper way to address point 2 would be to create an inventory and check it, for example with the cluster-inventory plugin for sonobuoy or the KBOM ("Kubernetes Bill of Materials").

A promising solution to tackle points 2 and 1 seems to be trivy, which conveniently is Apache-2.0 licensed. For example, the experimental trivy k8s subcommand can be used to scan a cluster. I successfully tried the following on a test cluster:

trivy k8s --report=summary --scanners=vuln cluster
trivy k8s --format=json --scanners=vuln --namespace=kube-system all
trivy k8s --scanners=vuln --namespace=kube-system --format=json -o result.json nodes,pods

JSON output is supported, which means we can further process the information.

Trivy can also be run in a k8s native fashion as an operator (trivy-operator). However I think this doesn't make sense if we only test short lived clusters which only exist during the conformance tests.

@martinmo
Copy link
Member

Today I brought the question about which scanning tools could be used into the Team Container call. However, because of holidays/vacation, we were only two people and this couldn't be discussed with a broader audience (I'll try again in the next week if necessary).

In the meantime, I picked up another tool that I will evaluate: kubescape (https://kubescape.io/).

@martinmo
Copy link
Member

martinmo commented Apr 4, 2024

I performend some evaluation on more CVE scanner tools for K8s. Unfortunately, most of them are not suited for our purpose – they either do not scan cluster components (e.g., the kubelet or apiserver) or they cannot easily be included in a CI pipeline (some of them are nice UI dashboards):

So all in all, trivy seems to be the best option. One thing to keep in mind though: the trivy k8s cli is still experimental and the format of the JSON export not be stable.

Furthermore, yesterday I prototyped with the Python libraries cvelib and nvdlib to see how much of an effort the library approach is:

cvelib is not suitable, it doesn't provide sophisticated search by product name. It is more aimed towards security professionals who want to assign/reserve/issue CVEs (e.g., the cve_api module provides functions to publish and reserve entries and lookup a specific CVE using the id). Furthermore, an API key is needed to interact with the CVE Services API.

nvdlib could be used if the trivy solution doesn't work out. It is more effort than the trivy solution but still an improvement over our current custom solution. Some facts:

  • There is a rate limiting if used without API key (without: 6s delay)
  • We can use searchCVE(…) with cpeName kwarg.
  • CPE (Common Platform Enumeration) is a standardized way to identify affected products
  • CPE Dictionary XML is the official listing where we can get the (partial) CPE (https://nvd.nist.gov/products/cpe)

For example, knowing that our cluster runs v1.27.2, with

import nvdlib

results = nvdlib.searchCVE(
  cpeName='cpe:2.3:a:kubernetes:kubernetes:1.27.2',
  isVulnerable=True,
  cvssV3Severity="HIGH",
  limit=10
)

we can get the CVEs this version is affected by. The library also wraps the CVE records data in a nice data structure.

According to my research, it should be sufficient to search only for

cpe:2.3:a:kubernetes:kubernetes:<version>:<update>`

The <version> part is something like 1.27.2 and the <update> part is used for prereleases and should be - instead of * (wildcard). I grepped through the CPE dictionary and only found that historically, the apiserver had a separate <product> in its CPE (it was cpe:2.3:a:kubernetes:apiserver), but only until v1.25rc1.

@martinmo
Copy link
Member

martinmo commented Apr 4, 2024

In today's Team Container call I brought the issue up again. Sven confirmed that trivy is a good approach. It was decided that I try the trivy approach with a MVP first. If it doesn't work out, I can still switch to the library approach.

FTR, we also had a short discussion whether the standard is feasible for CSPs, i.e., whether the timeframes that are set out are too short. It was concluded that it is feasible and that in reality a CSP needs to react quickly anyway. Also, in practice, critical K8s vulnerabilities do not appear often. (Nevertheless, this issue here can be tackled independently as it just deals with the implementation of the check.)

@martinmo
Copy link
Member

martinmo commented Apr 8, 2024

For the prototype using trivy with the k8s subcommand my first goal was to "narrow"/filter the command invocation as much as possible. With the --help flag and some trial and error I arrived at:

trivy k8s --scanners=vuln --components=infra --report=summary \
    --severity=HIGH --exit-code=1 --format=json -o trivy-cluster-infra-scan.yml cluster

The "narrowing" happens with --scanners=vuln, --components=infra and --severity=HIGH flags. I could not find a JSON schema for the resulting output, however the format is simple enough and is codified in the ConsolidatedReport struct (because of --report=summary) in https://github.com/aquasecurity/trivy/blob/v0.50.1/pkg/k8s/report/report.go.

Now I have two problems:

  • Testing the trivy invocations, I quickly reached the Docker Hub rate limit (TOOMANYREQUESTS: You have reached your pull rate limit.). I can raise the limit a bit by using a Docker hub account. However, I'm still concerned about this. It seems the images are not cached. Flags such as --offline-scan and --skip-db-update didn't help.
  • When tested against a cluster with the latest patch release (Kubernetes v1.27.12), I still get findings with severity "HIGH". For example, my kube-proxy pods run the image registry.k8s.io/kube-proxy:v1.27.12 and get flagged for being vulnerable because of an issue in runc (CVE-2024-21626). This is unexpected noise, here I am concerned about how to correctly filter/handle this.

@martinmo
Copy link
Member

martinmo commented Jun 4, 2024

FTR, the EOL check failed for the first time in the Zuul E2E tests for cluster-stacks because I did not update the k8s-eol-data.yml after 1.30 was released. This approach alone is prone to this error. Thus, new requirement for the refinement worked on in this issue: add some kind of interpolation of EOLs if data is missing (releases happen every four months on 28th).

@martinmo martinmo added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Jun 7, 2024
@martinmo martinmo mentioned this issue Jun 10, 2024
29 tasks
@piobig2871
Copy link

piobig2871 commented Oct 10, 2024

During my research about the task I have tried to dive into all of the mentioned technologies.

Started with setting up the k8s cluster with yaook, unfortunately without any success, so moved to kind approach, on which I have installed openstack and on that openstack I was able to use capi.

EDIT: after attempt to create Kubernetes cluster on top of the OpenStack using yaook there was an error found also by @michal-gubricky:

TASK [bootstrap/ssh-known-hosts : Trigger SSH certificate renewal] **********************************************************************************************************************
fatal: [managed-k8s-gw-0]: FAILED! => changed=false 
  msg: |-
    Unable to start service renew-ssh-certificates: Job for renew-ssh-certificates.service failed because the control process exited with error code.
    See "systemctl status renew-ssh-certificates.service" and "journalctl -xeu renew-ssh-certificates.service" for details.

from the logs of that service renew-ssh-certificates.service at the gateway instance visible was error:

Error writing data to auth/yaook/nodes/login: Put "https://127.0.0.1:32769/v1/auth/yaook/nodes/login": dial tcp 127.0.0.1:32769: connect: connection refused
failed to log into vault!

@mbuechse mbuechse linked a pull request Oct 18, 2024 that will close this issue
@piobig2871
Copy link

piobig2871 commented Oct 18, 2024

What I have done right now is restore the original standard text and drop the changes based on the review comment.

According to the code, there were several changes made:

- Integrated Trivy for scanning Kubernetes pod images for security vulnerabilities.
- Fixed issue with ClusterInfo object being incorrectly passed where kubeconfig path was expected.
- Added logging improvements to provide clearer insights during version compliance checks.
- Refined the code structure to handle K8s image scanning and cluster versioning in an async manner.

I have found some problems with SSL certificates as well on my side, for MacOS users there is a simple solution with

/Applications/Python\ 3.10/Install\ Certificates.command

What stopped me mostly was an error that told me:
AttributeError: 'ClusterInfo' object has no attribute 'split', therefore I have added a field with kubeconfig variable to ensure that I am passing path to kubeconfig instead of class which indeed does not have attribute split()

EDIT: I haven't met the problems with request quantity but I was able to check that it is:

  • Limited to 100 pulls per 6 hours from a single IP address for unauthorized accounts.
  • Limited to 200 pulls per 6 hours from a single Docker Hub account.

@piobig2871
Copy link

codes are waiting for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

4 participants