VMClarity Multi-Cloud Account / Cluster Design #581

Tehsmash · 2023-08-21T14:21:27Z

Tehsmash
Aug 21, 2023
Maintainer

Problem

It is very uncommon to have a single cloud account, kubernetes cluster or docker environment, and the user experience of having to install VMClarity once per environment is not good, as discovered assets and detected security findings are only accessible from that environment.

Goal

It should be possible to install VMClarity's control plane once, and provide a single pane of glass for all assets and security findings across a user's environments. The user should be able to configure and manage scans across all environments from that single control plane, but the asset scans should remain distributed and run in the environment of the asset they are scanning.

Solution

In order to satisfy the goal, we need to connect multiple Providers to the VMClarity control plane. There is currently a 1:1 relationship between the Orchestrator and the enabled Provider.

To keep the scanning distributed, secure, and stateless we will split the orchestrator component into two parts:

The "orchestrator", responsible for reconciling ScanConfigs, Scans, and watchdog AssetScans.
A "Provider", responsible for heartbeating, discovery, and reconciling AssetScans assigned to it.

When you deploy the VMClarity control plane the orchestrator will be deployed, but there will be no providers.

For each environment (AWS account, Kubernetes Cluster, Docker Daemon) that you want to connect to VMClarity, a provider will be deployed into that environment. The provider will connect back to the VMClarity control plane, feeding the API with discovered assets, as well as watching for AssetScan objects which are for assets that belong to that environment.

Each Provider will consist of two parts:

Provider-runtime - the common layers/loops and logic required by all providers to interact with the control plane correctly
Provider driver - an environment specific driver that implements the functions for Discovery and asset scanning.

A provider will consist of an instance of the provider-runtime library initialised with a specific provider driver. Each provider will be a separate go module to avoid inter-provider go module conflicts.

An example of the main cmd function for a provider might be:

func main() {
    driver := aws.New(ctx)
    runtime := providerRuntime.New(driver)
    runtime.Start()
}

The provider-runtime library will be developed along side the control plane VMClarity code to ensure consistency between its logic and the orchestrator logic.

There should be a well defined support policy between the provider-runtime and the control plane. Current recommendation is N-1 which means that the providers can be one provider-runtime version behind the control plane to which they are connecting. This allows for the control plane to be upgraded first, and then the providers preventing a need to upgrade everything at once. However is a small enough support window that we shouldn't be limited by backward compatibility issues.

Each provider should compile into an independent binary.

Each provider will also be responsible for maintaining its own installation mechanism for the environment it supports, for example the AWS provider should maintain a CFN and the Kubernetes provider should maintain a helm chart. These are separate from the control plane installation methods which will be maintained in the core VMClarity repo.

The VMClarity API will be extended to include a new object type "Provider" which represents a Provider installed in an environment. As with all objects in the VMClarity API, it will have a unique UUID which will be used to identify that provider.
Providers will initially only have a small number of fields:

displayName - a human readable name for this provider. Purely a UX/UI improvement to help users identify their providers.
lastHeartbeatTime - a timestamp of the last heartbeat from the provider
state - a state representing the providers health. An enum Healthy, NotHealthy, Unknown.
stateReason - a machine readable reason for the above state.
stateMessage - a human readable reason for the above state.
providerRuntimeVersion - the version of the provider runtime that the provider was compiled with. This can be used to indicate that a provider needs upgrading if there is a mismatch between the controlplane and the provider.
annotations - a map of string keys and string values to define provider specific properties.

Assets in VMClarity will need to be extended to include a relationship list of any providers which discover that asset. When a providers discover an asset they will add their unique ID to that assets provider list.

When a provider is removed from the system then that provider ID will also be removed from all asset's provider list.

A new orchestrator controller which handles AssetScans will be added which has two responsibilities:

Process AssetScans in "Pending", then based on the target Asset's Provider list assign a provider ID to the AssetScan in a new field "ProviderID". Once this has occurred then the AssetScan is moved to Scheduled and is ready for the Provider responsible to pick it up.
Process AssetScans in "Pending", "Scheduled", "ReadyToScan" and "InProgress" and move them to "Aborted" if they have exceeded the configured timeout. This runs in the control plane to ensure that AssetScans get timed out even if the provider responsible has gone offline.

The helper services trivy-server, grype-server etc. Can either be deployed on the control plane, or in the provider or externally some place else. The location of these services will be a configuration input per provider, not determined by the control plane.

Alternatives considered

One alternative is that the provider installed into the cloud account exposes the GRPC endpoint directly and all the orchestration logic remains in the control plane. This would not be favoured as it requires the cloud account and control plane to expose an internet endpoint and provide an ingress connection from the control plane to the provider. In the architecture with the provider runtime all communication from the cloud account is initiated from inside the cloud account to the control plane which means there is no need for any of the components to be publically addressable or accessible.

chrisgacsal · 2023-08-31T08:20:21Z

chrisgacsal
Aug 31, 2023
Maintainer

Thanks for putting together this document. I do agree with the proposed architecture in general, however I have a few questions, suggestions.

API changes

I'd like to introduce a Provider API object which will capture the runtime config and the heath of the provider. So registering a new provider to the API would require the provider to send a Post request with its info to the API in order to generate the Provider API object which will include the ID (probable an UUID) which then the Provider should persist in order to avoid re-registering itself on a restart. This will allow us to list the registered providers via API and also opens up the possibility for changing provider config via API (hopefully) without restarting the provider process.

Providerlet

Probably I might be missing some info here, but I do not see the benefit of having Providerlet and actual Providers being separate processes talking to each other via gRPC. Only benefit I could think about is to allow 3rd party providers integrate and not forcing them to develop them in Golang. Probably my opinion won't be popular, but I think it will make the architecture more complex than it needs to be.

I would make the Providerlet as a library (similalry to the reconciler logic we have for controllers in Orchestrator) including functionality common to all providers (cmd template, api/http client, logging, etc) the and make Providers responsible for talking to the API server to publish results of discovery and running scans for assets they are responsible for.

This would prevent a couple of things:

us maintaining a gRPC interface. The 3rd party providers can integrate via HTTP API using OpenAPI spec.
operations complexity by introducing an additional network hop between the Providers and the API Server.

2 replies

Tehsmash Aug 31, 2023
Maintainer Author

Re: API changes

I agree. I would like to introduce a heartbeat system etc which can be used much like the Kubernetes node heartbeat which allows users to see if a provider isn't able to communicate with the control plane etc.

When we get there it also provides a way for providers to report AssetScan capacity which can be used to schedule AssetScans onto providers instead of MaxParallelScans.

Re: Providerlet

I disagree here. From my perspective its much more flexible for the providerlet and the provider to be two separate processes colocated communicating via some local mechanism (doesn't have to be GRPC). This pattern is well established for things like Kubernetes CSI and CRI.

There are a number of benefits of doing this over having the providerlet part be a library. As it contains very specific flow logic for processing/reconciling AssetScans, it is fully under our control and providers will not modify that flow. The providerlet container can be published from the core VMClarity repo in sync with the other components, and can be upgraded independently from the provider code to resolve bugs without the provider needing to be be rebuilt.

I think (and I left this as a comment when we introduced the GRPC provider) we should provide a golang library which wraps the GRPC communication so that provider implementors never need to actually know about the GRPC part, they can write a provider just the same as if they were in-repo today.

IMO its not just about letting 3rd party's integrate, but its all about a separation of concerns, the code and people involved in the AWS provider shouldn't need to care about the the AssetScan reconciling or the handling asset conflicts during discovery.

I don't see this an an additional network hop, these processes/containers should be colocated and communicate via a socket/localhost. This is how CSI works in Kubernetes, there is a common CSI container and the plugin itself, they are colocated in the same Pod and communicate via localhost. This is what I'd like to replicate.

Tehsmash Sep 1, 2023
Maintainer Author

After some sync discussion with @adamtagscherer and @chrisgacsal we came to some conclusions. I'm going to update the main design doc with these changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMClarity Multi-Cloud Account / Cluster Design #581

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

VMClarity Multi-Cloud Account / Cluster Design #581

Tehsmash Aug 21, 2023 Maintainer

Problem

Goal

Solution

Alternatives considered

Replies: 1 comment · 2 replies

chrisgacsal Aug 31, 2023 Maintainer

API changes

Providerlet

Tehsmash Aug 31, 2023 Maintainer Author

Tehsmash Sep 1, 2023 Maintainer Author

Tehsmash
Aug 21, 2023
Maintainer

Replies: 1 comment 2 replies

chrisgacsal
Aug 31, 2023
Maintainer

Tehsmash Aug 31, 2023
Maintainer Author

Tehsmash Sep 1, 2023
Maintainer Author