Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCS K8s cluster standardization #181

Closed
21 of 65 tasks
garloff opened this issue Oct 10, 2022 · 9 comments
Closed
21 of 65 tasks

SCS K8s cluster standardization #181

garloff opened this issue Oct 10, 2022 · 9 comments
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints longterm Issues or pull requests that relevent for longterm support SCS is standardized SCS is standardized standardization Standards & Certification

Comments

@garloff
Copy link
Member

garloff commented Oct 10, 2022

As DevOps team (=SCS user), I want to have the ability to create and use clusters on many different SCS-compliant container providers, where all relevant properties are either predefined by the SCS standard or can be controlled by a provider-independent cluster-settings.yaml file.
Relevant properties are those that tend to create trouble for the application deployment, e.g. k8s versions, CNI features, persistent volumes, ingress/load-balancers, anti-affinity rules (avoiding to have k8s nodes on the same host) ...

These properties should either be fixed by SCS (and then of course only evolve slowly over time) or be controllable by the customer (via a standardized, provider-independent cluster-params.yaml. For the controllable properties, we mandate existence and syntax and we may mandate all or some of the supported options. In any case, the supported options need to be discoverable (and the mechanism for discoverability should include the fixed properties as well).

Note that there is value in standardizing things that are not mandatory, in order for providers to use the same name/semantics for same things. (Obviously optional features may become mandatory for providers in the future if we decide so.)

Hints:

Extensibility: We allow for extensions, but they must be clearly distinguishable from standardized properties.

This epic should list the standardization proposals / ADRs as issues that we as SCS community want to define as SCS-compliant relevant. Some of the proposals might not make it for a v1 of the SCS standard (because they are not ready or deemed not important enough or downgraded to recommendations). The individual proposed properties / ADRs should come with a rationale and with (ideally comprehensive) conformance tests. We want to evolve the reference implementation(s) in parallel to the standardization, but intellectually keep a clear distinction b/w standards and implementation.

We need to create conformance tests for these properties; it is useful to define standards in terms of tests that must pass. (Test-driven standardization!) Obviously, using existing test suites (such as CNCF/sonobouy or aqua/kube-bench) and possibly contributing to them is a good way to do this.

Inspiration for the list below:

Individual topics for standardization:

Networking

  • Standardize k8s networking policies (CNI)

    • Description: CNI capabilities / k8s network policy support Standardize k8s networking policies (CNI)
    • Current state: Blocked
    • Discussion #211
  • Service type LoadBalancer with externalTrafficPolicy: Local

    • Description: Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box
    • Current state: Unknown
    • Discussion #212
  • Ingress Support (OPTIONAL)

    • Description:
    • Current state: Action required

Container Registry

  • Container registry feature overview

    • Description: Container registry (OPT-IN), Container registry: Create overview of needed and desirable features and map OSS solutions against it
    • Current state: Completed
    • Decision Record #263
  • Registry Standard from DR SCS-0212

    • Description: Derive a standard from the DR created in the previous registry issue Split already existing document into a standard and a Decision Record only concerning the SCS cluster
    • Current state: Waiting for next issue
    • Decision Record | Standard #270
    • Test #662

Meta

  • Supported k8s versions

    • Version: scs-0210-v1
    • Description:
    • Current state: Completed
    • Standard #219
  • K8s version support period

    • Version: scs-0210-v2
    • Description: Include the K8s version support period into the SCS standards
    • Current state: Completed
    • Standard #386
    • Test #505
      • Conformance tests #488
      • Improve Tests !499
      • Restore scs-0210-v1 conformance tests !503
  • KaaS ControlPlane/worker machine flavors

    • Description: ControlPlane and Worker machine flavors and counts (translation from SCS flavors needed for non-SCS IaaS?)
    • Current state: Backlog
    • Discussion #421
  • Cluster management API

    • Description:
    • Current state: Action required

Automation

  • KaaS Cluster Management Gitops Controller

    • Description: Gitops controller for Cluster Mmgt
    • Current state: Backlog
    • Discussion #419
  • KaaS Gitops/CI tooling

    • Description: Gitops/CI tooling (flux/argo)?? (OPTIONAL)
    • Current state: Doing
    • Discussion #420

Identity Management

  • Understand the requirements towards the IdP Broker to support the container layer

    • Description: Identity federation via OIDC, Understand the requirements towards the IdP Broker to support the container layer
    • Current state: Backlog
    • Discussion #194
  • Implement Machine Identities

    • Description: Implement Machine Identities
    • Current state: Backlog
    • Discussion #163
  • KaaS IAM federation with ID broker

    • Description: IAM federation with ID broker (keycloak in our current ref impl)?
    • Current state: Backlog
    • Discussion #417

Logging & Metrics

  • Metrics server support (OPT-OUT)(OPTIONAL)

    • Description:
    • Current state: Backlog
    • Discussion #224
  • Logging/Monitoring/Tracing features? (OPTIONAL)

    • Description:
    • Current state: Backlog
    • Discussion #418

Security & Robustness

  • Forwarding-porting and retesting of upstream intel patchset for SGX and OpenStack

    • Description: Kube API access controls, Add ability to limit access to k8s API k8s-cluster-api-provider
    • Current state: Doing
    • Issue #246
  • K8s cluster baseline security setup K8s cluster hardening

    • Baseline security setups: External CA, protected kubeAPI, Security patching for nodes?
    • Current state: Doing
    • Standard #415
    • Standard update #475
  • Move Keycloak onto kubernetes powered runtime on management plane

    • Description: Control plane backup/ maintenance, etcd maintenance k8s-cluster-api-provider
    • Current state: Backlog
    • Issue #258
  • KaaS Optional Cert-Manager

    • DescriptioN: Cert manager (OPTIONAL)
    • Current state: Backlog
    • Discussion #416
  • Distributed K8s nodes to ensure Anti-Affinity

    • Version: scs-0214-v2
    • Description: Anti-affinity policies (for control-plane and -- possibly distinctly -- for workers) Anti-affinity for k8s nodes (control-plane and workers)
    • Current state: Doing
    • Decision Record #226
    • Standard #434
    • Standard v2 #494
    • Test #477
    • Test updates #489
    • Follow-up for stabilization standards/#639
  • KaaS Robustness features

    • Description: Robustness features: Rate limiting kube-api, etcd compaction/defragmentation, etcd backup, CA expiration avoidance, node-problem-detector
    • Current state: Waiting for next issue
    • Standard #414
    • Test #549

Storage

  • Standardize additional storage classes
    • Issue #214
    • Decision Record
    • Standard

Tests

Definition of Done:

  • We have a number of individual standards agreed and have reference implementations ready (or have otherwise created confidence that we can get them ready soon and without any potential blockers). Agreement includes reaching out to relevant communities, potentially also outside of the current SCS universe.
  • We have agreed on the subset of standards that we want to pull into a v1 of SCS-standard k8s platform
  • The included standards have good coverage by conformance tests
  • There is Documentation on the standard, with links to individual ADRs
@garloff garloff added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Oct 10, 2022
@garloff garloff self-assigned this Oct 10, 2022
@garloff garloff moved this to Backlog in Sovereign Cloud Stack Oct 10, 2022
@garloff garloff moved this from Backlog to Refined Stories in Sovereign Cloud Stack Oct 10, 2022
@garloff garloff added this to the R4 (v5.0.0) milestone Oct 10, 2022
@garloff garloff added epic Issues that are spread across multiple sprints longterm Issues or pull requests that relevent for longterm support standardization Standards & Certification labels Oct 10, 2022
@fkr fkr added the SCS is standardized SCS is standardized label Oct 18, 2022
@fkr fkr changed the title EPIC: SCS K8s cluster standardization SCS K8s cluster standardization Oct 18, 2022
@tibeer tibeer mentioned this issue Mar 29, 2023
@jschoone jschoone removed this from the R4 (v5.0.0) milestone Apr 28, 2023
@mbuechse
Copy link

@jschoone @garloff It seems that this existing epic does for the CaaS track what I intended the new epic SovereignCloudStack/standards#285 to do for the IaaS track. I guess it remains to compare the description here with the table https://input.scs.community/tqKlv1Z_Srmi5e5o76CxhQ?view#KaaS-Layer I took from Kurt's slides and maybe update accordingly? For instance, two standards have already been ticked off, even though we still need to implement the conformance tests -- @cah-hbaum will write the corresponding issues, and so I could add those to this epic. Please tell me if disagree to anything I just wrote.

@mbuechse
Copy link

mbuechse commented Jun 8, 2023

Comparison between this epic and the table from Kurt's ALASCA talk slides

Please check what should be added here or what I did wrong @garloff @jschoone.

@garloff
Copy link
Member Author

garloff commented Jun 28, 2023

TL;DR: I want them all to be considered and discussed.
Not all of them necessarily become a mandatory standard. Maybe some of them don't even become a recommendation.

Comparison between this epic and the table from Kurt's ALASCA talk slides

* Present in this epic, but missing in the slides (really? or did I just fail to align them?)
  
  * LBs don't require special annotations (upstream nginx deployment works out of the box): Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box [Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box SovereignCloudStack/issues#212](https://github.com/SovereignCloudStack/issues/issues/212)

The thing here is that nginx upstream uses externalTrafficPolicy: Local and assumes that

(1) The traffic only is routed to the nodes that run the nginx container - which requires a health monitor to be configured which on many LBs (including the octavia one) requires a special annotation or a changed default

(2) The original client IP is visible and not obscured by the LB -- L2/L3 LB instead of L4
Yet, the occm tends to prefer HTTP L7 health checks ...
Discussion here is on #212 and numerous subsequent issues, indeed.

  * ControlPlane and Worker machine flavors and counts (translation from SCS flavors needed for non-SCS IaaS?)

For both ControlPlane and Worker Nodes, the number of them and the Flavors need to be configurable. The madatory SCS- Flavors need to be accepted for the latter. (Sidenote: This is a cluster-management feature, not a cluster property -- the latter being something you can rely on once a cluster exists.)

* Present in the slides, but missing in this epic:
  
  * CNCF conformance tests (not linked to any issue so far)

We have sonobuoy binary installed on the management cluster and run it to test the workload clusters for CNCF conformance. So we have tooling to test CNCF conformance and we want to require CNCF conformance for all clusters.

  * K8s version support period (not linked to any issue so far)
    * note: "Offered K8s version recency" is present as [Supported k8s versions SovereignCloudStack/issues#219](https://github.com/SovereignCloudStack/issues/issues/219)

We have a standard on this: scs-0210-v1. Maybe we need to amend that providers must not drop support for a minor k8s version earlier than upstream does stop the security support (after ~14 months after a release). And maybe we should recommend that for managed clusters, the provider sends a warning to the users when they have a cluster entering the extended support period (after ~12 months) and align the needed upgrades?

  * Identity federation via OIDC, [Understand the requirements towards the IdP Broker to support the container layer SovereignCloudStack/issues#194](https://github.com/SovereignCloudStack/issues/issues/194)
  * Machine identities, [Implement Machine Identities SovereignCloudStack/issues#163](https://github.com/SovereignCloudStack/issues/issues/163)
  * Control plane backup/ maintenance, [etcd maintenance k8s-cluster-api-provider#258](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/issues/258)
  * Kube API access controls, [Add ability to limit access to k8s API k8s-cluster-api-provider#246](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/issues/246)
  * Container registry (opt-in), [Container registry: Create overview of needed and desirable features and map OSS solutions against it. SovereignCloudStack/issues#263](https://github.com/SovereignCloudStack/issues/issues/263)
  * Cluster management API, [SCS K8s cluster standardization SovereignCloudStack/issues#181](https://github.com/SovereignCloudStack/issues/issues/181)
  * Gitops controller for Cluster Mmgt (not linked to any issue so far)

We had some concepts written down for this -- and determined that this should be optional (for the customer).
This should become a requirement to the to-be-developed cluster stacks: Have the ability for the cluster-parameters to be pulled from a git repo (using tooling like flux or Argo).

Please check what should be added here or what I did wrong @garloff @jschoone.

I did not check these for completeness, but everything above looks desirable to me.

Note:
I believe we have two kind of standards here:
(1) What are the properties of the created clusters?

  • Things like the LB properties, CSI presence, CNI support for network policies, CNI security properties, IAM integration, anti-affinity, ...
  • These should be formulated such that users of non-capi solutions (like e.g. Gardener or maybe even AKS) can fulfill these, so SCS compatible clusters can be created there.

(2) What is the standardized parameter format and API to create, modify and delete clusters?

  • Things like capi, flavors, choice of machine numbers, .... belong here.
  • Obviously, non-capi based solutions will not be able to fulfill this (unless they really want to create an API emulation layer ...)

@mbuechse
Copy link

mbuechse commented Jul 6, 2023

@garloff I amended the description of this issue by everything that hadn't been in there. Maybe we can now go ahead and group the items a bit, like I did in SovereignCloudStack/standards#285.

@cah-hbaum
Copy link

I updated the epic and grouped everything a bit more together. But I think in the long run, something like a table would be better, since the "pre"-work for the standard issues is done in other issues or over multiple ones.
I can make a table here, so that the whole thing gets grouped better, if that is desired.

@cah-hbaum
Copy link

I created individual issues for nearly all points not yet covered by previous issues. I left a few open, since the seemed way too general and broad.

@mbuechse
Copy link

@cah-hbaum That sounds great! I also like the new structure in the description above. 👍👍👍

@cah-hbaum
Copy link

cah-hbaum commented Oct 6, 2023

Short term

Medium term

Long term

Not enough information

Blocked

Already working on

@martinmo
Copy link
Member

Closing in favor of SovereignCloudStack/standards#615.

@github-project-automation github-project-automation bot moved this from Refined Stories to Done in Sovereign Cloud Stack Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints longterm Issues or pull requests that relevent for longterm support SCS is standardized SCS is standardized standardization Standards & Certification
Projects
Status: Done
Development

No branches or pull requests

6 participants