Replies: 14 comments 81 replies
-
@ncdc @joelanford pls share with the team |
Beta Was this translation helpful? Give feedback.
-
Was this supposed to be here? https://github.com/operator-framework/operator-controller/discussions/140 |
Beta Was this translation helpful? Give feedback.
-
I think https://github.com/operator-framework/operator-controller/discussions/140 is related (as it speaks about multi-tenant clusters), yet still different, as focused on subset of use cases, specifically in this ticket we are not focused on namespaced CatalogSources at all, neither as you know we don't use namespaced CatalogSources, to have OLMv0 multi-tenant deployments. In this ticket I want to elaborate how could we move forward from the operator controller & operand modelling perspective (API Bundles, RBAC mgmt, how tenant versions are controlled, etc). |
Beta Was this translation helpful? Give feedback.
-
Reading through the Kubernetes multi-tenancy doc, a few things stand out:
IMO, we're fighting an uphill battle with any architecture that attempts to provide the illusion of multi-tenancy when, in fact, there is none. I think a better way to describe the proposal here would be "coordination of cluster-scoped APIs for multiple tenants sharing a control plane". This is control plane single-tenancy with a coordination layer. How does that strike folks? Does that framing seem more or less correct? To me, its about more than just technical possibilities and API/controller split. It's about the direction of the kubernetes community, the expectations of users, the assumptions of operator authors, the interactions tenants have with control planes, the interactions control planes have with tenants, and the larger set of use cases we're trying to solve in OLMv1. I'm honestly somewhat worried that we would build something to solve the coordination problem, but then this scenario happens:
But if that's the eventual outcome, is all of this extra complexity we place on ourselves and our users worth it? |
Beta Was this translation helpful? Give feedback.
-
I've been thinking about this some more, and I have some ideas for discussion. As Joe mentioned, Kubernetes does not afford true multi-tenancy. Anything we do today must recognize that. If one wants a tenant to be entirely self-sufficient; namely, they never need to ask a cluster admin for help installing an operator into a tenant's namespace, I can come up with these options:
Regardless of what option(s) we offer, I do think we can probably find ways for multiple Operators to co-own CRDs. As long as developers don't make breaking API changes in newer releases, OLM can orchestrate when to apply CRD updates as different operator versions are installed. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
FYI, a few of us have been brainstorming how to improve namespace-scoped operators, try to do multi-tenancy as safely as possible, etc. The working doc is https://docs.google.com/document/d/1xTu7XadmqD61imJisjnP9A6k38_fiZQ8ThvZSDYszog/edit. |
Beta Was this translation helpful? Give feedback.
-
After ruminating on this over the holidays, I don't believe that trying to program logic for "multi-tenancy" is feasible. I am not saying, however, that multi-tenancy with OLM is not possible. If you know what you're doing and you want it, you can. Please continue reading for much more detail. We have been using the term "multi-tenancy" to mean a few things. In particular, these are the use cases that are primarily associated with it:
Let's start with 1. OLM v0 makes this possible because it has full cluster-admin permissions. OLM acts as a deputy, using its super powers to perform operations that a non-admin cannot do. This includes installing CRDs, creating ClusterRoles and ClusterRoleBindings, and anything else that is typically a more privileged operation that a namespace admin is allowed to perform. If we want to design a secure system, that doesn't grant privilege escalations as a deputized cluster-admin, we have to start by removing OLM's cluster-admin permissions. By using the principle of least privilege, we strip OLM's permissions down to strictly what it requires; namely, reconciling ClusterExtensions (read + update status, generally). Any action OLM performs at a user's request (e.g., install an operator by way of a ClusterExtension) must be done using credentials the user has access to. This is where Kubernetes already has an established pattern to use: the service account. Anyone who can create a pod implicitly has access to all the service accounts in that namespace. We should do the same with OLM: anyone who has permission to create an Extension in a namespace has must designate a service account on the Extension, and OLM must use that service account when creating/updating/deleting manifests associated with that Extension. This necessarily means that someone, or something, must grant all the permissions needed to manage the lifecycle of the operator to the user creating the Extension (or create a service account with the permissions). This is the way to keep OLM secure by default. Those wanting a slightly less secure-by-default experience could conceivable write a controller that automates a deputization process and eliminates the need to ask a cluster-admin to either do the install, or grant the permissions. In other words, there could be an automation that creates the appropriate RBAC permissions and binds them to a service account whenever an Extension is created. You'd probably also want to have some sort of configuration around policy - which users can create Extensions all the time, which ones need review + approval, etc. Now on to 2 and 3 (install/upgrade multiple copies of an operator). The API space in Kubernetes is global and singular: it is not possible to install multiple copies of the same API simultaneously. It is not possible to serve one set of APIs to one user, and a different set of APIs to another user. It is not possible to serve one "form" of an API to one user, and a variation of the form to a different user. These truths apply to both built-in APIs and APIs that are added to Kubernetes via CRDs and aggregated API servers. We debated multiple “creative” solutions to work around these truths. One of these was the idea of trying to manage shared ownership of CRDs. Assuming all the appropriate permissions are in place, whenever an additional copy of an Extension is created, OLM could determine the “latest” version of the manifest for a CRD, and apply that. Unfortunately, this is impossible for multiple reasons:
Based on the findings above, I’d like to propose the following paths forward:
|
Beta Was this translation helpful? Give feedback.
-
The joining of CRDs and Controllers into one package is one of the key things that has made managing operators difficult. My take is that we should be trying to unwind that, and asking users to install APIs and Controllers separately, every time. If a tenant on a cluster wants to install an "Operator", first they need to use admin permissions to install the API package (or get someone to do it), then they can install the controller package themselves for their own namespace.
We could generate the two packages quite easily in existing projects by updating the operator-sdk to generate them for the catalog. An administrative end user could still just ask to install a Controller and have everything else done. A tenant would get an error about the API package not being installed. OLMs job would then be to manage the dependencies between them. |
Beta Was this translation helpful? Give feedback.
-
CRD upgrade safety checks belong in the operator-sdk api build and scorecard, we should try and catch these issues before they ever hit the cluster. Having a second layer of defense on cluster isn't a bad thing though. |
Beta Was this translation helpful? Give feedback.
-
Lets discuss more a concept of API packages being separated. I think this might be a key to separate issues into 'smaller' problems, or at least different approaches might be defined for each sub problem... Why cannot we say that API packages MUST follow semver otherwise wont be applied into a cluster? And then, have a process at deploy time doing the CRD evolution checking. Fron there, there needs to be a discussion about mutating and validating webhooks, which perhaps we reduce the usage, in favor of doing similar logic in the application layer (controller code)? For validation, its the .status update change. |
Beta Was this translation helpful? Give feedback.
-
Splitting out #269 (reply in thread) in a new thread:
Naming is obviously hard 😄. If we take the meaning behind the word "extension" out of the equation, do you think that conceptually it makes sense to have 2 distinctly named APIs that both, at the end of the day, represent a declarative expression for getting yaml from a package installed on a cluster? Besides the name, are there differences, in either spec fields or functional behavior, that would justify 2 different APIs? That's what I'd like to see us come up with before going down the two-APIs path. |
Beta Was this translation helpful? Give feedback.
-
Have a question related to the adoption of kapp-controller as part of OLMv1. I think only after I studied kapp-controller I got why the current proposal looks how described by Andy earlier... It looks we are saying that deployment of k8s resources (CRDs, operator Deployment, other stuff part of the bundle) will be done via kapp-controller, via impersonated ServiceAccount specified in the CR representing the 'operator'- the current name is Extension CR. And then, perhaps Extension can be only API-only-bundle or could have a controller code, with the watched namespaces specified in some parameter in such Extension CR. It is would be up to the user creating Extension to make sure ServiceAccount having sufficient rights. If all written above is true, let me ask what is the value-add of OLMv1 vs direct adoption of kapp-controller or Helm charts? Can you please clarify what is in the OLMv1 scope:
Thanks! |
Beta Was this translation helpful? Give feedback.
-
Opening another thread on another proposal which I'd like to discuss.
It simply uses different ServiceAccounts to deploy CRD bundles vs controller bundles. What we could and should discuss is how to make the cluster admin aware about CRDs being deployed, what's the change management, approvals, etc required for this. I know the diagram might be a bit unclear at this point, as didn't have much time to put this in writing, apologies, will refine as we go - will upload shortly its draw.io sources for any edits. @ncdc @joelanford @stevekuznetsov et all (sorry if you felt omitted, not intentional) |
Beta Was this translation helpful? Give feedback.
-
Want to surface the discussion which took place via emails, couple of conference calls and f2f discussions in this ticket.
The aim of this ticket is to discuss the fundamental differences OLM v1 introduces relative to OLM v0, when it comes to the support and handling the multi-tenant clusters. Specifically for large software vendors, who implemented significant number of operators (like IBM, with 250+ operators), who can be deployed in the tenant-level scope (both operator & operands), we need to understand the migration strategy from OLM v0 into OLM v1.
Purpose of this document is to document the terminology and enterprise use cases of IBM Cloud Pak customers, with OLM-based operators
Glossary
Cluster Admin - typically member of the IT Infrastructure Ops team, responsible to providing the Kubernetes cluster infrastructure to the invididual tenant teams.
Cluster Admins are responsible for:
Tenant - the team which is part of the customer's organisation, which is independent from other teams in the same company.
Tenants, aka Lines of Business, are responsible for providing the value via deployment and usage of the business apps like IBM Cloud Paks. This aligns most closely with the Kubernetes Teams multi-tenancy use case.
Tenants are provided a the set of Kubernetes
namespaces
and users are granted namespace admin roles for those namespaces, plus provided roles to deploy operators into their own namespace.Two types of Tenant namespaces are relevant here:
Tenant Admin - A Kubernetes user that has permissions to administer Kubernetes resources within one or more of the Control Plane namespaces that collectively represent the tenant. These users do not have privileges to affect other tenants.
IBM Cloud Pak - the suite of logically related business applications, using operators for their deployment and lifecycle management. Each of the application might have 1 or more operators (typically one top level operator and several nested operators).
Typically:
Separation of control-and-data - the deployment topology where tenant's application is split into separate
namespaces
: one for operators and one or more for their operands. Customers are setting up firewalls (network policies) to block traffic to k8s API Server from operand namespaces.Workload isolation - the deployment pattern where multiple tenants can deploy their applications like Cloud Paks (both operators and operands) independency from each other and manage its lifecycle independent from each other. It is acceptable to have multiple copies of same operator at different version, managed by different tenant.
etcd
,StorageClasses
, etc) and some cluster-level services (like Monitoring or Cert-manager) are cluster-level services. Yet, they shall be resillient to the noisy neighbour as much as possibleRelevant Kubernetes Resources that are in scope of a Tenant Namespaced isolation:
Scenarios
Typically, tenants are provided access to one of more
namespaces
which are under their control. Tenants are deploying operator(s) into theirnamespaces
. A single tenant has a single instance of any operator. It is accepted that single operator is watching multiple namespaces, as long as those namespaces belong to the same tenant.Workload isolation scenarios
Cloud Pak operators can be installed either into
AllNamespace
mode (openshift-operators
), meaning that there is just a single tenant on the cluster, or in theOwnNamespace
mode, whenever there are more tenants on the cluster. It is expected to have multiple tenants each of them running the same Cloud Pak, most probably each at different version (aka dev/test deployments).Operator dependencies
IBM Cloud Paks are leveraging two types of Statically Defined OLM operator dependencies:
IBM Cloud Paks also create Operators dynamically (via IBM Operand Deployment Lifecycle Manager ODLM), which enables auto-provisioning Operators and Operands (CRs) on-demand when required, typically for shared capabilities/components (like user identity and access management, common UI platform experience). With Cloud Pak 3.0 architecture, these shared capabilities/components are deployed as individual instance per tenant.
CatalogSource management
Currently,
CatalogSources
for IBM CloudPaks are deployed by the Cluster Admin as global catalogs inopenshift-marketplace
, but it leads to issues when CatalogSources are updated, causing uncontrolled operator upgrades across tenant namespaces. Mitigations leverage catalog source pinning, usage of manual approval mode and exploration of usage of private (namespaced) CatalogSources (each tenant having its own CatalogSource vs its own).Proposal for OLM v1 discussion
Introduction of the API Bundles
De-couple the API Bundle from the Operator controller bundle. Have a semver-versioned API Bundle which is cluster-scope and will register only CRDs and their conversion webhooks (if needed). The Operator controller bundle can be deployed wither in
All Namespace
mode (openshift-operators
) or into each of the tenant namespace. It is acceptable for controller operators to be deployed into multiple namespaces, each at different version. There shall be a way how controller operator would define a compatibility version range on the API Bundle(s). Controller code by itself would be responsible for making sure the individual CRs are properly structured (TBD whether validation webhooks are really required) and react accordingly by providing a proper.status
updates.API Bundle shall provide backwards and forwards compatibility as much as possible - and ideally there shall be a tool / method of validation the CRD evolution.
Ideally there shall be some migration tool (part of
operator-sdk
) which would take existing OLM v0 bundle and separate into two OLM v1 bundles: one with APIs (CRDs) vs the one with the rest of the code and properly definition the dependency relationship. Perhaps even it could be automatically executed for backwards compatibility with OLM v0 operators, on the OLM v1 OCP clusters.RBAC
There shall be a way in OLM v1 to define the RBAC for the tenant-level operator, based on the prescriptive input (list of
namespaces
) which defines the topology of the given tenant (sort ofOperatorGroup
, which defines theWATCH_NAMESPACES
for controller andnamespaces
to create RBAC for). Something likeIBM Namespace Scope operator
orOria
operator. Whenever tenant is just a single namespace, there shall not be required any topology definition - defaults should be assumed and RBAC properly created based on the operator controller bundle metadata. Tenant admins shall delegate to OLM v1 the CRUD of RBAC, based on the operator metadata and topology definition. Such RBAC shall be easily auditable - via few kubectl commands.Customers would deploy the actual controller operators, which should automatically load related API Bundles (TBD compatibility checking).
Dependency management
Dependencies (TBD whether we need them) shall deploy dependant operators in the same mode and namespace as the requesting operator, AND configure RBAC using the same topology definition.
Dependency resultion should be executed in the scope of the tenant (one or more
namespaces
).Catalog and Subscription management
CatalogSource
shall be tenant-level.Update of
CatalogSource
shall not impact other tenants.There shall be still concept of
Subscription
which allows subscription to the fixes. There shall be some equivalent of approval mode, but perhaps not working (like in OLM v0) on the level ofnamespace
(likeInstallPlan
), but rather on the level oftenant
(set of namespaces).There shall be a way to preview the available upgrade and what's involved with the upgrade (i.e. whether any additional cluster-level dependencies are introduced?)
Related Info
TODOs
Channel
- channels group all the fixes and updates to the semver-compatible version rangeChannels
and API Bundles, if anyBeta Was this translation helpful? Give feedback.
All reactions