Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Zeel-Patel
Copy link

@Zeel-Patel Zeel-Patel commented Oct 23, 2024

…ed on GPU nodes

This plugin is widely being used at Uber and thought can be useful for open source community.

KEP doc as requested in #788 PR

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Zeel-Patel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Zeel-Patel
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 23, 2024
Copy link

netlify bot commented Oct 23, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit bed72f0
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6718c1d59c391d000856303a

@ffromani
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2024
@k8s-ci-robot
Copy link
Contributor

@Zeel-Patel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-scheduler-plugins-verify bed72f0 link true /test pull-scheduler-plugins-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +401 to +402
The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
kubernetes cluster manager through mutating admission controller without client's knowledge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but this applies also to a filter plugin, doesn't it? arguably any non-core, non-default setting needs to be added by kubernetes cluster manager, and can be unknown for whatever reason to the pod submitter(s).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the filter plugin-based approach, pod submitters can focus solely on defining their resource requirements (e.g., nvidia.com/gpu) without needing to know about specific node taints or tolerations in the pod spec. This allows the plugin to handle those details automatically, reducing the configuration burden on the pod creator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true but if the filter plugin is part of the default scheduler profile.
AFAICT this will be true until the plugin is merged in the main k8s codebase and accepted as part of the default scheduler profile, which, especially the latter, is not happening quickly (as standard process mandates).

At very least for the time being, thus, the pod submitter would need to know which schedulerName to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

In typical CSP Kubernetes environments one doesn't get to change the scheduler settings. Any solution based on scheduler plugins will for a long time (if not forever) have to be using a mutating webhook which injects the special scheduler name plus of course you need an extra pod for running the special scheduler. You can't get away from using the mutating webhook, unless you accept modifying all workloads manually with the scheduler name. No difference there if you choose to go for scheduler plugins or if you choose to do labels and node affinity, so the least hassle would be to deploy node feature discovery and a mutating webhook without a need for scheduler changes.

In terms of resource usage, I bet the mutating webhook + NFD is a smaller burden than a mutating webhook + extra scheduler Pod. NFD isn't doing complex scheduling, it is pretty simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Northstar goal is to make this plugin available as part of the default scheduler profile with the k8s-scheduler binary. I would like to understand the reservations with that approach. We are fine with time it will take and we would like to follow the full process.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory. We are providing this choice to the users who want to use this plugin of-course it will work only after adding to the scheduler-profile configurations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

Copy link
Contributor

@ffromani ffromani Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

This makes sense, but is not clear to me how this would be the case if we make (as IIUC it is the goal) this plugin part of the default profile. OTOH using it on a secondary, non-default profile would be straightforward, but this would again require changes in the pod spec

Copy link

@kumariitr kumariitr Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

At least at Uber, we prefer to add such burden / extra responsibilities to cluster-admin team and not to the customers / users of the platform. There can be more people / companies who would like to take this approach. taints / tolerations adds burden on cluster-admin team, customer teams and also requires co-ordination if hardware / nodes are manage by different team (cluster-admin) and workloads / pods are managed by multiple different teams. This coordination becomes complex in case of multi-team setup in large organisation.
So, adding a new approach of using scheduler plugin helps many companies / people in industry. Choice will still be with them, whether they take this approach or taints-tolerations approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review the latest updates, but based on the comments I read, I still don't see great benefits over the existing taints and tolerations mechanism but, provided we explore alternatives and we acknowledge them in the KEP, I'm not against adding the plugin in this repo. This is (IIUC :) ), after all, the place where experimentation is expected to happen.

I agree the benefits would really surface, as written in anoter comment, only when this plugin is both merged upstream AND part of the default profile. Both steps will require large buy-in.

OTOH, a prototype will help the conversation, so again another reason to not oppose the merge.
Will review again and comment more deeply ASAP.

The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
kubernetes cluster manager through mutating admission controller without client's knowledge.
Taints are on nodes and tolerations are on pods. Submitter of pods and managers of nodes are two different teams sitting
in different orgs. So one can not manage both the taints and tolerations. And it will increase the chances of errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this PR for strong reasons, but I still struggle to see the benefits over taint+tolerations+mutating webhook because I don't see (yet) how the drawbacks list ed here don't apply to the extra-filter scenarios.
IOW: it seems to me the new filter will hit pretty much the same scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve evaluated the taints and tolerations approach as well. However, from the standpoint of managing additional node configurations, it requires cluster admins to add and maintain specific taints on GPU nodes and tolerations on GPU-using pods. In contrast, the filter plugin handles these configurations automatically, significantly reducing the administrative burden.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can label GPU-nodes with something like NFD and then inject node affinity to Pods with a webhook preventing scheduling to those nodes unless the Pod really uses GPU resources.

Automated, low maintenance. That's what I'd do in order to avoid maintaining a scheduler build. The gotcha is if some gitops tool will not allow using such a webhook. But usually gitops can be configured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @uniemimu

Can we have a summary of administrative burden for both approaches?
I'm still lacking enough evidence this approch bring major (and thus worthy) benefits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially your column for a patched scheduler is missing a few details.

For the Setup Effort, you will need to add the burden of defining the scheduler name for every workload, unless you manage to replace the existing default scheduler, which is not typically possible in CSP environments. If you don't do those workload definition changes by hand, add a webhook. If you do it by hand for every workload, you could do manual changes for workload node affinity by hand also in the left column for NFD. So either remove webhook from the left column or add it to the middle column.

For the ongoing maintenance about labels, the GPU vendors tend to have their own set of labels which has been rather stable, but varies between vendors. The dominating market leader has NFD installation as an option in its GPU device plugin helm chart. It's not like it would be an effort to set up and maintain.

The node-level overhead for the patched scheduler version will only be "none" in case you replace the existing default scheduler, which is usually not possible in CSP managed kubernetes environments. So you typically need an extra node, possibly a big one, if the cluster is big. The NFD worker daemonset is small and in the case of those GPU nodes where you didn't want extra Pods, it shouldn't force you to increase your node counts unless 5 millicores and 128MB of memory is too much. On other nodes then, there is that amount of unnecessary resource usage. So the row there in my opinion would be "typically 0 new nodes required, plausibly 1 extra node" | "typically 1 extra node required" | "0"

From the overall hassle point of view having a default scheduler forcing this filtering for all users by default is of course the simplest, and the right column rightfully shows that.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

Copy link
Contributor

@ffromani ffromani Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

+1 again. I can see a way to have this code accepted in kubernetes-sigs eventually, but changing the default profile is a much more impactful change which would require extensive discussion and evaluation.

Will review the rest shortly.

Copy link

@kumariitr kumariitr Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole discussion is moving around "CSP managed kubernetes environments". Why so ? There are companies (specially the larger companies) who manage their own Kubernetes environments / cluster setups. For them it makes sense to use hardware from cloud as IaaS and then install Kubernetes on top of that by themselves. Everyone is not willing to use GKE or EKS or any other "CSP managed kubernetes environments".
Let's talk about the open source Kubernetes and not club it with CSP goals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

If that's part of the process before merging a new plugin to this repo then, let's follow the process. We can present this with sig-scheduling, that's a good suggestion. @gkeesh7 @Zeel-Patel Please prepare for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where I'm standing, I don't oppose having the plugin here in this repository. I just don't see much benefit myself, since I believe I could do the same with less of a maintenance burden, basically without having to keep building a custom scheduler. But from considering having alternatives for those who really prefer building their own scheduler, I get it.

There's still yet another approach available for achieving the same thing, and that is scheduler extenders. With extenders you wouldn't have to rebuild the whole scheduler, you'd only have to build the extender and hook it up. The filter functionality is available in extenders. The downside of those is that you still need to configure the scheduler, and you still need to maintain your own scheduler extender container. Plus you take a performance hit in scheduling, which can be mitigated by the use of caches and only passing in node names to the extender. But you still take a measurable performance hit.

My view is that real benefits in terms of making things simple start to only surface if you get this to Kubernetes upstream all the way into the default profile where it would be enabled by default, and that requires sig-scheduling to buy the idea.

daemonset pod does not request any GPUs. Because of the filter plugin explained above, daemonset pods can get stuck in
pending state forever.

So, the scheduler plugin will exclude the pods which contain the GPU device-plugin container image from filtering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should exclude all daemonset pods?

Comment on lines +228 to +233
For a given pod resource, Filter plugin will check all the nodes in the cluster. It will check
whether the node is GPU node or not by checking node’s Allocatable field. If a node has an allocatable
GPU resource, it will consider that node as GPU node. 'Allocatable' on a Kubernetes
node is defined as the amount of compute resources that are available for pods. Note that node
resources which have been allocated to pods running on that node will not be subtracted from the
node allocatable resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there's no high pod churn, and assuming the working set of pod is relatively stable (e.g. pods tend to be long-running).
Would https://github.com/kubernetes-sigs/descheduler help in this use case?
Maybe not, let's discuss in the Alternatives section

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microservices use cases have low pods churn in general. But batch workload use cases have very high pod churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine. This deserves to be recorded in the Alternatives section, explaining why the proposed solution is better than descheduler, and on which scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants