KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Zeel-Patel · 2024-10-23T09:28:50Z

…ed on GPU nodes

This plugin is widely being used at Uber and thought can be useful for open source community.

KEP doc as requested in #788 PR

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

…ed on GPU nodes

k8s-ci-robot · 2024-10-23T09:28:53Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-10-23T09:29:00Z

Hi @Zeel-Patel. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-10-23T09:29:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Zeel-Patel
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-10-23T09:29:07Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`bed72f0`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6718c1d59c391d000856303a

ffromani · 2024-10-25T05:34:51Z

/ok-to-test

k8s-ci-robot · 2024-10-25T05:36:55Z

@Zeel-Patel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-scheduler-plugins-verify	`bed72f0`	link	true	`/test pull-scheduler-plugins-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ffromani · 2024-10-25T05:40:31Z

kep/tbd-gpu-aware-scheduling/README.md

+The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by 
+kubernetes cluster manager through mutating admission controller without client's knowledge.


ok, but this applies also to a filter plugin, doesn't it? arguably any non-core, non-default setting needs to be added by kubernetes cluster manager, and can be unknown for whatever reason to the pod submitter(s).

In the filter plugin-based approach, pod submitters can focus solely on defining their resource requirements (e.g., nvidia.com/gpu) without needing to know about specific node taints or tolerations in the pod spec. This allows the plugin to handle those details automatically, reducing the configuration burden on the pod creator.

this is true but if the filter plugin is part of the default scheduler profile.
AFAICT this will be true until the plugin is merged in the main k8s codebase and accepted as part of the default scheduler profile, which, especially the latter, is not happening quickly (as standard process mandates).

At very least for the time being, thus, the pod submitter would need to know which schedulerName to use.

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

In typical CSP Kubernetes environments one doesn't get to change the scheduler settings. Any solution based on scheduler plugins will for a long time (if not forever) have to be using a mutating webhook which injects the special scheduler name plus of course you need an extra pod for running the special scheduler. You can't get away from using the mutating webhook, unless you accept modifying all workloads manually with the scheduler name. No difference there if you choose to go for scheduler plugins or if you choose to do labels and node affinity, so the least hassle would be to deploy node feature discovery and a mutating webhook without a need for scheduler changes.

In terms of resource usage, I bet the mutating webhook + NFD is a smaller burden than a mutating webhook + extra scheduler Pod. NFD isn't doing complex scheduling, it is pretty simple.

Northstar goal is to make this plugin available as part of the default scheduler profile with the k8s-scheduler binary. I would like to understand the reservations with that approach. We are fine with time it will take and we would like to follow the full process.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory. We are providing this choice to the users who want to use this plugin of-course it will work only after adding to the scheduler-profile configurations.

Until this plugin is part of the the k8s-scheduler binary, an alternate (temporary) approach would be to make this plugin available in this repo and people who are interested in using this plugin can re-build the k8s-scheduler binary by placing this plugin code in the https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins directory.

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

It is debatable whether there is unanimous desire for preventing scheduling of Pods not using GPU resources to nodes that have GPUs. Sometimes, you might want to do that, but sometimes, you might want to save a buck and have some lower priority Pods land in such nodes also, when resources permit. I would be a little bit surprised if such a filter plugin became a part of the default scheduler profile rapidly.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

Allowing non-GPU workloads on GPU nodes can be practical in on-prem environments with underutilized resources. However, in cloud settings, aggressively downscaling GPU nodes and spawning non-GPU nodes for general workloads is often more cost-effective, as GPU resources are typically pricier. For those who don't need to exclude standard pods from GPU nodes to save costs, this plugin remains optional—they can choose not to use it if they prefer a more flexible GPU node utilization approach.

This makes sense, but is not clear to me how this would be the case if we make (as IIUC it is the goal) this plugin part of the default profile. OTOH using it on a secondary, non-default profile would be straightforward, but this would again require changes in the pod spec

Understood. Arguably, requiring the cluster admin to patch and rebuild and maintain their scheduler binary is a burden which needs to be taken into account when we evaluate pros and cons of this approach vs taints and tolerations

At least at Uber, we prefer to add such burden / extra responsibilities to cluster-admin team and not to the customers / users of the platform. There can be more people / companies who would like to take this approach. taints / tolerations adds burden on cluster-admin team, customer teams and also requires co-ordination if hardware / nodes are manage by different team (cluster-admin) and workloads / pods are managed by multiple different teams. This coordination becomes complex in case of multi-team setup in large organisation.
So, adding a new approach of using scheduler plugin helps many companies / people in industry. Choice will still be with them, whether they take this approach or taints-tolerations approach.

I still need to review the latest updates, but based on the comments I read, I still don't see great benefits over the existing taints and tolerations mechanism but, provided we explore alternatives and we acknowledge them in the KEP, I'm not against adding the plugin in this repo. This is (IIUC :) ), after all, the place where experimentation is expected to happen.

I agree the benefits would really surface, as written in anoter comment, only when this plugin is both merged upstream AND part of the default profile. Both steps will require large buy-in.

OTOH, a prototype will help the conversation, so again another reason to not oppose the merge.
Will review again and comment more deeply ASAP.

ffromani · 2024-10-25T05:42:05Z

kep/tbd-gpu-aware-scheduling/README.md

+The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by 
+kubernetes cluster manager through mutating admission controller without client's knowledge.
+Taints are on nodes and tolerations are on pods. Submitter of pods and managers of nodes are two different teams sitting
+in different orgs. So one can not manage both the taints and tolerations. And it will increase the chances of errors.


I'm not against this PR for strong reasons, but I still struggle to see the benefits over taint+tolerations+mutating webhook because I don't see (yet) how the drawbacks list ed here don't apply to the extra-filter scenarios.
IOW: it seems to me the new filter will hit pretty much the same scenarios.

We’ve evaluated the taints and tolerations approach as well. However, from the standpoint of managing additional node configurations, it requires cluster admins to add and maintain specific taints on GPU nodes and tolerations on GPU-using pods. In contrast, the filter plugin handles these configurations automatically, significantly reducing the administrative burden.

One can label GPU-nodes with something like NFD and then inject node affinity to Pods with a webhook preventing scheduling to those nodes unless the Pod really uses GPU resources.

Automated, low maintenance. That's what I'd do in order to avoid maintaining a scheduler build. The gotcha is if some gitops tool will not allow using such a webhook. But usually gitops can be configured.

+1 to @uniemimu

Can we have a summary of administrative burden for both approaches?
I'm still lacking enough evidence this approch bring major (and thus worthy) benefits

Especially your column for a patched scheduler is missing a few details.

For the Setup Effort, you will need to add the burden of defining the scheduler name for every workload, unless you manage to replace the existing default scheduler, which is not typically possible in CSP environments. If you don't do those workload definition changes by hand, add a webhook. If you do it by hand for every workload, you could do manual changes for workload node affinity by hand also in the left column for NFD. So either remove webhook from the left column or add it to the middle column.

For the ongoing maintenance about labels, the GPU vendors tend to have their own set of labels which has been rather stable, but varies between vendors. The dominating market leader has NFD installation as an option in its GPU device plugin helm chart. It's not like it would be an effort to set up and maintain.

The node-level overhead for the patched scheduler version will only be "none" in case you replace the existing default scheduler, which is usually not possible in CSP managed kubernetes environments. So you typically need an extra node, possibly a big one, if the cluster is big. The NFD worker daemonset is small and in the case of those GPU nodes where you didn't want extra Pods, it shouldn't force you to increase your node counts unless 5 millicores and 128MB of memory is too much. On other nodes then, there is that amount of unnecessary resource usage. So the row there in my opinion would be "typically 0 new nodes required, plausibly 1 extra node" | "typically 1 extra node required" | "0"

From the overall hassle point of view having a default scheduler forcing this filtering for all users by default is of course the simplest, and the right column rightfully shows that.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

+1 again. I can see a way to have this code accepted in kubernetes-sigs eventually, but changing the default profile is a much more impactful change which would require extensive discussion and evaluation.

Will review the rest shortly.

I think the whole discussion is moving around "CSP managed kubernetes environments". Why so ? There are companies (specially the larger companies) who manage their own Kubernetes environments / cluster setups. For them it makes sense to use hardware from cloud as IaaS and then install Kubernetes on top of that by themselves. Everyone is not willing to use GKE or EKS or any other "CSP managed kubernetes environments".
Let's talk about the open source Kubernetes and not club it with CSP goals.

What the table doesn't show is whether there is widespread desire to have this filtering happen for all non-GPU workloads. I know you want it, but do everybody? That's a pretty big change to the current way of scheduling. Has this been presented in sig-scheduling?

If that's part of the process before merging a new plugin to this repo then, let's follow the process. We can present this with sig-scheduling, that's a good suggestion. @gkeesh7 @Zeel-Patel Please prepare for that.

From where I'm standing, I don't oppose having the plugin here in this repository. I just don't see much benefit myself, since I believe I could do the same with less of a maintenance burden, basically without having to keep building a custom scheduler. But from considering having alternatives for those who really prefer building their own scheduler, I get it.

There's still yet another approach available for achieving the same thing, and that is scheduler extenders. With extenders you wouldn't have to rebuild the whole scheduler, you'd only have to build the extender and hook it up. The filter functionality is available in extenders. The downside of those is that you still need to configure the scheduler, and you still need to maintain your own scheduler extender container. Plus you take a performance hit in scheduling, which can be mitigated by the use of caches and only passing in node names to the extender. But you still take a measurable performance hit.

My view is that real benefits in terms of making things simple start to only surface if you get this to Kubernetes upstream all the way into the default profile where it would be enabled by default, and that requires sig-scheduling to buy the idea.

KunWuLuan · 2024-10-25T08:53:17Z

kep/tbd-gpu-aware-scheduling/README.md

+daemonset pod does not request any GPUs. Because of the filter plugin explained above, daemonset pods can get stuck in 
+pending state forever.
+
+So, the scheduler plugin will exclude the pods which contain the GPU device-plugin container image from filtering 


Maybe we should exclude all daemonset pods?

ffromani · 2024-11-04T07:23:53Z

kep/tbd-gpu-aware-scheduling/README.md

+For a given pod resource, Filter plugin will check all the nodes in the cluster. It will check
+whether the node is GPU node or not by checking node’s Allocatable field. If a node has an allocatable
+GPU resource, it will consider that node as GPU node. 'Allocatable' on a Kubernetes 
+node is defined as the amount of compute resources that are available for pods. Note that node 
+resources which have been allocated to pods running on that node will not be subtracted from the 
+node allocatable resources.


Assuming there's no high pod churn, and assuming the working set of pod is relatively stable (e.g. pods tend to be long-running).
Would https://github.com/kubernetes-sigs/descheduler help in this use case?
Maybe not, let's discuss in the Alternatives section

Microservices use cases have low pods churn in general. But batch workload use cases have very high pod churn.

fine. This deserves to be recorded in the Alternatives section, explaining why the proposed solution is better than descheduler, and on which scenarios.

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul…

bed72f0

…ed on GPU nodes

k8s-ci-robot requested review from seanmalloy and swatisehgal October 23, 2024 09:29

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 23, 2024

Zeel-Patel mentioned this pull request Oct 23, 2024

[Feature] Add a Filter plugin to ensure that non-GPU pods are not sch… #788

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2024

ffromani reviewed Oct 25, 2024

View reviewed changes

KunWuLuan reviewed Oct 25, 2024

View reviewed changes

ffromani reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Zeel-Patel commented Oct 23, 2024 •

edited

Loading

k8s-ci-robot commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

netlify bot commented Oct 23, 2024 •

edited

Loading

ffromani commented Oct 25, 2024

k8s-ci-robot commented Oct 25, 2024

ffromani Oct 25, 2024

gkeesh7 Oct 30, 2024

ffromani Oct 30, 2024

uniemimu Oct 31, 2024

kumariitr Nov 4, 2024

ffromani Nov 4, 2024

gkeesh7 Nov 4, 2024

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading

ffromani Nov 5, 2024

ffromani Oct 25, 2024

KunWuLuan Oct 25, 2024

gkeesh7 Oct 30, 2024

uniemimu Oct 30, 2024

ffromani Oct 30, 2024

uniemimu Nov 4, 2024

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024

uniemimu Nov 5, 2024

KunWuLuan Oct 25, 2024

ffromani Nov 4, 2024

kumariitr Nov 4, 2024

ffromani Nov 5, 2024

		The drawback in this approach is that, either pod creator(client) needs to add tolerations or it needs to be added by
		kubernetes cluster manager through mutating admission controller without client's knowledge.

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Are you sure you want to change the base?

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… #812

Conversation

Zeel-Patel commented Oct 23, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

netlify bot commented Oct 23, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

ffromani commented Oct 25, 2024

k8s-ci-robot commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

kumariitr Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

kumariitr Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zeel-Patel commented Oct 23, 2024 •

edited

Loading

netlify bot commented Oct 23, 2024 •

edited

Loading

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading

ffromani Nov 4, 2024 •

edited

Loading

kumariitr Nov 4, 2024 •

edited

Loading