Tutorial on Dynamic GPU Partitioning with MIG to Maximize the Utilization of GPUs in Kubernetes

nos, opensource to maximize GPU utilization in Kubernetes

Partitioning is a way to divide GPU resources into smaller slices. This allows Pods to be scheduled only on the memory/compute resources they actually need, thus increasing GPU utilization and reducing infrastructure costs in Kubernetes clusters.

To minimize infrastructure expenses, it’s crucial to use GPU accelerators in the most efficient way. One method to achieve this is by dividing the GPU into smaller partitions, called slices, so that containers can request only the strictly necessary resources. Some workloads may only require a minimal amount of the GPU’s compute and memory, so having the ability in Kubernetes to divide a single GPU into multiple slices, which can be requested by individual containers, is essential.

This is particularly relevant for large Kubernetes clusters used for running Artificial Intelligence (AI) and High-Performance Computing (HPC) workloads, where inefficiencies in GPU utilization can have a significant impact on infrastructure expenses. These inefficiencies are generally due to lightweight tasks that do not fully utilize a GPU, such as inference servers and Jupyter Notebooks used for preliminary data and model exploration.

For example, researchers from the European Organization for Nuclear Research (CERN) published a blog post about how they used MIG GPU partitioning for addressing low GPU utilization problems caused by spiky workloads running High Energy Physics (HEP) simulations and code inefficiencies locking whole GPUs.

The NVIDIA GPU Operator enables using MIG in Kubernetes, but it alone is insufficient to ensure efficient GPU partitioning. In this article, we examine the reasons for this and offer a more effective solution for using MIG in Kubernetes: Dynamic MIG Partitioning.

MIG support in Kubernetes

MIG support in Kubernetes is provided by the NVIDIA device plugin, which allows to expose MIG devices, i.e. isolated GPU partitions, either as generic nvidia.com/gpu resources or as specific resource kinds such as, for instance, nvidia.com/mig-1g.10gb.

Manually managing MIG devices through nvidia-smi is impractical, so NVIDIA provides a tool called nvidia-mig-parted. This allows cluster admins to declaratively define the set of MIG devices they need on all GPUs on a node. The tool automatically manages the GPU partitioning state to match the desired configuration. For instance, below is an example of configurations taken from the nvidia-mig-parted GitHub repository:

In Kubernetes, cluster admins generally would not use nvidia-mig-parted directly, but they would use it through the NVIDIA GPU Operator.

This operator further simplifies the application of MIG configurations. After creating a ConfigMap defining a set of allowed MIG configurations, the NVIDIA GPU Operator only requires you to label the nodes with nvidia.com/mig.config and specify as value the name of the specific configuration you want to apply on that node.

For instance, referring to the configuration defined above, we could apply the config all-3g.20gb to the node node-1 as follows:

Static MIG configurations cause poor usability

The NVIDIA GPU Operator has a significant limitation: MIG devices are created through static configurations.

This means that the cluster admin has to first define all the possible MIG configurations they think might be required in the cluster, and then manually apply them to each node according to their needs.

This way of managing MIG devices can easily lead to inefficiencies in GPU utilization and time spent by the cluster admin to change MIG configurations. In fact, GPU memory and compute requirements vary from Pod to Pod and change over time. To achieve optimal GPU utilization as new Pods with different MIG resources requests are created, the cluster admin would have to spend his/her time constantly finding and applying the most proper configuration for each node of the cluster, which is very impractical.

As a trivial example, consider that we need to schedule multiple Pods that require 20gb of GPU memory. We would therefore create and apply a configuration that provides only nvidia.com/mig-3.20gb profiles on all the GPUs in our cluster, since it allows to perfectly use all GPU resources. In a second moment however, the server receives a request to create some Pods that require less resources, say 10GB of GPU memory, corresponding to the MIG profile nvidia.com/mig-2g.10gb. These Pods will not be scheduled until the cluster admin changes the label of at least one node applying a MIG configuration that provides the requested profiles.

Complications do not end here. While a certain configuration might provide the required MIG resources, at the same it might also remove some of the devices that are currently in use by some containers. In such cases it is up to the cluster admins to find or create the most proper configuration, and to ensure it does not delete any of the used devices, introducing significant operational costs.

This approach simply does not scale. With the NVIDIA GPU Operator alone, it is impractical to constantly adjust the MIG configurations based on workloads demand, resulting in both unused MIG devices and pending Pods.

Let’s see how we can solve this issue with Dynamic MIG Partitioning.

Dynamic MIG Partitioning

Dynamic MIG Partitioning automates the creation and deletion of MIG profiles based on real-time requirements of the workloads in the cluster, ensuring that the optimal MIG configuration is always applied to the available GPUs.

To apply dynamic partitioning, we need to use nos, an open-source module that runs alongside the NVIDIA GPU Operator and makes MIG partitioning dynamic.

You can think of nos as a Cluster Autoscaler for GPUs: instead of scaling up the number of nodes and GPUs, it dynamically partitions them to maximize their utilization, leading to spare GPU capacity. Then, you can schedule more Pods or reduce the number of GPU nodes needed, reducing infrastructure costs.

With nos, there is no need to manually create and manage MIG configurations. Simply submit your Pods to the cluster and the requested MIG devices are automatically provisioned.

Let’s explore how nos and Dynamic MIG Partitioning work in practice.

Prerequisites

As already mentioned, nos does not replace the NVIDIA GPU Operator, but it works alongside it. Hence, you need to install it first by meeting two requirements:

mig.strategy must be set to mixed, so that every different MIG profile is exposed to Kubernetes as a specific resource type
migManager must be disabled

If not already done, you can install the NVIDIA GPU Operator with Helm as follows:

By default, MIG mode is not activated on NVIDIA GPUs. So, first you need to enable MIG on all the GPUs you want to be partitioned. You can do this by SSHing into the node and running the following command for each GPU, where <index> corresponds to their respective index:

Depending on the type of machine you are using, it may be necessary to reboot the node after this operation. For more information and troubleshooting, you can refer to the NVIDIA MIG User Guide.

Installation

Once you have installed the NVIDIA GPU Operator and enabled MIG mode on your GPUs, you can simply install nos as follows:

That’s it! Now you are ready to activate Dynamic MIG Partitioning on your nodes.

Dynamic partitioning in action

First, you need to specify to nos for which nodes it should manage GPU partitioning. Label those nodes as follow:

This label marks a node as a “MIG node”, delegating the management of MIG devices of all the node’s GPUs to nos.

After that, you can submit workloads requesting MIG resources. nos will automatically find and apply the best MIG configuration on the GPUs of the nodes you previously marked as “MIG nodes”, creating the missing MIG devices requested by Pods and deleting the unnecessary unused ones.

Let’s take a look at a simple example of nos in action.

Assume we are operating a simple cluster with two nodes, one of which has a single NVIDIA A100 80GB. We have already enabled MIG mode on that GPU, so we can enable automatic partitioning for that node:

The output of kubectl describe node aks-gpua100–24975740-vmss000000 shows that the node does not have any available MIG resources, since no MIG device has been requested or created yet:

Let’s create some Pods requesting MIG resources. In this case, we create a deployment with 5 replicas of a Pod with a container requesting a GPU slice of 10 GB of memory.

There are now 5 pending Pods in the namespace demo, requesting a total of five nvidia.com/mig-1g.10gb resources which are not yet available in the cluster:

In a few seconds, nos will detect these pending Pods. It will try to create the requested resources selecting the most suitable MIG configuration. In this example, nos applies a configuration that provides five 1g.10gb and one 2g.20gb devices:

If we check once again the state of the Pods, we can see that this time they are now in Running state:

Note that, besides the 1g.10gb devices, nos also created an additional 2g.20gb device. This is because each MIG GPU model only supports a specific set of configurations, and, in this scenario, the best configuration that met the required devices also included the 2g.20gb device. Keep in mind that:

nos selects the configuration that allows to schedule the highest number of pending pods, which is computed leveraging a scheduling simulation done by the nos the internal scheduler
MIG devices that are already in use are never deleted. Any MIG configuration that would require the deletion of these devices is rejected.

Conclusions

The possibility of requesting GPU slices is crucial for improving GPU utilization and cutting down infrastructure costs.

NVIDIA MIG allows to create fully-isolated GPU instances with dedicated memory and compute resources, but the support in Kubernetes provided by the NVIDIA GPU Operator is not enough if we want to achieve operational excellence. Static configurations do not automatically adjust to the changing demands of workloads and thus are inadequate to provide each Pod with the GPU slices it requires, especially in scenarios with workloads demanding a variety of slices in terms of memory and computing that change over time.

nos overcomes NVIDIA GPU Operator static configurations limitations through Dynamic GPU Partitioning, which increases GPU utilization and reduces the operation burden of manually defining and applying MIG configurations on the cluster nodes.

It is worth noting that NVIDIA MIG has its limitations and is not the only partitioning technology, nor the only way to increase the utilization in a Kubernetes cluster. Specifically, MIG is only supported on newer architectures (Ampere and Hopper) and does not offer fine-grained GPU partitioning, meaning it is not possible to create GPU slices with arbitrary memory and compute resources.

To overcome these limitations, nos also offers Dynamic GPU Partitioning through NVIDIA Multi-Process Service (MPS), a partitioning technology that is supported by almost all NVIDIA GPUs and allows to create slices of any desired amount of memory. You can find more information on Dynamic MPS partitioning here.

Resources