Fractional GPUs in Kubernetes

Fractional GPUs in Kubernetes


The GenAI revolution has led to a surge in GPU demand across the industry. Companies want to train, fine-tune and deploy LLMs in massive quantities. This has meant lower availability and consequent increase in prices for the latest GPUs. Companies running workloads on public cloud have suffered from high prices and increasing uncertainty wrt GPU availability.

These new realities make being able to utilize available GPUs to the maximum extent absolutely critical. Partitioning or sharing a single GPUs between multiple processes helps with this. Implementing it on top of kubernetes gives a winning combination where we get autoscaling and a sophisticated scheduler to help with optimizing GPU utilization.

Options for sharing GPUs

In order to share a single GPU with multiple workloads in kubernetes, these are the options we have -


Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications. Each partition is completely memory and compute isolated and can provide predictable througput and latency

A single NVIDIA A100 GPU can be partitioned in upto 7 isolated GPU instances. Each partition appears as a separate GPU to the software running on a partitioned node. Other MIG supported GPUs and number of supported partitions are listed here.
More info here


  • Full compute and memory isolation that can support predictable latency and throughput
  • nvidia-device-plugin for kubernetes has native support for MIG


  • Only supported for recent GPUs like A100, H100, A30. This ends up limiting the options one has
  • Number of partitions has a hard limit of 7 for most architectures. This is fairly less if we are running smaller workloads with limited memory and compute requirements

Time slicing

Time slicing enables multiple workloads to be scheduled on the same GPU. Compute time is shared between the multiple processes and the processes are interleaved in time. A cluster administrator can configure a cluster or node to advertise a certain number of replicas/GPU which reconfigures the nodes accordingly.


  • No upper limit to the number of pods that can share a single GPU
  • Can work with older versions of NVIDIA GPUs


  • No memory or fault isolation. There is no in built way to make sure a workload doesn’t overrun the memory assigned to it.
  • Time slicing provides equal time to all running processes. A pod running multiple processes can hog the CPU much more than intended

Time slicing Demo

Lets go through a short walkthrough on how we can utilize time sharing on Azure Kubernetes Service. We start with an already existing kubernetes cluster.

  1. Add a GPU enabled node pool in the cluster -

    $ az aks nodepool add \
        --name <nodepool-name> \
        --resource-group <resource-group-name> \
        --cluster-name <cluster-name> \
        --node-vm-size Standard_NC4as_T4_v3 \
    		--node-count 1

    This will add a new node pool with a single node to the existing AKS cluster with a single NVIDIA T4 GPU. This can be verified by running the following -

    $ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'
  2. Install the gpu operator -

    $ helm repo add nvidia \
    	&& helm repo update
    $ helm install gpu-operator nvidia/gpu-operator \
    -n gpu-operator --create-namespace \
    --set driver.enabled=false \
    --set toolkit.enabled=false \
    --set operator.runtimeClass=nvidia-container-runtime
  3. Once the operator is installed, we create a time slicing configuration and configure the whole cluster to slice the GPU resources where available -

    $ kubectl apply -f - <<EOF
    apiVersion: v1
    kind: ConfigMap
      name: time-slicing-config
      any: |-
        version: v1
          migStrategy: none
            renameByDefault: false
            failRequestsGreaterThanOne: false
              - name:
                replicas: 10
    # Reconfigure gpu operator to pick up the config map
    $ kubectl patch clusterpolicy/cluster-policy \
    -n gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
  4. Verify that the existing node has been successfully reconfigured -

    $ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'
  5. We can verify the configuration by creating a deployment with 4 replicas with each asking for 2 resource -

    $ kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
      name: time-slicing-verification
        app: time-slicing-verification
      replicas: 4
          app: time-slicing-verification
            app: time-slicing-verification
            - key:
              operator: Exists
              effect: NoSchedule
          hostPID: true
            - name: cuda-sample-vector-add
              image: ""
              command: ["/bin/bash", "-c", "--"]
                - while true; do /cuda-samples/vectorAdd; done

Verify that all the pods of this deployment have come up on the same already created node and it was able to accommodate them.


The GenAI revolution has changed the landscape of GPU requirements and made being responsible with resource utilization more critical than ever. There are shortcomings to both the approaches outlined here but there is no way around being responsible with GPU costs in the current scenario.