In-Place Vertical Scaling for LLM Workloads in Kubernetes

Stateless microservices are the gold standard when it comes to scaling applications in the cloud. This allows services to scale up and down and handle massive bursts of traffic. LLM inference engines are stateless, but low-latency serving relies on preserving KV cache across requests. In this post, I look at how Kubernetes 1.35 advances in-place vertical scaling for running pods. The approach applies to any workload where you want to scale resources without restarting the process, with LLM inference being a particularly good fit.

Photo by R M on Unsplash

To maintain a conversation or analyze a large codebase, an LLM relies on the KV (Key-Value) Cache. This is essentially the model's working memory of the current context. For large context windows, the KV cache can grow to several gigabytes and is typically held in GPU memory (VRAM), with additional host memory used by the inference runtime.

To handle a burst in LLM traffic, when we use horizontal scaling, we face the classic cold start problem. The new pod starts with an empty KV cache and before it can generate a single token for a user, it must re-process the entire prompt prefix. For the user, this means the time to first token (TTFT) spikes.

Externalizing Cache

While externalizing the cache to a Redis or a disaggregated KV store is a valid pattern, it introduces the Network-to-device memory bottleneck. Currently, context windows are massive. Moving a cache over a network link can often be slower than recomputing prefill for long-context, latency-sensitive workloads, once you include serialization, GPU copies, and coordination overhead. Kubernetes in-place resize does not change GPU memory capacity, but it avoids process restarts, allowing GPU-resident KV cache to be preserved.

For ultra-low latency, the "State" (the cache) typically needs to stay in local device memory.

We can now mutate cpu and memory limits on a running pod without a restart. The kubelet coordinates an in-place resource update (cgroup updates plus runtime support), but results depend on runtime and node capacity. For LLM workloads, this means the inference process can scale CPU and host memory without a restart, avoiding a flush of in-memory state such as the KV cache, as long as the GPU remains attached and the process stays alive. There's a nuance though. Memory resize upward is generally workload and runtime-dependent. And memory shrink is best-effort and not something LLM inference relies on anyway.

Demo

In the demo below, I've created a simple containerized service to simulate this behavior.

Install k8s 1.35 locally (I used minikube).

minikube start --kubernetes-version=v1.35.0 --cpus=4 --memory=6g

This simple Python script serves data from cache which is loaded the first time the API is hit. It then serves the data from cache for subsequent requests.

### imports
app = FastAPI()
KV_CACHE = {}

def cpu_work(n: int) -> int:
    total = 0
    for i in range(n):
        total += i
    return total

@app.get("/predict")
def predict(request_id: str = "default"):
    start_time = time.time()
    
    is_cold = False
    if request_id not in KV_CACHE:
        is_cold = True
        time.sleep(5) # Simulate the "Prefill" penalty
        KV_CACHE[request_id] = "context_loaded"
    
    # simulate CPU work based on the number of cores
    cpu_cores = os.cpu_count() or 1
    work_per_core = 10_000_000

    with ProcessPoolExecutor(max_workers=cpu_cores) as executor:
        futures = [
            executor.submit(cpu_work, work_per_core)
            for _ in range(cpu_cores)
        ]
        for f in futures:
            f.result()
    
    duration = time.time() - start_time
    
    return {
        "status": "Warm" if not is_cold else "Cold Start",
        "latency_seconds": round(duration, 2),
        "cache_state": "Preserved"
    }

Deploy the image.

eval $(minikube docker-env)

cat <<EOF > Dockerfile
FROM python:3.9-slim
RUN pip install fastapi uvicorn psutil
COPY app.py .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
EOF

docker build -t kv-simulator:v1 .

Deploy this image to minikube.

kubectl apply -f kv-simulator.yaml

kv-simulator.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kv-simulator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kv-simulator
  template:
    metadata:
      labels:
        app: kv-simulator
    spec:
      containers:
      - name: kv-simulator
        image: kv-simulator:v1
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
          limits:
            cpu: "1"
            memory: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: kv-simulator-svc
spec:
  selector:
    app: kv-simulator
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

Port forward the service to localhost.

kubectl port-forward svc/kv-simulator-svc 9999:80

I ran the test twice and got the following numbers for the response time:

curl http://localhost:9999/predict | jq

Partial output:

1st:

{
  "status": "Cold Start",
  "latency_seconds": 12.14,
  "cache_state": "Preserved"
}

2nd:

{
  "status": "Warm",
  "latency_seconds": 6.71,
  "cache_state": "Preserved"
}

This simulates how an LLM serving stack behaves when its cache is empty. If I now scale this horizontally the first request again will be a cold start.

Next, I scaled it vertically to 6GB from 4GB. In a production environment, AI workloads are likely running in the Guaranteed QoS class (where requests == limits). Guaranteed QoS reduces eviction risk under memory pressure and makes CPU/memory allocation more predictable.

POD_NAME=$(kubectl get pods -l app=kv-simulator -o jsonpath='{.items[0].metadata.name}')
kubectl patch pod $POD_NAME --subresource resize --type='merge' -p \
'{
  "spec": {
    "containers": [
      {
        "name": "kv-simulator",
        "resources": {
          "requests": {"cpu": "4", "memory": "6Gi"},
          "limits": {"cpu": "4", "memory": "6Gi"}
        }
      }
    ]
  }
}'

You can run the following to ensure the deployed generation is newer.

kubectl get pod $POD_NAME -o jsonpath='{.status.observedGeneration}'

Right after the resize, I ran the same test again.

curl http://localhost:9999/predict | jq

{
  "status": "Warm",
  "latency_seconds": 1.11,
  "cache_state": "Preserved"
}

This demo shows two things. First, in-place resize avoids process restarts, which is a prerequisite for preserving in-memory state such as the KV cache. Second, increasing CPU limits improves request latency for CPU-bound inference paths, since the process can immediately take advantage of additional cores without a restart. Performance impact depends on whether you were CPU or memory constrained.

For the demo I scaled the pods manually in-place. VPA support for in-place resize is still limited.