Vertical-scaling

  • Published on
    Stateless microservices are the gold standard when it comes to scaling applications in the cloud. This allows services to scale up and down and handle massive bursts of traffic. LLM inference engines are stateless, but low-latency serving relies on preserving KV cache across requests. In this post, I look at how Kubernetes 1.35 advances in-place vertical scaling for running pods. The approach applies to any workload where you want to scale resources without restarting the process, with LLM inference being a particularly good fit.