An agent runs my GPU cluster

- 6 mins read

For three weeks an agent has been running my GPU cluster. It scales Blackwell cards up and down, picks nodes, and frees silicon when nobody is rendering.

Not autonomous magic. It runs the loop I used to babysit by hand. Here is the honest version of what that taught me, with the parts LinkedIn was too short to hold.

A GPU rack orchestrated by an agent

The first lesson: idle GPUs that are still blocked

A render UI grabs a GPU the moment its process starts, not when you click “render”. torch.cuda takes the device on import. So an interface nobody is using still owns a card all day.

That sounds trivial until you only have so many cards and a queue of training jobs waiting behind a web UI that has had zero traffic since lunch.

The naive fix is “scale the deployment to zero when idle”. The problem: nothing wakes it back up. A user opens the URL, lands on a 502, and you are the one getting pinged.

KEDA scale-from-zero, wired to the ingress

The pattern that actually works is the KEDA HTTP add-on. An interceptor sits in front of the service. The first request after idle does this:

flowchart LR A[HTTP request] --> B[KEDA interceptor] B -->|pod at 0| C[scale 0 to 1] C --> D[DRA driver allocates GPU] D --> E[pod serves request] E -->|idle window elapses| F[scale back to 0] F -->|GPU released| G[card free for training]

The interceptor holds the request open during the cold start, so the user waits instead of seeing an error. Cold start is about two minutes for a heavy render image. I will take two minutes over a card sitting dead for a day, every single time.

A minimal HTTPScaledObject looks like this:

apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: render-ui
spec:
  hosts:
    - render.example.internal
  scaleTargetRef:
    deployment: render-ui
    service: render-ui
    port: 8080
  replicas:
    min: 0          # the whole point: zero when idle
    max: 1
  scaledownPeriod: 1800   # 30 min idle, then release the GPU

The deployment itself carries no replica management of its own. KEDA owns the 0-to-1 decision based on whether traffic is flowing through the interceptor.

The part that got genuinely interesting: scheduling

Here is the mistake I made early, and it cost me half my throughput without a single error in the logs.

I let the scheduler pick “first available” GPU. On a box with several cards on different PCIe lanes, that hands you a mismatched set. Tensor-parallel work then spreads across cards that do not share a clean, wide link, and your NCCL all-reduce quietly chokes on an x4 lane while the other cards wait.

No crash. No warning. Just decode throughput sitting at half of what the hardware can do, and you staring at it wondering why.

The fix is to stop treating GPUs as interchangeable units. Kubernetes 1.36’s Dynamic Resource Allocation lets you express what you actually need: not “a GPU”, but “a coherent set on a wide link”. With the NVIDIA DRA driver you write a ResourceClaim that selects on link width, so tensor-parallel pods land on an x16 quartet instead of whatever was free first.

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: tp4-coherent-x16
spec:
  spec:
    devices:
      requests:
        - name: gpus
          deviceClassName: gpu.nvidia.com
          count: 4
          selectors:
            - cel:
                expression: "device.attributes['pcie'].linkWidth == 16"

Clean NCCL all-reduce, no x4 bottleneck, back to full decode throughput. Same hardware. The only thing that changed was telling the scheduler the truth about what the workload needs.

PCIe lanes, one clean x16 highlighted

The model that reloads itself every time

The next thing that bled throughput was quieter than a bad lane, because it never showed up as an error either. It showed up as latency I had started treating as normal.

My render checkpoints live on a shared NFS volume, mounted read-only into every render pod. That part is deliberate: weights belong on storage the whole cluster can reach, never baked into a host or an image. ComfyUI and Wan2GP both mount the exact same model share, so there is one copy of a multi-gigabyte checkpoint and no drift between engines.

The trap is what happens on a cold pod. The first render streams tens of gigabytes of checkpoint off NFS and into GPU memory before it draws a single frame. Do that on every request and your “two minute cold start” turns into two minutes of disk and PCIe traffic you pay again and again.

The fix is to stop reloading. Once a pod is warm I keep the model resident in VRAM for the whole idle window, not just for the one request that woke it. The first request after a scale-up eagerly loads the checkpoint; every render after that hotloads against weights already sitting in GPU memory.

spec:
  replicas: 0            # idle: card is free for training
  template:
    spec:
      containers:
        - name: render
          env:
            - name: EAGER_LOAD_MODEL   # load on startup, not on first frame
              value: "1"
          # readiness waits out the one-time NFS->VRAM load,
          # so the pod only goes Ready once weights are resident
          readinessProbe:
            httpGet: { path: /readyz, port: 8080 }
            failureThreshold: 40       # generous: model load is slow, once

So Wan2GP picking an LTX checkpoint, or ComfyUI running a graph, both pay the NFS read exactly once per warm window. After that the weights are hot. The card streams from VRAM, not from a network share. When the idle timer finally fires, the pod scales to zero and the whole card goes back to the training queue.

The pattern underneath is the same one as the idle UI: tell the system the truth about cost. Loading a checkpoint is expensive and should happen once. A web UI with no traffic should own nothing. Neither of those is clever. Both were just me, at some point, not having said so out loud to the scheduler.

What the agent does, and what it does not

This is the line that matters to me, so I want to be precise about it.

The agent executes. It reconciles desired state, wakes pods, frees GPUs, retries a failed rollout, and tells me when something has drifted from what I declared. It runs the boring loop I would otherwise be watching by hand.

The agent does not decide the architecture. I do.

It is the hands, not the brain. Anyone showing you a fully autonomous datacenter that needs no human judgment is showing you a demo, not a system they trust with real money. The interesting engineering is in drawing that line well: what is genuinely toil, and what is a decision that deserves a human who will own the consequences.

A pod scaling from zero to one and back

Local and cloud, not local versus cloud

One more honest piece, because the timeline is full of absolutism right now.

This runs on my own hardware and through the cloud. Local where latency, cost and privacy win. Cloud where I need a frontier model or a burst of capacity I do not own. We run our strongest model through a managed cloud service and I am not embarrassed about it.

The setup that actually works is both. People selling you “leave the cloud” or “cloud only” are selling you half a stack. The hard part was never the GPU. It is the routing: deciding, per request, what runs where.

What it actually changed

Letting an agent run the cluster did not make me lazy. It made me sharper about what is worth deciding myself and what is just toil.

The toil belongs to the agent now. The judgment stays mine.

That is the part the hype keeps getting backwards.