Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs

We rejected bin-packing and FIFO for a weighted fair queuing approach with topology-aware placement. This post covers the math, the trade-offs, and why latency-sensitive inference workloads changed everything.

Advanced Level

GPU cluster with topology-aware scheduling across 1,798 GPUs

GPU scheduling is a solved problem — if your workloads are homogeneous, your cluster is a single rack, and you do not care about latency. In that world, bin-packing maximizes utilization, FIFO ensures fairness, and everyone goes home happy. Our world is different. We schedule 1,798 GPUs across three data centers, running a mix of latency-sensitive inference workloads (p95 under 12ms), throughput-oriented training jobs (running for days), and interactive development sessions (engineers iterating on model architectures). The workloads differ in duration by six orders of magnitude (1ms to 1,000,000ms), in GPU requirements by three orders of magnitude (1 to 1,000 GPUs), and in latency sensitivity by four orders of magnitude (1ms to 10,000ms). No single scheduling policy can optimize for all of these simultaneously. This article presents the algorithm we developed, the trade-offs we chose, and the performance we achieved.

We evaluated three baseline algorithms before designing our own. Bin-packing assigns workloads to the smallest number of GPU groups that can satisfy their requirements, maximizing utilization. On our fleet, bin-packing achieved 89% average GPU utilization — impressive, but with a fatal flaw: inference workloads were frequently queued behind long-running training jobs, with p99 inference latency of 340ms — 28x our target. The problem is that bin-packing treats all workloads equally; it has no concept of latency urgency. FIFO scheduling processes workloads in arrival order, ensuring fairness. On our fleet, FIFO achieved p99 inference latency of 45ms — better, but still 3.8x our target, and GPU utilization dropped to 61% because small inference workloads fragmented the cluster, leaving large contiguous GPU groups unavailable for training jobs. Dominant Resource Fairness (DRF) allocates resources proportionally across tenants, ensuring that no tenant is starved. DRF achieved 73% utilization with p99 inference latency of 180ms — the worst of both worlds for our requirements, because proportionality does not account for latency sensitivity.

Our algorithm, which we call Topology-Aware Weighted Fair Queuing (TAWFQ), combines three mechanisms. The first mechanism is weighted fair queuing (WFQ), which assigns each workload a weight based on its latency sensitivity class. Inference workloads receive weight 100, interactive development receives weight 10, and training receives weight 1. When the scheduler selects the next workload to run, it picks the one with the highest weighted deficit — where deficit is the difference between the service a workload should have received (proportional to its weight) and the service it actually received. This ensures that inference workloads are scheduled within milliseconds of arrival, even if a training job is currently occupying the target GPUs, because the inference workload's deficit grows 100x faster than the training workload's. The preemption cost is bounded: inference workloads typically require 1-8 GPUs, and the preemption overhead for a training job on those GPUs is approximately 2 seconds (checkpoint, save state, release resources). The net effect is that inference p99 latency drops from 340ms (bin-packing) to 11.3ms (TAWFQ) — a 30x improvement — at the cost of a 7% reduction in training throughput due to preemption overhead.

The second mechanism is topology-aware placement. When the scheduler selects a workload for execution, it must choose which physical GPUs to assign. A naive placement algorithm might assign any available GPUs, but GPU interconnect topology matters enormously for performance. A training job that requires 8 GPUs on the same NVLink domain achieves 94% of peak training throughput. The same job split across two NVLink domains achieves 62%. Split across two racks, it achieves 34%. Our topology-aware placement formulates GPU allocation as a constrained optimization problem: minimize inter-GPU communication latency (expressed as a function of NVLink bandwidth, PCIe topology, and network path) subject to jurisdiction constraints, cooling capacity, and power budget. The solver uses a greedy heuristic with a local search refinement — it finds an initial placement by greedily selecting the lowest-latency GPU group, then improves it by swapping GPUs between groups if the swap reduces total communication cost. For the current fleet size of 1,798 GPUs, the solver runs in under 50ms, which is negligible compared to the scheduling interval of 100ms.

The third mechanism is gang scheduling with graceful degradation. Many AI workloads require all requested GPUs simultaneously — a training job that requests 64 GPUs cannot make progress with 63. This is the gang scheduling problem, and it is the primary source of fragmentation in GPU clusters. Our approach schedules all GPUs in a gang simultaneously or not at all, but with a graceful degradation option: if the full gang cannot be scheduled within a configurable timeout (default: 30 seconds for training, 5 seconds for inference), the scheduler offers a reduced allocation. A 64-GPU training job that cannot get 64 GPUs might accept 48 GPUs with a corresponding reduction in batch size and throughput. This is not ideal, but it is far better than waiting indefinitely while GPUs sit idle because a perfect placement cannot be found. In practice, graceful degradation reduces average queue time for training jobs by 62% and increases fleet utilization by 8 percentage points, because the GPUs that would have been held in reserve for a full gang allocation are instead productively employed at a reduced scale.

The combined performance of TAWFQ on our production fleet: 87% average GPU utilization (2 points below bin-packing, 26 points above FIFO), p99 inference latency of 11.3ms (30x below bin-packing, 4x below FIFO), training throughput within 7% of theoretical maximum (the preemption overhead for inference scheduling), and zero tenant starvation events across 14 months of operation. The scheduling overhead — the CPU time spent computing placement decisions — is 0.3% of cluster compute capacity, well within our budget. TAWFQ is not optimal for any single metric. It is Pareto-optimal for the set of metrics that matter to our business: utilization, inference latency, training throughput, and fairness. Finding that Pareto-optimal point required understanding our workload distribution deeply and making explicit trade-offs that more generic algorithms avoid. In GPU scheduling, as in most engineering problems, there are no free lunches — only well-chosen compromises.

Continue Reading

InfrastructureAdvancedMarch 8, 202624 min read

AI/MLAdvancedJanuary 25, 202616 min readHarch Intelligence Scheduling Team

Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs

Advanced Level

Continue Reading

InfrastructureAdvancedMarch 8, 202624 min read

Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs

Continue Reading

Inside HarchOS: How We Built a Distributed AI Operating System from Scratch

Designing the SENSE Layer: Real-Time Data Ingestion at 10M Events/Second

Zero-Trust Networking in Multi-Tenant AI Infrastructure

Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs

Continue Reading

Inside HarchOS: How We Built a Distributed AI Operating System from Scratch

Designing the SENSE Layer: Real-Time Data Ingestion at 10M Events/Second

Zero-Trust Networking in Multi-Tenant AI Infrastructure