Jelly48 icon Jelly48

When the GPU is 99% overhead: measuring WebGPU dispatch cost in a tiny physics engine

June 10, 2026

Jelly48 is a soft-body 2048: every tile is an XPBD jelly simulated at 600 substeps a second. We built the engine GPU-first — Rust, wgpu, the whole solver in WGSL compute shaders. Then we profiled it at actual game scale and moved the physics back to the CPU. This post is the numbers that changed our minds.

The setup

A full board is 16 tiles × 16 particles = 256 particles and ~500 constraints — a tiny workload. The GPU pipeline ran it the canonical way: per substep, an integrate kernel, graph-colored constraint-solve dispatches, contact detection and response, boundary, velocities. At the game's settings (10 substeps, 6 constraint colors) that's 176 compute dispatches per frame, all in one command encoder, one submit.

What we measured

Timing full GPU completion per frame on an Apple M3 (native Metal via wgpu, 600 frames, 12-tile board, 192 particles), sweeping the substep count to vary the dispatch count:

substepsdispatches/frameavg frame
1231.96 ms
4742.42 ms
10 (game)1764.67 ms
162785.32 ms
325508.15 ms

That's a clean linear fit: frame ≈ 1.7 ms fixed + ~12 µs per dispatch. Two control results made the diagnosis unambiguous. Growing the board from 12 to 16 bodies (33% more work in every kernel) changed the frame time by nothing — the workload is invisible next to the per-dispatch cost. And the per-dispatch slope held whether the kernels did real work or near-none. Each dependent dispatch serializes the GPU pipeline, and with 1–6 workgroups in flight the hardware never even fills one shader core. At this scale the pipeline is roughly 99% dispatch overhead, 1% physics.

Browsers make it worse: published measurements put WebGPU's CPU-side encode cost at ~32 µs per dispatch on Safari/Metal and around a millisecond per dispatch on Firefox — multiply by 176 and the architecture simply doesn't fit inside a mobile frame budget.

The fix that didn't need a GPU

We wrote the same solver as plain sequential Rust — no SIMD, single thread, brute-force broadphase — and benchmarked the identical worst-case frame (256 particles, 512 constraints, 10 substeps, full contact detection every substep) natively and as WebAssembly in both browser engines:

targetfull frame
native (M3)209 µs
wasm — JavaScriptCore (Safari)306 µs
wasm — V8 (Chrome)453 µs

The wasm tax is only ~1.5–2.2× over native — while the GPU path's overhead gets worse in a browser. Through the engine's real API the CPU backend runs the game frame in 0.7 ms vs 4.0 ms for the GPU pipeline on the same machine, with dead-flat tails (p99 0.74 ms vs 9.9 ms) — and it deletes the async GPU readback entirely, so game logic reads positions with zero latency. As a bonus, a steady ~1 ms of CPU work per frame is far kinder to a phone's battery than trickle-feeding a GPU 176 times per frame.

What we'd tell past us

The crossover is real and it is much higher than intuition says: scalar CPU wasm beats a many-dispatch GPU pipeline until somewhere in the thousands of particles. The engine now picks a backend by particle count — CPU below the threshold, the GPU pipeline (which scales fine; its cost is per-dispatch, not per-particle) above it. GPU compute isn't slow; small sequential GPU compute is. If your per-dispatch work can't keep a shader core busy for longer than the dispatch costs to issue, batch it, fuse it — or just let the CPU have it.

The game all of this powers is free, in your browser, and considerably squishier than these tables suggest.

Play Jelly48 →