We shipped a win condition that was geometrically impossible
We're building a second game on the Jelly48 engine: a soft-body Suika-like called Jellygon, coming to iOS. Polygon pieces — triangle through hendecagon, one side per tier — drop into a cup, squish, and fuse when same-tier pieces touch. Making the top-tier piece, two tier-7 decagons fusing and bursting, is the game's watermelon moment. We had tuned the piece sizes for feel, validated stability with soak tests, and built the celebration animation. Then a simple question came up in review: is there actually enough room in the cup to do this?
The answer was no. Not "very hard" — geometrically impossible. No human or bot would ever have seen the win screen. This post is the procedure that caught it, the fix, and the experiment that turned the tuning constant into a dose-response curve we can now adjust with data instead of vibes.
The napkin math said "tight but fine" — the napkin was wrong
Piece sizes grow geometrically: scale = 0.62 · g^tier, with growth
g = 1.26 at the time. Measuring the actual spawned rings (not
circumradius estimates — flat-sided polygons are narrower than their bounding
circle):
| tier | width | area | cumulative 0..tier |
|---|---|---|---|
| 5 | 3.80 | 10.9 | 25.4 |
| 6 | 4.82 | 17.7 | 43.1 |
| 7 | 5.95 | 28.6 | 71.6 |
| 8 | 7.72 | 45.9 | — |
The cup is 8.4 wide with 9.8 of usable height: 82.3 area. One of every tier 0–7 — the worst-case build inventory — is 71.6, i.e. 87% occupancy. Tight but achievable, we concluded, since soft bodies pack better than rigid ones and every merge in a cascade frees ~19% of the merging pair's area.
That analysis was arithmetically right and practically wrong, because the binding constraint isn't total area — it's that the 7→8 leg concentrates the load into three enormous, badly-packing blobs. To fuse the second tier-7 you must hold the first one (28.6) plus two tier-6s (17.7 each) at the moment they chain: 64 of 82 area in three pieces, before counting any of the working inventory that built them. Three giant rounded polygons in a rectangle pack far worse than the 87% headline suggested.
Bots as the falsifier
Paper math can't settle a packing question; play can. We wrote a headless player against the real engine — same physics, same rules, 60 Hz — that aims and drops deliberately: target the highest same-tier partner (merges fire on contact), otherwise park in a size-sorted slot. One game runs in about two seconds.
First experiment, 12 runs across three strategy variants: every run reached tier 7 in ~45 drops, and every run died 10–20 drops later while building the second 7 — exactly at the predicted bottleneck. Zero wins.
So we built the real harness: 11 strategies × 100 seeds, with bug oracles watching every frame (particle-velocity explosions, NaN positions, pieces tunneling out of the cup, merge-reject storms). The strategies deliberately span competent to abusive:
- greedy / flat / patient / noisy — partner-targeting play, from rigid to human-like (noisy adds aim error)
- rush / jitter — drop-timing abuse: release mid-aim, re-aim every 5 frames
- split / walls / center / stack / random — board-topology abuse
STRATS=greedy,flat SEEDS=1-100 cargo run --release --example playtest
At the original growth value the verdict was unanimous: 0 wins, with a characteristic tier-histogram wall at 7. The win condition was dead code.
The fix is one constant — and the response curve is steep
The honest lever is the growth ratio itself: smaller tier-to-tier growth shrinks the late-game blobs relative to the cup. We stepped it down and re-swept. First attempt, 1.26 → 1.24: still 0 wins in 200 runs. We mispredicted this — the napkin said 1.24 would open the door. It barely moved it.
At 1.22 the first wins in the project's history appeared — and fittingly, the first strategy to crack it was noisy, the one that plays like a human, misses included. The deterministic bots kept tiling themselves into the same doomed layouts; aim error explored better boards.
That steepness deserved a proper experiment instead of one more guess. We made the growth factor an env hook and swept seven values × 4 win-capable strategies × 50 seeds — 1,400 games, ~30 minutes on a laptop, 7 parallel jobs:
| growth | bot win rate (200 runs each) |
|---|---|
| 1.24 | 0% |
| 1.23 | 1% |
| 1.22 (ship) | 4% |
| 1.21 | 8% |
| 1.20 | 21% |
| 1.19 | 21% |
| 1.18 | 38% |
A clean, monotonic dose-response curve, roughly doubling per 0.01 step. Two practical consequences. First, difficulty is now a calibrated dial: when human playtest data says "too hard," we know exactly what one notch buys. Second, the steepness is a warning — a constant you'd happily nudge by 0.05 "to taste" swings the win rate by an order of magnitude. Tune in 0.01 steps, against human data. Bots are a lower bound on skill: they do no lookahead and no inventory sequencing, so a deliberate human should multiply that 4% several-fold. That lands engaged players at a win every handful of good games — a repeatable climax, not a one-time trophy (in Jellygon the top piece bursts and the run continues; the score game is chasing repeat bursts).
The same harness found a real bug in this game
To check the methodology generalizes, we ported the harness to Jelly48 — the soft-body 2048 you can play right now — with swipe-policy strategies instead of aim positions: corner discipline, input-mashing every 12 frames, gravity sloshing, lateral pendulum. 350 runs.
It flagged something the test suite never had: velocity spikes of 47–96 units/s — pieces visibly snapping when pinched in packed boards under brisk input — in ~5% of runs. Steady play never exceeds ~34. The existing regression test runs 25 seconds; the spikes first appear around sim-minute four.
Two things made this finding actionable within the hour:
- Determinism gave us an exact differential. Jelly48's spawn RNG is fixed, so replaying the same move sequence on the engine version last deployed versus head produced bit-identical output — proving the spikes were latent in production, not a regression from recent engine work.
- The cure already existed: a per-substep velocity cap built for Jellygon's pile ejections. Set to 40 (clear of the 34 normal-play ceiling, under the 47 pathology floor), the five worst seeds re-ran clean and a 60-run confirmation sweep showed zero flags. The fix is live in the game today.
Takeaways
- A win condition is a testable claim. If your game has a hardest achievement, write the bot that tries to do it. Ours had never been done when we shipped the animation for it.
- Bot win rate is a tuning instrument, not a difficulty target. Treat it as a reproducible lower bound and measure the curve, not a point — the shape (ours doubles per 0.01) tells you how carefully to tune.
- Strategy diversity beats strategy quality. The human-like noisy bot won before the smarter deterministic ones; abuse strategies (input mashing, gravity sloshing) found the production bug. Both kinds earn their CPU time.
- Long-horizon oracles catch what regression tests can't. Our spike bug needed four sim-minutes to manifest; no 25-second test would ever see it. Full-game runs with invariant checks are cheap — ~2 s/game — so run thousands.
- Determinism is a superpower. Fixed seeds turned "did the engine change behavior?" into a bit-identical diff, and "is this finding real?" into a replayable artifact.
Jellygon ships on iOS soon. Its sibling — same engine, same bots watching over it — is free in your browser right now.
Play Jelly48 →