Parameter Golf — Experiment Lab
OpenAI Model Craft Challenge · Nathan Maine · March 19–25, 2026
Disclaimer — Unofficial & Independent
This dashboard is not affiliated with, endorsed by, or officially associated with OpenAI, RunPod, or the Parameter Golf competition organizers. It is solely an independent attempt by one participant to document, visualize, and analyze personal experiment data from a public open-source competition. All competition data referenced is derived from publicly available GitHub pull requests and the author's own runs. Interpretations of rules, rankings, and technique effectiveness reflect the author's personal understanding and may not represent official positions.
0 — Community Leaderboard (456 Submissions)Give Feedback
352 shown
| Rank ↕ | PR# ↕ | Author ↕ | Submission Name ↕ | val_bpb ↕ | Size (MB) ↕ | <16MB ↕ | Date ↕ | Status ↕ |
|---|
How to read: CLOSED submissions were flagged by organizers (often illegal TTT). OPEN submissions are likely valid. MERGED submissions are official baselines or infrastructure PRs. Data sourced from public submission.json files in each PR. Filter to "Open Only" for the realistic leaderboard.
1a — All Submissions by Date (352 PRs)Give Feedback
Each dot is a submission. Color = BPB range. Hover for details. The competition launched March 18 — watch the BPB scores drop as competitors iterated.
1b — Top 20 Verified (Open + Under 16MB)Give Feedback
Only submissions that are OPEN and under 16MB. This is the realistic leaderboard — what would actually count if scoring closed today.
1c — Open vs Closed: The Illegal TTT CliffGive Feedback
The story: Nearly everything below ~1.10 BPB was closed by organizers for using multi-epoch TTT (training on validation data). Issue #402 redrew the map — the real competition is above the cliff.
1d — BPB by Technique CategoryGive Feedback
Techniques extracted from submission names. Each box shows the distribution. TTT dominates the low end but most were banned. QAT + architectural tricks (U-Net, XSA, EMA) are the legal frontier.
2 — My BPB Score ProgressionGive Feedback
Reading the chart: Lower BPB is better. The green zone shows verified legal scores. The red zone below 1.10 contains submissions using illegal multi-epoch TTT — all closed by organizers. Our PR #406 at 1.1287 is top 5 on the verified leaderboard.
2 — My Experiment Log (46+ Runs)Give Feedback
| Run ↕ | Day ↕ | Date ↕ | Pod ↕ | Config / PR Base ↕ | val_bpb ↕ | step_ms ↕ | Steps ↕ | Size ↕ | <16MB ↕ | Legal ↕ | Submit? ↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PR#273 | 2 | Mar 20 | Pod 1 (1×H100) | 10L, Int6 QAT, SmearGate, SWA | 1.1575 | — | — | ~15MB | Yes | Yes | Submitted |
| PR#385 | 2 | Mar 20 | Pod 1 (1×H100) | 11L, WD=0.04, SWA=0.4 | 1.1488 | — | — | ~15MB | Yes | Yes | Submitted |
| PR#406 | 3 | Mar 21 | Pod 1 (1×H100) | XSA4 + EMA + Int6 QAT | 1.1287 | 82 | 7,300 | ~15MB | Yes | Yes | Submitted ★ |
| Auto×80 | 4 | Mar 22 | Pod 1 (1×H100) | Autoresearch (sed corruption) | CRASHED | — | — | — | — | — | Failed |
| Auto-1 | 4 | Mar 22 | Pod 1 (1×H100) | Baseline (1×H100) | 1.4230 | ~550 | ~500 | ~15MB | Yes | Yes | Baseline |
| T1 | 5 | Mar 23 | Pod 2 (1×H100) | PR#406 baseline | 2.1809 | 572 | 525 | ~15MB | Yes | Yes | 1×H100 |
| T2 | 5 | Mar 23 | Pod 2 (1×H100) | + reduce-overhead compile | 2.1787 | 568 | 529 | ~15MB | Yes | Yes | 1×H100 |
| T3 | 5 | Mar 23 | Pod 2 (1×H100) | PR#505 full arch | 4.7651 | 945 | 318 | ~19MB | No | Yes | Too slow |
| T1b | 5 | Mar 23 | Pod 3 KC (742TF) | PR#406 baseline | 2.1809 | 572 | 525 | ~15MB | Yes | Yes | 1×H100 |
| T2b | 5 | Mar 23 | Pod 3 KC (742TF) | + reduce-overhead | 2.1787 | 568 | 529 | ~15MB | Yes | Yes | 1×H100 |
| T3b | 5 | Mar 23 | Pod 3 KC (742TF) | PR#505 full | 4.7651 | 945 | 318 | ~19MB | No | Yes | Too slow |
| T4 | 5 | Mar 23 | Pod 3 KC (742TF) | stride=32 eval | TIMEOUT | — | — | — | — | — | Failed |
| T5 | 5 | Mar 23 | Pod 3 KC (742TF) | 13 layers | TIMEOUT | — | — | — | — | — | Failed |
| ICE-1 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#505 no TTT | 1.1279 | 133 | 4,490 | 19.8MB | No | Yes | Over size |
| ICE-2 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#462 10ep TTT | 1.0689 | 72 | 8,278 | 19.2MB | No | Illegal | Banned |
| ICE-3 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#77 LoRA TTT | 1.1951 | 51 | 11,822 | 15.9MB | Yes | Yes | Poor score |
| ICE-4 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#462 KV=4 | 1.0723 | 69 | 8,744 | 17.5MB | No | Illegal | Banned |
| ICE-5 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#532 codebook+huffman | CRASHED | 112 | 5,368 | 14.6MB | Yes | — | Crash |
| ICE-7 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | KV=4 MLP=1536 | 1.0827 | 64 | 9,369 | 16.06MB | 60KB over | Illegal | Both |
| ICE-8 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | + BigramHash=7168 | 1.0842 | 64 | 9,396 | 16.07MB | Over | Illegal | Both |
| ICE-9 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | + FP16_EMBED=0 | 1.0820 | 64 | 9,385 | 16.18MB | Over | Illegal | Both |
| ICE-10 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | dim=496 (size fit) | 1.0935 | 70 | 8,578 | 15.37MB | Yes | Illegal | Illegal TTT |
| KC2-A | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 WD=0.05 | 1.1390 | ~133 | ~4,500 | 18.5MB | No | Yes | Over size |
| KC2-B | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#77 legal LoRA TTT | 1.2063 | ~50 | ~12K | ~15MB | Yes | Yes | Wrong arch |
| KC2-C | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 dim=496 no TTT | 1.1366 | ~130 | ~4,500 | 16.09MB | 93KB over | Yes | 93KB over! |
| KC2-D | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 full no TTT | 1.1217 | ~133 | ~4,500 | 19.8MB | No | Yes | Over size |
| KC2-S1 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 1337) | 1.1305 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
| KC2-S2 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 42) | 1.1309 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
| KC2-S3 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 7) | 1.1310 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
Note: 1×H100 runs (Days 2-5) show higher BPB because they get far fewer training steps in the time budget. Only 8×H100 results are competition-comparable. The 80+ autoresearch crashes are collapsed into one row — $43 lesson learned.
3 — Technique EffectivenessGive Feedback
Int6 QAT
Quantization-aware training. Table stakes — every top submission uses it.
REQUIRED · ~0.5MB model size reduction
reduce-overhead compile
torch.compile mode. -25ms/step on slow GPUs, -4ms on fast. Free win.
PROVEN · -25ms/step (slow GPU)
EMA 0.997
Exponential moving average of weights during training. Small but consistent improvement.
HELPFUL · small quality gain
Star-ReLU (not SwiGLU)
ReLU² + learnable scale/bias. Used in top submissions. PR #505 title was misleading.
TOP-TIER · used in #1 no-TTT submission
Late QAT
Enable QAT when LR drops below 15% threshold. Better than constant QAT.
BETTER · vs constant QAT
Legal LoRA TTT (score-first)
Single-pass adaptation at eval. Score token, then train. Never see same data twice. PR #77 proved -0.037 BPB.
PROVEN · -0.037 BPB (needs lr=0.01)
XSA4 (Extended Self-Attention)
Adds quality but costs +17ms/step overhead. Net effect marginal.
MIXED · +17ms/step overhead
U-Net Skip Connections
5 encoder + 6 decoder with learned gates. Helps quality but adds model size.
MIXED · quality ↑ but size ↑
BigramHash
Hash-based bigram embedding. Minimal measurable impact on quality in our tests.
MINIMAL · no clear benefit
Custom Kernels (THOR, REAPER...)
Fused Triton/CUDA operations. 1.17×–1.70× speedup but only during eval, not training.
EVAL ONLY · doesn't speed training
Multi-epoch TTT
10-epoch AdamW on validation data. -0.059 BPB but BANNED by organizers (Issue #402).
BANNED · all sub-1.10 submissions closed
torch.compile max-autotune
CUDAGraph conflicts with rotary cache. Crashes every time. Use reduce-overhead instead.
CRASHES · CUDAGraph conflicts
4 — Cost & EfficiencyGive Feedback
Pod 1 — Days 2–4 (1×H100)
$90
Includes $43 overnight waste
Pod 2 — Day 5 (1×H100)
$7
3 targeted tests
Pod 3 KC — Day 5 (1×H100)
$15
742 TFLOPS — best single GPU
Pod 4 Iceland — Day 6 (8×H100)
$65
10 experiments — most productive session
Pod 5 KC2 — Day 6 (8×H100)
$35
7 runs including 3-seed verification
Overnight Waste
$43
80+ crashed autoresearch experiments
5 — Pod Performance ComparisonGive Feedback
| Pod | Location | GPU Config | GEMM (TFLOPS) | Step ms (same code) | Cost/hr | Verdict |
|---|---|---|---|---|---|---|
| Pod 1 | Unknown | 1×H100 SXM | Not measured | 498–592 ms | ~$2.69 | Variable |
| Pod 2 | Unknown | 1×H100 SXM | Not measured | 553–578 ms | ~$2.69 | Consistent |
| Pod 3 | Kansas City, MO | 1×H100 SXM | 742 TFLOPS | 568–572 ms | $2.69 | Exceptional |
| Pod 4 | Reykjavík, Iceland | 8×H100 SXM | 733 TFLOPS | 45–133 ms | $21.52 | Top-tier |
| Pod 5 | Kansas City (KC2) | 8×H100 SXM | ~730 est | 60–133 ms | ~$21.52 | Solid |
Key insight: GPU quality varies significantly even within the same model. Pod 3 hit 742 TFLOPS — well above the ~275 TFLOPS reference. This is why we built the pod benchmarking tool.
6 — Key DiscoveriesGive Feedback
Day 2 — March 20
"Step speed is the #1 bottleneck"
Every millisecond per step = fewer total training steps = worse BPB. Discovered after 17 experiments showed speed beats depth.
Day 3 — March 21
"XSA4 adds quality but costs speed"
Extended Self-Attention improved per-step quality but +17ms/step nearly negated the gain. The tradeoff warning sign.
Day 4 — March 22
"Autoresearch overnight = $43 lesson in automation"
80+ experiments crashed due to sed corruption, OOM errors, and eval timeouts. Led to: "use bash scripts, not Claude Code, for automated loops."
Day 5 — March 23
"reduce-overhead = free -25ms/step"
torch.compile reduce-overhead mode confirmed working where max-autotune crashes. -25ms on slow GPUs, -4ms on fast ones. Zero cost.
Day 5 — March 23
"GPU quality matters more than compile mode"
Pod 3 at 742 TFLOPS showed reduce-overhead only saved 4ms vs 25ms on slower hardware. Pod selection is the real optimization.
Day 6 — March 24
"Half the leaderboard was using illegal TTT"
Issue #402 revealed multi-epoch TTT is banned. Every sub-1.10 submission was closed. Our #9 became top 5 on verified rankings overnight.
Day 6 — March 24
"Our PR #406 is actually top 5 verified"
1.1287 BPB beats every submission except verified legal TTT entries around 1.12. We were closer to winning than we thought.
Day 6 — March 24
"Custom kernels help eval, not training"
Deep-read of PR #376's kernel library revealed all fused operations only accelerate evaluation. Training path uses standard torch.compile.
Day 6 — March 24
"93KB over the size limit = so close yet so far"
Run KC2-C: PR#505 dim=496 produced 16.09MB — just 93KB over the 16MB limit. Best legal no-TTT architecture but can't submit it.