Parameter Golf — Experiment Lab

OpenAI Model Craft Challenge · Nathan Maine · March 19–25, 2026

Disclaimer — Unofficial & Independent This dashboard is not affiliated with, endorsed by, or officially associated with OpenAI, RunPod, or the Parameter Golf competition organizers. It is solely an independent attempt by one participant to document, visualize, and analyze personal experiment data from a public open-source competition. All competition data referenced is derived from publicly available GitHub pull requests and the author's own runs. Interpretations of rules, rankings, and technique effectiveness reflect the author's personal understanding and may not represent official positions.
352
Total Submissions
275
Open (Valid)
61
Closed (Invalid)
263
Under 16MB
36
Days Left
46+
My Experiments
0 — Community Leaderboard (456 Submissions)
352 shown
Rank PR# Author Submission Name val_bpb Size (MB) <16MB Date Status
How to read: CLOSED submissions were flagged by organizers (often illegal TTT). OPEN submissions are likely valid. MERGED submissions are official baselines or infrastructure PRs. Data sourced from public submission.json files in each PR. Filter to "Open Only" for the realistic leaderboard.
1a — All Submissions by Date (352 PRs)
Each dot is a submission. Color = BPB range. Hover for details. The competition launched March 18 — watch the BPB scores drop as competitors iterated.
1b — Top 20 Verified (Open + Under 16MB)
Only submissions that are OPEN and under 16MB. This is the realistic leaderboard — what would actually count if scoring closed today.
1c — Open vs Closed: The Illegal TTT Cliff
The story: Nearly everything below ~1.10 BPB was closed by organizers for using multi-epoch TTT (training on validation data). Issue #402 redrew the map — the real competition is above the cliff.
1d — BPB by Technique Category
Techniques extracted from submission names. Each box shows the distribution. TTT dominates the low end but most were banned. QAT + architectural tricks (U-Net, XSA, EMA) are the legal frontier.
2 — My BPB Score Progression
Reading the chart: Lower BPB is better. The green zone shows verified legal scores. The red zone below 1.10 contains submissions using illegal multi-epoch TTT — all closed by organizers. Our PR #406 at 1.1287 is top 5 on the verified leaderboard.
2 — My Experiment Log (46+ Runs)
Run Day Date Pod Config / PR Base val_bpb step_ms Steps Size <16MB Legal Submit?
PR#2732Mar 20Pod 1 (1×H100) 10L, Int6 QAT, SmearGate, SWA 1.1575~15MB Yes Yes Submitted
PR#3852Mar 20Pod 1 (1×H100) 11L, WD=0.04, SWA=0.4 1.1488~15MB Yes Yes Submitted
PR#4063Mar 21Pod 1 (1×H100) XSA4 + EMA + Int6 QAT 1.1287827,300~15MB Yes Yes Submitted ★
Auto×804Mar 22Pod 1 (1×H100) Autoresearch (sed corruption) CRASHED Failed
Auto-14Mar 22Pod 1 (1×H100) Baseline (1×H100) 1.4230~550~500~15MB Yes Yes Baseline
T15Mar 23Pod 2 (1×H100) PR#406 baseline 2.1809572525~15MB Yes Yes 1×H100
T25Mar 23Pod 2 (1×H100) + reduce-overhead compile 2.1787568529~15MB Yes Yes 1×H100
T35Mar 23Pod 2 (1×H100) PR#505 full arch 4.7651945318~19MB No Yes Too slow
T1b5Mar 23Pod 3 KC (742TF) PR#406 baseline 2.1809572525~15MB Yes Yes 1×H100
T2b5Mar 23Pod 3 KC (742TF) + reduce-overhead 2.1787568529~15MB Yes Yes 1×H100
T3b5Mar 23Pod 3 KC (742TF) PR#505 full 4.7651945318~19MB No Yes Too slow
T45Mar 23Pod 3 KC (742TF) stride=32 eval TIMEOUT Failed
T55Mar 23Pod 3 KC (742TF) 13 layers TIMEOUT Failed
ICE-16Mar 24Pod 4 Iceland (8×H100) PR#505 no TTT 1.12791334,49019.8MB No Yes Over size
ICE-26Mar 24Pod 4 Iceland (8×H100) PR#462 10ep TTT 1.0689728,27819.2MB No Illegal Banned
ICE-36Mar 24Pod 4 Iceland (8×H100) PR#77 LoRA TTT 1.19515111,82215.9MB Yes Yes Poor score
ICE-46Mar 24Pod 4 Iceland (8×H100) PR#462 KV=4 1.0723698,74417.5MB No Illegal Banned
ICE-56Mar 24Pod 4 Iceland (8×H100) PR#532 codebook+huffman CRASHED1125,36814.6MB Yes Crash
ICE-76Mar 24Pod 4 Iceland (8×H100) KV=4 MLP=1536 1.0827649,36916.06MB 60KB over Illegal Both
ICE-86Mar 24Pod 4 Iceland (8×H100) + BigramHash=7168 1.0842649,39616.07MB Over Illegal Both
ICE-96Mar 24Pod 4 Iceland (8×H100) + FP16_EMBED=0 1.0820649,38516.18MB Over Illegal Both
ICE-106Mar 24Pod 4 Iceland (8×H100) dim=496 (size fit) 1.0935708,57815.37MB Yes Illegal Illegal TTT
KC2-A6Mar 24Pod 5 KC2 (8×H100) PR#505 WD=0.05 1.1390~133~4,50018.5MB No Yes Over size
KC2-B6Mar 24Pod 5 KC2 (8×H100) PR#77 legal LoRA TTT 1.2063~50~12K~15MB Yes Yes Wrong arch
KC2-C6Mar 24Pod 5 KC2 (8×H100) PR#505 dim=496 no TTT 1.1366~130~4,50016.09MB 93KB over Yes 93KB over!
KC2-D6Mar 24Pod 5 KC2 (8×H100) PR#505 full no TTT 1.1217~133~4,50019.8MB No Yes Over size
KC2-S16Mar 24Pod 5 KC2 (8×H100) PR#462 1ep TTT (seed 1337) 1.1305~70~8,50015.4MB Yes Yes Worse
KC2-S26Mar 24Pod 5 KC2 (8×H100) PR#462 1ep TTT (seed 42) 1.1309~70~8,50015.4MB Yes Yes Worse
KC2-S36Mar 24Pod 5 KC2 (8×H100) PR#462 1ep TTT (seed 7) 1.1310~70~8,50015.4MB Yes Yes Worse
Note: 1×H100 runs (Days 2-5) show higher BPB because they get far fewer training steps in the time budget. Only 8×H100 results are competition-comparable. The 80+ autoresearch crashes are collapsed into one row — $43 lesson learned.
3 — Technique Effectiveness
Int6 QAT
Quantization-aware training. Table stakes — every top submission uses it.
REQUIRED · ~0.5MB model size reduction
reduce-overhead compile
torch.compile mode. -25ms/step on slow GPUs, -4ms on fast. Free win.
PROVEN · -25ms/step (slow GPU)
EMA 0.997
Exponential moving average of weights during training. Small but consistent improvement.
HELPFUL · small quality gain
Star-ReLU (not SwiGLU)
ReLU² + learnable scale/bias. Used in top submissions. PR #505 title was misleading.
TOP-TIER · used in #1 no-TTT submission
Late QAT
Enable QAT when LR drops below 15% threshold. Better than constant QAT.
BETTER · vs constant QAT
Legal LoRA TTT (score-first)
Single-pass adaptation at eval. Score token, then train. Never see same data twice. PR #77 proved -0.037 BPB.
PROVEN · -0.037 BPB (needs lr=0.01)
!
XSA4 (Extended Self-Attention)
Adds quality but costs +17ms/step overhead. Net effect marginal.
MIXED · +17ms/step overhead
!
U-Net Skip Connections
5 encoder + 6 decoder with learned gates. Helps quality but adds model size.
MIXED · quality ↑ but size ↑
!
BigramHash
Hash-based bigram embedding. Minimal measurable impact on quality in our tests.
MINIMAL · no clear benefit
!
Custom Kernels (THOR, REAPER...)
Fused Triton/CUDA operations. 1.17×–1.70× speedup but only during eval, not training.
EVAL ONLY · doesn't speed training
Multi-epoch TTT
10-epoch AdamW on validation data. -0.059 BPB but BANNED by organizers (Issue #402).
BANNED · all sub-1.10 submissions closed
torch.compile max-autotune
CUDAGraph conflicts with rotary cache. Crashes every time. Use reduce-overhead instead.
CRASHES · CUDAGraph conflicts
4 — Cost & Efficiency
Pod 1 — Days 2–4 (1×H100)
$90
Includes $43 overnight waste
Pod 2 — Day 5 (1×H100)
$7
3 targeted tests
Pod 3 KC — Day 5 (1×H100)
$15
742 TFLOPS — best single GPU
Pod 4 Iceland — Day 6 (8×H100)
$65
10 experiments — most productive session
Pod 5 KC2 — Day 6 (8×H100)
$35
7 runs including 3-seed verification
Overnight Waste
$43
80+ crashed autoresearch experiments
$5.57
Cost per successful experiment
$8,889
Cost per BPB point gained
0.0288
BPB improvement (total)
56%
Experiments with no usable result
5 — Pod Performance Comparison
Pod Location GPU Config GEMM (TFLOPS) Step ms (same code) Cost/hr Verdict
Pod 1Unknown1×H100 SXM Not measured 498–592 ms~$2.69 Variable
Pod 2Unknown1×H100 SXM Not measured 553–578 ms~$2.69 Consistent
Pod 3Kansas City, MO1×H100 SXM 742 TFLOPS 568–572 ms$2.69 Exceptional
Pod 4Reykjavík, Iceland8×H100 SXM 733 TFLOPS 45–133 ms$21.52 Top-tier
Pod 5Kansas City (KC2)8×H100 SXM ~730 est 60–133 ms~$21.52 Solid
Key insight: GPU quality varies significantly even within the same model. Pod 3 hit 742 TFLOPS — well above the ~275 TFLOPS reference. This is why we built the pod benchmarking tool.
6 — Key Discoveries
Day 2 — March 20
"Step speed is the #1 bottleneck"
Every millisecond per step = fewer total training steps = worse BPB. Discovered after 17 experiments showed speed beats depth.
Day 3 — March 21
"XSA4 adds quality but costs speed"
Extended Self-Attention improved per-step quality but +17ms/step nearly negated the gain. The tradeoff warning sign.
Day 4 — March 22
"Autoresearch overnight = $43 lesson in automation"
80+ experiments crashed due to sed corruption, OOM errors, and eval timeouts. Led to: "use bash scripts, not Claude Code, for automated loops."
Day 5 — March 23
"reduce-overhead = free -25ms/step"
torch.compile reduce-overhead mode confirmed working where max-autotune crashes. -25ms on slow GPUs, -4ms on fast ones. Zero cost.
Day 5 — March 23
"GPU quality matters more than compile mode"
Pod 3 at 742 TFLOPS showed reduce-overhead only saved 4ms vs 25ms on slower hardware. Pod selection is the real optimization.
Day 6 — March 24
"Half the leaderboard was using illegal TTT"
Issue #402 revealed multi-epoch TTT is banned. Every sub-1.10 submission was closed. Our #9 became top 5 on verified rankings overnight.
Day 6 — March 24
"Our PR #406 is actually top 5 verified"
1.1287 BPB beats every submission except verified legal TTT entries around 1.12. We were closer to winning than we thought.
Day 6 — March 24
"Custom kernels help eval, not training"
Deep-read of PR #376's kernel library revealed all fused operations only accelerate evaluation. Training path uses standard torch.compile.
Day 6 — March 24
"93KB over the size limit = so close yet so far"
Run KC2-C: PR#505 dim=496 produced 16.09MB — just 93KB over the 16MB limit. Best legal no-TTT architecture but can't submit it.