Parameter Golf — Experiment Lab

OpenAI Model Craft Challenge · Nathan Maine · March 19–25, 2026

Disclaimer — Unofficial & Independent This dashboard is not affiliated with, endorsed by, or officially associated with OpenAI, RunPod, or the Parameter Golf competition organizers. It is solely an independent attempt by one participant to document, visualize, and analyze personal experiment data from a public open-source competition. All competition data referenced is derived from publicly available GitHub pull requests and the author's own runs. Interpretations of rules, rankings, and technique effectiveness reflect the author's personal understanding and may not represent official positions.

352

Total Submissions

275

Open (Valid)

Closed (Invalid)

263

Under 16MB

Days Left

46+

My Experiments

0 — Community Leaderboard (456 Submissions)Give Feedback

352 shown

Rank ↕	PR# ↕	Author ↕	Submission Name ↕	val_bpb ↕	Size (MB) ↕	<16MB ↕	Date ↕	Status ↕

How to read: CLOSED submissions were flagged by organizers (often illegal TTT). OPEN submissions are likely valid. MERGED submissions are official baselines or infrastructure PRs. Data sourced from public submission.json files in each PR. Filter to "Open Only" for the realistic leaderboard.

1a — All Submissions by Date (352 PRs)Give Feedback

Each dot is a submission. Color = BPB range. Hover for details. The competition launched March 18 — watch the BPB scores drop as competitors iterated.

1b — Top 20 Verified (Open + Under 16MB)Give Feedback

Only submissions that are OPEN and under 16MB. This is the realistic leaderboard — what would actually count if scoring closed today.

1c — Open vs Closed: The Illegal TTT CliffGive Feedback

The story: Nearly everything below ~1.10 BPB was closed by organizers for using multi-epoch TTT (training on validation data). Issue #402 redrew the map — the real competition is above the cliff.

1d — BPB by Technique CategoryGive Feedback

Techniques extracted from submission names. Each box shows the distribution. TTT dominates the low end but most were banned. QAT + architectural tricks (U-Net, XSA, EMA) are the legal frontier.

2 — My BPB Score ProgressionGive Feedback

Reading the chart: Lower BPB is better. The green zone shows verified legal scores. The red zone below 1.10 contains submissions using illegal multi-epoch TTT — all closed by organizers. Our PR #406 at 1.1287 is top 5 on the verified leaderboard.

2 — My Experiment Log (46+ Runs)Give Feedback

Run ↕	Day ↕	Date ↕	Pod ↕	Config / PR Base ↕	val_bpb ↕	step_ms ↕	Steps ↕	Size ↕	<16MB ↕	Legal ↕	Submit? ↕
PR#273	2	Mar 20	Pod 1 (1×H100)	10L, Int6 QAT, SmearGate, SWA	1.1575	—	—	~15MB	Yes	Yes	Submitted
PR#385	2	Mar 20	Pod 1 (1×H100)	11L, WD=0.04, SWA=0.4	1.1488	—	—	~15MB	Yes	Yes	Submitted
PR#406	3	Mar 21	Pod 1 (1×H100)	XSA4 + EMA + Int6 QAT	1.1287	82	7,300	~15MB	Yes	Yes	Submitted ★
Auto×80	4	Mar 22	Pod 1 (1×H100)	Autoresearch (sed corruption)	CRASHED	—	—	—	—	—	Failed
Auto-1	4	Mar 22	Pod 1 (1×H100)	Baseline (1×H100)	1.4230	~550	~500	~15MB	Yes	Yes	Baseline
T1	5	Mar 23	Pod 2 (1×H100)	PR#406 baseline	2.1809	572	525	~15MB	Yes	Yes	1×H100
T2	5	Mar 23	Pod 2 (1×H100)	+ reduce-overhead compile	2.1787	568	529	~15MB	Yes	Yes	1×H100
T3	5	Mar 23	Pod 2 (1×H100)	PR#505 full arch	4.7651	945	318	~19MB	No	Yes	Too slow
T1b	5	Mar 23	Pod 3 KC (742TF)	PR#406 baseline	2.1809	572	525	~15MB	Yes	Yes	1×H100
T2b	5	Mar 23	Pod 3 KC (742TF)	+ reduce-overhead	2.1787	568	529	~15MB	Yes	Yes	1×H100
T3b	5	Mar 23	Pod 3 KC (742TF)	PR#505 full	4.7651	945	318	~19MB	No	Yes	Too slow
T4	5	Mar 23	Pod 3 KC (742TF)	stride=32 eval	TIMEOUT	—	—	—	—	—	Failed
T5	5	Mar 23	Pod 3 KC (742TF)	13 layers	TIMEOUT	—	—	—	—	—	Failed
ICE-1	6	Mar 24	Pod 4 Iceland (8×H100)	PR#505 no TTT	1.1279	133	4,490	19.8MB	No	Yes	Over size
ICE-2	6	Mar 24	Pod 4 Iceland (8×H100)	PR#462 10ep TTT	1.0689	72	8,278	19.2MB	No	Illegal	Banned
ICE-3	6	Mar 24	Pod 4 Iceland (8×H100)	PR#77 LoRA TTT	1.1951	51	11,822	15.9MB	Yes	Yes	Poor score
ICE-4	6	Mar 24	Pod 4 Iceland (8×H100)	PR#462 KV=4	1.0723	69	8,744	17.5MB	No	Illegal	Banned
ICE-5	6	Mar 24	Pod 4 Iceland (8×H100)	PR#532 codebook+huffman	CRASHED	112	5,368	14.6MB	Yes	—	Crash
ICE-7	6	Mar 24	Pod 4 Iceland (8×H100)	KV=4 MLP=1536	1.0827	64	9,369	16.06MB	60KB over	Illegal	Both
ICE-8	6	Mar 24	Pod 4 Iceland (8×H100)	+ BigramHash=7168	1.0842	64	9,396	16.07MB	Over	Illegal	Both
ICE-9	6	Mar 24	Pod 4 Iceland (8×H100)	+ FP16_EMBED=0	1.0820	64	9,385	16.18MB	Over	Illegal	Both
ICE-10	6	Mar 24	Pod 4 Iceland (8×H100)	dim=496 (size fit)	1.0935	70	8,578	15.37MB	Yes	Illegal	Illegal TTT
KC2-A	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 WD=0.05	1.1390	~133	~4,500	18.5MB	No	Yes	Over size
KC2-B	6	Mar 24	Pod 5 KC2 (8×H100)	PR#77 legal LoRA TTT	1.2063	~50	~12K	~15MB	Yes	Yes	Wrong arch
KC2-C	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 dim=496 no TTT	1.1366	~130	~4,500	16.09MB	93KB over	Yes	93KB over!
KC2-D	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 full no TTT	1.1217	~133	~4,500	19.8MB	No	Yes	Over size
KC2-S1	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 1337)	1.1305	~70	~8,500	15.4MB	Yes	Yes	Worse
KC2-S2	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 42)	1.1309	~70	~8,500	15.4MB	Yes	Yes	Worse
KC2-S3	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 7)	1.1310	~70	~8,500	15.4MB	Yes	Yes	Worse

Note: 1×H100 runs (Days 2-5) show higher BPB because they get far fewer training steps in the time budget. Only 8×H100 results are competition-comparable. The 80+ autoresearch crashes are collapsed into one row — $43 lesson learned.

3 — Technique EffectivenessGive Feedback

✓

Int6 QAT

Quantization-aware training. Table stakes — every top submission uses it.

REQUIRED · ~0.5MB model size reduction

✓

reduce-overhead compile

torch.compile mode. -25ms/step on slow GPUs, -4ms on fast. Free win.

PROVEN · -25ms/step (slow GPU)

✓

EMA 0.997

Exponential moving average of weights during training. Small but consistent improvement.

HELPFUL · small quality gain

✓

Star-ReLU (not SwiGLU)

ReLU² + learnable scale/bias. Used in top submissions. PR #505 title was misleading.

TOP-TIER · used in #1 no-TTT submission

✓

Late QAT

Enable QAT when LR drops below 15% threshold. Better than constant QAT.

BETTER · vs constant QAT

✓

Legal LoRA TTT (score-first)

Single-pass adaptation at eval. Score token, then train. Never see same data twice. PR #77 proved -0.037 BPB.

PROVEN · -0.037 BPB (needs lr=0.01)

XSA4 (Extended Self-Attention)

Adds quality but costs +17ms/step overhead. Net effect marginal.

MIXED · +17ms/step overhead

U-Net Skip Connections

5 encoder + 6 decoder with learned gates. Helps quality but adds model size.

MIXED · quality ↑ but size ↑

BigramHash

Hash-based bigram embedding. Minimal measurable impact on quality in our tests.

MINIMAL · no clear benefit

Custom Kernels (THOR, REAPER...)

Fused Triton/CUDA operations. 1.17×–1.70× speedup but only during eval, not training.

EVAL ONLY · doesn't speed training

✗

Multi-epoch TTT

10-epoch AdamW on validation data. -0.059 BPB but BANNED by organizers (Issue #402).

BANNED · all sub-1.10 submissions closed

✗

torch.compile max-autotune

CUDAGraph conflicts with rotary cache. Crashes every time. Use reduce-overhead instead.

CRASHES · CUDAGraph conflicts

4 — Cost & EfficiencyGive Feedback

Pod 1 — Days 2–4 (1×H100)

$90

Includes $43 overnight waste

Pod 2 — Day 5 (1×H100)

3 targeted tests

Pod 3 KC — Day 5 (1×H100)

$15

742 TFLOPS — best single GPU

Pod 4 Iceland — Day 6 (8×H100)

$65

10 experiments — most productive session

Pod 5 KC2 — Day 6 (8×H100)

$35

7 runs including 3-seed verification

Overnight Waste

$43

80+ crashed autoresearch experiments

$5.57

Cost per successful experiment

$8,889

Cost per BPB point gained

0.0288

BPB improvement (total)

56%

Experiments with no usable result

5 — Pod Performance ComparisonGive Feedback

Pod	Location	GPU Config	GEMM (TFLOPS)	Step ms (same code)	Cost/hr	Verdict
Pod 1	Unknown	1×H100 SXM	Not measured	498–592 ms	~$2.69	Variable
Pod 2	Unknown	1×H100 SXM	Not measured	553–578 ms	~$2.69	Consistent
Pod 3	Kansas City, MO	1×H100 SXM	742 TFLOPS	568–572 ms	$2.69	Exceptional
Pod 4	Reykjavík, Iceland	8×H100 SXM	733 TFLOPS	45–133 ms	$21.52	Top-tier
Pod 5	Kansas City (KC2)	8×H100 SXM	~730 est	60–133 ms	~$21.52	Solid

Key insight: GPU quality varies significantly even within the same model. Pod 3 hit 742 TFLOPS — well above the ~275 TFLOPS reference. This is why we built the pod benchmarking tool.

6 — Key DiscoveriesGive Feedback

Day 2 — March 20

"Step speed is the #1 bottleneck"

Every millisecond per step = fewer total training steps = worse BPB. Discovered after 17 experiments showed speed beats depth.

Day 3 — March 21

"XSA4 adds quality but costs speed"

Extended Self-Attention improved per-step quality but +17ms/step nearly negated the gain. The tradeoff warning sign.

Day 4 — March 22

"Autoresearch overnight = $43 lesson in automation"

80+ experiments crashed due to sed corruption, OOM errors, and eval timeouts. Led to: "use bash scripts, not Claude Code, for automated loops."

Day 5 — March 23

"reduce-overhead = free -25ms/step"

torch.compile reduce-overhead mode confirmed working where max-autotune crashes. -25ms on slow GPUs, -4ms on fast ones. Zero cost.

Day 5 — March 23

"GPU quality matters more than compile mode"

Pod 3 at 742 TFLOPS showed reduce-overhead only saved 4ms vs 25ms on slower hardware. Pod selection is the real optimization.

Day 6 — March 24

"Half the leaderboard was using illegal TTT"

Issue #402 revealed multi-epoch TTT is banned. Every sub-1.10 submission was closed. Our #9 became top 5 on verified rankings overnight.

Day 6 — March 24

"Our PR #406 is actually top 5 verified"

1.1287 BPB beats every submission except verified legal TTT entries around 1.12. We were closer to winning than we thought.

Day 6 — March 24

"Custom kernels help eval, not training"

Deep-read of PR #376's kernel library revealed all fused operations only accelerate evaluation. Training path uses standard torch.compile.

Day 6 — March 24

"93KB over the size limit = so close yet so far"

Run KC2-C: PR#505 dim=496 produced 16.09MB — just 93KB over the 16MB limit. Best legal no-TTT architecture but can't submit it.