Parameter Golf — Experiment Lab
OpenAI Model Craft Challenge · Nathan Maine · March 19 - April 21, 2026
Disclaimer — Unofficial & Independent
This dashboard is not affiliated with, endorsed by, or officially associated with OpenAI, RunPod, or the Parameter Golf competition organizers. It is solely an independent attempt by one participant to document, visualize, and analyze personal experiment data from a public open-source competition. All competition data referenced is derived from publicly available GitHub pull requests and the author's own runs. Interpretations of rules, rankings, and technique effectiveness reflect the author's personal understanding and may not represent official positions.
0 — Community Leaderboard (1,683 Submissions)Give Feedback
Default view: All submissions
This leaderboard defaults to showing every submission regardless of status. Use the highlighted filter dropdown below to narrow the view to Legal-only, Record Eligible, Closed, or other categories. Submissions flagged as Illegal (multi-epoch TTT, score-every-epoch) or Suspect (closed by organizers, BPB < 0.5 indicating likely n-gram cache exploits) are shown by default but visually distinguished. Classifications are our best interpretation of the issue #402 guidelines and may not be 100% accurate.
▼ Change the view
Use the dropdown to filter submissions. Default shows all.
1,683 shown
| Rank ↕ | PR# ↕ | Author ↕ | Submission Name ↕ | val_bpb ↕ | Size (MB) ↕ | <16MB ↕ | Date ↕ | Status ↕ | TTT Legal ↕ |
|---|
How to read: CLOSED submissions were flagged by organizers (often illegal TTT). OPEN submissions are likely valid. MERGED submissions are official baselines or infrastructure PRs. Data sourced from public submission.json files in each PR. Use "Record Eligible" to see only Legal + Open + Under 16MB submissions. Use "Over 16MB / Unknown Size" to find submissions that may exceed the size limit. If your submission's size is showing incorrectly, open an issue on the dashboard repo.
TTT Legality (per issue #402):
Legal Score-first TTT: score chunk (no_grad), then adapt on already-scored tokens. Per-document independent. No cross-document leakage.
Illegal Multi-epoch TTT: train on val then re-score. Score-every-epoch keeping min NLL. Pre-eval adaptation.
Suspect Closed by organizers, or BPB < 0.5 (likely n-gram cache exploit).
Full ruling
1a — Legal Submissions by DateGive Feedback
Each dot is a legal submission (suspect and illegal TTT filtered out). Color = BPB range. Hover for details. The competition launched March 18.
1b — Top 20 Verified (Legal + Open + Under 16MB)Give Feedback
Only submissions that are Legal, OPEN, and under 16MB. Suspect and illegal TTT submissions are excluded. This is the realistic leaderboard - what would actually count if scoring closed today.
1c — Legal vs Closed vs Suspect: The TTT CliffGive Feedback
The story: Nearly everything below ~1.10 BPB was closed by organizers for using multi-epoch TTT (training on validation data). Issue #402 redrew the map. Since April, the landscape has shifted further: SLOT legality is now also in question (PR #1240 proved standard SLOT illegal, Issue #1336 asking about causal SLOT). The cliff is no longer just about TTT.
1d — BPB by Technique Category (Legal Only)Give Feedback
Techniques extracted from submission names. Each box shows the distribution. TTT dominates the low end but most were banned. QAT + architectural tricks (U-Net, XSA, EMA) remain strong, but parallel residuals and depth recurrence have emerged as new top techniques on the legal frontier.
2 — My BPB Score ProgressionGive Feedback
Reading the chart: Lower BPB is better. The green zone shows verified legal scores. The red zone below 1.10 contains submissions using illegal multi-epoch TTT, all closed by organizers. Our best verified-legal number is 1.1048 (#1287, no SLOT). #1291 at 1.0925 is parked pending SLOT legality ruling (Issue #1336). We retracted #1193, #406, and #1127 for TTT-on-val on April 13. Important correction: between April 15-19 a byte-counting bug in
build_sentencepiece_luts (tracked in Issue #1719) was discovered in the GDN-family submissions. The `+1` on leading-space bytes double-counts the denominator, inflating reported BPB by ~17%. PRs #1632, #1672, #1681, #1687, #1705, #1711, #1734 all self-closed after canonical rescoring put them in the 1.18-1.22 range. #1698 at reported 1.00995 is still open but flagged. The real canonical frontier right now sits around 1.057 (dexhunter #1693, Casefold V4 + Multi-Phase Global SGD TTT). 10 days to deadline.
2 - My Experiment Log (127+ Runs)Give Feedback
| Run ↕ | Day ↕ | Date ↕ | Pod ↕ | Config / PR Base ↕ | val_bpb ↕ | step_ms ↕ | Steps ↕ | Size ↕ | <16MB ↕ | Legal ↕ | Submit? ↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PR#273 | 2 | Mar 20 | Pod 1 (1×H100) | 10L, Int6 QAT, SmearGate, SWA | 1.1575 | — | — | ~15MB | Yes | Yes | Submitted |
| PR#385 | 2 | Mar 20 | Pod 1 (1×H100) | 11L, WD=0.04, SWA=0.4 | 1.1488 | — | — | ~15MB | Yes | Yes | Submitted |
| PR#406 | 3 | Mar 21 | Pod 1 (1×H100) | XSA4 + EMA + Int6 QAT | 1.1287 | 82 | 7,300 | ~15MB | Yes | Yes | Submitted ★ |
| Auto×80 | 4 | Mar 22 | Pod 1 (1×H100) | Autoresearch (sed corruption) | CRASHED | — | — | — | — | — | Failed |
| Auto-1 | 4 | Mar 22 | Pod 1 (1×H100) | Baseline (1×H100) | 1.4230 | ~550 | ~500 | ~15MB | Yes | Yes | Baseline |
| T1 | 5 | Mar 23 | Pod 2 (1×H100) | PR#406 baseline | 2.1809 | 572 | 525 | ~15MB | Yes | Yes | 1×H100 |
| T2 | 5 | Mar 23 | Pod 2 (1×H100) | + reduce-overhead compile | 2.1787 | 568 | 567 | ~15MB | Yes | Yes | 1×H100 |
| T3 | 5 | Mar 23 | Pod 2 (1×H100) | PR#505 full arch | 4.7651 | 945 | 318 | ~19MB | No | Yes | Too slow |
| T1b | 5 | Mar 23 | Pod 3 KC (742TF) | PR#406 baseline | 2.1809 | 572 | 525 | ~15MB | Yes | Yes | 1×H100 |
| T2b | 5 | Mar 23 | Pod 3 KC (742TF) | + reduce-overhead | 2.1787 | 568 | 567 | ~15MB | Yes | Yes | 1×H100 |
| T3b | 5 | Mar 23 | Pod 3 KC (742TF) | PR#505 full | 4.7651 | 945 | 318 | ~19MB | No | Yes | Too slow |
| T4 | 5 | Mar 23 | Pod 3 KC (742TF) | stride=32 eval | TIMEOUT | — | — | — | — | — | Failed |
| T5 | 5 | Mar 23 | Pod 3 KC (742TF) | 13 layers | TIMEOUT | — | — | — | — | — | Failed |
| ICE-1 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#505 no TTT | 1.1279 | 133 | 4,490 | 19.8MB | No | Yes | Over size |
| ICE-2 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#462 10ep TTT | 1.0689 | 72 | 8,278 | 19.2MB | No | Illegal | Banned |
| ICE-3 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#77 LoRA TTT | 1.1951 | 51 | 11,822 | 15.9MB | Yes | Yes | Poor score |
| ICE-4 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#462 KV=4 | 1.0723 | 69 | 8,744 | 17.5MB | No | Illegal | Banned |
| ICE-5 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | PR#532 codebook+huffman | CRASHED | 112 | 5,368 | 14.6MB | Yes | — | Crash |
| ICE-7 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | KV=4 MLP=1536 | 1.0827 | 64 | 9,369 | 16.06MB | 60KB over | Illegal | Both |
| ICE-8 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | + BigramHash=7168 | 1.0842 | 64 | 9,396 | 16.07MB | Over | Illegal | Both |
| ICE-9 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | + FP16_EMBED=0 | 1.0820 | 64 | 9,385 | 16.18MB | Over | Illegal | Both |
| ICE-10 | 6 | Mar 24 | Pod 4 Iceland (8×H100) | dim=496 (size fit) | 1.0935 | 70 | 8,578 | 15.37MB | Yes | Illegal | Illegal TTT |
| KC2-A | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 WD=0.05 | 1.1390 | ~133 | ~4,500 | 18.5MB | No | Yes | Over size |
| KC2-B | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#77 legal LoRA TTT | 1.2063 | ~50 | ~12K | ~15MB | Yes | Yes | Wrong arch |
| KC2-C | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 dim=496 no TTT | 1.1366 | ~130 | ~4,500 | 16.09MB | 93KB over | Yes | 93KB over! |
| KC2-D | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#505 full no TTT | 1.1217 | ~133 | ~4,500 | 19.8MB | No | Yes | Over size |
| KC2-S1 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 1337) | 1.1305 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
| KC2-S2 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 42) | 1.1309 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
| KC2-S3 | 6 | Mar 24 | Pod 5 KC2 (8×H100) | PR#462 1ep TTT (seed 7) | 1.1310 | ~70 | ~8,500 | 15.4MB | Yes | Yes | Worse |
| 5090-1 | 8 | Mar 26 | RTX 5090 Reykjavik (180TF) | PR#414 baseline (SDPA fallback) | 2.3155 | 326 | 100 | ~15MB | Yes | Yes | 1×GPU test |
| 5090-2 | 8 | Mar 26 | RTX 5090 Reykjavik (180TF) | PR#549 LeakyReLU² + Banking | 2.3280 | 385 | 100 | ~15MB | Yes | Yes | 1×GPU test |
| PHANTOM | 8 | Mar 26 | RTX 5090 Reykjavik (180TF) | PHANTOM kernel (fused Linear+LoRA) | N/A | — | — | — | — | — | 0.41x slower |
| NG-1× | 8 | Mar 26 | H100 Mumbai (734TF) | PR#900 N-gram+Dirichlet (1×H100, 200 steps) | 0.1191 | 237 | 200 | 8.7MB | Yes | Yes | N-gram works! |
| PR#948-s1 | 8 | Mar 26 | 8×H100 Rancho Cordova (749TF) | Order-15 Dirichlet + Phrase (seed 1337) | 0.11555 | 146 | 3,827 | 15.1MB | Yes | Yes | Submitted ★ |
| PR#948-s2 | 8 | Mar 26 | 8×H100 Rancho Cordova (749TF) | Order-15 Dirichlet + Phrase (seed 42) | 0.11556 | 146 | 3,827 | 15.1MB | Yes | Yes | Submitted ★ |
| PR#948-s3 | 8 | Mar 26 | 8×H100 Rancho Cordova (749TF) | Order-15 Dirichlet + Phrase (seed 2025) | 0.11556 | 146 | 3,827 | 15.1MB | Yes | Yes | Submitted ★ |
| ABL-1 | 8 | Mar 27 | H100 Kansas City (740TF) | Baseline (control) | 0.11906 | 237 | 200 | ~8.7MB | Yes | Yes | Control |
| ABL-2 | 8 | Mar 27 | H100 Kansas City (740TF) | + Two-pass rescore | 0.11906 | 237 | 200 | ~8.7MB | Yes | Yes | No change |
| ABL-3 | 8 | Mar 27 | H100 Kansas City (740TF) | N-gram order 20 (was 15) | 0.11873 | 237 | 200 | ~8.7MB | Yes | Yes | Winner! -0.00033 |
| ABL-4 | 8 | Mar 27 | H100 Kansas City (740TF) | Int5 quantization | 0.11906 | 237 | 200 | ~8.7MB | Yes | Yes | No change |
| ABL-5 | 8 | Mar 27 | H100 Kansas City (740TF) | Comp alpha=0.30 | 0.11906 | 237 | 200 | ~8.7MB | Yes | Yes | No change |
| ABL-6 | 8 | Mar 27 | H100 Kansas City (740TF) | zstd compression | CRASHED | — | — | — | — | — | Not supported |
| PR#968-s1 | 8 | Mar 27 | 8×H100 Montréal (747TF) | Order-20 Dirichlet + Phrase (seed 1337) | 0.11544 | 177 | 3,170 | 15.1MB | Yes | Yes | Submitted ★ #1 |
| PR#968-s2 | 8 | Mar 27 | 8×H100 Montréal (747TF) | Order-20 Dirichlet + Phrase (seed 42) | 0.11546 | 177 | 3,170 | 15.1MB | Yes | Yes | Submitted ★ #1 |
| PR#968-s3 | 8 | Mar 27 | 8×H100 Montréal (747TF) | Order-20 Dirichlet + Phrase (seed 2025) | 0.11545 | 177 | 3,170 | 15.1MB | Yes | Yes | Submitted ★ #1 |
| KS-v1 | 10 | Mar 30 | Pod 6 RTX 5090 Iceland | Universal Transformer (22 iter, 90% sparse) | 1.8134 | 560 | 1,071 | 2.9MB | Yes | Yes | Research |
| KS-v2 | 10 | Mar 30 | Pod 6 RTX 5090 Iceland | UT-12 + 50% sparse + TTT | 1.4390 | 564 | 1,064 | 2.9MB | Yes | Yes | Submitted ★ |
| KS-v3 | 10 | Mar 30 | Pod 6 RTX 5090 Iceland | UT-12 dim=768, no sparse | 1.5212 | 981 | 610 | 6.5MB | Yes | Yes | Research |
| DIFF-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | Text Diffusion (MDLM) hybrid AR+diff | 3.3801 | 2,311 | 79 | 5.3MB | Yes | Yes | Submitted ★ |
| ADAPT-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | Random Linear Map Adapters | 2.2017 | 609 | 296 | 10.5MB | Yes | Yes | Submitted ★ |
| JEPA-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | LLM-JEPA (Joint Embedding) | 2.2020 | 682 | 265 | 9.1MB | Yes | Yes | Submitted ★ |
| MAMBA-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | Mamba SSM Hybrid 3:1 | 3.3168 | ~600 | ~300 | 5.3MB | Yes | Yes | Submitted ★ |
| HNET-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | H-Net Dynamic Chunking | 1.6393 | 624 | 289 | 8.5MB | Yes | Yes | Research |
| MEGA-1 | 10 | Mar 30 | Pod 7 RTX 5090 Iceland | Triton Megakernels (broken) | 3.3164 | 274 | 657 | 5.3MB | Yes | Yes | CRASHED |
| MEGA-2 | 11 | Mar 31 | Pod 6 RTX 5090 Iceland | Megakernel (fixed, fullgraph=True) | 1.3560 | 616 | 975 | 12.3MB | Yes | Yes | Submitted ★ |
| HNET-2 | 11 | Mar 31 | Pod 7 RTX 5090 Iceland | H-Net + TTT (10 min) | 1.3587 | 624 | 964 | 8.6MB | Yes | Yes | Submitted ★ |
| CTRL-1 | 11 | Mar 31 | Pod 7 RTX 5090 Iceland | Control baseline (no tricks) | 1.3577 | 606 | 958 | 12.3MB | Yes | Yes | Baseline |
| H200-1 | 11 | Mar 31 | Pod 8 H200 SXM (742TF) | SOTA+Triton (FA2, GPTQ test) | 5.6466 | 1,106 | 543 | 5.9MB | Yes | Yes | GPTQ test |
| #1019-s42 | 14 | Apr 2 | Pod Montreal 778T | PR#1019 baseline | 1.1265 | 114 | 5,261 | 15.89MB | Yes | Yes | Baseline |
| #1019-s1337 | 14 | Apr 2 | Pod Montreal 778T | PR#1019 + FA3 fix | 1.1266 | 115 | 5,197 | 15.87MB | Yes | Yes | Baseline |
| #1176-SLOT | 14 | Apr 2 | Pod Montreal 778T | PR#1176 SLOT QK4.0 | 1.1147 | 101 | 5,900 | 15.68MB | Yes | Yes | Matched SOTA |
| #1218-s42 | 14 | Apr 2 | Pod Montreal 778T | Vocab4096 MLP4x WD.085 | 1.1039 | 130 | 4,807 | 15.95MB | Yes | Yes | Submitted ★ |
| #1218-s1337 | 14 | Apr 2 | Pod Montreal 778T | Same | 1.1054 | 130 | 4,701 | 15.93MB | Yes | Yes | Submitted ★ |
| #1218-s2025 | 14 | Apr 2 | Pod Montreal 778T | Same | 1.1052 | 130 | 4,758 | 15.96MB | Yes | Yes | Submitted ★ |
| #1218+SLOT-s42 | 15 | Apr 3 | Pod Iceland 802T | Vocab4096 MLP4x SLOT | 1.0947 | 115 | 5,165 | 15.95MB | Yes | Yes | Submitted ★ |
| #1218+SLOT-s1337 | 15 | Apr 3 | Pod Iceland 802T | Same | 1.0913 | 100 | 5,890 | 15.93MB | Yes | Yes | Submitted ★ |
| #1218+SLOT-s2025 | 15 | Apr 3 | Pod Iceland 802T | Same | 1.0915 | 100 | 5,900 | 15.95MB | Yes | Yes | Submitted ★ |
| #1287-s42 | 15 | Apr 3 | Pod Iceland 802T | Vocab4096 MLP4x NO SLOT | 1.1048 | 130 | 4,807 | 15.95MB | Yes | Yes | Submitted |
| PROTEUS-baseline | 16-17 | Apr 4-5 | DGX Spark GB10 | sp1024 baseline 1000 steps | 1.4601 | - | 1,000 | 8.99MB | Yes | Yes | Local only |
| PROTEUS-parallel | 16-17 | Apr 4-5 | DGX Spark GB10 | Parallel+INT5+SLOT 1000 steps | 1.4479 | - | 1,000 | 8.21MB | Yes | Yes | Local only |
| Ablation-A | 17-18 | Apr 5-6 | DGX Spark GB10 | Baseline 500 steps | 1.5734 | - | 500 | 7.55MB | Yes | Yes | Local only |
| Ablation-C | 17-18 | Apr 5-6 | DGX Spark GB10 | Parallel only 500 steps | 1.5559 | - | 500 | 7.58MB | Yes | Yes | Local only |
| Ablation-F | 17-18 | Apr 5-6 | DGX Spark GB10 | Parallel+SLOT 500 steps | 1.5557 | - | 500 | 6.67MB | Yes | Yes | Local only |
| UT-1 (6 iters) | 20 | Apr 9 | DGX Spark GB10 | Universal Transformer 1 block x 6 iters | 3.2483 | 707 | 200 | - | Yes | Yes | Research PR #1193 |
| UT-2 (24 iters) | 20 | Apr 9 | DGX Spark GB10 | Universal Transformer 1 block x 24 iters | 3.2490 | 2,734 | 200 | 1.45MB | Yes | Yes | Research PR #1193 |
| DIFF-1 (70/30) | 20 | Apr 9 | DGX Spark GB10 | Text Diffusion 70% AR / 30% diff | 2.4195 | 1,388 | 200 | 6.90MB | Yes | Yes | Research PR #1194 |
| DIFF-2 (50/50) | 20 | Apr 9 | DGX Spark GB10 | Text Diffusion 50/50 | 2.4194 | 997 | 200 | 6.90MB | Yes | Yes | Research PR #1194 |
| DIFF-3 (pure AR) | 20 | Apr 9 | DGX Spark GB10 | Pure AR reference | 2.4194 | 997 | 200 | 6.90MB | Yes | Yes | Research PR #1194 |
| RND-1 (default 0.5%) | 20 | Apr 9 | DGX Spark GB10 | Random Adapters default | 2.5123 | 894 | 200 | 10.49MB | Yes | Yes | Research PR #1195 |
| RND-2 (wider rank 8) | 20 | Apr 9 | DGX Spark GB10 | Random Adapters wider | 2.6323 | 881 | 200 | 1.36MB | Yes | Yes | Research PR #1195 |
| RND-3 (5% unfrozen) | 20 | Apr 9 | DGX Spark GB10 | Random Adapters 5% trainable | 2.5120 | 895 | 200 | 10.49MB | Yes | Yes | Research PR #1195 |
| RND-4 (progressive) | 20 | Apr 9 | DGX Spark GB10 | Random Adapters progressive unfreeze | 2.5122 | 894 | 200 | 10.49MB | Yes | Yes | Research PR #1195 |
| JEPA-1 (10%) | 20 | Apr 9 | DGX Spark GB10 | JEPA 10% auxiliary weight | 2.2323 | 498 | 200 | 7.04MB | Yes | Yes | Research PR #1196 |
| JEPA-2 (30%) | 20 | Apr 9 | DGX Spark GB10 | JEPA 30% auxiliary weight | 2.2322 | 496 | 200 | 7.04MB | Yes | Yes | Research PR #1196 |
| JEPA-3 (50%) | 20 | Apr 9 | DGX Spark GB10 | JEPA 50% auxiliary weight | 2.2322 | 497 | 200 | 7.04MB | Yes | Yes | Research PR #1196 |
| SSM-1 (1:1 ratio) | 20 | Apr 9 | DGX Spark GB10 | Mamba SSM 1:1 SSM:Attention | 2.0295 | 37,492 | 200 | 11.42MB | Yes | Yes | Research PR #1197 |
| SSM-4 (state=64) | 20 | Apr 9 | DGX Spark GB10 | Mamba SSM larger state dim (partial) | 2.1816 | 55,824 | 100 | - | Yes | Yes | Research PR #1197 |
| HNET-1 (default) | 20 | Apr 9 | DGX Spark GB10 | H-Net default chunker | 2.0558 | 513 | 200 | 7.37MB | Yes | Yes | Research PR #1191 |
| HNET-2 (large chunker) | 20 | Apr 9 | DGX Spark GB10 | H-Net large chunker | 2.0559 | 513 | 200 | 7.37MB | Yes | Yes | Research PR #1191 |
| HNET-3 (boundary reg) | 20 | Apr 9 | DGX Spark GB10 | H-Net boundary regularizer | 2.0558 | 514 | 200 | 7.37MB | Yes | Yes | Research PR #1191 |
| MEGA-1 (baseline) | 20 | Apr 9 | DGX Spark GB10 | Megakernel baseline 9L d=512 | 2.2147 | 584 | 200 | 7.13MB | Yes | Yes | Research PR #1192 |
| MEGA-2 (d=640) | 20 | Apr 9 | DGX Spark GB10 | Megakernel wider d=640 | 2.1592 | 903 | 200 | 10.25MB | Yes | Yes | Research PR #1192 |
| MEGA-3 (11 layers) | 20 | Apr 9 | DGX Spark GB10 | Megakernel deeper 11L | 2.1961 | 714 | 200 | 8.61MB | Yes | Yes | Research PR #1192 |
Note: 1xGPU runs show higher BPB due to fewer training steps. Only 8xH100 results are competition-comparable. The N-gram cache revolution (Day 8) took us from 1.1287 to 0.11545, a 10x improvement. Days 14-15: Vocab 4096 + MLP 4.0x + SLOT pushed neural model to 1.0913 BPB. Total: 127+ experiments across 13 pods + DGX Spark, ~$360 spent. April 9: 22-run research expansion covering all 7 OpenAI-requested architectures (Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, Megakernels).
3 — Technique EffectivenessGive Feedback
Int6 QAT
Quantization-aware training. Table stakes — every top submission uses it.
REQUIRED · ~0.5MB model size reduction
reduce-overhead compile
torch.compile mode. -25ms/step on slow GPUs, -4ms on fast. Free win.
PROVEN · -25ms/step (slow GPU)
EMA 0.997
Exponential moving average of weights during training. Small but consistent improvement.
HELPFUL · small quality gain
Star-ReLU (not SwiGLU)
ReLU² + learnable scale/bias. Used in top submissions. PR #505 title was misleading.
TOP-TIER · used in #1 no-TTT submission
Late QAT
Enable QAT when LR drops below 15% threshold. Better than constant QAT.
BETTER · vs constant QAT
Legal LoRA TTT (score-first)
Single-pass adaptation at eval. Score token, then train. Never see same data twice. PR #77 proved -0.037 BPB.
PROVEN · -0.037 BPB (needs lr=0.01)
XSA4 (Extended Self-Attention)
Adds quality but costs +17ms/step overhead. Net effect marginal.
MIXED · +17ms/step overhead
U-Net Skip Connections
5 encoder + 6 decoder with learned gates. Helps quality but adds model size.
MIXED · quality ↑ but size ↑
BigramHash
Hash-based bigram embedding. Minimal measurable impact on quality in our tests.
MINIMAL · no clear benefit
Custom Kernels (THOR, REAPER...)
Fused Triton/CUDA operations. 1.17×–1.70× speedup but only during eval, not training.
EVAL ONLY · doesn't speed training
Multi-epoch TTT
10-epoch AdamW on validation data. -0.059 BPB but BANNED by organizers (Issue #402).
BANNED · all sub-1.10 submissions closed
torch.compile max-autotune
CUDAGraph conflicts with rotary cache. Crashes every time. Use reduce-overhead instead.
CRASHES · CUDAGraph conflicts
PHANTOM Kernel (Fused Linear+LoRA)
Custom Triton kernel fusing base linear + LoRA. Rank-8 matmuls too small to benefit — cuBLAS already optimal.
0.41× SLOWER · dead end confirmed on RTX 5090
N-gram Cache (orders 2-20)
Hash table of token patterns from already-scored validation data. Backward-looking, score-first, causal. THE paradigm shift.
GAME CHANGER · 1.17 → 0.115 BPB (10× improvement)
Dirichlet Smoothing (per-order OBCL)
Bayesian posterior: p = (count + c × prior) / (total + c). Learned concentration per order: 50.0 (bigrams) → 1.86 (14-grams).
CRITICAL · -0.06 BPB vs linear interpolation
Phrase Cache (suffix matching)
Match 16-20 token suffixes with Dirichlet smoothing. Captures long-range repetition that 15-gram can't reach.
PROVEN · -0.11 BPB additional improvement
Complementary Training
Downweight loss on N-gram-predictable tokens during training. Focus neural model on hard tokens.
PROVEN · -0.07 BPB (alpha=0.50)
Order 20 (vs 15)
Extending N-gram orders from 15 to 20 gave -0.00033 BPB in ablation, -0.00011 in competition.
SMALL WIN · -0.00011 BPB (confirmed via 6-test ablation)
H-Net Dynamic Chunking
Learned boundary prediction between tokens. 263K extra params, matched baseline within 0.001 BPB.
PROVEN · baseline-matching with learned tokenization
Fused Triton Megakernels
RMSNorm + LeakyReLU² eval kernels. Beat baseline by 0.0017 BPB.
PROVEN · -0.0017 BPB improvement
Full Hessian GPTQ
Cholesky error compensation. Works on Hopper (H200/H100), crashes on consumer GPUs (5090). Best results with AR self-gen calibration data.
HOPPER ONLY · H100/H200 exclusive, works with AR self-gen calibration
Universal Transformer
Shared-weight block looped 12×. Saves 90% block params but loses 0.08 BPB vs flat. Confirms PR #363 findings.
RESEARCH VALUE · depth recurrence confirmed inferior at this scale
Adaptive Density
Sparse→dense curriculum. Faster early steps but net BPB worse.
MIXED · faster early steps, worse final BPB
Echo Training
Self-distillation from EMA. Crashed due to autograd conflicts. Needs architectural rethink.
INCOMPLETE · autograd conflicts
Text Diffusion (MDLM)
Hybrid AR+diffusion. 3.38 BPB. Too slow (2.3s/step) for 10-min window. Signs of life only.
RESEARCH ONLY · 3.38 BPB, 2.3s/step
Random Linear Map Adapters
Frozen orthogonal weights. 2.20 BPB. Pretraining on frozen random projections doesn't work.
NEGATIVE RESULT · 2.20 BPB
LLM-JEPA
Joint embedding prediction. 2.20 BPB. JEPA benefits are invisible on raw BPB pretraining metric.
NEGATIVE RESULT · 2.20 BPB
Mamba SSM Hybrid
Pure PyTorch SSM. 3.32 BPB. No custom kernels = too slow. Needs CUDA implementation.
RESEARCH ONLY · 3.32 BPB, needs CUDA kernels
Depth Recurrence
Confirmed independently: flat models beat looped models at this scale. Also confirmed by PR #363.
DEAD END · flat > looped at 16MB scale
SLOT (per-batch delta optimization)
Eval-time per-batch delta optimization. Consistent gain on top of any base model with zero training changes needed. Standard SLOT ruled illegal by PR #1240 (100% causal violation). Context-only SLOT variant under review (Issue #1336).
PROVEN · -0.007 BPB eval-time gain, legality under review
Vocab 4096 + MLP 4.0x
Bigger vocabulary and wider FFN combined with aggressive weight decay (0.085). Better compression per parameter.
BREAKTHROUGH · 1.1048 base (3-seed mean), beats SOTA without eval tricks
Brotli-11 Compression
Saves ~400KB vs LZMA, enabling larger models to fit under the 16MB submission limit.
PROVEN · ~400KB savings vs LZMA, enables bigger models under 16MB
Parallel Residuals
Dual-stream attention/MLP lanes with learnable route vector and lane_merge. From PR #1204, #1289.
PROVEN · -0.0175 BPB (DGX Spark ablation)
4 — Cost & EfficiencyGive Feedback
Pod 1 — Days 2–4 (1×H100)
$90
Includes $43 overnight waste
Pod 2 — Day 5 (1×H100)
$7
3 targeted tests
Pod 3 KC — Day 5 (1×H100)
$15
742 TFLOPS — best single GPU
Pod 4 Iceland — Day 6 (8×H100)
$65
10 experiments — most productive session
Pod 5 KC2 — Day 6 (8×H100)
$35
7 runs including 3-seed verification
Overnight Waste
$43
80+ crashed autoresearch experiments
RTX 5090 — Day 8 (Reykjavik)
$5
PR#414 vs #549 + PHANTOM kernel test
H100 Mumbai — Day 8 (1×H100)
$5
First N-gram test — 0.1191 BPB!
8×H100 Rancho Cordova — Day 8
$35
3-seed PR#948 — #1 tied at 0.11556
H100 Kansas City — Day 8 (ablation)
$8
6-test ablation battery — found order 20
8×H100 Montréal — Day 8
$35
3-seed PR#968 — NEW #1 at 0.11545
Broken Pods (Amsterdam)
$15
2 pods failed — critical error + broken CUDA
Pod 6 — Days 10-11 (1×RTX 5090 Iceland)
$5.50
Kitchen Sink tuning + megakernel experiments
Pod 7 — Days 10-11 (1×RTX 5090 Iceland)
$7.00
7 research architecture runs + overnight
Pod 8 — Day 11 (1×H200 SXM)
$13.00
GPTQ validation on Hopper + SOTA test
Pod Montreal — Day 14 (8×H100 778T)
~$15
3 baseline + SLOT + 3x #1218 seeds
Pod Iceland — Day 15 (8×H100 802T)
~$15
3x #1218+SLOT seeds, best pod ever (802T)
DGX Spark - Days 16-19 (Local GB10)
$0
10 PROTEUS ablation runs, free local compute
5 — Pod Performance ComparisonGive Feedback
| Pod | Location | GPU Config | GEMM (TFLOPS) | Step ms (same code) | Cost/hr | Verdict |
|---|---|---|---|---|---|---|
| Pod 1 | Unknown | 1×H100 SXM | Not measured | 498–592 ms | ~$2.69 | Variable |
| Pod 2 | Unknown | 1×H100 SXM | Not measured | 553–578 ms | ~$2.69 | Consistent |
| Pod 3 | Kansas City, MO | 1×H100 SXM | 742 TFLOPS | 568–572 ms | $2.69 | Exceptional |
| Pod 4 | Reykjavík, Iceland | 8×H100 SXM | 733 TFLOPS | 45–133 ms | $21.52 | Top-tier |
| Pod 5 | Kansas City (KC2) | 8×H100 SXM | ~730 est | 60-133 ms | ~$21.52 | Solid |
| Pod Montreal (Apr 2) | Montreal, QC | 8×H100 SXM | 778 TFLOPS | ~130 ms (#1218) | ~$2.69/hr | Exceptional |
| Pod Iceland (Apr 3) | Reykjavik, IS | 8×H100 SXM | 802 TFLOPS | ~100 ms (#1218 cached) | ~$2.69/hr | Best Ever |
| DGX Spark (Apr 4-6) | Local (10G LAN) | 1x GB10 128GB | N/A (unified) | ~6000 ms (no compile) | $0/hr | Free + Always On |
Key insight: GPU quality varies significantly even within the same model. Pod 3 hit 742 TFLOPS - well above the ~275 TFLOPS reference. This is why we built the pod benchmarking tool. The DGX Spark GB10 (local, free) runs at ~6s/step without torch.compile but the parallel residuals architecture is 2.3x faster than sequential, partially compensating.
6 — Key DiscoveriesGive Feedback
Day 2 — March 20
"Step speed is the #1 bottleneck"
Every millisecond per step = fewer total training steps = worse BPB. Discovered after 17 experiments showed speed beats depth.
Day 3 — March 21
"XSA4 adds quality but costs speed"
Extended Self-Attention improved per-step quality but +17ms/step nearly negated the gain. The tradeoff warning sign.
Day 4 — March 22
"Autoresearch overnight = $43 lesson in automation"
80+ experiments crashed due to sed corruption, OOM errors, and eval timeouts. Led to: "use bash scripts, not Claude Code, for automated loops."
Day 5 — March 23
"reduce-overhead = free -25ms/step"
torch.compile reduce-overhead mode confirmed working where max-autotune crashes. -25ms on slow GPUs, -4ms on fast ones. Zero cost.
Day 5 — March 23
"GPU quality matters more than compile mode"
Pod 3 at 742 TFLOPS showed reduce-overhead only saved 4ms vs 25ms on slower hardware. Pod selection is the real optimization.
Day 6 — March 24
"Half the leaderboard was using illegal TTT"
Issue #402 revealed multi-epoch TTT is banned. Every sub-1.10 submission was closed. Our #9 became top 5 on verified rankings overnight.
Day 6 — March 24
"Our PR #406 is actually top 5 verified"
1.1287 BPB beats every submission except verified legal TTT entries around 1.12. We were closer to winning than we thought.
Day 6 — March 24
"Custom kernels help eval, not training"
Deep-read of PR #376's kernel library revealed all fused operations only accelerate evaluation. Training path uses standard torch.compile.
Day 6 — March 24
"93KB over the size limit = so close yet so far"
Run KC2-C: PR#505 dim=496 produced 16.09MB — just 93KB over the 16MB limit. Best legal no-TTT architecture but can't submit it.
Day 10 — March 30
"Research Sprint: all 7 OpenAI requests in 48 hours"
Implemented all 7 of OpenAI's requested research directions in 48 hours. 11,810 lines of code across 8 training scripts. Only entrant to attempt comprehensive coverage.
Day 10 — March 30
"Depth Recurrence Confirmed Dead"
Our Universal Transformer experiments independently confirmed PR #363's finding: shared-weight architectures lose ~0.08 BPB vs flat models at 16MB scale.
Day 10 — March 30
"H-Net Surprise: learned chunking matches baseline"
Learned dynamic chunking (H-Net) matched baseline within 0.001 BPB using only 263K extra parameters. Suggests learned tokenization is a viable research direction.
Day 11 — March 31
"Triton Kernels Debug: autograd broke, eval-only fix"
First attempt at training-time Triton kernels broke autograd (no gradients flowing). Fix: eval-only Triton + fullgraph=True torch.compile. Net result: 0.0017 BPB improvement.
Day 11 — March 31
"GPTQ on Hopper — H200 only"
Full Hessian GPTQ Cholesky decomposition crashes on RTX 5090 (consumer GPU precision) but works perfectly on H200 SXM (Hopper). This is an H100/H200-only technique.
Day 11 — March 31
"Brain Trust Gaps: 12 major knowledge gaps"
Queried 343K-chunk expert knowledge base. Found zero coverage on GPTQ, Triton kernel authoring, Mamba-3, and competition meta-techniques. 12 major knowledge gaps identified.
Day 11 — March 31
"Partition Function Challenge: n-gram scores may be invalid"
PR #1147 mathematically proved that Dirichlet/n-gram cache approaches (including our PRs #948/#968) produce invalid BPB due to unnormalized probability distributions. Our n-gram scores may be invalidated.
Day 14-15 — April 2-3
"Vocab 4096 + MLP 4.0x beats SOTA without eval tricks"
PR #1218 config with vocab 4096, MLP 4.0x expansion, and WD 0.085 achieved 1.1048 BPB (3-seed mean) on base model alone. No eval-time tricks needed to beat verified SOTA.
Day 14-15 — April 2-3
"SLOT gives consistent -0.007 BPB on top of base"
Per-batch delta optimization at eval time. Stacks cleanly on any trained model. Combined with #1218, pushed to 1.0925 mean (3-seed) on Iceland pod.
Day 15 — April 3
"Iceland pod lottery win: 802 TFLOPS + cached compilation = 100ms/step"
Best pod we have ever used. 802 TFLOPS GEMM performance combined with cached Triton compilation gave 100ms/step, enabling 5,900 training steps in 10 minutes.
Day 14-15 — April 2-3
"Weight decay correlates with compressibility (R^2 ~0.99)"
Higher weight decay (0.085) produces more compressible checkpoints. This enables bigger models to fit under 16MB with Brotli-11 compression. Nearly perfect linear correlation.
Day 16-17 - April 4
"PROTEUS integration: 4 features ported in one session"
Integrated parallel residuals, mixed INT5/INT6 quant, score-first TTT, and CPU test suite from PROTEUS v1.6 (PR #1289). All 22 CPU tests passing.
Day 17-19 - April 5-6
"Parallel residuals: the clear winner (-0.0175 BPB)"
7-run overnight ablation on DGX Spark isolated feature contributions. Parallel residuals delivered -0.0175 train_bpb and 2.3x throughput. INT5 saves 0.9 MB. SLOT adds nothing on top of parallel.
Day 17-19 - April 5-6
"SLOT legality crisis: standard SLOT proven illegal"
PR #1240 showed 100% causal violation rate for standard SLOT. Our #1291 (1.0925) uses standard SLOT. Issue #1336 pending ruling on causal SLOT variant. Safe fallback: #1287 at 1.1048.
Day 19 - April 6
"sp4096 tokenizer not officially approved"
Discussion on PR #1333 revealed sp4096 has been adopted by ~8 PRs but never formally approved by maintainers. Adds risk to all sp4096 submissions including ours.
Day 19 - April 6
"PR #1334 shows the legal ceiling: 1.0897 with zero eval tricks"
aryanbhosale's Track A submission uses our base stack (#1287) + depth recurrence + parallel residuals + MuonEq-R. 1.0897 BPB with no SLOT, no TTT, no n-gram. Credits our PR #1287.
Day 22 - April 9
"22-run research expansion: all 7 OpenAI-requested architectures ablated in one night"
Overnight DGX Spark run covering Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, and Megakernels. 20 of 22 runs successful. Published raw data as public gist for community reuse.
Day 22 - April 9
"Surprise: SSM wins on raw BPB despite being 50x slower per step"
Mamba-style SSM hit 2.0295 BPB with only ~5 effective training steps. Pure PyTorch implementation bottlenecks everything at 37s/step. Fast Triton selective scan kernel would be competitive.
Day 22 - April 9
"Hyperparameters inside an architecture often do nothing"
JEPA weight (10/30/50%), diffusion ratio (70/30, 50/50, pure AR), H-Net chunker config all produced identical BPB to 4 decimal places. Architectures either work or they do not. Tuning does not rescue them.
Day 23 - April 10
"The landscape shifted fast: we are no longer top 5 Track A"
23 new PRs in last 3 days. samacqua #1530 at 1.0734 is new Track A leader. msisovic #1529 at 1.0744, aryanbhosale #1540 at 1.0777 close behind. SP8192 vocab emerging as new standard. Our #1287 at 1.1048 needs another push.
Day 26 - April 13
"First sub-1.01 legal submission: dexhunter #1586 hits 1.0000 BPB"
Per-Layer Adaptive GPTQ Clip + int7 Embedding. Clean round number. The 1.00 BPB milestone is broken. Hkoyuer #1632 right behind with GDN-Hybrid (Gated DeltaNet + SWA) at 1.0274. 78 new PRs in 3 days, the final week is accelerating hard.
Day 26 - April 13
"Self-flagged 3 PRs for TTT-on-val after @MatoTeziTanka audit"
@MatoTeziTanka (PROTEUS #1289) flagged PR #1193 and #406 on April 11-12 for multi-epoch TTT on val_tokens. Audited our other submissions proactively and found the same pattern in #1127. Retracted all three, kept pre-SDTTT numbers on #406 (1.1455 legal), filed clean resubmission #1554. Added compliance notes to #948/#968 n-gram PRs (different class, awaiting maintainer ruling).
Day 26 - April 13
"RunPod $100 credit + PR #116 on containers repo"
Closed the loop on ticket #35283 after 12 days of back-and-forth. Received $100 credit. @max4c (RunPod engineer) built PR #116 on runpod/containers based on my PR #115, crediting my root cause analysis explicitly. Public GitHub pressure worked where private support tickets stalled.
Day 26 - April 13
"Submitted peft PR #3154: Gemma 4 dispatch error UX fix"
Closed the loop on huggingface/peft #3129 after thread got sidetracked. Built minimal repro on Spark, verified @BenjaminBossan's .linear regex workaround, submitted PR #3154 adding an actionable hint to PEFT's dispatch error for wrapper modules like Gemma4ClippableLinear. Full TestLoraInitialization suite passes with zero regressions.
Day 27 - April 14
"All 7 research PRs + #1554 rated LOOKS CLEAN by independent audit"
@MatoTeziTanka (PROTEUS author) independently reviewed #1191, #1192, #1193, #1194, #1195, #1196, #1197, and #1554. All came back LOOKS CLEAN - MERGE. PR #1554 received manual hand-review (not auto-classified). He retracted his own earlier citation of #1416/#1423 as legal, corrected to #1413 (dexhunter). Called the training-slice TTT approach "structurally the cleanest." Tagged to OpenAI maintainers. No response yet - merge rate is ~3%.
Day 28 - April 15
"peft PR #3154 closed, broader fix invited"
@BenjaminBossan agreed the helpful error message is valuable, but the Gemma 4 specific fix became redundant after PR #3136 merged. Closed #3154 and invited a broader redesign: general "invalid target module" error handler for all PEFT methods (LoRA, IA3, LoKr, LoHa, OFT), starting with LoRA but designed to extend. Scope agreed, tests will be written first.
Day 30-32 - April 15-19
"The byte-LUT bug: the GDN sub-1.01 cluster was an accounting artifact"
Between April 15-19, @tejasnaladala, @dexhunter, and @bigbag forensically identified a byte-counting defect in
build_sentencepiece_luts that originated in PR #1545 and propagated through every GDN-family inheritance. Leading-space bytes were being added twice, shrinking the BPB denominator and inflating reported scores by ~17.46%. PRs #1545, #1576, #1632, #1672, #1681, #1687, #1705, #1711, #1734 all self-closed after canonical rescoring. The reported "sub-1.01 cluster" did not hold - yahya010's own canonical check on #1734 put the 1.01080 at ~1.187. #1698 at reported 1.00995 remains open but flagged. Issue #1719 tracks the bug class.Day 30-33 - April 15-20
"The real canonical frontier: casefold tokenizers + MP-SGD TTT at 1.057-1.072"
After filtering out byte-buggy inheritance and pending-legality SLOT submissions, the legitimate open frontier lands in the 1.057-1.072 band. Top clean legal open: dexhunter #1693 (1.05733, Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT), #1670 (1.05970, same family), codemath3000 #1585 (1.0639, Casefold + Parallel Residuals), romeerp #1756 (1.06505, CaseOps + Recurrence Depth Curriculum), mikeapedia #1578 (1.0668, Custom Casefold), yahya010 #1727 (1.07217, SP8192 MP-SGD 4-phase), jorge-asenjo #1700 (1.07219, same base). Custom casefold-style tokenization + Multi-Phase SGD TTT is doing more work than the rest of the stack.
Day 33 - April 20
"Archive synced: 1,683 PRs captured, full-fidelity"
Pulled PRs #1633 through #1758 (123 new, 2 skipped as not-PRs, 0 failures). 1,919 new files downloaded, every PR diff preserved 100%. Local archive now 2.3 GB, NAS mirror at 2.6 GB. With 10 days to deadline, the archive represents the full corpus of community research to date.
Day 34 - April 21
"DGX Spark dev environment stood up; FA3 on Blackwell is a dead end, FA2 is the workaround"
Brought up a local experiment environment on DGX Spark (GB10, SM12.1, aarch64, CUDA 13.2 forward-compat) using
nvcr.io/nvidia/pytorch:26.03-py3. Built FA3 from source in the NGC container per the NVIDIA engineer workaround on the developer forums. Build succeeds after ~50 minutes, import flash_attn_interface works, but any flash_attn_func call fails at runtime with no kernel image is available for execution on the device. Build log shows every kernel object compiled for sm80 or sm90 only. Reported as a data point back to the NVIDIA developer forum thread and to FA upstream issue #1969. Working solution: use the pre-installed flash_attn 2.7.4 in the same NGC container, which runs bit-exact to torch.nn.functional.scaled_dot_product_attention on GB10. Pattern is consistent with AI2 open-instruct already excluding flash-attn on aarch64.Day 34 - April 21
"Surgical MP-SGD port from #1626 into #1667 base"
Ported dexhunter #1626's Multi-Phase Global SGD TTT functions (
eval_val_ttt_phased, train_val_ttt_global_sgd_distributed, plus five helpers) into MarioPaerle #1667's base (vanilla SP8192 + SmearGate + AttnOutGate). Added PHASED_TTT_ENABLED gate so one file runs single-phase or multi-phase TTT. Compute-capability detection forces FA2 on Blackwell automatically. torch.compile disable flag for GB10's 101KB shared-memory limit on Triton kernels. Audited each of Issue #1017's four conditions against the ported code.Day 34 - April 21
"Overnight ablation pipeline running: 9 experiments queued on Spark"
Pipeline compares pr1667 baseline, no-TTT baseline, MP-SGD 1-phase, MP-SGD 3-phase, Tapered WD, MP-SGD + Tapered WD, and gate-ablations (SmearGate off, AttnOutGate off, both off). Each experiment timeboxed to 60 min at reduced GB10 scale. Scale gap vs 8xH100 production means absolute BPB values do not match the leaderboard, but relative comparisons are apples-to-apples across configurations. 30-min health-check cycle runs alert-on-failure only. Full results in the next update.