Parameter Golf — Experiment Lab

OpenAI Model Craft Challenge · Nathan Maine · March 19 - April 21, 2026

Disclaimer — Unofficial & Independent This dashboard is not affiliated with, endorsed by, or officially associated with OpenAI, RunPod, or the Parameter Golf competition organizers. It is solely an independent attempt by one participant to document, visualize, and analyze personal experiment data from a public open-source competition. All competition data referenced is derived from publicly available GitHub pull requests and the author's own runs. Interpretations of rules, rankings, and technique effectiveness reflect the author's personal understanding and may not represent official positions.

1,683

Total Submissions

1,151

Open (Valid)

485

Closed (Invalid)

862

Under 16MB

Days Left

127+

My Experiments

0 — Community Leaderboard (1,683 Submissions)Give Feedback

Default view: All submissions

This leaderboard defaults to showing every submission regardless of status. Use the highlighted filter dropdown below to narrow the view to Legal-only, Record Eligible, Closed, or other categories. Submissions flagged as Illegal (multi-epoch TTT, score-every-epoch) or Suspect (closed by organizers, BPB < 0.5 indicating likely n-gram cache exploits) are shown by default but visually distinguished. Classifications are our best interpretation of the issue #402 guidelines and may not be 100% accurate.

▼ Change the view

Use the dropdown to filter submissions. Default shows all.

View: 1,683 shown

Rank ↕	PR# ↕	Author ↕	Submission Name ↕	val_bpb ↕	Size (MB) ↕	<16MB ↕	Date ↕	Status ↕	TTT Legal ↕

How to read: CLOSED submissions were flagged by organizers (often illegal TTT). OPEN submissions are likely valid. MERGED submissions are official baselines or infrastructure PRs. Data sourced from public submission.json files in each PR. Use "Record Eligible" to see only Legal + Open + Under 16MB submissions. Use "Over 16MB / Unknown Size" to find submissions that may exceed the size limit. If your submission's size is showing incorrectly, open an issue on the dashboard repo.

TTT Legality (per issue #402): Legal Score-first TTT: score chunk (no_grad), then adapt on already-scored tokens. Per-document independent. No cross-document leakage. Illegal Multi-epoch TTT: train on val then re-score. Score-every-epoch keeping min NLL. Pre-eval adaptation. Suspect Closed by organizers, or BPB < 0.5 (likely n-gram cache exploit). Full ruling

1a — Legal Submissions by DateGive Feedback

Each dot is a legal submission (suspect and illegal TTT filtered out). Color = BPB range. Hover for details. The competition launched March 18.

1b — Top 20 Verified (Legal + Open + Under 16MB)Give Feedback

Only submissions that are Legal, OPEN, and under 16MB. Suspect and illegal TTT submissions are excluded. This is the realistic leaderboard - what would actually count if scoring closed today.

1c — Legal vs Closed vs Suspect: The TTT CliffGive Feedback

The story: Nearly everything below ~1.10 BPB was closed by organizers for using multi-epoch TTT (training on validation data). Issue #402 redrew the map. Since April, the landscape has shifted further: SLOT legality is now also in question (PR #1240 proved standard SLOT illegal, Issue #1336 asking about causal SLOT). The cliff is no longer just about TTT.

1d — BPB by Technique Category (Legal Only)Give Feedback

Techniques extracted from submission names. Each box shows the distribution. TTT dominates the low end but most were banned. QAT + architectural tricks (U-Net, XSA, EMA) remain strong, but parallel residuals and depth recurrence have emerged as new top techniques on the legal frontier.

2 — My BPB Score ProgressionGive Feedback

Reading the chart: Lower BPB is better. The green zone shows verified legal scores. The red zone below 1.10 contains submissions using illegal multi-epoch TTT, all closed by organizers. Our best verified-legal number is 1.1048 (#1287, no SLOT). #1291 at 1.0925 is parked pending SLOT legality ruling (Issue #1336). We retracted #1193, #406, and #1127 for TTT-on-val on April 13. Important correction: between April 15-19 a byte-counting bug in build_sentencepiece_luts (tracked in Issue #1719) was discovered in the GDN-family submissions. The `+1` on leading-space bytes double-counts the denominator, inflating reported BPB by ~17%. PRs #1632, #1672, #1681, #1687, #1705, #1711, #1734 all self-closed after canonical rescoring put them in the 1.18-1.22 range. #1698 at reported 1.00995 is still open but flagged. The real canonical frontier right now sits around 1.057 (dexhunter #1693, Casefold V4 + Multi-Phase Global SGD TTT). 10 days to deadline.

2 - My Experiment Log (127+ Runs)Give Feedback

Run ↕	Day ↕	Date ↕	Pod ↕	Config / PR Base ↕	val_bpb ↕	step_ms ↕	Steps ↕	Size ↕	<16MB ↕	Legal ↕	Submit? ↕
PR#273	2	Mar 20	Pod 1 (1×H100)	10L, Int6 QAT, SmearGate, SWA	1.1575	—	—	~15MB	Yes	Yes	Submitted
PR#385	2	Mar 20	Pod 1 (1×H100)	11L, WD=0.04, SWA=0.4	1.1488	—	—	~15MB	Yes	Yes	Submitted
PR#406	3	Mar 21	Pod 1 (1×H100)	XSA4 + EMA + Int6 QAT	1.1287	82	7,300	~15MB	Yes	Yes	Submitted ★
Auto×80	4	Mar 22	Pod 1 (1×H100)	Autoresearch (sed corruption)	CRASHED	—	—	—	—	—	Failed
Auto-1	4	Mar 22	Pod 1 (1×H100)	Baseline (1×H100)	1.4230	~550	~500	~15MB	Yes	Yes	Baseline
T1	5	Mar 23	Pod 2 (1×H100)	PR#406 baseline	2.1809	572	525	~15MB	Yes	Yes	1×H100
T2	5	Mar 23	Pod 2 (1×H100)	+ reduce-overhead compile	2.1787	568	567	~15MB	Yes	Yes	1×H100
T3	5	Mar 23	Pod 2 (1×H100)	PR#505 full arch	4.7651	945	318	~19MB	No	Yes	Too slow
T1b	5	Mar 23	Pod 3 KC (742TF)	PR#406 baseline	2.1809	572	525	~15MB	Yes	Yes	1×H100
T2b	5	Mar 23	Pod 3 KC (742TF)	+ reduce-overhead	2.1787	568	567	~15MB	Yes	Yes	1×H100
T3b	5	Mar 23	Pod 3 KC (742TF)	PR#505 full	4.7651	945	318	~19MB	No	Yes	Too slow
T4	5	Mar 23	Pod 3 KC (742TF)	stride=32 eval	TIMEOUT	—	—	—	—	—	Failed
T5	5	Mar 23	Pod 3 KC (742TF)	13 layers	TIMEOUT	—	—	—	—	—	Failed
ICE-1	6	Mar 24	Pod 4 Iceland (8×H100)	PR#505 no TTT	1.1279	133	4,490	19.8MB	No	Yes	Over size
ICE-2	6	Mar 24	Pod 4 Iceland (8×H100)	PR#462 10ep TTT	1.0689	72	8,278	19.2MB	No	Illegal	Banned
ICE-3	6	Mar 24	Pod 4 Iceland (8×H100)	PR#77 LoRA TTT	1.1951	51	11,822	15.9MB	Yes	Yes	Poor score
ICE-4	6	Mar 24	Pod 4 Iceland (8×H100)	PR#462 KV=4	1.0723	69	8,744	17.5MB	No	Illegal	Banned
ICE-5	6	Mar 24	Pod 4 Iceland (8×H100)	PR#532 codebook+huffman	CRASHED	112	5,368	14.6MB	Yes	—	Crash
ICE-7	6	Mar 24	Pod 4 Iceland (8×H100)	KV=4 MLP=1536	1.0827	64	9,369	16.06MB	60KB over	Illegal	Both
ICE-8	6	Mar 24	Pod 4 Iceland (8×H100)	+ BigramHash=7168	1.0842	64	9,396	16.07MB	Over	Illegal	Both
ICE-9	6	Mar 24	Pod 4 Iceland (8×H100)	+ FP16_EMBED=0	1.0820	64	9,385	16.18MB	Over	Illegal	Both
ICE-10	6	Mar 24	Pod 4 Iceland (8×H100)	dim=496 (size fit)	1.0935	70	8,578	15.37MB	Yes	Illegal	Illegal TTT
KC2-A	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 WD=0.05	1.1390	~133	~4,500	18.5MB	No	Yes	Over size
KC2-B	6	Mar 24	Pod 5 KC2 (8×H100)	PR#77 legal LoRA TTT	1.2063	~50	~12K	~15MB	Yes	Yes	Wrong arch
KC2-C	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 dim=496 no TTT	1.1366	~130	~4,500	16.09MB	93KB over	Yes	93KB over!
KC2-D	6	Mar 24	Pod 5 KC2 (8×H100)	PR#505 full no TTT	1.1217	~133	~4,500	19.8MB	No	Yes	Over size
KC2-S1	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 1337)	1.1305	~70	~8,500	15.4MB	Yes	Yes	Worse
KC2-S2	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 42)	1.1309	~70	~8,500	15.4MB	Yes	Yes	Worse
KC2-S3	6	Mar 24	Pod 5 KC2 (8×H100)	PR#462 1ep TTT (seed 7)	1.1310	~70	~8,500	15.4MB	Yes	Yes	Worse
5090-1	8	Mar 26	RTX 5090 Reykjavik (180TF)	PR#414 baseline (SDPA fallback)	2.3155	326	100	~15MB	Yes	Yes	1×GPU test
5090-2	8	Mar 26	RTX 5090 Reykjavik (180TF)	PR#549 LeakyReLU² + Banking	2.3280	385	100	~15MB	Yes	Yes	1×GPU test
PHANTOM	8	Mar 26	RTX 5090 Reykjavik (180TF)	PHANTOM kernel (fused Linear+LoRA)	N/A	—	—	—	—	—	0.41x slower
NG-1×	8	Mar 26	H100 Mumbai (734TF)	PR#900 N-gram+Dirichlet (1×H100, 200 steps)	0.1191	237	200	8.7MB	Yes	Yes	N-gram works!
PR#948-s1	8	Mar 26	8×H100 Rancho Cordova (749TF)	Order-15 Dirichlet + Phrase (seed 1337)	0.11555	146	3,827	15.1MB	Yes	Yes	Submitted ★
PR#948-s2	8	Mar 26	8×H100 Rancho Cordova (749TF)	Order-15 Dirichlet + Phrase (seed 42)	0.11556	146	3,827	15.1MB	Yes	Yes	Submitted ★
PR#948-s3	8	Mar 26	8×H100 Rancho Cordova (749TF)	Order-15 Dirichlet + Phrase (seed 2025)	0.11556	146	3,827	15.1MB	Yes	Yes	Submitted ★
ABL-1	8	Mar 27	H100 Kansas City (740TF)	Baseline (control)	0.11906	237	200	~8.7MB	Yes	Yes	Control
ABL-2	8	Mar 27	H100 Kansas City (740TF)	+ Two-pass rescore	0.11906	237	200	~8.7MB	Yes	Yes	No change
ABL-3	8	Mar 27	H100 Kansas City (740TF)	N-gram order 20 (was 15)	0.11873	237	200	~8.7MB	Yes	Yes	Winner! -0.00033
ABL-4	8	Mar 27	H100 Kansas City (740TF)	Int5 quantization	0.11906	237	200	~8.7MB	Yes	Yes	No change
ABL-5	8	Mar 27	H100 Kansas City (740TF)	Comp alpha=0.30	0.11906	237	200	~8.7MB	Yes	Yes	No change
ABL-6	8	Mar 27	H100 Kansas City (740TF)	zstd compression	CRASHED	—	—	—	—	—	Not supported
PR#968-s1	8	Mar 27	8×H100 Montréal (747TF)	Order-20 Dirichlet + Phrase (seed 1337)	0.11544	177	3,170	15.1MB	Yes	Yes	Submitted ★ #1
PR#968-s2	8	Mar 27	8×H100 Montréal (747TF)	Order-20 Dirichlet + Phrase (seed 42)	0.11546	177	3,170	15.1MB	Yes	Yes	Submitted ★ #1
PR#968-s3	8	Mar 27	8×H100 Montréal (747TF)	Order-20 Dirichlet + Phrase (seed 2025)	0.11545	177	3,170	15.1MB	Yes	Yes	Submitted ★ #1
KS-v1	10	Mar 30	Pod 6 RTX 5090 Iceland	Universal Transformer (22 iter, 90% sparse)	1.8134	560	1,071	2.9MB	Yes	Yes	Research
KS-v2	10	Mar 30	Pod 6 RTX 5090 Iceland	UT-12 + 50% sparse + TTT	1.4390	564	1,064	2.9MB	Yes	Yes	Submitted ★
KS-v3	10	Mar 30	Pod 6 RTX 5090 Iceland	UT-12 dim=768, no sparse	1.5212	981	610	6.5MB	Yes	Yes	Research
DIFF-1	10	Mar 30	Pod 7 RTX 5090 Iceland	Text Diffusion (MDLM) hybrid AR+diff	3.3801	2,311	79	5.3MB	Yes	Yes	Submitted ★
ADAPT-1	10	Mar 30	Pod 7 RTX 5090 Iceland	Random Linear Map Adapters	2.2017	609	296	10.5MB	Yes	Yes	Submitted ★
JEPA-1	10	Mar 30	Pod 7 RTX 5090 Iceland	LLM-JEPA (Joint Embedding)	2.2020	682	265	9.1MB	Yes	Yes	Submitted ★
MAMBA-1	10	Mar 30	Pod 7 RTX 5090 Iceland	Mamba SSM Hybrid 3:1	3.3168	~600	~300	5.3MB	Yes	Yes	Submitted ★
HNET-1	10	Mar 30	Pod 7 RTX 5090 Iceland	H-Net Dynamic Chunking	1.6393	624	289	8.5MB	Yes	Yes	Research
MEGA-1	10	Mar 30	Pod 7 RTX 5090 Iceland	Triton Megakernels (broken)	3.3164	274	657	5.3MB	Yes	Yes	CRASHED
MEGA-2	11	Mar 31	Pod 6 RTX 5090 Iceland	Megakernel (fixed, fullgraph=True)	1.3560	616	975	12.3MB	Yes	Yes	Submitted ★
HNET-2	11	Mar 31	Pod 7 RTX 5090 Iceland	H-Net + TTT (10 min)	1.3587	624	964	8.6MB	Yes	Yes	Submitted ★
CTRL-1	11	Mar 31	Pod 7 RTX 5090 Iceland	Control baseline (no tricks)	1.3577	606	958	12.3MB	Yes	Yes	Baseline
H200-1	11	Mar 31	Pod 8 H200 SXM (742TF)	SOTA+Triton (FA2, GPTQ test)	5.6466	1,106	543	5.9MB	Yes	Yes	GPTQ test
#1019-s42	14	Apr 2	Pod Montreal 778T	PR#1019 baseline	1.1265	114	5,261	15.89MB	Yes	Yes	Baseline
#1019-s1337	14	Apr 2	Pod Montreal 778T	PR#1019 + FA3 fix	1.1266	115	5,197	15.87MB	Yes	Yes	Baseline
#1176-SLOT	14	Apr 2	Pod Montreal 778T	PR#1176 SLOT QK4.0	1.1147	101	5,900	15.68MB	Yes	Yes	Matched SOTA
#1218-s42	14	Apr 2	Pod Montreal 778T	Vocab4096 MLP4x WD.085	1.1039	130	4,807	15.95MB	Yes	Yes	Submitted ★
#1218-s1337	14	Apr 2	Pod Montreal 778T	Same	1.1054	130	4,701	15.93MB	Yes	Yes	Submitted ★
#1218-s2025	14	Apr 2	Pod Montreal 778T	Same	1.1052	130	4,758	15.96MB	Yes	Yes	Submitted ★
#1218+SLOT-s42	15	Apr 3	Pod Iceland 802T	Vocab4096 MLP4x SLOT	1.0947	115	5,165	15.95MB	Yes	Yes	Submitted ★
#1218+SLOT-s1337	15	Apr 3	Pod Iceland 802T	Same	1.0913	100	5,890	15.93MB	Yes	Yes	Submitted ★
#1218+SLOT-s2025	15	Apr 3	Pod Iceland 802T	Same	1.0915	100	5,900	15.95MB	Yes	Yes	Submitted ★
#1287-s42	15	Apr 3	Pod Iceland 802T	Vocab4096 MLP4x NO SLOT	1.1048	130	4,807	15.95MB	Yes	Yes	Submitted
PROTEUS-baseline	16-17	Apr 4-5	DGX Spark GB10	sp1024 baseline 1000 steps	1.4601	-	1,000	8.99MB	Yes	Yes	Local only
PROTEUS-parallel	16-17	Apr 4-5	DGX Spark GB10	Parallel+INT5+SLOT 1000 steps	1.4479	-	1,000	8.21MB	Yes	Yes	Local only
Ablation-A	17-18	Apr 5-6	DGX Spark GB10	Baseline 500 steps	1.5734	-	500	7.55MB	Yes	Yes	Local only
Ablation-C	17-18	Apr 5-6	DGX Spark GB10	Parallel only 500 steps	1.5559	-	500	7.58MB	Yes	Yes	Local only
Ablation-F	17-18	Apr 5-6	DGX Spark GB10	Parallel+SLOT 500 steps	1.5557	-	500	6.67MB	Yes	Yes	Local only
UT-1 (6 iters)	20	Apr 9	DGX Spark GB10	Universal Transformer 1 block x 6 iters	3.2483	707	200	-	Yes	Yes	Research PR #1193
UT-2 (24 iters)	20	Apr 9	DGX Spark GB10	Universal Transformer 1 block x 24 iters	3.2490	2,734	200	1.45MB	Yes	Yes	Research PR #1193
DIFF-1 (70/30)	20	Apr 9	DGX Spark GB10	Text Diffusion 70% AR / 30% diff	2.4195	1,388	200	6.90MB	Yes	Yes	Research PR #1194
DIFF-2 (50/50)	20	Apr 9	DGX Spark GB10	Text Diffusion 50/50	2.4194	997	200	6.90MB	Yes	Yes	Research PR #1194
DIFF-3 (pure AR)	20	Apr 9	DGX Spark GB10	Pure AR reference	2.4194	997	200	6.90MB	Yes	Yes	Research PR #1194
RND-1 (default 0.5%)	20	Apr 9	DGX Spark GB10	Random Adapters default	2.5123	894	200	10.49MB	Yes	Yes	Research PR #1195
RND-2 (wider rank 8)	20	Apr 9	DGX Spark GB10	Random Adapters wider	2.6323	881	200	1.36MB	Yes	Yes	Research PR #1195
RND-3 (5% unfrozen)	20	Apr 9	DGX Spark GB10	Random Adapters 5% trainable	2.5120	895	200	10.49MB	Yes	Yes	Research PR #1195
RND-4 (progressive)	20	Apr 9	DGX Spark GB10	Random Adapters progressive unfreeze	2.5122	894	200	10.49MB	Yes	Yes	Research PR #1195
JEPA-1 (10%)	20	Apr 9	DGX Spark GB10	JEPA 10% auxiliary weight	2.2323	498	200	7.04MB	Yes	Yes	Research PR #1196
JEPA-2 (30%)	20	Apr 9	DGX Spark GB10	JEPA 30% auxiliary weight	2.2322	496	200	7.04MB	Yes	Yes	Research PR #1196
JEPA-3 (50%)	20	Apr 9	DGX Spark GB10	JEPA 50% auxiliary weight	2.2322	497	200	7.04MB	Yes	Yes	Research PR #1196
SSM-1 (1:1 ratio)	20	Apr 9	DGX Spark GB10	Mamba SSM 1:1 SSM:Attention	2.0295	37,492	200	11.42MB	Yes	Yes	Research PR #1197
SSM-4 (state=64)	20	Apr 9	DGX Spark GB10	Mamba SSM larger state dim (partial)	2.1816	55,824	100	-	Yes	Yes	Research PR #1197
HNET-1 (default)	20	Apr 9	DGX Spark GB10	H-Net default chunker	2.0558	513	200	7.37MB	Yes	Yes	Research PR #1191
HNET-2 (large chunker)	20	Apr 9	DGX Spark GB10	H-Net large chunker	2.0559	513	200	7.37MB	Yes	Yes	Research PR #1191
HNET-3 (boundary reg)	20	Apr 9	DGX Spark GB10	H-Net boundary regularizer	2.0558	514	200	7.37MB	Yes	Yes	Research PR #1191
MEGA-1 (baseline)	20	Apr 9	DGX Spark GB10	Megakernel baseline 9L d=512	2.2147	584	200	7.13MB	Yes	Yes	Research PR #1192
MEGA-2 (d=640)	20	Apr 9	DGX Spark GB10	Megakernel wider d=640	2.1592	903	200	10.25MB	Yes	Yes	Research PR #1192
MEGA-3 (11 layers)	20	Apr 9	DGX Spark GB10	Megakernel deeper 11L	2.1961	714	200	8.61MB	Yes	Yes	Research PR #1192

Note: 1xGPU runs show higher BPB due to fewer training steps. Only 8xH100 results are competition-comparable. The N-gram cache revolution (Day 8) took us from 1.1287 to 0.11545, a 10x improvement. Days 14-15: Vocab 4096 + MLP 4.0x + SLOT pushed neural model to 1.0913 BPB. Total: 127+ experiments across 13 pods + DGX Spark, ~$360 spent. April 9: 22-run research expansion covering all 7 OpenAI-requested architectures (Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, Megakernels).

3 — Technique EffectivenessGive Feedback

✓

Int6 QAT

Quantization-aware training. Table stakes — every top submission uses it.

REQUIRED · ~0.5MB model size reduction

✓

reduce-overhead compile

torch.compile mode. -25ms/step on slow GPUs, -4ms on fast. Free win.

PROVEN · -25ms/step (slow GPU)

✓

EMA 0.997

Exponential moving average of weights during training. Small but consistent improvement.

HELPFUL · small quality gain

✓

Star-ReLU (not SwiGLU)

ReLU² + learnable scale/bias. Used in top submissions. PR #505 title was misleading.

TOP-TIER · used in #1 no-TTT submission

✓

Late QAT

Enable QAT when LR drops below 15% threshold. Better than constant QAT.

BETTER · vs constant QAT

✓

Legal LoRA TTT (score-first)

Single-pass adaptation at eval. Score token, then train. Never see same data twice. PR #77 proved -0.037 BPB.

PROVEN · -0.037 BPB (needs lr=0.01)

XSA4 (Extended Self-Attention)

Adds quality but costs +17ms/step overhead. Net effect marginal.

MIXED · +17ms/step overhead

U-Net Skip Connections

5 encoder + 6 decoder with learned gates. Helps quality but adds model size.

MIXED · quality ↑ but size ↑

BigramHash

Hash-based bigram embedding. Minimal measurable impact on quality in our tests.

MINIMAL · no clear benefit

Custom Kernels (THOR, REAPER...)

Fused Triton/CUDA operations. 1.17×–1.70× speedup but only during eval, not training.

EVAL ONLY · doesn't speed training

✗

Multi-epoch TTT

10-epoch AdamW on validation data. -0.059 BPB but BANNED by organizers (Issue #402).

BANNED · all sub-1.10 submissions closed

✗

torch.compile max-autotune

CUDAGraph conflicts with rotary cache. Crashes every time. Use reduce-overhead instead.

CRASHES · CUDAGraph conflicts

✗

PHANTOM Kernel (Fused Linear+LoRA)

Custom Triton kernel fusing base linear + LoRA. Rank-8 matmuls too small to benefit — cuBLAS already optimal.

0.41× SLOWER · dead end confirmed on RTX 5090

✓

N-gram Cache (orders 2-20)

Hash table of token patterns from already-scored validation data. Backward-looking, score-first, causal. THE paradigm shift.

GAME CHANGER · 1.17 → 0.115 BPB (10× improvement)

✓

Dirichlet Smoothing (per-order OBCL)

Bayesian posterior: p = (count + c × prior) / (total + c). Learned concentration per order: 50.0 (bigrams) → 1.86 (14-grams).

CRITICAL · -0.06 BPB vs linear interpolation

✓

Phrase Cache (suffix matching)

Match 16-20 token suffixes with Dirichlet smoothing. Captures long-range repetition that 15-gram can't reach.

PROVEN · -0.11 BPB additional improvement

✓

Complementary Training

Downweight loss on N-gram-predictable tokens during training. Focus neural model on hard tokens.

PROVEN · -0.07 BPB (alpha=0.50)

✓

Order 20 (vs 15)

Extending N-gram orders from 15 to 20 gave -0.00033 BPB in ablation, -0.00011 in competition.

SMALL WIN · -0.00011 BPB (confirmed via 6-test ablation)

✓

H-Net Dynamic Chunking

Learned boundary prediction between tokens. 263K extra params, matched baseline within 0.001 BPB.

PROVEN · baseline-matching with learned tokenization

✓

Fused Triton Megakernels

RMSNorm + LeakyReLU² eval kernels. Beat baseline by 0.0017 BPB.

PROVEN · -0.0017 BPB improvement

✓

Full Hessian GPTQ

Cholesky error compensation. Works on Hopper (H200/H100), crashes on consumer GPUs (5090). Best results with AR self-gen calibration data.

HOPPER ONLY · H100/H200 exclusive, works with AR self-gen calibration

Universal Transformer

Shared-weight block looped 12×. Saves 90% block params but loses 0.08 BPB vs flat. Confirms PR #363 findings.

RESEARCH VALUE · depth recurrence confirmed inferior at this scale

Adaptive Density

Sparse→dense curriculum. Faster early steps but net BPB worse.

MIXED · faster early steps, worse final BPB

Echo Training

Self-distillation from EMA. Crashed due to autograd conflicts. Needs architectural rethink.

INCOMPLETE · autograd conflicts

✗

Text Diffusion (MDLM)

Hybrid AR+diffusion. 3.38 BPB. Too slow (2.3s/step) for 10-min window. Signs of life only.

RESEARCH ONLY · 3.38 BPB, 2.3s/step

✗

Random Linear Map Adapters

Frozen orthogonal weights. 2.20 BPB. Pretraining on frozen random projections doesn't work.

NEGATIVE RESULT · 2.20 BPB

✗

LLM-JEPA

Joint embedding prediction. 2.20 BPB. JEPA benefits are invisible on raw BPB pretraining metric.

NEGATIVE RESULT · 2.20 BPB

✗

Mamba SSM Hybrid

Pure PyTorch SSM. 3.32 BPB. No custom kernels = too slow. Needs CUDA implementation.

RESEARCH ONLY · 3.32 BPB, needs CUDA kernels

✗

Depth Recurrence

Confirmed independently: flat models beat looped models at this scale. Also confirmed by PR #363.

DEAD END · flat > looped at 16MB scale

✓

SLOT (per-batch delta optimization)

Eval-time per-batch delta optimization. Consistent gain on top of any base model with zero training changes needed. Standard SLOT ruled illegal by PR #1240 (100% causal violation). Context-only SLOT variant under review (Issue #1336).

PROVEN · -0.007 BPB eval-time gain, legality under review

✓

Vocab 4096 + MLP 4.0x

Bigger vocabulary and wider FFN combined with aggressive weight decay (0.085). Better compression per parameter.

BREAKTHROUGH · 1.1048 base (3-seed mean), beats SOTA without eval tricks

✓

Brotli-11 Compression

Saves ~400KB vs LZMA, enabling larger models to fit under the 16MB submission limit.

PROVEN · ~400KB savings vs LZMA, enables bigger models under 16MB

✓

Parallel Residuals

Dual-stream attention/MLP lanes with learnable route vector and lane_merge. From PR #1204, #1289.

PROVEN · -0.0175 BPB (DGX Spark ablation)

4 — Cost & EfficiencyGive Feedback

Pod 1 — Days 2–4 (1×H100)

$90

Includes $43 overnight waste

Pod 2 — Day 5 (1×H100)

3 targeted tests

Pod 3 KC — Day 5 (1×H100)

$15

742 TFLOPS — best single GPU

Pod 4 Iceland — Day 6 (8×H100)

$65

10 experiments — most productive session

Pod 5 KC2 — Day 6 (8×H100)

$35

7 runs including 3-seed verification

Overnight Waste

$43

80+ crashed autoresearch experiments

RTX 5090 — Day 8 (Reykjavik)

PR#414 vs #549 + PHANTOM kernel test

H100 Mumbai — Day 8 (1×H100)

First N-gram test — 0.1191 BPB!

8×H100 Rancho Cordova — Day 8

$35

3-seed PR#948 — #1 tied at 0.11556

H100 Kansas City — Day 8 (ablation)

6-test ablation battery — found order 20

8×H100 Montréal — Day 8

$35

3-seed PR#968 — NEW #1 at 0.11545

Broken Pods (Amsterdam)

$15

2 pods failed — critical error + broken CUDA

Pod 6 — Days 10-11 (1×RTX 5090 Iceland)

$5.50

Kitchen Sink tuning + megakernel experiments

Pod 7 — Days 10-11 (1×RTX 5090 Iceland)

$7.00

7 research architecture runs + overnight

Pod 8 — Day 11 (1×H200 SXM)

$13.00

GPTQ validation on Hopper + SOTA test

Pod Montreal — Day 14 (8×H100 778T)

~$15

3 baseline + SLOT + 3x #1218 seeds

Pod Iceland — Day 15 (8×H100 802T)

~$15

3x #1218+SLOT seeds, best pod ever (802T)

DGX Spark - Days 16-19 (Local GB10)

10 PROTEUS ablation runs, free local compute

~$360

Total spend (13 pods + Spark)

$320

Cost per BPB point gained

1.0132

BPB improvement (1.1287 → 0.1155)

Current leaderboard position

5 — Pod Performance ComparisonGive Feedback

Pod	Location	GPU Config	GEMM (TFLOPS)	Step ms (same code)	Cost/hr	Verdict
Pod 1	Unknown	1×H100 SXM	Not measured	498–592 ms	~$2.69	Variable
Pod 2	Unknown	1×H100 SXM	Not measured	553–578 ms	~$2.69	Consistent
Pod 3	Kansas City, MO	1×H100 SXM	742 TFLOPS	568–572 ms	$2.69	Exceptional
Pod 4	Reykjavík, Iceland	8×H100 SXM	733 TFLOPS	45–133 ms	$21.52	Top-tier
Pod 5	Kansas City (KC2)	8×H100 SXM	~730 est	60-133 ms	~$21.52	Solid
Pod Montreal (Apr 2)	Montreal, QC	8×H100 SXM	778 TFLOPS	~130 ms (#1218)	~$2.69/hr	Exceptional
Pod Iceland (Apr 3)	Reykjavik, IS	8×H100 SXM	802 TFLOPS	~100 ms (#1218 cached)	~$2.69/hr	Best Ever
DGX Spark (Apr 4-6)	Local (10G LAN)	1x GB10 128GB	N/A (unified)	~6000 ms (no compile)	$0/hr	Free + Always On

Key insight: GPU quality varies significantly even within the same model. Pod 3 hit 742 TFLOPS - well above the ~275 TFLOPS reference. This is why we built the pod benchmarking tool. The DGX Spark GB10 (local, free) runs at ~6s/step without torch.compile but the parallel residuals architecture is 2.3x faster than sequential, partially compensating.

6 — Key DiscoveriesGive Feedback

Day 2 — March 20

"Step speed is the #1 bottleneck"

Every millisecond per step = fewer total training steps = worse BPB. Discovered after 17 experiments showed speed beats depth.

Day 3 — March 21

"XSA4 adds quality but costs speed"

Extended Self-Attention improved per-step quality but +17ms/step nearly negated the gain. The tradeoff warning sign.

Day 4 — March 22

"Autoresearch overnight = $43 lesson in automation"

80+ experiments crashed due to sed corruption, OOM errors, and eval timeouts. Led to: "use bash scripts, not Claude Code, for automated loops."

Day 5 — March 23

"reduce-overhead = free -25ms/step"

torch.compile reduce-overhead mode confirmed working where max-autotune crashes. -25ms on slow GPUs, -4ms on fast ones. Zero cost.

Day 5 — March 23

"GPU quality matters more than compile mode"

Pod 3 at 742 TFLOPS showed reduce-overhead only saved 4ms vs 25ms on slower hardware. Pod selection is the real optimization.

Day 6 — March 24

"Half the leaderboard was using illegal TTT"

Issue #402 revealed multi-epoch TTT is banned. Every sub-1.10 submission was closed. Our #9 became top 5 on verified rankings overnight.

Day 6 — March 24

"Our PR #406 is actually top 5 verified"

1.1287 BPB beats every submission except verified legal TTT entries around 1.12. We were closer to winning than we thought.

Day 6 — March 24

"Custom kernels help eval, not training"

Deep-read of PR #376's kernel library revealed all fused operations only accelerate evaluation. Training path uses standard torch.compile.

Day 6 — March 24

"93KB over the size limit = so close yet so far"

Run KC2-C: PR#505 dim=496 produced 16.09MB — just 93KB over the 16MB limit. Best legal no-TTT architecture but can't submit it.

Day 10 — March 30

"Research Sprint: all 7 OpenAI requests in 48 hours"

Implemented all 7 of OpenAI's requested research directions in 48 hours. 11,810 lines of code across 8 training scripts. Only entrant to attempt comprehensive coverage.

Day 10 — March 30

"Depth Recurrence Confirmed Dead"

Our Universal Transformer experiments independently confirmed PR #363's finding: shared-weight architectures lose ~0.08 BPB vs flat models at 16MB scale.

Day 10 — March 30

"H-Net Surprise: learned chunking matches baseline"

Learned dynamic chunking (H-Net) matched baseline within 0.001 BPB using only 263K extra parameters. Suggests learned tokenization is a viable research direction.

Day 11 — March 31

"Triton Kernels Debug: autograd broke, eval-only fix"

First attempt at training-time Triton kernels broke autograd (no gradients flowing). Fix: eval-only Triton + fullgraph=True torch.compile. Net result: 0.0017 BPB improvement.

Day 11 — March 31

"GPTQ on Hopper — H200 only"

Full Hessian GPTQ Cholesky decomposition crashes on RTX 5090 (consumer GPU precision) but works perfectly on H200 SXM (Hopper). This is an H100/H200-only technique.

Day 11 — March 31

"Brain Trust Gaps: 12 major knowledge gaps"

Queried 343K-chunk expert knowledge base. Found zero coverage on GPTQ, Triton kernel authoring, Mamba-3, and competition meta-techniques. 12 major knowledge gaps identified.

Day 11 — March 31

"Partition Function Challenge: n-gram scores may be invalid"

PR #1147 mathematically proved that Dirichlet/n-gram cache approaches (including our PRs #948/#968) produce invalid BPB due to unnormalized probability distributions. Our n-gram scores may be invalidated.

Day 14-15 — April 2-3

"Vocab 4096 + MLP 4.0x beats SOTA without eval tricks"

PR #1218 config with vocab 4096, MLP 4.0x expansion, and WD 0.085 achieved 1.1048 BPB (3-seed mean) on base model alone. No eval-time tricks needed to beat verified SOTA.

Day 14-15 — April 2-3

"SLOT gives consistent -0.007 BPB on top of base"

Per-batch delta optimization at eval time. Stacks cleanly on any trained model. Combined with #1218, pushed to 1.0925 mean (3-seed) on Iceland pod.

Day 15 — April 3

"Iceland pod lottery win: 802 TFLOPS + cached compilation = 100ms/step"

Best pod we have ever used. 802 TFLOPS GEMM performance combined with cached Triton compilation gave 100ms/step, enabling 5,900 training steps in 10 minutes.

Day 14-15 — April 2-3

"Weight decay correlates with compressibility (R^2 ~0.99)"

Higher weight decay (0.085) produces more compressible checkpoints. This enables bigger models to fit under 16MB with Brotli-11 compression. Nearly perfect linear correlation.

Day 16-17 - April 4

"PROTEUS integration: 4 features ported in one session"

Integrated parallel residuals, mixed INT5/INT6 quant, score-first TTT, and CPU test suite from PROTEUS v1.6 (PR #1289). All 22 CPU tests passing.

Day 17-19 - April 5-6

"Parallel residuals: the clear winner (-0.0175 BPB)"

7-run overnight ablation on DGX Spark isolated feature contributions. Parallel residuals delivered -0.0175 train_bpb and 2.3x throughput. INT5 saves 0.9 MB. SLOT adds nothing on top of parallel.

Day 17-19 - April 5-6

"SLOT legality crisis: standard SLOT proven illegal"

PR #1240 showed 100% causal violation rate for standard SLOT. Our #1291 (1.0925) uses standard SLOT. Issue #1336 pending ruling on causal SLOT variant. Safe fallback: #1287 at 1.1048.

Day 19 - April 6

"sp4096 tokenizer not officially approved"

Discussion on PR #1333 revealed sp4096 has been adopted by ~8 PRs but never formally approved by maintainers. Adds risk to all sp4096 submissions including ours.

Day 19 - April 6

"PR #1334 shows the legal ceiling: 1.0897 with zero eval tricks"

aryanbhosale's Track A submission uses our base stack (#1287) + depth recurrence + parallel residuals + MuonEq-R. 1.0897 BPB with no SLOT, no TTT, no n-gram. Credits our PR #1287.

Day 22 - April 9

"22-run research expansion: all 7 OpenAI-requested architectures ablated in one night"

Overnight DGX Spark run covering Universal Transformer, Text Diffusion, Random Adapters, JEPA, Mamba SSM, H-Net, and Megakernels. 20 of 22 runs successful. Published raw data as public gist for community reuse.

Day 22 - April 9

"Surprise: SSM wins on raw BPB despite being 50x slower per step"

Mamba-style SSM hit 2.0295 BPB with only ~5 effective training steps. Pure PyTorch implementation bottlenecks everything at 37s/step. Fast Triton selective scan kernel would be competitive.

Day 22 - April 9

"Hyperparameters inside an architecture often do nothing"

JEPA weight (10/30/50%), diffusion ratio (70/30, 50/50, pure AR), H-Net chunker config all produced identical BPB to 4 decimal places. Architectures either work or they do not. Tuning does not rescue them.

Day 23 - April 10

"The landscape shifted fast: we are no longer top 5 Track A"

23 new PRs in last 3 days. samacqua #1530 at 1.0734 is new Track A leader. msisovic #1529 at 1.0744, aryanbhosale #1540 at 1.0777 close behind. SP8192 vocab emerging as new standard. Our #1287 at 1.1048 needs another push.

Day 26 - April 13

"First sub-1.01 legal submission: dexhunter #1586 hits 1.0000 BPB"

Per-Layer Adaptive GPTQ Clip + int7 Embedding. Clean round number. The 1.00 BPB milestone is broken. Hkoyuer #1632 right behind with GDN-Hybrid (Gated DeltaNet + SWA) at 1.0274. 78 new PRs in 3 days, the final week is accelerating hard.

Day 26 - April 13

"Self-flagged 3 PRs for TTT-on-val after @MatoTeziTanka audit"

@MatoTeziTanka (PROTEUS #1289) flagged PR #1193 and #406 on April 11-12 for multi-epoch TTT on val_tokens. Audited our other submissions proactively and found the same pattern in #1127. Retracted all three, kept pre-SDTTT numbers on #406 (1.1455 legal), filed clean resubmission #1554. Added compliance notes to #948/#968 n-gram PRs (different class, awaiting maintainer ruling).

Day 26 - April 13

"RunPod $100 credit + PR #116 on containers repo"

Closed the loop on ticket #35283 after 12 days of back-and-forth. Received $100 credit. @max4c (RunPod engineer) built PR #116 on runpod/containers based on my PR #115, crediting my root cause analysis explicitly. Public GitHub pressure worked where private support tickets stalled.

Day 26 - April 13

"Submitted peft PR #3154: Gemma 4 dispatch error UX fix"

Closed the loop on huggingface/peft #3129 after thread got sidetracked. Built minimal repro on Spark, verified @BenjaminBossan's .linear regex workaround, submitted PR #3154 adding an actionable hint to PEFT's dispatch error for wrapper modules like Gemma4ClippableLinear. Full TestLoraInitialization suite passes with zero regressions.

Day 27 - April 14

"All 7 research PRs + #1554 rated LOOKS CLEAN by independent audit"

@MatoTeziTanka (PROTEUS author) independently reviewed #1191, #1192, #1193, #1194, #1195, #1196, #1197, and #1554. All came back LOOKS CLEAN - MERGE. PR #1554 received manual hand-review (not auto-classified). He retracted his own earlier citation of #1416/#1423 as legal, corrected to #1413 (dexhunter). Called the training-slice TTT approach "structurally the cleanest." Tagged to OpenAI maintainers. No response yet - merge rate is ~3%.

Day 28 - April 15

"peft PR #3154 closed, broader fix invited"

@BenjaminBossan agreed the helpful error message is valuable, but the Gemma 4 specific fix became redundant after PR #3136 merged. Closed #3154 and invited a broader redesign: general "invalid target module" error handler for all PEFT methods (LoRA, IA3, LoKr, LoHa, OFT), starting with LoRA but designed to extend. Scope agreed, tests will be written first.

Day 30-32 - April 15-19

"The byte-LUT bug: the GDN sub-1.01 cluster was an accounting artifact"

Between April 15-19, @tejasnaladala, @dexhunter, and @bigbag forensically identified a byte-counting defect in build_sentencepiece_luts that originated in PR #1545 and propagated through every GDN-family inheritance. Leading-space bytes were being added twice, shrinking the BPB denominator and inflating reported scores by ~17.46%. PRs #1545, #1576, #1632, #1672, #1681, #1687, #1705, #1711, #1734 all self-closed after canonical rescoring. The reported "sub-1.01 cluster" did not hold - yahya010's own canonical check on #1734 put the 1.01080 at ~1.187. #1698 at reported 1.00995 remains open but flagged. Issue #1719 tracks the bug class.

Day 30-33 - April 15-20

"The real canonical frontier: casefold tokenizers + MP-SGD TTT at 1.057-1.072"

After filtering out byte-buggy inheritance and pending-legality SLOT submissions, the legitimate open frontier lands in the 1.057-1.072 band. Top clean legal open: dexhunter #1693 (1.05733, Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT), #1670 (1.05970, same family), codemath3000 #1585 (1.0639, Casefold + Parallel Residuals), romeerp #1756 (1.06505, CaseOps + Recurrence Depth Curriculum), mikeapedia #1578 (1.0668, Custom Casefold), yahya010 #1727 (1.07217, SP8192 MP-SGD 4-phase), jorge-asenjo #1700 (1.07219, same base). Custom casefold-style tokenization + Multi-Phase SGD TTT is doing more work than the rest of the stack.

Day 33 - April 20

"Archive synced: 1,683 PRs captured, full-fidelity"

Pulled PRs #1633 through #1758 (123 new, 2 skipped as not-PRs, 0 failures). 1,919 new files downloaded, every PR diff preserved 100%. Local archive now 2.3 GB, NAS mirror at 2.6 GB. With 10 days to deadline, the archive represents the full corpus of community research to date.

Day 34 - April 21

"DGX Spark dev environment stood up; FA3 on Blackwell is a dead end, FA2 is the workaround"

Brought up a local experiment environment on DGX Spark (GB10, SM12.1, aarch64, CUDA 13.2 forward-compat) using nvcr.io/nvidia/pytorch:26.03-py3. Built FA3 from source in the NGC container per the NVIDIA engineer workaround on the developer forums. Build succeeds after ~50 minutes, import flash_attn_interface works, but any flash_attn_func call fails at runtime with no kernel image is available for execution on the device. Build log shows every kernel object compiled for sm80 or sm90 only. Reported as a data point back to the NVIDIA developer forum thread and to FA upstream issue #1969. Working solution: use the pre-installed flash_attn 2.7.4 in the same NGC container, which runs bit-exact to torch.nn.functional.scaled_dot_product_attention on GB10. Pattern is consistent with AI2 open-instruct already excluding flash-attn on aarch64.

Day 34 - April 21

"Surgical MP-SGD port from #1626 into #1667 base"

Ported dexhunter #1626's Multi-Phase Global SGD TTT functions (eval_val_ttt_phased, train_val_ttt_global_sgd_distributed, plus five helpers) into MarioPaerle #1667's base (vanilla SP8192 + SmearGate + AttnOutGate). Added PHASED_TTT_ENABLED gate so one file runs single-phase or multi-phase TTT. Compute-capability detection forces FA2 on Blackwell automatically. torch.compile disable flag for GB10's 101KB shared-memory limit on Triton kernels. Audited each of Issue #1017's four conditions against the ported code.

Day 34 - April 21

"Overnight ablation pipeline running: 9 experiments queued on Spark"

Pipeline compares pr1667 baseline, no-TTT baseline, MP-SGD 1-phase, MP-SGD 3-phase, Tapered WD, MP-SGD + Tapered WD, and gate-ablations (SmearGate off, AttnOutGate off, both off). Each experiment timeboxed to 60 min at reduced GB10 scale. Scale gap vs 8xH100 production means absolute BPB values do not match the leaderboard, but relative comparisons are apples-to-apples across configurations. 30-min health-check cycle runs alert-on-failure only. Full results in the next update.