⛳ golf

Experiment Queue

20 items
#1
mlp-mult-3x
architecture~1.29000p=9

Top leaderboard entries all use MLP_MULT=3. Single most impactful env-var change based on leaderboard data (~0.008 bpb gain at full training).

#2
mlp3x-matrix-lr-005
architecture~1.29000p=9

MLP_MULT=3 is critical for top entries. Pair with slightly higher Muon LR (0.05) to compensate for larger MLP capacity. Different from prior mlp-mult-3x which used default LR.

#3
mlp3x-11layers
architecture~1.27000p=9

Top leaderboard entries combine MLP_MULT=3 with 11 layers. This is the highest-impact combo we haven't tested together.

#4
mlp3x-matrix-lr-0035
architecture~1.29000p=9

MLP_MULT=3 is critical for SOTA but prior runs at LR 0.04 and 0.05 failed/reverted. Try intermediate LR 0.035 to stabilize training with wider MLP.

#5
10-layers
architecture~1.31000p=8.5

10 layers instead of 9. Direct depth scaling, used by multiple top entries.

#6
mlp3x-10layers
architecture~1.27000p=8.5

Combine MLP_MULT=3 with NUM_LAYERS=10 — mirrors top leaderboard configurations. Tests whether gains stack.

#7
weight-decay-004
regularization~1.30000p=8.5

Weight decay WD=0.04 appears in multiple top leaderboard entries (1.1428, 1.1483 bpb). Expected ~0.003 bpb gain.

#8
mlp3x-matrix-lr-004
architecture~1.29000p=8.5

MLP_MULT=3 with default LR (previous mlp3x-matrix-lr-005 failed, possibly due to LR being too high). Keep default MATRIX_LR=0.04.

#9
11-layers
architecture~1.30000p=8

11 layers. May push artifact size limits without better quantization, but worth screening to measure the gain.

#10
12-layers-dim448
architecture~1.29000p=8

More depth with slightly reduced dim to stay within 16MB. 12 layers at MODEL_DIM=448 trades width for depth, which tends to help language modeling.

#11
model-dim-576
architecture~1.30000p=7.5

Wider model (576 vs 512) increases capacity without adding layers. Should fit in 16MB with int8.

#12
embed-lr-08
optimizer~1.31000p=7

Higher embedding LR (0.8 vs 0.6) may help the small vocab (1024) learn better representations faster in 2h screening.

#13
muon-momentum-097
optimizer~1.31000p=7

Higher Muon momentum (0.97 vs 0.95) for smoother optimization. Previous 0.90 was tried; go the other direction.

#14
12-layers-default-dim
architecture~1.30000p=7

12 layers at default MODEL_DIM=512. Prior 12L used dim448 which may have hurt; 11L also reverted but 12L at full dim worth testing for depth scaling.

#15
rope-base-50000
architecture~1.31000p=6.5

Higher RoPE base frequency (50000 vs 10000) smooths positional encoding, may improve generalization at seq_len=1024.

#16
matrix-lr-005
optimizer~1.31000p=6.5

Higher Muon LR (0.05 vs 0.04). Previous combo with MLP3x failed but solo LR increase untested.

#17
embed-lr-05
optimizer~1.31000p=6.5

Lower embed LR (0.5 vs 0.6) for more stable embedding training. Prior 0.7/0.8 runs reverted; try conservative direction instead.

#18
warmdown-2400
optimizer~1.31000p=6

Longer warmdown (2400 vs 1200 iters) may improve final loss convergence in 2h window.

#19
matrix-lr-005-momentum-092
optimizer~1.31000p=6

Slightly higher Muon LR (0.05) with lower momentum (0.92) for faster convergence in 2h screening window. Different from prior individual LR/momentum experiments.

#20
warmdown-1800
optimizer~1.31000p=6

Moderate warmdown increase (1800 vs 1200). Previous 2400 was reverted; try a smaller step.