Experiment Queue
20 itemsTop leaderboard entries all use MLP_MULT=3. Single most impactful env-var change based on leaderboard data (~0.008 bpb gain at full training).
MLP_MULT=3 is critical for top entries. Pair with slightly higher Muon LR (0.05) to compensate for larger MLP capacity. Different from prior mlp-mult-3x which used default LR.
Top leaderboard entries combine MLP_MULT=3 with 11 layers. This is the highest-impact combo we haven't tested together.
MLP_MULT=3 is critical for SOTA but prior runs at LR 0.04 and 0.05 failed/reverted. Try intermediate LR 0.035 to stabilize training with wider MLP.
10 layers instead of 9. Direct depth scaling, used by multiple top entries.
Combine MLP_MULT=3 with NUM_LAYERS=10 — mirrors top leaderboard configurations. Tests whether gains stack.
Weight decay WD=0.04 appears in multiple top leaderboard entries (1.1428, 1.1483 bpb). Expected ~0.003 bpb gain.
MLP_MULT=3 with default LR (previous mlp3x-matrix-lr-005 failed, possibly due to LR being too high). Keep default MATRIX_LR=0.04.
11 layers. May push artifact size limits without better quantization, but worth screening to measure the gain.
More depth with slightly reduced dim to stay within 16MB. 12 layers at MODEL_DIM=448 trades width for depth, which tends to help language modeling.
Wider model (576 vs 512) increases capacity without adding layers. Should fit in 16MB with int8.
Higher embedding LR (0.8 vs 0.6) may help the small vocab (1024) learn better representations faster in 2h screening.
Higher Muon momentum (0.97 vs 0.95) for smoother optimization. Previous 0.90 was tried; go the other direction.
12 layers at default MODEL_DIM=512. Prior 12L used dim448 which may have hurt; 11L also reverted but 12L at full dim worth testing for depth scaling.
Higher RoPE base frequency (50000 vs 10000) smooths positional encoding, may improve generalization at seq_len=1024.
Higher Muon LR (0.05 vs 0.04). Previous combo with MLP3x failed but solo LR increase untested.
Lower embed LR (0.5 vs 0.6) for more stable embedding training. Prior 0.7/0.8 runs reverted; try conservative direction instead.
Longer warmdown (2400 vs 1200 iters) may improve final loss convergence in 2h window.
Slightly higher Muon LR (0.05) with lower momentum (0.92) for faster convergence in 2h screening window. Different from prior individual LR/momentum experiments.
Moderate warmdown increase (1800 vs 1200). Previous 2400 was reverted; try a smaller step.