Experiment Log
25 experiments#1
swa-every-100
regularizationrunning
Reducing SWA checkpoint frequency from every 50 to every 100 steps gives more diverse checkpoints with more unique training between them, leading to better weight averaging
#2
muon-wd-005
optimizer1.40420reverted
Increase Muon optimizer weight decay from 0.04 to 0.05 for stronger regularization of matrix parameters, which should reduce overfitting and improve val_bpb
#3
bigram-dim-256
architecture1.40200reverted
Double BigramHash embedding dim from 128 to 256 for more expressive bigram context injection, at a small parameter cost (~0.3MB)
#4
cosine-warmdown
failed
#5
muon-momentum-095
optimizerreverted
Lower Muon momentum from 0.99 to 0.95 to reduce gradient over-smoothing, allowing faster adaptation within the 2-hour training window
#6
ttt-batch-32
failed
#7
ttt-lr-003-from-best
ttt1.28850reverted
#8
11-layers-from-best
architecture1.29510reverted
#9
head-lr-004
optimizer1.28920reverted
#10
ttt-chunk-64
ttt1.28850completed
#11
no-ttt-lora-sliding-window-only
architecture1.29680reverted
#12
swa-start-30pct
optimizer1.28880reverted
#13
ttt-lora-lr-003
ttt1.28880completed
#14
baseline-ttt-v2
architecture1.29610reverted
#15
ttt-lora-rank-16
ttt1.28890completed
#16
baseline-ttt
architecturefailed
#17
muon-momentum-090
optimizerfailed
#18
matrix-lr-005
optimizer1.29580reverted
#19
model-dim-576
architecturefailed
#20
11-layers
architecture1.29890reverted
#21
matrix-lr-003
optimizer1.29720reverted
#22
warmdown-2400
optimizer1.29940reverted
#23
10-layers
architecture1.29940reverted
#24
mlp-mult-3x
architecture1.28920completed
#25
baseline-wandb
architecture1.29610completed