⛳ golf

Experiment Log

25 experiments
#1
swa-every-100
regularizationrunning

Reducing SWA checkpoint frequency from every 50 to every 100 steps gives more diverse checkpoints with more unique training between them, leading to better weight averaging

#2
muon-wd-005
optimizer1.40420reverted

Increase Muon optimizer weight decay from 0.04 to 0.05 for stronger regularization of matrix parameters, which should reduce overfitting and improve val_bpb

#3
bigram-dim-256
architecture1.40200reverted

Double BigramHash embedding dim from 128 to 256 for more expressive bigram context injection, at a small parameter cost (~0.3MB)

#4
cosine-warmdown
failed
#5
muon-momentum-095
optimizerreverted

Lower Muon momentum from 0.99 to 0.95 to reduce gradient over-smoothing, allowing faster adaptation within the 2-hour training window

#6
ttt-batch-32
failed
#7
ttt-lr-003-from-best
ttt1.28850reverted
#8
11-layers-from-best
architecture1.29510reverted
#9
head-lr-004
optimizer1.28920reverted
#10
ttt-chunk-64
ttt1.28850completed
#11
no-ttt-lora-sliding-window-only
architecture1.29680reverted
#12
swa-start-30pct
optimizer1.28880reverted
#13
ttt-lora-lr-003
ttt1.28880completed
#14
baseline-ttt-v2
architecture1.29610reverted
#15
ttt-lora-rank-16
ttt1.28890completed
#16
baseline-ttt
architecturefailed
#17
muon-momentum-090
optimizerfailed
#18
matrix-lr-005
optimizer1.29580reverted
#19
model-dim-576
architecturefailed
#20
11-layers
architecture1.29890reverted
#21
matrix-lr-003
optimizer1.29720reverted
#22
warmdown-2400
optimizer1.29940reverted
#23
10-layers
architecture1.29940reverted
#24
mlp-mult-3x
architecture1.28920completed
#25
baseline-wandb
architecture1.29610completed