Model Training

Training
~1M
Parameters
2K
Context
6
Layers
4
Heads
160
Dimension
256
FFN Dim

Architecture Features

Recurrent Memory (Chunk-GRU) Enabled
Precision Codebook (Output Bias) Enabled (2111 params)
Makeshift MTP Enabled (horizons: 2,3,4, weight: 0.3)
Gradient Checkpointing Disabled
Torch Compile Disabled
Chunked Attention Enabled
Flash Attention Enabled
Repetition Penalty Disabled (1.0)
Tied Embeddings Enabled
Output Logit Bias Enabled
Word Token Loss Boost Enabled (3x)
Response-Start Boost Enabled (3x, 20 tokens)
Entropy Regularization Disabled
QK-Norm (RMSNorm) Enabled
SwiGLU FFN Enabled

Recurrent Memory (Chunk-GRU)

8
Chunk Size
32
Memory Dim
GRU
Cell Type
4
Layers

Training Datasets

HuggingFaceFW/fineweb_100BT Pretraining
mattwesney/General_Inquiry_Thinking-Chain-Of-Thought Instruction Tuning
tatsu-lab/alpaca Instruction Tuning
databricks/databricks-dolly-15k Instruction Tuning
TeichAI/Step-3.5-Flash-2600x Generalization
TeichAI/convo-v1 Generalization (2x)

Training Log

2026-03-01 Running Training on RTX 5090 with torch.compile disabled, chunked attention active
2026-02-28 Done Tokenizer vocabulary scan completed - 256 new characters added
2026-02-27 Done Model initialized with ~1M parameters
2026-02-26 Done Checkpoint conversion pipeline verified

Want to follow along with the training adventures?

Read the Blog