The Wasted Precision | TinyMemoryLM

We spend a lot of time optimizing attention mechanisms. We prune weights in the middle layers. We quantize activations to save memory during inference. Yet there is a massive inefficiency sitting right at the very end of the network that we almost completely ignore.

The output layer. That big matrix multiplication that turns hidden states into vocabulary probabilities. It is huge. It is expensive. And honestly, it is kind of dumb.

The Codebook Alternative

What if instead of predicting directly into a 500-dimensional vocabulary, we first project down to a small codebook and then look up the actual tokens? It is like having a translator between the model's thoughts and the final output.

We call it a precision codebook. 16 dimensions of learned codes that get mapped to the full vocabulary. Is it more accurate? Sometimes. Is it faster? Usually. Is it cooler? Definitely.