Makeshift MTP | TinyMemoryLM

Multi-token prediction sounds fancy. Really it is just the model trying to do its homework before the teacher assigns it. Sometimes it works. Sometimes it does not. But it always tries.

The idea is simple: instead of predicting one token at a time, predict multiple tokens ahead. During training, we learn to predict tokens at positions t+1, t+2, t+3, and so on. Then during inference, we can either use all these predictions or pick the best one.

We call it "makeshift" because it is not the elegant solution. The elegant solution would be a model that inherently understands sequence. But we are working with what we have, which is a transformer that mostly just wants to predict the next word and occasionally surprise us.