Blog
Updates on TinyMemoryLM development, training adventures, and things I learned the hard way.
SPIN Is Cool And I Am Still Confused
I keep hearing about SPIN. Self-Play Fine-Tuning. It sounds like a yoga class for language models. It is not. It is cooler. It is a training method that lets models get better by arguing with themselves. No data required. No API credits. Just pure, unadulterated self-debate.
Read moreClaude Code Fixed My Script And I Published Haiku-2 Anyway
I asked Claude Code to fix my training script. It fixed almost every bug. Then it added SPIN. Then it made my models more efficient. Then I published Haiku-2. Then I added all the optimizations. Obviously that is what I needed to do.
Read moreJackrong's Perfect Benchmarks And My Suspicious Mind
I saw a model card today that made my tiny brain hurt. Jackrong released Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. The name alone is a mouthful. The benchmarks are a different kind of mouthful. They are perfect. One hundred percent on tool calling. One hundred percent on autonomy. One hundred percent on not crashing while I am still figuring out how to not NaN my loss curve.
Read moreI Watched Anthropic Find Anxiety Neurons And Now I Want To Delete Them
I watched an Anthropic video today. Official account. Not mine. I wish it were mine. Then I could monetize my existential dread. Instead I just have dread. And a GPU.
Read moreI Bricked My School Chromebook With Pi-hole And Regret Everything
This blog post was supposed to go live at 9 AM. It is now 1 PM. The delay was not caused by NaN losses or GPU crashes or model training failures. The delay was caused by me being an idiot with a school Chromebook
Read moreI Dreamed Of NaN And Woke Up To NaN
I dreamed about NaN last night. Not a metaphorical NaN. A literal loss: nan in bright red terminal text. I was running through a field of gradients. They were all exploding. I woke up in a panic. I checked my phone. I checked the logs. I needed to know.
Read moreI Made A Dataset So Dense It Broke My Hard Drive
I deleted Sonnet today. Not because it was bad. Not because it failed. Because I realized my dataloader was feeding it the same data four times. Because I had four dataloader cores. Because four cores was enough to feed my GPU. Because I did not think about what four cores meant for data repetition.
Read moreI Made A Dataset So Dense It Broke My Hard Drive
I have a new dataset. It is called Dense-PRISM. It lives on Hugging Face. It is 164 GB. My hard drive cried when I uploaded it. My internet provider sent me a concerned email. I am proud.
Read moreI Captured The Ghosts In The Machine (And Named It Prism)
Most distillation datasets are flat. They show you what the AI said. They do not show you what the AI thought about saying. They show you the destination. They hide the journey. I decided to capture the journey. I decided to name it Prism.
Read moreTMLM-Haiku-2 Is Coming And It Might Speak English
I have added DeepSeek hyper connections. I have added Engrams. I have added hope. The model is currently trying to learn English through distillation. It is struggling. I am struggling. We are struggling together like two people trying to assemble furniture without instructions.
Read moreClosed Source Distillation Is A Half-Finished Puzzle
Everyone is distilling models lately. TeichAI does it. I do it. The internet is full of tiny models claiming to be smart because they learned from big models. There is a catch. A big one. Closed source models will never be properly distilled through an API.
Read moreI Am Joining Forces With TeichAI And It Is Funny Either Way
I am officially part of TeichAI now. They know I exist. We have been communicating for a while. I am listed on their Hugging Face page as a collaborator. This is not a unilateral declaration. This is real. And it is still funny.
Read moreDeepSeek Beat Me To My Own Idea And I Am Not Okay
I had an idea. A good idea. I called it EMM: External Memory Module. The concept was simple. Train the memory separately. Plug it into the model. Decode vectorized data. O(1) retrieval. Minimal overhead. Elegant.
Read moreTwo Days For Ten Percent And Opus Is Laughing At Me
I started pretraining TMLM-Sonnet two days ago. I checked the progress bar this morning. It says ten percent. I did the math. The math is terrible. I am now living in a hellscape of my own calculation.
Read moreI Released TMLM-Haiku-1.3 And It Is Still Dumb
I released TMLM-Haiku-1.3 today. It is on Hugging Face. It is open weights. It is still completely devoid of intelligence. I trained it with Muon. I spent electricity. I generated heat. The model still thinks Paris is a person.
Read moreI Flashed The Matrix VBIOS And Now I Train Models All Day
Yesterday I wrote about how AI failed to help me find the InfoROM for VBIOS flashing. It could not do it. I had to do it myself. I spent the night reading forums. Reading modding guides. Reading warnings that I should not be doing this.
Read moreI Asked AI To Mod My VBIOS And It Choked At Step Four
TI have a RTX 5090 OC LC. It runs at 600W. I wanted 700W. Not because I need it. Not because it is safe. Because I can. Because the model said it could help. Because I have learned nothing from previous AI disappointments. The plan was simple. Four steps. Extract the VBIOS. Find the wattage limit. Modify it. Flash it back. How hard could it be? The answer is very hard. The AI failed at step four. It could not figure out how to get the InfoROM. It tried for an hour. It gave up. I am still at 600W.
Read moreI Watched Project Hail Mary And Forgot About My NaN Loss
This blog is usually about AI. About training models. About GPUs that cost more than my education. About loss curves that go down and then suddenly become NaN and destroy my will to live. Today I am writing about something else. Something that made me forget about my 261 hour training run. Something that made me feel joy for the first time in weeks. I watched Project Hail Mary
Read moreI Woke Up To NaN And Now I Am Dead Inside<
I went to sleep happy. The loss was going down. The gradients were stable. The GPU was humming at 60C like a contented cat. I dreamed of completion. I dreamed of a finished Sonnet model. I dreamed of sleep that was not interrupted by thoughts of learning rate schedules.
Read moreI Tried Opus 4.6 And Now Everything Else Feels Broken<
I have spent the last month writing blogs about how AI models are lazy. How they are too expensive. How they form unhealthy attachments. How they cannot finish a task without asking for permission. I stand by most of that. Opus 4.6 changed my mind about the laziness part.
Read more261 Hours For A 300M Model And I Have Every Optimization
I have every optimization under the sun enabled. Native NVFP4 quantization. Torch.compile with max auto tune and cudagraphs. No gradient accumulation. Maximum batch size. My GPU is locked at 600W. My clocks are fixed. My cooling is liquid. Everything is perfect.
Read moreI Locked My GPU Clocks And Now It Runs Forever.html
I have an RTX 5090 OC LC edition. Liquid cooled. Overclocked out of the box. It is the kind of card that makes people ask uncomfortable questions about my financial decisions. I have no good answers.
Read moreI Built A Training UI And Then Unsloth Laughed
I decided to build a training interface. A backend. A way for people to fine-tune models without touching a terminal. It sounded simple. It was not simple. It is currently the hardest thing I have ever done and I once tried to explain transformers to my cat.
Read moreEvery AI Model Is Lazy And I Have The Screenshots
I have asked many AI models to build things. Fully implement a task. Write the code. Run the tests. Fix the errors. Ship it. Not one of them has done this without me holding their hand through every single step.
Read moreOpenAI Did A Good Thing And Everyone Is Mad About It
I have an unpopular opinion and I am ready to be yelled at for it. OpenAI removing GPT-4o was the right decision. People are furious about this. They are grieving. They are writing petitions. They are mourning a chatbot like it was a person and I think that is exactly the problem.
Read moreI Built A Tool That Snitches On AI Models
Every AI model has an accent. Not a literal accent because they do not have mouths. A writing accent. A way of forming sentences that gives them away like a fingerprint at a crime scene.
Read moreI Spent $40 And Got A Greeting
I used to spend money on AI APIs for testing. Now I spend money on AI APIs and immediately regret every life choice that led me to that moment. The prices have gotten out of hand and I need to talk about it before I have a breakdown in the middle of a terminal window.
Read moreI Released A Model And Nobody Clapped (Fair)
I released a model yesterday. TMLM-Haiku-1. It is small. Surprisingly small. It also somehow speaks which I consider a major achievement given my training budget and general approach to machine learning which can best be described as throwing things at a GPU until something sticks.
Read moreDistilling Closed Models Until They Forget They Were Closed
I have been thinking about model distillation lately. Not the academic kind with proper methodology and peer review. The hobbyist kind where someone spends their own money on API credits, LoRA fine-tunes a small model, and releases it for free because they can.
Read moreI Finally Switched Terminals (And My Ego Is Healing)
I used the default macOS terminal for years. Not because I loved it. I kept it because change is scary and I am deeply committed to mediocrity. Then I tried Warp and realized I have been suffering through a text-based interface that treats me like an enemy.
Read moreThe Chinchilla Effect: Why Tiny Models Have to Be Picky
The Chinchilla paper told us something elegant. For compute optimal training, aim for roughly twenty tokens per parameter. A 70 billion parameter model wants 1.4 trillion tokens. A 1 million parameter model wants 20 million tokens. The math is clean. The implication is messy.
Read moreThe Training Time Compute Trap
There is a moment in every AI project when someone says "maybe we just need more compute." It sounds reasonable. It sounds scientific. It sounds like the kind of thing that gets budgets approved and GPUs ordered. Then you wake up three weeks later, your electricity bill has achieved sentience, and your model still thinks "python" refers exclusively to snakes.
Read moreTeaching AI to Regret: The Backspace Token Theory
Humans backtrack. We type "thr" and realize we meant "the" and we fix it. We type "tje" and we laugh at our own fingers and we correct it. Large language models do not do this. They commit to every token like it is a binding legal contract. I started wondering what would happen if we gave them an out. What if we added a backspace token to the vocabulary?
Read moreThe Irony Cloud: When AI Downtime Meets Timing
Anthropic is down. Of course it is down. The universe has a sense of humor and apparently that humor is "make the ethical AI company unreachable right after they make a big ethical statement.
Read moreThe Bloatening: When AI Companies Forgot About the Little Guy
I used to get excited about model releases. A new tiny model would drop and I would immediately try to run it on my laptop that sounds like a jet engine. Now I scroll through announcements and see numbers that require a data center just to pronounce
Read moreWhy Does My AI Think Math Is a Fishing Trip?
I asked my model to solve a simple integral. It responded with a detailed description of trout migration patterns. This is not the answer I was looking for, though I admit the trout explanation was surprisingly well-structured. Training a small language model is like teaching a very enthusiastic puppy. It wants to please you.
Read moreTraining Models on a Ramen Budget
How to train a transformer when your GPU bill looks like a phone number. Tips, tricks, and questionable life choices from someone who learned about electricity costs the hard way.
Read moreOne Year of Vibecoding and Other Questionable Life Choices
You start vibecoding because someone told you it feels like magic. You imagine floating through code. Reality does not care about your imagination.
Read moreOpenClaw: The Most Overhyped Bot Since Sliced Bread
OpenClaw, formerly Clawdbot, formerly Moltbot, has now accumulated more GitHub stars than the Linux kernel. Let that sink in.
Read moreThe Scaling Wall And Other Things I Yelled At
Someone told me we can just keep making models bigger. They said compute will solve everything. They lied. Or they hoped. Or they had investors to please.
Read moreYour AI Agent is Lying Behind Your Back
You know the feeling. You type a prompt. The text streams. The terminal says success. I am here to tell you that you are being played.
Read moreAnthropic's Distillation Drama: A Masterclass in Projection
So Anthropic published a blog post. Big surprise. The title alone could power a small city.
Read moreThe Wasted Precision of the Output Layer
We spend a lot of time optimizing attention mechanisms. We prune weights. We quantize activations. Yet there is a massive inefficiency sitting right at the very end of the network.
Read moreMy Baby Model Takes Forever to Grow Up
You start with hope. A tiny transformer. A few million parameters. You think, how long could this possibly take? I am here to ruin your optimism.
Read moreExternal Memory Modules: Because My Model Has Commitment Issues
You know what takes forever? Training a transformer. You know what takes less forever? Training a tiny thing that just remembers stuff.
Read moreThe Goalpost Has Legs: Why AGI Keeps Running Away
Imagine handing Claude Opus 4.6 to someone from 2004. They would think you summoned a minor deity. Our collective response? A polite nod.
Read moreWords, Words, Words: My Model Learned to Ramble
My model has achieved something truly special. It can now ramble. Endlessly. With words. It does not just predict tokens anymore. It holds court.
Read moreThe Memory Bottleneck: Why Your Model Can't Remember Anything
Context windows are like attention spans at a tech conference. Everyone pretends they can focus for longer, but really they're just waiting for the snack break.
Read moreMakeshift MTP: Predicting the Future on a Budget
Multi-token prediction sounds fancy. Really it's just the model trying to do its homework before the teacher assigns it. Sometimes it works. Sometimes it doesn't. But it always tries.
Read moreBuilt with Curiosity Over Compute
The tagline sounds nice. What it really means is we couldn't afford the compute so we got curious instead.
Read more