When Better Reasoning Traces Hurt
Tldr;I spent some time trying to make a small language model better at math by being more careful about its training data. The model got worse. I made the data more careful. The model got worse. I tried to fix the fix. It got worse a third time. This is a write-up of why and of a result I genuinely did not see coming: the same intervention that hurts a 0.5B model helps a 3B model, and helps a 7B model more. The data never changed. The student did.
There was this classic llm failure mode that was trending recently. Basically, when asked the question
I want to wash my car. The car wash is 50 metres away. Should I walk or drive?
Many SOTA llms suggested that the subject should walk to the car wash. Here’s what the latest version of chatgpt answered when i tried the query
It had completely ignored the first sentence which was that I wanted to wash the car. Examples like this is why there’s a constant debate about whether these models have reasoning capabilities or not. I can’t really explain this particular failure mode (my best guess with failures like this is that the problem is not that the model doesn’t reason but it doesn’t have a good model of what it is reasoning about in this case how the world works and this is a classic instance of how language is not sufficient to help the model completely understand the world) but I was curious about how to fix them. The obvious solution here is to finetune the model with similar examples and reasoning traces that emphasizes the constraint. This led me to thinking about a targeted data quality approach.
Data curation during SFT when models learn to imitate reasoning, majorly consists of generating synthetic reasoning from bigger models and filtering for correctness. People rarely question the quality of the reasoning trace, but are some reasoning traces genuinely better teacher than others? and can we use feedbacks from model’s failures to improve reasoning traces thereby improving the performance of the model? That’s the question this post is about. And the short answer turned out to be more interesting than yes or no.
Similar works like STaR and SuperCorrect ….
The setup
The pipeline involves using a stronger model (Qwen2.5-72B) to solve GSM8K problems, filter the correct ones whose reasoning traces conform to specific defines criteria judged by another stronger model and then fine-tuning a smaller model on the resulting traces. The smaller model is a base model and we use standard finetuning hyperparameters since this is just a validation run.
Same problems, same teacher, same hyperparameters across runs. The only thing that varies is the prompt used to ask the teacher how to write. That isolates trace style as the variable. All answers are correct, all traces come from the same family of weights, all training recipes are identical. The only thing changing is the reasoning style the teacher was told to produce.
I designed four prompts, and the four datasets they produced will follow you through the rest of this post:
- Normal — “solve the problem step by step.” Nothing else.
- Gold — explicitly addresses the failure modes I’d catalogued in the base model. Restate the problem. Translate tricky wording. Restate intermediate values when reusing them. Check at the end that you answered what was asked. Numbered structure.
- Titanium — a strict schema, the kind of thing a programmer would design. Goal on top, every fact labelled
F1, F2…, every translation labelledX1, X2…, every calculation labelledC1, C2…, answer at the bottom. No prose at all. - Diamond — careful attention to failure points, but as natural prose. No labels, no headers, no schema. Inline arithmetic.
Here’s the same problem in all four registers:
Problem: Julie is reading a 120-page book. Yesterday she read 12 pages, and today she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read tomorrow?
Normal:
Yesterday, Julie read 12 pages. Today, she read twice as many pages as yesterday, so she read 2 × 12 = 24 pages. The total number of pages she has read so far is 12 + 24 = 36 pages. The number of pages remaining in the book is 120 − 36 = 84 pages. Julie wants to read half of the remaining pages tomorrow, so she should read 84 / 2 = 42 pages. #### 42
Gold:
Julie is reading a book that has 120 pages. Yesterday, she read 12 pages, and today she read twice as many pages as she read yesterday. We need to find out how many pages she should read tomorrow.
Step 1: Calculate the number of pages Julie read today. Pages read today = 2 × 12 = 24
[… continues for four numbered steps …]
The question asked for the number of pages Julie should read tomorrow. My answer is 42.
42
Titanium:
G: Number of pages Julie should read tomorrow F1: Julie is reading a 120-page book F2: Yesterday, she read 12 pages F3: Today, she read twice as many pages as yesterday C1: pages_read_today = 2 × 12 = 24 C2: total_pages_read = 12 + 24 = 36 C3: remaining_pages = 120 − 36 = 84 C4: half_remaining_pages = 84 / 2 = 42 ANSWER: 42
42
Diamond:
Yesterday, Julie read 12 pages. Today, she read twice as many pages as yesterday, which is 2 × 12 = 24 pages. So far, she has read 12 + 24 = 36 pages. The book has 120 pages, so the remaining pages are 120 − 36 = 84 pages. If she wants to read half of the remaining pages tomorrow, she should read 84 / 2 = 42 pages. #### 42
To my eye, Gold and Diamond are clearly the most carefully written. Titanium is the most rigorous in the formal systems sense ,we use variable names so it should be difficult to omit quantities. Normal is the laziest of the four. If you’d shown me these four traces and asked which would teach a small student best, I’d have ranked them in roughly that order, with Titanium and Gold competing for the top but the results did not agree.
The result, in one chart
At 0.5B, the laziest dataset wins. Every careful, well motivated intervention I designed made things worse. Titanium, the dataset I was most confident about, hit exactly the unfine-tuned base model’s accuracy.
At 3B, the ranking flips at the top. Gold overtakes Normal by 2.4 points. Diamond and Titanium are still below, and at 7B we see a similar picture.
There’s a clean shape here, but it isn’t the one I started with. Before I get to what I think it means, let me walk through the experiments in order. The wrong moves were the instructive ones.
Let there be Gold
I started by reading every wrong answer the 0.5B base model produced on 50 GSM8K problems. The errors weren’t random, they clustered into patterns that recurred over and over:
- comparative wording inverted,
- premises forgotten by step three,
- intermediate values dropped,
- scopes confused (treating a per-day rate as a total) etc.
These felt like teachable mistakes. The kind a careful tutor would address. So I built Gold around them. Restate the problem at the top, to force engagement with what’s actually being asked. Translate tricky comparative wording before using it, to make relation direction explicit. Restate values when reusing them, so the trace never relies on the model holding state across many tokens. Completion check at the end. I asked the 72B teacher to solve 1,700 GSM8K problems in this style, filtered for correctness and traces that followed the prompt instructions, fine-tuned the 0.5B on the result.
Training looked beautiful. Gold’s training loss was lower than Normal’s but the eval told a different story:
| Dataset | 0.5B accuracy |
|---|---|
| Normal | 49.7% |
| Gold | 47.3% |
Gold was worse. The dataset I’d designed specifically to address the model’s failure modes performed worse than the one I’d generated with no instructions at all. This was a bit suprising and disappointing but after looking at about 30 outputs of the model it wasn’t hard to see that Gold’s structure (numbered headers, restatement preamble) seemed to be competing with it’s reasoning ability. The structure is different from what the model had seen during pretraining hence the model was spending some of its limited capacity learning Gold’s format instead of Gold’s math.
Doubling down: And then there was Titanium
The fix, I thought, was to commit harder. If the model was struggling with Gold’s prose-around-structure, give it a cleaner structure. Make every part of the trace mechanical, unambiguous, named. Take away the prose entirely.
That gave me Titanium. Every fact on its own line with an F1 label. Every translation with X1. Every calculation with C1. Goal at the top, answer at the bottom. The model literally cannot drop a constraint, because each one has its own row with its own identifier. There is no ambiguity for it to misread.
Titanium scored 33.3%.
That’s the base model. To one decimal place. A full fine-tuning run on a clean, schematic, perfectly labelled dataset accomplished nothing measurable. The model learned to predict F2: after F1: and went home. We’ll discuss more about the failures of Titanium after looking at Diamond dataset but at this point it was obvious that the structure was hurting the model’s ability to learn at least at this scale.
Be Fruitful: Diamond
So Gold failed because its structure pushed traces off distribution. Titanium failed because its schema turned reasoning into a fill in form. What I needed, clearly, was Normal’s prose distribution with Gold’s pedagogical care, and absolutely no schema to imitate.
I called that Diamond. The teacher prompt was roughly: “Write naturally, with inline arithmetic. Be explicit at exactly the decision points where small models fail — ratio direction, comparative wording, state tracking, scope. No bullets. No numbered steps. No labels. Just careful prose.”
This had to work. It was essentially Normal with the dial of careful attention turned up. No format to imitate, no schema cost, just better English at the points where the model was known to slip.
Diamond scored 45.7%.
Below Normal. Again.
The fourth day
Three carefully designed datasets in a row, and the version with no instructions at all was still beating them.
At this point I had to seriously consider that I had been wrong about something more fundamental than which formatting style to pick.
Let’s Look at the training loss for the four runs:
Starting with the panel on the left, the train loss, two things are easy to read off it. First, Normal and Gold sit at the top of the band, plateauing around 0.09–0.10. Second, Titanium and Diamond sit lower, around 0.05–0.07. Titanium ends up just under Diamond, but on the train side the gap between them is small.
The fact that Titanium reaches very low train loss is unsurprising once you’ve stared at a Titanium trace. The schema (F1:, C2:, ANSWER:, the structural punctuation) gives the model a huge amount of token-level predictability. Once it’s seen F1: it knows F2: is coming. The schema is, mechanically, just easier to fit. The fact that Diamond is sitting down there with Titanium is the part that surprised me. Diamond has no schema at all. So why is Diamond down with Titanium on train loss instead of up with Normal and Gold? My guess is that Diamond’s prompt creates a different kind of predictability which is not structural, but stylistic. The teacher was instructed to write in a tight, compressed, inline-arithmetic register, and across 1,700 traces that register is consistent. Normal and Gold, by contrast, were generated with much looser prompts: across their traces, the same idea (“she has read a total of”) might appear as “altogether, she’s read,” “the total is so far,” “in total she’s read,” or any of a dozen other phrasings. Diamond’s prose is more uniform across examples, so once the model picks up its rhythms, predicting the next token within that style gets cheap.
Now look at the eval panel on the right. The Titanium-Diamond gap, which was small on training, opens up clearly on held out data. This second observation is the one that pinned down what was actually going on for me. If both Titanium and Diamond were just “narrow distributions are easy to fit,” they should track each other on eval too but they don’t. Titanium’s eval loss advantage over Diamond is bigger than its train loss advantage. Something is letting the model predict held out Titanium examples more efficiently than it predicts held out Diamond examples. That something is the schema’s positional predictability. Stylistic narrowness (Diamond) is generic, once you’ve learned the register, you have to roughly guess what content comes next at every position. Schematic narrowness (Titanium) is much stronger, once you’ve learned the template, you know not just that the next token will be in a particular style, but that it sits at a known slot: after C2: comes an arithmetic expression of a known shape; after ANSWER: comes a single number followed by a newline. The schema acts like positional anchors, which generalises perfectly to any other Titanium formatted example regardless of the specific math. Diamond’s prose register doesn’t give you anchors of that kind. You still have to predict the next word in the flow.
Both of these make loss go down. Neither of them has anything to do with the model getting better at math. This was the first big update for me, and it’s the kind that feels obvious in retrospect: low SFT loss does not mean good reasoning supervision. It means the dataset is more compressible, either stylistically, or schematically, or both. The math content is approximately constant across all four of these datasets. What varies is the prose around it. And the prose is what loss is dominated by, because it’s most of the tokens.
I had been judging traces by how clear and pedagogically sound they looked to me. Like a teacher writing for a textbook. But a 0.5B model doesn’t read. It predicts tokens. If most of the variance in a trace lives in stylistic and structural conventions rather than in reasoning, then most of the supervision is being spent on those conventions, and the model becomes very good at producing them while getting only the crumbs of the reasoning.
Reframing: two different concepts of “quality”
There were two interpretations I had to weigh against each other, and they pointed in very different directions.
The first: data quality just doesn’t matter at this scale. 0.5B parameters, ~1,700 examples the model can only extract the most basic signal (“produce something that looks like a math solution and ends with ####”) and any quality differences between traces sit below the noise floor.
This didn’t quite fit. If quality genuinely didn’t matter, Titanium should have landed near Normal. Both are correct. Both are valid math solutions. Both have an answer at the end. Instead, Titanium landed exactly on the base model, which means training on it actively neutralised the signal that Normal extracted.
The second: I had the concept of quality wrong.
I’d been optimising for traces that would be good for a human learner; clear, pedagogical, explicit, and well structured. The kind of solution I’d want a student to be shown. But a 0.5B model is not a human learner. It does not read. It does not benefit from named variables and explicit restatement the way a high school student does. It updates weights to predict next tokens.
For a small student, a “good” trace might mean something almost orthogonal. It might mean a trace that sits close to the distribution of math text the model saw in pretraining. With enough text to actually shift weights. With no special structure to consume capacity learning a new format.
Normal hits all three of those, almost by accident. It looks like math solutions on the open internet, which is where most of the model’s pretraining math came from. It has enough prose to deliver meaningful supervision. It has no special structure to imitate. The thing that made Normal lazy from my teacher’s perspective is precisely what made it effective from the model’s.
I started calling this student-learnable quality to distinguish it from human readable quality. They aren’t the same axis. They can actively trade off. And I’d been pushing hard on the first while quietly destroying the second.
What scale changed
That was a satisfying story to tell myself. But it had testable consequences, and the obvious test was: if the binding constraint at 0.5B is really distributional accessibility i.e the trace’s proximity to what the model has already seen, then at a larger scale, where the model has slack on capacity, the picture should change. Carefully written traces should start helping again, because the model can finally afford to absorb their structural overhead.
I ran the same four datasets at 3B. Then at 7B. Here are all the numbers:
| Student | Base | Normal | Gold | Titanium | Diamond |
|---|---|---|---|---|---|
| 0.5B | 33.3 | 49.7 | 47.3 | 33.3 | 45.7 |
| 3B | 74.0 | 85.3 | 87.7 | 70.0 | 81.3 |
| 7B* | — | 88.3 | 90.3 | 81.3 | 87.7 |
The 7B base evaluation pipeline was misbehaving and produced uninterpretable numbers; I’m reporting only the SFT runs, which share an internally consistent pipeline.
Two things flip between 0.5B and 3B. Gold overtakes Normal. And Titanium falls below the unfine-tuned base. At 7B, Gold’s lead widens, and Titanium recovers somewhat but still trails the rest of the field by about seven points.
The Gold result is the one I expected eventually, once I had the right frame. With more capacity, the model can absorb the structural overhead of pedagogical scaffolding, and the actual teaching content (restatement, completion check, explicit relations) finally becomes extractable. The picture I now carry around looks like this:
Two factors that score traces differently.
Distributional accessibility is how close the trace style sits to what the model saw in pretraining. Normal scores high. Diamond medium-high. Gold medium-low. Titanium very low.
Teaching content is how much actual reasoning signal lives in the trace beyond the bare requirement that it be correct. Gold is high (restatement and completion checks do real work). Normal and Diamond are medium. Titanium is low labels do not reason, they label.
At small scale, accessibility binds. The model cannot afford the distribution-shift cost, so the ranking tracks accessibility almost perfectly: Normal > Gold > Diamond > Titanium.
At larger scale, accessibility has slack. The model has enough capacity to absorb new structure, the cost gets paid easily, and the ranking starts to track teaching content instead: Gold > Normal > Diamond > Titanium.
This is actually similar to the result found in the paper “Small models struggle to learn from strong reasoners “
This story makes three predictions you can check.
First, a schema with no teaching content should hurt at every scale, and the hurt should compound with capacity. More capacity means more thorough format learning, which displaces math behaviour the base model already had. Titanium does exactly this — worst at 0.5B, worse than base at 3B, and at 7B it recovers only enough to still trail the other runs.
Second, the crossover where Gold overtakes Normal should happen somewhere between 0.5B and 3B. I haven’t run intermediate sizes.
Third, a pedagogical intervention that preserves the natural prose distribution should sit between Gold and Normal — gaining some teaching content but giving up some of the explicit scaffolding. Diamond looks exactly like that.
Did targeting failures actually work?
The original motivation for Gold was that the 0.5B made specific mistakes, and Gold’s prompt explicitly addressed each of them. So there’s a fair question to ask, separately from overall accuracy: did Gold actually reduce the targeted failure modes, even on runs where overall accuracy didn’t move?
For the 3B student, where the deltas are large enough to read above the noise, the answer is mostly yes:
Gold improves on the failure modes its prompt was designed to address. Relation inversions drop by 10 (Normal manages 7). State tracking by 7. Answer overwriting by 9. The categories where Gold helps most are precisely the ones the prompt asked the teacher to handle carefully. The intervention worked. The intervention worked. It just had to compete against accessibility costs that ate most of its margin.
Titanium tells the opposite story, and it’s clarifying. Titanium doesn’t merely fail to teach, it creates errors that didn’t exist before. Aggregation errors jump by 14. Relation inversions go up by 4 versus base. Loop degeneration appears as a category that essentially doesn’t exist for the other datasets.
Diamond is the messiest. It improves operation omission by 7 (best of any dataset) but regresses on aggregation by 6. My read is that Diamond’s compressed prose style sometimes packs too many sub operations into single sentences, and the model loses the thread when it tries to reproduce that compression at inference time. This is a guess. I haven’t pulled enough traces to confirm.
The cleanest takeaway is that Gold’s intervention did work as designed at 3B. The places it targeted are the places it improved. The reason Gold’s overall accuracy gain over Normal is only +2.4 points isn’t that the intervention was ineffective — it’s that the gains kept getting partly cancelled by accessibility costs the small student couldn’t fully afford.
How problem difficulty interacts
One more breakdown I find interesting:
On Easy problems, Gold wins by a comfortable margin: 92% to Normal’s 87%. On Medium, Normal edges ahead (89% to 86%). On Hard, Gold and Normal tie at 75%.
The lazy reading is that Gold’s pedagogical scaffolding helps when problems are easy enough that the model can still use the scaffolding and stops helping when problems are hard enough that the scaffolding becomes overhead. I don’t think that’s quite right. If scaffolding were just overhead, Gold should fall below Normal on hard problems, not tie. The pattern is more like: Gold helps where there’s headroom and ties where neither approach has enough leverage. The intervention seems to be additive on top of a substrate of capability, it can convert “almost right” into “right,” but it can’t manufacture capability the model doesn’t yet have.
Titanium sits flat at base level performance across all three difficulties. Its pathology isn’t problem specific. The model just isn’t learning math from these traces, regardless of how hard the math is.
Where I might be wrong
I want to be explicit about where this story could break, because it’s an interpretation of a small number of single seed runs and I haven’t fully closed the loop.
Token budget. Different datasets have different total token counts:
Gold has the most tokens, by a clear margin. Normal sits in the middle. Titanium and Diamond are the lean ones. You could try to tell a competing story where the rankings are basically about supervised token count i.e more tokens, more learning, but the data doesn’t cooperate. Diamond has the fewest tokens of any dataset and still beats Titanium at every scale. Titanium has fewer tokens than Normal and fewer than Gold, but it loses to both by far more than the token gap would predict and at 3B, training on Titanium leaves the model worse than the base it started from, which no amount of “not enough supervision” explains. Whatever Titanium is losing on, it isn’t supervision volume. The cleanest token-budget experiment I still owe is a head-to-head between Normal and Diamond at matched token count, since Diamond’s deficit there is the largest. If Diamond closes the gap with Normal when matched on tokens, then “preserving prose distribution while adding pedagogical care” is a real direction. If it doesn’t, Diamond is “Normal with extra steps” and I have less to say.
Single seeds. All these numbers are single runs. The 2.4 point Gold over Normal margin at 3B is exactly the kind of effect that could shrink or invert under reasonable seed variation. I don’t know my noise floor. Until I do, treat the magnitudes as suggestive and the shape as the robust part;the small scale ranking flipping at larger scale, Titanium being uniformly bad, the directional consistency from 3B to 7B. The shape, I’d bet on. The decimal places, less so.
What I’d actually try next
One consequence of this result is that we might be leaving some data on the table from targeted improvements on vanilla cot. Is there a world where we can train a model on the same dataset multiple times with improvements in the reasoning traces each round based on the observed failures of the previous round and without overfitting or collapsing the model?
References
- Li et al., “Small Models Struggle to Learn from Strong Reasoners,” arXiv:2502.12143
- Kim et al., “In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners,” arXiv:2509.22230
- Costello et al., “Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models,” arXiv:2504.18116