Fine-Tuning Llama 3.2 3B on Medical QA: When a Better Loss Number Produced a Worse Model

Week 3 gave us a working fine-tuned Llama 3.2 3B model for medical QA—one epoch, one dataset, clear improvement over base. Week 4 was supposed to be the upgrade: more data from two sources, two epochs of training, and a cleaner setup overall. The eval loss dropped from 2.495 to 2.275. By that number alone, success was guaranteed. Except the model collapsed.

The Plan: Four Changes Over Week 3

The strategy combined ChatDoctor (conversational patient-doctor QA) with MedAlpaca WikiDoc (encyclopedic clinical reference) in an 8,000-to-4,000 row split. Training ran for two full epochs instead of one on a Kaggle T4 instance, taking roughly four hours. The team also switched to greedy decoding for reproducible evaluation and made a critical tokenizer change: using Llama 3.2's built-in padding token <|finetune_right_pad_id|> (token 128004) instead of adding a custom pad token that had bloated the Week 3 adapter file from ~50MB to 3.19GB.

The Pad Token Fix That Actually Worked

Week 3 suffered an unexpected side effect: adding a custom pad token required resizing the embedding layer, and PEFT saved that entire resized layer alongside the LoRA adapters. Switching to Llama 3.2's native padding token eliminated the embedding resize entirely. The Week 4 adapter came out at ~50MB—exactly what was expected. Before reaching for add_special_tokens, check whether your model already ships with an appropriate reserved token.

Training Run: Numbers That Lied

The loss curve looked textbook-perfect. Step 150 showed train 2.499, eval 2.474. By step 1,050—the final checkpoint—train sat at 2.231 and eval at 2.275. Train and eval tracked each other closely with no divergence indicating overfitting. Mean token accuracy climbed to 0.515. Every metric screamed improvement.

The Regression Nobody Predicted

Test questions under greedy decoding told a different story. The diabetes answer began coherently, then collapsed into gibberish: 'Eye yawning, Eye yawns, Eye years, Eye yolks, Eye yummy, Eye yogurt.' Under deterministic decoding—supposedly the stable option—the model degraded mid-generation. A heart attack question spiraled into a runaway list drifting from cardiac symptoms into sore throats and ear pain. Hypertension confidently recommended atenolol as first-line therapy, which is pharmacologically incorrect: beta-blockers are not first-line for uncomplicated hypertension.

Diagnosing Two Separate Failures

The no_repeat_ngram_size=3 setting was meant to prevent repetition loops but achieved the opposite. Once the model generated a three-token phrase like 'consult your doctor,' it could never repeat that exact sequence in the same answer. When the model wanted to close a list by repeating a natural ending pattern, the rule forbade it—forcing entirely new tokens each time and eventually drifting into nonsense. The second issue ran deeper: two epochs on data weighted 50/50 between conversational ChatDoctor and encyclopedic WikiDoc had overfit the model to list generation. On questions inviting enumeration (symptoms, drugs), the model started a list and could not stop, eventually confabulating invented medications like 'artuzofloxacin.' The loss curve never revealed this because eval loss measures next-token prediction accuracy—a model can excel at predicting the next token while getting worse at coherent, bounded, truthful answers.

Three Fixes That Actually Worked

Rebalancing the dataset attacked root behavior: dropped WikiDoc from 4,000 to 1,500 rows while raising ChatDoctor to 8,500, landing at roughly 85% narrative prose and 15% encyclopedic content. ChatDoctor's conversational format trains the model toward bounded flowing responses rather than open-ended enumeration. Expanding LoRA target modules addressed the factual recall failure: attention layers (q_proj, k_proj, v_proj, o_proj) route information but feed-forward layers (up_proj, down_proj, gate_proj) store and retrieve factual knowledge somewhat like a key-value memory. Week 3 had frozen the feed-forward layers entirely; adding them let fine-tuning adjust where facts live rather than just attention routing.

Generation Settings: The Fix Nobody Talks About

Removing no_repeat_ngram_size entirely eliminated the constraint forcing token drift. Setting eos_token_id explicitly to <|eot_id|> gave the model an actual stop signal it could honor, replacing the implicit stopping behavior that had failed before. A repetition_penalty of 1.3 discouraged loops without hard n-gram bans, and capping max_new_tokens at 256 prevented runaway generation from consuming excessive tokens.

The Result: Clean Answers, Reproducible Outputs

The degeneration vanished completely. Hypertension now listed four real first-line drug classes—ACE inhibitors, ARBs, beta-blockers, calcium channel blockers—and stopped cleanly. Malaria named actual treatments (artemether-lumefantrine, chloroquine, mefloquine) and terminated properly. The heart attack answer, which had failed in every previous run including Week 3, finally produced seven correct cardiac warning signs without confabulation or drift. Running the same question twice under greedy decoding now produces byte-for-byte identical output—the reproducibility that makes claims defensible.

Key Lessons for Fine-Tuning Small Models

Lower eval loss is not a better model. Eval loss measures next-token prediction accuracy on your validation set—it does not measure factual accuracy, coherence, or whether the model knows when to stop generating. The Week 4 two-epoch model achieved the best loss and produced the worst generation quality of any version so far. Generation settings are not an afterthought. The same model weights produced either total collapse or clean answers depending entirely on decoding configuration. A repetition penalty meant to help actively drove the degeneration. Half the battle with a small parameter-count model is how you decode it.

The Bottom Line

This week exposed a trap that catches even experienced practitioners: optimizing for loss metrics without validating actual output quality leads to models that look better in logs and perform worse in production. Manual testing of generated outputs isn't optional—it's the only way to catch regressions that eval loss hides. For anyone fine-tuning small models on domain-specific tasks, the generation configuration deserves as much attention as the training pipeline itself.

> Fine-Tuning Llama 3.2 3B on Medical QA: When a Better Loss Number Produced a Worse Model