AI Model Fine-Tuning Data Guide: Quality, Formats & Flywheel.

The fine-tuning data guide engineers need: how much

data, 4 sources, 3 formats, model collapse risk from synthetic data, and the data flywheel that builds a better model every quarter.

Let me tell you about a team that did everything right and still ended up with a model worse than the one they started with.

They were building a customer support assistant for a software company. They had access to four years of support tickets and over 10,000 resolved conversations between their agents and customers. Real data. Actual domain content. It's exactly what every fine-tuning guide says to collect.

They trained the model. The loss curves looked good. Validation loss came down steadily. They ran a few manual spot checks. It seemed fine. They deployed.

Users noticed within a week. The model confidently gave outdated answers. It contradicted itself on the same question asked two different ways. It used the writing style of whichever agent had written the most tickets, a person who had left the company two years ago.

The data was real. It was domain-specific. There was plenty of it. It was also four years of inconsistency written by twelve different people with different styles, different levels of accuracy, and different ideas about what a good answer looked like. The model learned the average of twelve opinions. None of those opinions were entirely right. The average of twelve inconsistent opinions is not a good model.

The Insight That Changes Everything

Before a single data collection decision, you need to understand one thing about how fine-tuning works.

The training process does not know which of your examples are good and which are lazy. It treats every example with equal weight. If you provide 1,000 excellent examples and 200 sloppy ones, the model does not learn from the excellent 1,000 and ignore the sloppy 200.

It minimises the average loss across all 1,200. Each of the sloppy 200 examples casts a vote against the direction the good examples are pulling.

A fine-tuned model learns the average of your data, not the ideal of your task.

This is why "garbage in, garbage out" is more serious for fine-tuning than for most software systems. In a database, a bad record is a bad record. You can filter it out, correct it, flag it. In a fine-tuned model, the bad record has already been woven into 7 billion adjusted parameters. You cannot reverse it selectively. You can only dilute it with more good examples — or retrain from scratch.

Quality test for every training example:_ Ask yourself, would I be comfortable if the model learned only this example and nothing else? If the answer is no, that example should not be in your dataset._

This test is more useful than any automated filter. It forces you to think about each example as a standalone lesson rather than one row in a JSONL file.

How Much Data Do You Actually Need ?

Most articles tell you to collect 500 to 10,000 examples. That range is so wide it is almost useless. Here is a more precise picture from research and production experience.

For most domain adaptation tasks: 500 high-quality examples.

This number appears consistently across the fine-tuning literature and production case studies. Five hundred high-quality examples consistently outperform 5,000 mediocre ones on domain-specific tasks.

The mechanism is straightforward: more bad examples dilute the training signal from good ones, pulling the average loss in the wrong direction.

Minimum floor: 200 to 300 examples.

Fewer than 200 examples and the signal is too weak to produce consistent improvement. The model adjusts parameters, but not enough to overcome the base model's existing tendencies. You get inconsistent improvements that do not generalise.

Where the ceiling is?

For a narrow, well-defined task like extract this specific type of information, generate responses in this specific format, you can reach the quality ceiling with 1,000 to 2,000 examples. Beyond that, additional examples yield only marginal gains.

The ceiling is not about how many examples you have. It is about how well those examples cover the distribution of inputs the model will face in production.

The distribution coverage principle:

Your 500 to 1,000 examples need to cover the range of inputs your model will see. Not just the easy cases. Not just the cases you remembered to include. A model trained only on clean, well-formatted customer questions will fail when real users send poorly phrased, abbreviation-heavy, context-free messages, because those inputs look nothing like what it was trained on.

Sample your training data from the same distribution as your production inputs. If you do not yet have production data, this means building realistic examples that intentionally include the messy, ambiguous, and edge-case inputs you know are coming.

The Four Data Sources — What Works and What the Research Now Shows

You have four places training data can come from. Each has a different cost, quality profile, and failure mode.

Source 1 — Real User Data (Highest Signal)

Real interactions between your system and actual users are the gold standard. They capture exactly the distribution of inputs the model will face. They contain the messiness, the ambiguity, and the edge cases that synthetic data rarely reproduces.

The problem is that you usually need a production system to generate real user data and you are building the system that needs the data. This is the cold start problem. Most teams do not have real user data before their first model.

The practical path: if you have a related existing product or workflow, mine it. Customer support transcripts. Internal tool usage logs. Email threads. Sales call transcripts. Any existing human-to-system or human-to-human interaction in your domain is a candidate.

Filter it and format it, always start from what already exists before reaching for synthetic generation.

Source 2 — Synthetic Data Generated by a Frontier Model (Scalable, Risky)

Using GPT-4o, Claude Opus, or Gemini to generate training examples is fast and cheap. You can generate 2,000 examples in an afternoon. This is genuinely useful for getting past the cold start problem.

But two specific risks apply.

The first is quality drift. The generating model has its own opinions and biases. If you prompt it to generate customer support responses, it generates the support responses that GPT-4o would write — not necessarily the responses your domain requires. The generated examples look plausible but may not reflect your actual requirements.

The second is the model collapse risk. Research published at ACL 2026 documented what happens when you iteratively fine-tune models on synthetic data: the output distribution narrows. The model produces outputs that cluster around the most common patterns in the synthetic data.

Iterative fine-tuning on the model's own outputs gradually strips away the capability to handle unusual inputs. Using a single source model for all your synthetic data accelerates this collapse.

The mitigation: use multiple frontier models to generate synthetic data. GPT-4o and Claude Opus generating different examples from the same prompts produce more diverse data than either alone. Multi-source synthetic data significantly mitigates distribution collapse compared to data from a single source.

The ratio that works: If you have 200 real examples and need 2,000 total, generate 1,800 synthetic examples. Have a human review a random sample of 100 to 200 of the synthetic examples against your quality standard. Keep only those that pass. Aim for a real-to-synthetic ratio of no less than 1:10. Over-weighting synthetic tokens beyond this ratio hurts performance due to diversity deterioration.

Source 3 — Human-Labelled Examples (Expensive, Highest Quality)

Humans writing or reviewing examples against a clear rubric produce the highest-quality training data. It is also the most expensive and slowest approach.

The economics: professional annotators with domain expertise cost $15 to $50 per example for complex tasks. 500 examples at $30 each is $15,000 , more than the compute cost of training the model.

This is appropriate when the task is high-stakes and the quality requirements are very high. It is not appropriate when the task is well-defined enough that synthetic data can approximate it.

The efficiency shortcut: instead of having humans write examples from scratch, have them review and correct synthetic examples. A human who can review and mark synthetic examples as pass, fix, or reject works five to ten times faster than a human writing examples from scratch. The output quality approaches human-written. The cost approaches synthetic.

Source 4 — Distillation From a Larger Model (Your Secret Weapon)

Model distillation means using a larger, more capable model as the teacher and training your smaller model to reproduce its outputs. You prompt the large model with your production inputs and capture its responses as training data. Then you fine-tune your smaller model on those input-output pairs.

This is an approach that sits between synthetic generation and real user data. You are generating data from a strong model's actual outputs on realistic inputs, rather than having a model generate hypothetical examples. The resulting data tends to be of higher quality than purely synthetic generation and easier to produce than human labelling.

The critical requirement: the teacher model must be genuinely better at the task than your target model. Distilling from a model that is only marginally better produces marginal improvement.

The Three Data Formats — Pick One and Never Mix

Every fine-tuning framework accepts training data in a specific format. The three you will encounter most often are ChatML, ShareGPT, and Alpaca.

These are not interchangeable. Mixing formats in the same dataset confuses the model, it sees different structural patterns and cannot learn a consistent mapping between input and output. Pick one format before you collect a single example and use it for everything.

ChatML — the format most directly matching how modern chat models are trained:

{
  "messages": [
    {"role": "system", "content": "You are a customer support agent..."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password, go to..."}
  ]
}

{
  "messages": [
    {"role": "system", "content": "You are a customer support agent..."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password, go to..."}
  ]
}

Use ChatML when you are fine-tuning a chat or instruction model. It is the native format for most modern base models including Llama 3, Mistral, Qwen3, and Gemma.

ShareGPT — a multi-turn conversation format with a different structure:

{
  "conversations": [
    {"from": "human", "value": "How do I reset my password?"},
    {"from": "gpt", "value": "To reset your password, go to..."}
  ]
}

{
  "conversations": [
    {"from": "human", "value": "How do I reset my password?"},
    {"from": "gpt", "value": "To reset your password, go to..."}
  ]
}

ShareGPT is useful for multi-turn conversations where you want to capture the full exchange rather than just single turns. The Unsloth and FastChat ecosystems expect this format.

Alpaca — a simpler single-turn instruction format:

{
  "instruction": "Extract the key obligations from this contract...",
  "input": "[contract text here]",
  "output": "Key obligations: 1. Delivery within 30 days..."
}

{
  "instruction": "Extract the key obligations from this contract...",
  "input": "[contract text here]",
  "output": "Key obligations: 1. Delivery within 30 days..."
}

Alpaca suits single-turn structured tasks — extraction, classification, transformation. No system prompt required. Simpler to produce and validate.

The format rule: the format you train on must match the format you use at inference time. If you train in ChatML with a system prompt, you must include the same system prompt structure at inference. Mismatch between training and inference format is one of the most common causes of unexpectedly poor results from an otherwise well-trained model.

The Production Distribution Gap — Why Your Model Fails at Users?

You wrote clean, clear, well-structured training examples. You tested the model against other clean, clear examples. It worked well. You deployed. Users sent messy, abbreviated, context-free inputs and the model failed.

This is the production distribution gap. It is the most common failure mode for fine-tuned models, and it is almost entirely preventable.

Training distribution describes the types of inputs in your training data. Production distribution describes the types of inputs real users actually send. These are almost never the same. Training examples are written by people who know what the system needs.

Production inputs are written by people who do not know or care what the system needs.

The differences are concrete and consistent across every domain.

Real user inputs are shorter. They omit context that seems obvious to the writer. They contain typos, abbreviations, and informal language. They arrive in fragments — a word, a partial sentence, a screenshot described in three words.

If none of your training examples look like this, your model has never learned to handle it.

The fix is straightforward but requires deliberate effort:

Include intentionally messy examples. Write training examples that mirror how users actually communicate, not how you wish they would communicate.
Collect from existing interfaces. If you have a search bar, a chat interface, or any form users interact with, sample real inputs from it — even if you have to generate the expected outputs manually.
Include the edge cases you are afraid of. Ambiguous requests. Requests that should be declined. Requests in the wrong domain. A model that has never seen these in training will handle them unpredictably.

The Data Quality Filter — Five Questions Before You Train

Run this review on a random sample of 50 examples before starting any training run. An afternoon of human review catches most quality problems that automated tools miss.

For each example, answer these five questions:

Is the output something I would be proud to show a user? Not "is it acceptable" — proud. Any example below that standard is a vote against the quality you are trying to teach.
Does the output format exactly match what the model should produce in production? Missing fields, extra fields, wrong punctuation conventions, inconsistent capitalisation — the model will learn these too.
Would this example mislead the model about what the task requires? An example that is correct in isolation but inconsistent with other examples in the dataset teaches the model that inconsistency is acceptable.
Does this example represent an input type the model will actually see in production, or only the ideal version? Training examples that are all perfectly formed teach the model to fail on realistic inputs.
If the model learned only this example and forgot everything else, would it be a better model or a worse one? This is the single most clarifying question in data quality review.

Any example that fails two or more of these tests should be removed or corrected before training.

The Model Collapse Warning — What Happens When Your Synthetic Data Is Too Similar?

ACL 2026 research documented something that matters for every team building a data pipeline with synthetic generation.

When you fine-tune a model on synthetic data generated by a single source — one model, one style, one pattern, two things happen. Output quality as measured by standard benchmarks remains acceptable. Adversarial robustness — the model's ability to handle unusual, out-of-distribution, or challenging inputs, decreases significantly.

More concerning: iterative fine-tuning on a model's own outputs creates a narrowing effect. Each generation of synthetic data looks a bit more like the previous generation. The base model has lower perplexity on the outputs — they are becoming more predictable — but the long tail of capability is being lost.

This is model collapse: a model that performs well on the common case and catastrophically on the unusual case.

The practical implication for your data pipeline:

Do not generate all your synthetic data from a single model. Use at least two frontier models with different generation prompts. The resulting diversity in synthetic examples significantly mitigates collapse and produces better models at the same data volume.

Do not fine-tune repeatedly on your own model's outputs unless you have strong evaluation coverage of edge cases. Each iteration where your model's outputs become your training data is a step toward a narrower, more brittle model.

The Data Flywheel — Building a Pipeline That Gets Better Over Time

Every team that builds a fine-tuned model faces the same reality three months later: the inputs have drifted, the model's performance has degraded, and the training data is stale.

The teams that handle this well do one thing differently from the start. They instrument their production systems to capture what the model cannot handle.

The minimum viable data flywheel has four components:

Capture. Log every production inference — input, output, and any available signal of correctness. A user copying the output is a positive signal. A user immediately submitting the same query again is a negative signal. User corrections are the strongest signal of all.

Label. Once a week, a human reviews the flagged low-confidence or user-corrected outputs and creates correct examples. This produces the highest-quality training data available: real inputs the model struggled with, paired with the correct answer.

Retrain. Quarterly, or when performance metrics drop below a threshold, retrain on the accumulated examples. Each retraining cycle includes real-world edge cases from the previous quarter. The model improves on exactly the cases it previously failed on.

Evaluate. Before each retraining, run your fixed evaluation set — the same 100 to 200 examples you used before the first deployment. The score on this fixed set should increase each cycle. If it does not, then the new training data is introducing regression.

The compound effect is significant. Teams that start this flywheel from day one have dramatically better models at the one-year mark than teams that treat data as a one-time collection problem. The model improves on real failures rather than hypothetical ones.

The One Thing to Do Before You Collect a Single Example

Write the quality rubric first.

Before you open a spreadsheet, before you write a prompt to generate synthetic data, before you export support tickets, write down in one paragraph what a perfect training example looks like for your task. What does the output format look like exactly? What tone? What level of detail? What does the model do when the input is ambiguous?

Then write down what a bad example looks like. Too long. Wrong format. Inconsistent with company policy. Answers the wrong question.

Pin both descriptions to the wall. Every example that enters your dataset must match the first description and avoid the second. Every human reviewer must have read both before they review a single example.

The rubric takes 30 minutes to write. The time it saves in review, retraining, and debugging is measured in weeks.

What's Next?

Our next Article 3 will cover the training side: which fine-tuning techniques to use, how to configure the training run, and how to read the loss curves so you know when to stop. But the technique will only take you as far as your data lets it. The ceiling of your fine-tuned model is set by the quality of your training data, not the sophistication of your training configuration.

Build Your Own AI is a series for engineers and technical leaders building custom AI systems — from first decision to production deployment. @pramodchandrayan on Medium.

Article 1 covers the decision framework: when to use prompting, RAG, fine-tuning, or train from scratch. In Article 3 we will cover fine-tuning techniques: LoRA vs QLoRA vs full fine-tuning, and how to read training metrics.

Sources: ACL 2026: Synthetic Eggs in Many Baskets — The Impact of Synthetic Data Diversity on LLM Fine-Tuning (arXiv:2511.01490). arXiv:2410.15226 — On the Diversity of Synthetic Data and its Impact on Training LLMs. Databricks: A Practical Guide to LLM Fine-Tuning. Spheron Blog: How to Fine-Tune LLMs in 2026 (March 2026). SitePoint: Fine-Tune Local LLMs 2026 (March 2026). Label Your Data: Synthetic Data vs Real Data for ML Training (March 2026). AI in Plain English: High-Quality Dataset Curation for LLM Fine-Tuning (October 2025).

Contents