Solutions/AI Model Fine-Tuning, Deployment & Evaluation Systems

Fine-Tuning FundamentalsUpdated 8 May 2026

Choosing Between Fine-Tuning Approaches: LoRA, QLoRA, and Full

For PMs reviewing a fine-tuning proposal. Which approach is right for your project, what each one costs, anti-patterns to watch for, and three worked examples.

How do you decide between LoRA, QLoRA, and full fine-tuning?

For roughly 95% of production projects, LoRA is the right default. It delivers comparable quality to full fine-tuning at about 1% of the compute cost and 3x the iteration speed. QLoRA is the same idea, plus a memory trick that lets you fit large models on a single GPU. Full fine-tuning is rarely worth its cost; treat any proposal that calls for it as an exception that needs documented evidence.

If You Remember Nothing Else

For roughly 95% of production fine-tuning projects, the right approach is LoRA. It costs about 1% of what full fine-tuning costs, runs in hours instead of days, and produces models with comparable quality on real production tasks. QLoRA is the same idea plus a memory trick for fitting large models on a single GPU. Full fine-tuning is the exception that needs documented evidence, not the default.

If you're approving a budget request and the proposal calls for full fine-tuning, your first question is "what specific quality limit did LoRA hit?" If your team can't answer that with a benchmark number, the proposal isn't ready to approve.

Recommendations by Situation

What's right for your project depends on your specific shape. Find the row that matches you.

Your situation	What to do	Why
Startup, small dataset (under 5K examples), tight runway	LoRA on a 7B model	Cheapest path to a working model; iteration speed lets you fail fast; small dataset overfits less with LoRA
Enterprise with rich production data (50K+ examples) and ML capacity	LoRA on a 7B or 13B model	Quality is fine; the savings let you run experiments instead of one expensive bet
Need to ship a domain copilot in 6 weeks	LoRA with a stable base model (Llama 3.1 8B)	Iteration speed is the constraint; LoRA gives you 5x to 10x more experiments per week
Multi-tenant SaaS with per-customer fine-tuning	LoRA, keep adapters separate (don't merge)	One base model serves dozens of tenants; dramatic infrastructure cost savings
Memory-constrained (single GPU available, want to use a 70B model)	QLoRA	Only practical option below a multi-GPU budget
Latency budget under 100ms p99, willing to sacrifice some quality	LoRA on a small model (Phi-3 or Llama 3.1 8B)	Smaller is faster; LoRA preserves the speed advantage
Compliance-sensitive, on-prem only	LoRA or QLoRA on Llama or Mistral	Adapter weights are tiny and easy to audit; full fine-tuning produces large model copies that are harder to track
Multilingual workload (especially Asian languages)	LoRA on Qwen 2.5, not Llama	Base model choice matters here; LoRA approach still applies
Worried about overfitting	LoRA	Restricted parameter space is a built-in safety belt against overfitting
Scientific or research-grade quality is the goal	Full fine-tuning, with explicit evidence	Acceptable if cost isn't the binding constraint and you have a measurable quality target
Quality has plateaued on LoRA after 30+ experiments	Full fine-tuning, after diagnosing why LoRA plateaued	Sometimes the dataset is the issue, not the method; check before doubling cost
Specific benchmark gap LoRA can't close	Full fine-tuning, justified by the gap measurement	Document the gap explicitly; treat it as the exception

Worked Examples

Example 1: Customer support copilot for a fintech (5,000 labeled tickets)

A mid-stage fintech wants to build an internal copilot for their support team. They have 5,000 historical tickets with high-quality agent responses. Privacy rules mean the model has to run on their infrastructure.

The right approach: LoRA on Llama 3.1 8B. Total project budget: about $300 in compute (running 30 experiments at $5 to $15 each). Timeline: 3 weeks from dataset preparation to first production-quality model.

What they nearly got wrong: their first proposal called for full fine-tuning on a 70B model "for the best possible quality." That would have been $25,000 in compute, a multi-GPU cluster they didn't have, and a model too large for their inference setup. The smaller 8B model was sufficient because the task is narrow (support ticket responses, not open-ended writing).

What to remember: when the task is narrow and the dataset is moderate, the smaller model with LoRA is almost always the right call.

Example 2: Domain expert assistant for a legal team (15,000 examples)

An enterprise legal team wants an AI assistant that handles 12 distinct practice areas (contract review, IP, employment, and so on). Each area has different terminology and rules.

The right approach: LoRA on Llama 3.1 70B (with QLoRA so it fits on a single H100), with separate LoRA adapters for each practice area. Inference uses the same base model but loads the appropriate adapter per query. Total budget: about $1,200 in compute. Timeline: 6 weeks.

Why QLoRA: the base model needs to be 70B because legal reasoning is harder than support replies. QLoRA lets the team fit it on a single GPU instead of needing a multi-GPU cluster, which their cloud budget didn't support.

Why separate adapters: 12 fully fine-tuned 70B models would have been 12x the storage and 12x the GPU memory at inference. Separate LoRA adapters are tiny (about 50MB each) and swap in milliseconds. The architectural savings are larger than the training savings.

What to remember: when you have multiple variants of the same task, LoRA's small adapter files unlock multi-tenant economics that full fine-tuning can't match.

Example 3: Internal copilot replacement for a proprietary system (50,000 examples, multi-language)

A logistics company has been using a frontier API for an internal copilot. Costs hit $80,000 per month at scale. They have 50,000 high-quality examples of past usage and want to bring this in-house.

The right approach: QLoRA on Qwen 2.5 32B (the workload is multilingual, including Mandarin). Total fine-tuning budget: about $4,000 in compute. Timeline: 8 weeks including evaluation and rollout. Self-hosted inference on a 2x A100 instance: about $4,500 per month. Net savings: about $75,000 per month.

Why this is more involved: a real production migration from a frontier API to a self-hosted model has to validate quality at parity, build the inference and serving infrastructure, plan rollback, and run a phased rollout. The fine-tuning is maybe 30% of the project.

What to remember: at scale, fine-tuning + self-hosting beats frontier APIs decisively on cost. The fine-tuning approach itself (LoRA, QLoRA) is rarely the hard part; the operational work around it is.

Anti-Patterns to Watch For

When you're reviewing a fine-tuning proposal, these are the specific patterns that should make you push back.

"We'll do full fine-tuning to get the best quality"

What it looks like: a proposal that defaults to full fine-tuning without comparing against LoRA. Often paired with high compute budgets and multi-GPU cluster requirements.

Why it's wrong: 95% of production tasks see no measurable quality difference between LoRA and full fine-tuning. The team is guessing, not measuring.

How to redirect: ask for a 1-day comparison. Run a small LoRA fine-tune on a sample of the data, evaluate against the same target. If LoRA hits the quality bar, the project ships at 1% of the cost. If it doesn't, the team has the evidence they need.

"We need GPUs because we're doing fine-tuning"

What it looks like: requests for multi-GPU clusters or H100 instances for projects on small models with small datasets.

Why it's wrong: most fine-tuning workloads at LoRA economics fit on a single A100 or even a consumer GPU. The team is sizing infrastructure for full fine-tuning when LoRA is the actual approach.

How to redirect: ask "what's the actual GPU memory needed for a LoRA run on this model?" The answer for a 7B model is usually 24GB or less. A 70B model with QLoRA fits on a single 80GB GPU. Multi-GPU is rarely necessary below 70B.

"We'll experiment with hyperparameters"

What it looks like: vague iteration plans with no defined experiment count or per-experiment cost.

Why it's wrong: hyperparameter tuning is a real activity but it should be bounded. Without a budget, "experimenting" can consume 100+ runs without converging.

How to redirect: ask the team to commit to a number of experiments (typically 10 to 30) and a per-experiment cost. This forces explicit thinking about which hyperparameters matter and which don't.

"Let's start with full fine-tuning, fall back to LoRA if it's too expensive"

What it looks like: ambitious-sounding plans that frame full fine-tuning as the primary path.

Why it's wrong: the order is backwards. LoRA should be the primary path; full fine-tuning is the fallback if LoRA's quality is genuinely insufficient (which is rare).

How to redirect: invert the framing. "Let's start with LoRA. We'll only consider full fine-tuning if a measured benchmark on the test set shows LoRA falls short by more than X points."

"We'll skip the validation set"

What it looks like: project plans that don't carve out a clear test set, or plans that retrain on the test set after results come in.

Why it's wrong: without a held-out test set the team can never tell if the model actually improved. Reported quality numbers will mislead you.

How to redirect: insist on a locked test set carved out before training starts. The test set never moves; the team can never train on it. Reported quality always comes from this locked set.

When NOT to Fine-Tune at All

A meaningful fraction of "we should fine-tune" proposals are wrong. The honest answer for some workloads is don't fine-tune. Specific situations:

Your data updates frequently and the model needs current knowledge. Fine-tuning bakes in training-time facts. Use RAG (retrieval) instead so the model can reference current data.
You have under 500 high-quality examples. Fine-tuning needs more than this; the model overfits or fails to generalize. Stronger prompts on a frontier API typically beat a weak fine-tune.
The task is open-ended generation (long-form writing, creative tasks). Fine-tuning works less well here than on narrow tasks.
The team doesn't have evaluation infrastructure in place. Fine-tuning without measurement is wishful thinking; build the eval first.
Your unit economics depend on the cost-per-token of frontier APIs being too high. Fine-tuning + self-hosting only pays back at sustained high volume; below 10M tokens per day, the savings rarely justify the engineering investment.

If any of these apply, the right move is RAG, prompt engineering, a stronger base model, or simply not building this feature yet. A "no, fine-tuning isn't the right tool here" answer can save a project from a costly detour.

What to Ask Your Engineering Team

The short list of questions for the architecture review:

Why are you proposing full fine-tuning instead of LoRA? If the answer is "it's the standard approach" or "for best quality," push back; both are weak answers.
What's our model size and dataset size? Below 30B parameters and below 500K examples, LoRA is almost certainly the right call.
How many training experiments do you expect to run? Multiply by per-experiment cost to get the real project budget.
How will we validate that the chosen approach is good enough before scaling? A small benchmark in week 1 saves weeks of misdirection.
What's the deployment story? If multi-tenant or multiple variants, keeping LoRA adapters separate dramatically simplifies the architecture.
What's the fallback plan if quality is weaker than expected? "More compute" is rarely the answer; usually the dataset needs work.
Have we built the evaluation harness yet? If not, fine-tuning is premature.

Cost & Timeline Quick Reference

Realistic ranges for the most common project shapes:

Project shape	Approach	Compute budget	Timeline
7B model, 5K examples, single use case	LoRA on 1x A100	$50 to $300	2 to 4 weeks
7B model, 50K examples, single use case	LoRA on 1x A100	$200 to $1,000	4 to 6 weeks
13B to 30B model, 50K examples, complex domain	LoRA on 1x to 2x A100	$500 to $2,500	6 to 8 weeks
70B model, 50K examples, hard task	QLoRA on 1x H100	$2,000 to $8,000	8 to 12 weeks
7B to 13B model, multi-tenant (10+ variants)	LoRA per tenant on shared base	$1,000 to $5,000 total	8 to 12 weeks
Production frontier API replacement, 100K+ examples	QLoRA + self-hosted serving	$5,000 to $25,000 + serving infra	12 to 20 weeks
Research-grade quality, no cost ceiling	Full fine-tuning on multi-GPU	$20,000 to $200,000	12 to 24 weeks

The Bottom Line

LoRA is the right call for the large majority of production fine-tuning projects. QLoRA when memory is the binding constraint. Full fine-tuning only when there's documented benchmark evidence that the additional cost earns its keep, which is rare.

If your team's proposal calls for anything other than this, the burden is on them to explain why. "Best quality" and "industry standard" aren't reasons; they're hand-waves. Specific benchmark gaps, measured on a representative test set, are the only justification that should clear the budget.

Boolean & Beyond

AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026

Talk to our team

From guide to production

Need help building this?

Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.

Book a free consultation Estimate cost

All AI Model Fine-Tuning, Deployment & Evaluation Systems guides