For PMs reviewing a fine-tuning proposal. Which approach is right for your project, what each one costs, anti-patterns to watch for, and three worked examples.
For roughly 95% of production projects, LoRA is the right default. It delivers comparable quality to full fine-tuning at about 1% of the compute cost and 3x the iteration speed. QLoRA is the same idea, plus a memory trick that lets you fit large models on a single GPU. Full fine-tuning is rarely worth its cost; treat any proposal that calls for it as an exception that needs documented evidence.
For roughly 95% of production fine-tuning projects, the right approach is LoRA. It costs about 1% of what full fine-tuning costs, runs in hours instead of days, and produces models with comparable quality on real production tasks. QLoRA is the same idea plus a memory trick for fitting large models on a single GPU. Full fine-tuning is the exception that needs documented evidence, not the default.
If you're approving a budget request and the proposal calls for full fine-tuning, your first question is "what specific quality limit did LoRA hit?" If your team can't answer that with a benchmark number, the proposal isn't ready to approve.
What's right for your project depends on your specific shape. Find the row that matches you.
| Your situation | What to do | Why |
|---|---|---|
| Startup, small dataset (under 5K examples), tight runway | LoRA on a 7B model | Cheapest path to a working model; iteration speed lets you fail fast; small dataset overfits less with LoRA |
| Enterprise with rich production data (50K+ examples) and ML capacity | LoRA on a 7B or 13B model | Quality is fine; the savings let you run experiments instead of one expensive bet |
| Need to ship a domain copilot in 6 weeks | LoRA with a stable base model (Llama 3.1 8B) | Iteration speed is the constraint; LoRA gives you 5x to 10x more experiments per week |
| Multi-tenant SaaS with per-customer fine-tuning | LoRA, keep adapters separate (don't merge) | One base model serves dozens of tenants; dramatic infrastructure cost savings |
| Memory-constrained (single GPU available, want to use a 70B model) | QLoRA | Only practical option below a multi-GPU budget |
| Latency budget under 100ms p99, willing to sacrifice some quality | LoRA on a small model (Phi-3 or Llama 3.1 8B) | Smaller is faster; LoRA preserves the speed advantage |
| Compliance-sensitive, on-prem only | LoRA or QLoRA on Llama or Mistral | Adapter weights are tiny and easy to audit; full fine-tuning produces large model copies that are harder to track |
| Multilingual workload (especially Asian languages) | LoRA on Qwen 2.5, not Llama | Base model choice matters here; LoRA approach still applies |
| Worried about overfitting | LoRA | Restricted parameter space is a built-in safety belt against overfitting |
| Scientific or research-grade quality is the goal | Full fine-tuning, with explicit evidence | Acceptable if cost isn't the binding constraint and you have a measurable quality target |
| Quality has plateaued on LoRA after 30+ experiments | Full fine-tuning, after diagnosing why LoRA plateaued | Sometimes the dataset is the issue, not the method; check before doubling cost |
| Specific benchmark gap LoRA can't close | Full fine-tuning, justified by the gap measurement | Document the gap explicitly; treat it as the exception |
A mid-stage fintech wants to build an internal copilot for their support team. They have 5,000 historical tickets with high-quality agent responses. Privacy rules mean the model has to run on their infrastructure.
The right approach: LoRA on Llama 3.1 8B. Total project budget: about $300 in compute (running 30 experiments at $5 to $15 each). Timeline: 3 weeks from dataset preparation to first production-quality model.
What they nearly got wrong: their first proposal called for full fine-tuning on a 70B model "for the best possible quality." That would have been $25,000 in compute, a multi-GPU cluster they didn't have, and a model too large for their inference setup. The smaller 8B model was sufficient because the task is narrow (support ticket responses, not open-ended writing).
What to remember: when the task is narrow and the dataset is moderate, the smaller model with LoRA is almost always the right call.
An enterprise legal team wants an AI assistant that handles 12 distinct practice areas (contract review, IP, employment, and so on). Each area has different terminology and rules.
The right approach: LoRA on Llama 3.1 70B (with QLoRA so it fits on a single H100), with separate LoRA adapters for each practice area. Inference uses the same base model but loads the appropriate adapter per query. Total budget: about $1,200 in compute. Timeline: 6 weeks.
Why QLoRA: the base model needs to be 70B because legal reasoning is harder than support replies. QLoRA lets the team fit it on a single GPU instead of needing a multi-GPU cluster, which their cloud budget didn't support.
Why separate adapters: 12 fully fine-tuned 70B models would have been 12x the storage and 12x the GPU memory at inference. Separate LoRA adapters are tiny (about 50MB each) and swap in milliseconds. The architectural savings are larger than the training savings.
What to remember: when you have multiple variants of the same task, LoRA's small adapter files unlock multi-tenant economics that full fine-tuning can't match.
A logistics company has been using a frontier API for an internal copilot. Costs hit $80,000 per month at scale. They have 50,000 high-quality examples of past usage and want to bring this in-house.
The right approach: QLoRA on Qwen 2.5 32B (the workload is multilingual, including Mandarin). Total fine-tuning budget: about $4,000 in compute. Timeline: 8 weeks including evaluation and rollout. Self-hosted inference on a 2x A100 instance: about $4,500 per month. Net savings: about $75,000 per month.
Why this is more involved: a real production migration from a frontier API to a self-hosted model has to validate quality at parity, build the inference and serving infrastructure, plan rollback, and run a phased rollout. The fine-tuning is maybe 30% of the project.
What to remember: at scale, fine-tuning + self-hosting beats frontier APIs decisively on cost. The fine-tuning approach itself (LoRA, QLoRA) is rarely the hard part; the operational work around it is.
When you're reviewing a fine-tuning proposal, these are the specific patterns that should make you push back.
What it looks like: a proposal that defaults to full fine-tuning without comparing against LoRA. Often paired with high compute budgets and multi-GPU cluster requirements.
Why it's wrong: 95% of production tasks see no measurable quality difference between LoRA and full fine-tuning. The team is guessing, not measuring.
How to redirect: ask for a 1-day comparison. Run a small LoRA fine-tune on a sample of the data, evaluate against the same target. If LoRA hits the quality bar, the project ships at 1% of the cost. If it doesn't, the team has the evidence they need.
What it looks like: requests for multi-GPU clusters or H100 instances for projects on small models with small datasets.
Why it's wrong: most fine-tuning workloads at LoRA economics fit on a single A100 or even a consumer GPU. The team is sizing infrastructure for full fine-tuning when LoRA is the actual approach.
How to redirect: ask "what's the actual GPU memory needed for a LoRA run on this model?" The answer for a 7B model is usually 24GB or less. A 70B model with QLoRA fits on a single 80GB GPU. Multi-GPU is rarely necessary below 70B.
What it looks like: vague iteration plans with no defined experiment count or per-experiment cost.
Why it's wrong: hyperparameter tuning is a real activity but it should be bounded. Without a budget, "experimenting" can consume 100+ runs without converging.
How to redirect: ask the team to commit to a number of experiments (typically 10 to 30) and a per-experiment cost. This forces explicit thinking about which hyperparameters matter and which don't.
What it looks like: ambitious-sounding plans that frame full fine-tuning as the primary path.
Why it's wrong: the order is backwards. LoRA should be the primary path; full fine-tuning is the fallback if LoRA's quality is genuinely insufficient (which is rare).
How to redirect: invert the framing. "Let's start with LoRA. We'll only consider full fine-tuning if a measured benchmark on the test set shows LoRA falls short by more than X points."
What it looks like: project plans that don't carve out a clear test set, or plans that retrain on the test set after results come in.
Why it's wrong: without a held-out test set the team can never tell if the model actually improved. Reported quality numbers will mislead you.
How to redirect: insist on a locked test set carved out before training starts. The test set never moves; the team can never train on it. Reported quality always comes from this locked set.
A meaningful fraction of "we should fine-tune" proposals are wrong. The honest answer for some workloads is don't fine-tune. Specific situations:
If any of these apply, the right move is RAG, prompt engineering, a stronger base model, or simply not building this feature yet. A "no, fine-tuning isn't the right tool here" answer can save a project from a costly detour.
The short list of questions for the architecture review:
Realistic ranges for the most common project shapes:
| Project shape | Approach | Compute budget | Timeline |
|---|---|---|---|
| 7B model, 5K examples, single use case | LoRA on 1x A100 | $50 to $300 | 2 to 4 weeks |
| 7B model, 50K examples, single use case | LoRA on 1x A100 | $200 to $1,000 | 4 to 6 weeks |
| 13B to 30B model, 50K examples, complex domain | LoRA on 1x to 2x A100 | $500 to $2,500 | 6 to 8 weeks |
| 70B model, 50K examples, hard task | QLoRA on 1x H100 | $2,000 to $8,000 | 8 to 12 weeks |
| 7B to 13B model, multi-tenant (10+ variants) | LoRA per tenant on shared base | $1,000 to $5,000 total | 8 to 12 weeks |
| Production frontier API replacement, 100K+ examples | QLoRA + self-hosted serving | $5,000 to $25,000 + serving infra | 12 to 20 weeks |
| Research-grade quality, no cost ceiling | Full fine-tuning on multi-GPU | $20,000 to $200,000 | 12 to 24 weeks |
LoRA is the right call for the large majority of production fine-tuning projects. QLoRA when memory is the binding constraint. Full fine-tuning only when there's documented benchmark evidence that the additional cost earns its keep, which is rare.
If your team's proposal calls for anything other than this, the burden is on them to explain why. "Best quality" and "industry standard" aren't reasons; they're hand-waves. Specific benchmark gaps, measured on a representative test set, are the only justification that should clear the budget.
Boolean & Beyond
AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
Share your project details and we'll get back to you within 24 hours with a free consultation—no commitment required.
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002