Solutions/AI Model Fine-Tuning, Deployment & Evaluation Systems

Production InferenceUpdated 8 May 2026

GPU Planning for Production AI Systems

For PMs approving GPU infrastructure spend. How to choose between H100, A100, L40S; size for memory; and mix reserved/on-demand/spot capacity for sustainable economics.

How do you plan GPU capacity for production AI systems?

Match GPU to workload: H100 for high-volume training and frontier serving, A100 for general production, L40S for cost-optimized inference, L4 for low-volume edge servers. Size memory to model + KV cache + 50% headroom. Mix capacity (60 to 80% reserved baseline + on-demand bursts + spot for batch). Cloud wins below 4 GPUs steady-state; on-prem wins above 32. Plan for re-evaluation annually as new GPU generations shift the matrix.

If You Remember Nothing Else

GPU planning is the single largest cost decision in most AI infrastructure projects. The decisions you make early — which hardware tier, how many GPUs, on-prem vs cloud, reserved vs on-demand — determine cost trajectories for years.

Match the GPU to the workload, not to prestige or what your engineers know. H100 is the flagship but A100 still works well for most production training. L40S is often the right inference GPU at lower cost. L4 fits low-volume workloads and edge servers.

Size memory to model + KV cache + 50% headroom. Mix capacity types (60 to 80% reserved baseline, plus on-demand for bursts, plus spot for batch jobs). Below 4 GPUs steady-state, cloud almost always wins. Above 32 GPUs sustained, on-prem typically wins on TCO.

Recommendations by Situation

Your situation	GPU choice	Why
Production training, frontier-class models	H100	FP8 support, fastest training, multi-GPU NVLink
Production training, mid-sized models (7B to 70B)	A100	Workhorse generation; mature tooling; cost-effective for training
Inference only, 7B to 70B models	L40S or A100	L40S is often 30 to 40% cheaper than A100 for inference
Low-volume inference, smaller models	L4	Lower cost, lower power, sufficient for under 1000 RPM
Embeddings or classification at high throughput	A10 or T4	Older generations still cheap and effective
Edge GPU server (latency-critical, on-site)	L4	Best small-form-factor option
Memory-constrained 70B+ deployment	H100 with FP8 quantization	80GB single GPU vs needing multi-A100
Multi-tenant LoRA serving	A100 80GB or L40S	Memory headroom for multiple adapters
Below 4 GPUs steady-state	Cloud	On-prem ops cost exceeds savings at this scale
4 to 32 GPUs, mixed steady and burst	Hybrid (on-prem baseline + cloud burst)	Best cost-flexibility balance
Above 32 GPUs sustained demand	On-prem with operational capacity	TCO 3 to 5x lower than cloud reserved over 3 years
Compliance-driven workloads	On-prem or sovereign cloud regardless of cost	Often regulatory requirement

Worked Examples

Example 1: Mid-stage SaaS, switching from frontier API to self-hosted (60M tokens per day)

A SaaS company processes about 60M tokens per day across customer-facing AI features. They want to migrate from frontier APIs to self-hosted to reduce costs.

The right setup: 2x L40S GPUs running fine-tuned Llama 3.1 8B. Reserved 1-year capacity on AWS at about $1,500 per GPU per month. Total infrastructure cost: about $3,000 per month. Frontier API alternative was costing $18,000 per month. Net savings: $15,000 per month, $180K per year.

What worked: choosing L40S (not A100). For their inference-only workload, L40S delivered comparable throughput at 60% of A100 cost. The team didn't pay for training-grade hardware they wouldn't use.

What they nearly got wrong: defaulting to A100 because it's the "standard." For pure inference at this scale, L40S is the better fit. The team's engineer suggested A100 out of familiarity; the cost analysis caught it.

What to remember: don't default to the GPU your team knows. Match the hardware to the workload. Inference deployments often benefit from L40S over A100.

Example 2: Enterprise on-prem AI platform (200 GPUs, sustained)

An enterprise wants to bring 60% of their AI workloads on-prem for compliance and cost reasons. Steady-state demand: 200 GPUs.

The right setup: H100 cluster on-prem with NVLink-connected nodes (8x H100 per node, 25 nodes). Capital cost: about $7M. 3-year operational cost (power, cooling, ops staff): about $4M. Total 3-year TCO: about $11M.

Cloud equivalent: 200x H100 reserved on AWS at about $5/hour each. 3-year cost: about $26M. Net savings: $15M over 3 years.

What worked: doing the math honestly. The analysis included not just hardware cost but operational overhead, power, cooling, ops staff, and the cost of the colocation facility. On-prem still won decisively.

What they nearly got wrong: ignoring the operational ramp-up time. On-prem GPU procurement and deployment took 4 months. The team had to rent cloud capacity in the interim, which added about $1M in transitional cost.

What to remember: above 32 GPUs sustained, on-prem typically wins on TCO. Plan for the procurement timeline; cloud bridges the gap during transition.

Example 3: Variable workload with predictable peaks (15M tokens per day, 4x peak)

A content platform has spiky traffic: 15M tokens per day average, 4x that at peak. They want efficient capacity management.

The right setup: 60% reserved 1-year A100 capacity for steady-state baseline, 30% on-demand A100 capacity for predictable peaks (mornings, evenings), 10% spot A100 capacity for batch processing (overnight content generation). Combined cost: about $8,500 per month.

What worked: capacity right-sizing across three tiers. Reserved instances at the baseline got 50% off list price. Spot for batch saved another 70% on that workload. On-demand for predictable peaks was the most expensive but only used when needed.

What they nearly got wrong: provisioning for peak everywhere. A pure-reserved peak-sized cluster would have wasted 70% of capacity during off-peak. Pure on-demand would have cost 2x reserved. The mix was the win.

What to remember: capacity is a portfolio decision, not a single choice. Mix reserved (steady), on-demand (peaks), and spot (batch) for cost-efficient operations.

Anti-Patterns to Watch For

"Let's use H100s, they're the latest"

What it looks like: GPU choice driven by recency rather than workload fit.

Why it's wrong: H100 is the most expensive GPU. For inference-only workloads or smaller models, A100 or L40S is often 50% cheaper at comparable performance.

How to redirect: ask "what specific feature of H100 do we need?" If the answer is "FP8 support" and you're not using FP8, the choice is wrong. If "we'll need it eventually," that eventual need can wait until it's actual.

"Cloud is more flexible, let's stay 100% cloud"

What it looks like: cloud-only deployment without TCO analysis at scale.

Why it's wrong: above 32 GPUs of sustained demand, on-prem TCO is typically 3 to 5x lower than cloud reserved. The flexibility of cloud has real value but it's quantifiable; many teams overestimate it.

How to redirect: do the 3-year TCO analysis. Include ops staff, power, cooling, and procurement. If cloud still wins, fine. Often it doesn't.

"Reserved capacity is wasteful, let's stay on-demand"

What it looks like: capacity strategy driven by avoiding lock-in.

Why it's wrong: reserved capacity is typically 30 to 60% cheaper than on-demand. For predictable steady-state workloads, the lock-in is fine because you'd pay for the capacity anyway.

How to redirect: forecast usage at 12 months. The portion that's predictably steady should be reserved. Stay on-demand for genuinely unpredictable workloads.

"We need to provision for peak"

What it looks like: capacity planning sized to maximum expected load.

Why it's wrong: peak-sized capacity wastes 60 to 80% of capacity during normal load. The cost is real and avoidable.

How to redirect: provision for steady-state load with reserved capacity, layer on-demand or spot for peaks. The hybrid pattern saves substantially.

"Spot is too risky, let's avoid it"

What it looks like: avoiding spot capacity due to interruption risk.

Why it's wrong: spot is the right choice for genuinely interruptible workloads. Batch jobs, fine-tuning runs, and overflow handling are all spot-eligible. The 50 to 70% cost savings compound.

How to redirect: classify workloads by interruption tolerance. Anything that can be retried or checkpointed is spot-eligible. The implementation work is one-time; the savings are recurring.

When Cloud Wins Decisively

Despite TCO calculations, cloud is the right call when:

Below 4 GPUs of sustained demand. Operational overhead exceeds savings at this scale.
The workload is unpredictable or growing fast. Cloud's capacity flexibility has real value.
The team has no infrastructure operations capacity. Hiring for it doesn't pay back.
The deployment timeline is short. Cloud is hours; on-prem is months.
New hardware (latest H100, future B100) is needed before it's available for purchase. Cloud gets it first.

In these cases, paying the cloud premium is the right call. Revisit when scale or operations capacity changes.

What to Ask Your Engineering Team

What's our 12-month projected GPU demand in average utilization, not peak?
Why this specific GPU? What feature or capability of this GPU is the critical fit?
What's the on-prem vs cloud TCO over 3 years? Include ops, power, cooling, procurement.
What's our capacity mix strategy? Reserved baseline + on-demand peaks + spot batch?
What's the rollback or pivot plan if growth doesn't match forecast?
What's the multi-region or DR story? Critical for production AI systems.
How will we monitor utilization to detect over- or under-provisioning?

Cost & Timeline Quick Reference

Realistic 2026 cloud GPU costs (subject to provider and region):

GPU	On-demand $/hr	1-year reserved	3-year reserved	Best for
H100 80GB	$5 to $10	$3 to $6	$2.50 to $5	Frontier training, high-volume serving
A100 80GB	$3 to $5	$1.50 to $3	$1.20 to $2	General production
L40S	$1.50 to $3	$0.90 to $2	$0.75 to $1.50	Cost-optimized inference
L4	$0.60 to $1.20	$0.40 to $0.80	$0.30 to $0.60	Low-volume serving, edge
A10 / T4	$0.40 to $0.90	$0.25 to $0.60	$0.20 to $0.45	Embeddings, classification

Reserved 1-year breaks even at about 6 to 9 months of utilization. Reserved 3-year wins for long-term predictable workloads. Spot is 50 to 70% off on-demand for interruptible workloads.

The Bottom Line

Match GPU to workload, not to prestige. H100 for training and frontier serving. A100 for general production. L40S for cost-optimized inference. L4 for low volume.

Mix capacity: reserved baseline (60 to 80%), on-demand for peaks, spot for batch. Below 4 GPUs steady-state, cloud wins; above 32 GPUs sustained, on-prem wins on TCO.

Re-evaluate annually. New GPU generations (H200, B100) will shift the matrix; today's right answer won't be next year's right answer. Plan for the re-evaluation as a recurring exercise, not a one-time decision.

Boolean & Beyond

AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026

Talk to our team

From guide to production

Need help building this?

Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.

Book a free consultation Estimate cost

All AI Model Fine-Tuning, Deployment & Evaluation Systems guides

GPU Planning for Production AI Systems

How do you plan GPU capacity for production AI systems?

If You Remember Nothing Else

Recommendations by Situation

Worked Examples

Example 1: Mid-stage SaaS, switching from frontier API to self-hosted (60M tokens per day)

Example 2: Enterprise on-prem AI platform (200 GPUs, sustained)

Example 3: Variable workload with predictable peaks (15M tokens per day, 4x peak)

Anti-Patterns to Watch For

"Let's use H100s, they're the latest"

"Cloud is more flexible, let's stay 100% cloud"

"Reserved capacity is wasteful, let's stay on-demand"

"We need to provision for peak"

"Spot is too risky, let's avoid it"

When Cloud Wins Decisively

What to Ask Your Engineering Team

Cost & Timeline Quick Reference

The Bottom Line

Need help building this?

AI導入について相談してみませんか？

Registered Office

Operational Office

GPU Planning for Production AI Systems

How do you plan GPU capacity for production AI systems?

If You Remember Nothing Else

Recommendations by Situation

Worked Examples

Example 1: Mid-stage SaaS, switching from frontier API to self-hosted (60M tokens per day)

Example 2: Enterprise on-prem AI platform (200 GPUs, sustained)

Example 3: Variable workload with predictable peaks (15M tokens per day, 4x peak)

Anti-Patterns to Watch For

"Let's use H100s, they're the latest"

"Cloud is more flexible, let's stay 100% cloud"

"Reserved capacity is wasteful, let's stay on-demand"

"We need to provision for peak"

"Spot is too risky, let's avoid it"

When Cloud Wins Decisively

What to Ask Your Engineering Team

Cost & Timeline Quick Reference

The Bottom Line

Need help building this?

AI導入について 相談してみませんか？

Registered Office

Operational Office

AI導入について相談してみませんか？