For PMs approving GPU infrastructure spend. How to choose between H100, A100, L40S; size for memory; and mix reserved/on-demand/spot capacity for sustainable economics.
Match GPU to workload: H100 for high-volume training and frontier serving, A100 for general production, L40S for cost-optimized inference, L4 for low-volume edge servers. Size memory to model + KV cache + 50% headroom. Mix capacity (60 to 80% reserved baseline + on-demand bursts + spot for batch). Cloud wins below 4 GPUs steady-state; on-prem wins above 32. Plan for re-evaluation annually as new GPU generations shift the matrix.
GPU planning is the single largest cost decision in most AI infrastructure projects. The decisions you make early — which hardware tier, how many GPUs, on-prem vs cloud, reserved vs on-demand — determine cost trajectories for years.
Match the GPU to the workload, not to prestige or what your engineers know. H100 is the flagship but A100 still works well for most production training. L40S is often the right inference GPU at lower cost. L4 fits low-volume workloads and edge servers.
Size memory to model + KV cache + 50% headroom. Mix capacity types (60 to 80% reserved baseline, plus on-demand for bursts, plus spot for batch jobs). Below 4 GPUs steady-state, cloud almost always wins. Above 32 GPUs sustained, on-prem typically wins on TCO.
| Your situation | GPU choice | Why |
|---|---|---|
| Production training, frontier-class models | H100 | FP8 support, fastest training, multi-GPU NVLink |
| Production training, mid-sized models (7B to 70B) | A100 | Workhorse generation; mature tooling; cost-effective for training |
| Inference only, 7B to 70B models | L40S or A100 | L40S is often 30 to 40% cheaper than A100 for inference |
| Low-volume inference, smaller models | L4 | Lower cost, lower power, sufficient for under 1000 RPM |
| Embeddings or classification at high throughput | A10 or T4 | Older generations still cheap and effective |
| Edge GPU server (latency-critical, on-site) | L4 | Best small-form-factor option |
| Memory-constrained 70B+ deployment | H100 with FP8 quantization | 80GB single GPU vs needing multi-A100 |
| Multi-tenant LoRA serving | A100 80GB or L40S | Memory headroom for multiple adapters |
| Below 4 GPUs steady-state | Cloud | On-prem ops cost exceeds savings at this scale |
| 4 to 32 GPUs, mixed steady and burst | Hybrid (on-prem baseline + cloud burst) | Best cost-flexibility balance |
| Above 32 GPUs sustained demand | On-prem with operational capacity | TCO 3 to 5x lower than cloud reserved over 3 years |
| Compliance-driven workloads | On-prem or sovereign cloud regardless of cost | Often regulatory requirement |
A SaaS company processes about 60M tokens per day across customer-facing AI features. They want to migrate from frontier APIs to self-hosted to reduce costs.
The right setup: 2x L40S GPUs running fine-tuned Llama 3.1 8B. Reserved 1-year capacity on AWS at about $1,500 per GPU per month. Total infrastructure cost: about $3,000 per month. Frontier API alternative was costing $18,000 per month. Net savings: $15,000 per month, $180K per year.
What worked: choosing L40S (not A100). For their inference-only workload, L40S delivered comparable throughput at 60% of A100 cost. The team didn't pay for training-grade hardware they wouldn't use.
What they nearly got wrong: defaulting to A100 because it's the "standard." For pure inference at this scale, L40S is the better fit. The team's engineer suggested A100 out of familiarity; the cost analysis caught it.
What to remember: don't default to the GPU your team knows. Match the hardware to the workload. Inference deployments often benefit from L40S over A100.
An enterprise wants to bring 60% of their AI workloads on-prem for compliance and cost reasons. Steady-state demand: 200 GPUs.
The right setup: H100 cluster on-prem with NVLink-connected nodes (8x H100 per node, 25 nodes). Capital cost: about $7M. 3-year operational cost (power, cooling, ops staff): about $4M. Total 3-year TCO: about $11M.
Cloud equivalent: 200x H100 reserved on AWS at about $5/hour each. 3-year cost: about $26M. Net savings: $15M over 3 years.
What worked: doing the math honestly. The analysis included not just hardware cost but operational overhead, power, cooling, ops staff, and the cost of the colocation facility. On-prem still won decisively.
What they nearly got wrong: ignoring the operational ramp-up time. On-prem GPU procurement and deployment took 4 months. The team had to rent cloud capacity in the interim, which added about $1M in transitional cost.
What to remember: above 32 GPUs sustained, on-prem typically wins on TCO. Plan for the procurement timeline; cloud bridges the gap during transition.
A content platform has spiky traffic: 15M tokens per day average, 4x that at peak. They want efficient capacity management.
The right setup: 60% reserved 1-year A100 capacity for steady-state baseline, 30% on-demand A100 capacity for predictable peaks (mornings, evenings), 10% spot A100 capacity for batch processing (overnight content generation). Combined cost: about $8,500 per month.
What worked: capacity right-sizing across three tiers. Reserved instances at the baseline got 50% off list price. Spot for batch saved another 70% on that workload. On-demand for predictable peaks was the most expensive but only used when needed.
What they nearly got wrong: provisioning for peak everywhere. A pure-reserved peak-sized cluster would have wasted 70% of capacity during off-peak. Pure on-demand would have cost 2x reserved. The mix was the win.
What to remember: capacity is a portfolio decision, not a single choice. Mix reserved (steady), on-demand (peaks), and spot (batch) for cost-efficient operations.
What it looks like: GPU choice driven by recency rather than workload fit.
Why it's wrong: H100 is the most expensive GPU. For inference-only workloads or smaller models, A100 or L40S is often 50% cheaper at comparable performance.
How to redirect: ask "what specific feature of H100 do we need?" If the answer is "FP8 support" and you're not using FP8, the choice is wrong. If "we'll need it eventually," that eventual need can wait until it's actual.
What it looks like: cloud-only deployment without TCO analysis at scale.
Why it's wrong: above 32 GPUs of sustained demand, on-prem TCO is typically 3 to 5x lower than cloud reserved. The flexibility of cloud has real value but it's quantifiable; many teams overestimate it.
How to redirect: do the 3-year TCO analysis. Include ops staff, power, cooling, and procurement. If cloud still wins, fine. Often it doesn't.
What it looks like: capacity strategy driven by avoiding lock-in.
Why it's wrong: reserved capacity is typically 30 to 60% cheaper than on-demand. For predictable steady-state workloads, the lock-in is fine because you'd pay for the capacity anyway.
How to redirect: forecast usage at 12 months. The portion that's predictably steady should be reserved. Stay on-demand for genuinely unpredictable workloads.
What it looks like: capacity planning sized to maximum expected load.
Why it's wrong: peak-sized capacity wastes 60 to 80% of capacity during normal load. The cost is real and avoidable.
How to redirect: provision for steady-state load with reserved capacity, layer on-demand or spot for peaks. The hybrid pattern saves substantially.
What it looks like: avoiding spot capacity due to interruption risk.
Why it's wrong: spot is the right choice for genuinely interruptible workloads. Batch jobs, fine-tuning runs, and overflow handling are all spot-eligible. The 50 to 70% cost savings compound.
How to redirect: classify workloads by interruption tolerance. Anything that can be retried or checkpointed is spot-eligible. The implementation work is one-time; the savings are recurring.
Despite TCO calculations, cloud is the right call when:
In these cases, paying the cloud premium is the right call. Revisit when scale or operations capacity changes.
Realistic 2026 cloud GPU costs (subject to provider and region):
| GPU | On-demand $/hr | 1-year reserved | 3-year reserved | Best for |
|---|---|---|---|---|
| H100 80GB | $5 to $10 | $3 to $6 | $2.50 to $5 | Frontier training, high-volume serving |
| A100 80GB | $3 to $5 | $1.50 to $3 | $1.20 to $2 | General production |
| L40S | $1.50 to $3 | $0.90 to $2 | $0.75 to $1.50 | Cost-optimized inference |
| L4 | $0.60 to $1.20 | $0.40 to $0.80 | $0.30 to $0.60 | Low-volume serving, edge |
| A10 / T4 | $0.40 to $0.90 | $0.25 to $0.60 | $0.20 to $0.45 | Embeddings, classification |
Reserved 1-year breaks even at about 6 to 9 months of utilization. Reserved 3-year wins for long-term predictable workloads. Spot is 50 to 70% off on-demand for interruptible workloads.
Match GPU to workload, not to prestige. H100 for training and frontier serving. A100 for general production. L40S for cost-optimized inference. L4 for low volume.
Mix capacity: reserved baseline (60 to 80%), on-demand for peaks, spot for batch. Below 4 GPUs steady-state, cloud wins; above 32 GPUs sustained, on-prem wins on TCO.
Re-evaluate annually. New GPU generations (H200, B100) will shift the matrix; today's right answer won't be next year's right answer. Plan for the re-evaluation as a recurring exercise, not a one-time decision.
Boolean & Beyond
AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
御社の課題をお聞かせください。24時間以内に、AI活用の可能性と具体的な進め方について無料でご提案いたします。
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002