LLM pricing models decoded: tokens, context, fine-tuning
Understanding LLM costs starts with tokens. Tokens are chunks of text (≈4 chars or ¾ of a word). Pricing is usually measured in $ per 1,000 tokens.
Three main factors drive costs:
- Input tokens: What you send to the model (prompt + context).
- Output tokens: What the model generates.
- Context length: Longer context = more tokens = higher cost.
Additional pricing dimensions:
- Function calling: Structured outputs can increase tokens consumed.
- Embeddings: Needed for search, RAG, and classification.
- Fine-tuning: One-time training + higher per-token cost afterward.
Bottom line: total cost = input tokens + output tokens × price per 1k tokens.
OpenAI pricing breakdown with real examples
OpenAI provides both low-cost models (GPT-3.5) and premium models (GPT-4).

GPT-4 vs GPT-3.5: when the price difference is worth it
- GPT-3.5 Turbo (16k context): $0.50 / 1M input tokens, $1.50 / 1M output tokens.
- GPT-4o (128k context): $2.50 / 1M input tokens, $10 / 1M output tokens.
- GPT-4 Turbo Vision adds multimodal, but pricing scales similarly.
Example (Customer Support):
- Prompt: 500 tokens (input), Response: 200 tokens (output).
- Cost per conversation:
- GPT-3.5: (500×$0.50 + 200×$1.50)/1M = ~$0.00055
- GPT-4o: (500×$2.50 + 200×$10)/1M = ~$0.0035
GPT-4 is ~6× more expensive per conversation but provides more reliable reasoning. Worth it if accuracy = fewer escalations.
Hidden costs: function calling, embeddings, moderation
- Function calling: Adds extra tokens for structured JSON output.
- Embeddings:
text-embedding-3-small
costs $0.02 per 1M tokens. Cheap but adds up for large datasets. - Moderation API: Free but counts toward token usage in pipelines.
Anthropic Claude: pricing and positioning vs OpenAI
Anthropic positions Claude 3 models as safer and more “steerable.”
- Claude 3 Haiku (200k context): $0.25 input / $1.25 output per 1M tokens.
- Claude 3 Sonnet (200k context): $3 input / $15 output.
- Claude 3 Opus (200k context): $15 input / $75 output.
Compared to OpenAI:
- Haiku ≈ GPT-3.5 in cost.
- Sonnet ≈ GPT-4 Turbo in cost/performance.
- Opus ≈ premium GPT-4 for heavy reasoning tasks.
Claude’s 200k context window is a differentiator, making it ideal for legal, research, or document-heavy workflows.
Local models: infrastructure costs vs APIs
Running LLMs locally (or self-hosted in the cloud) avoids per-token API fees but shifts costs to hardware + ops.
Hardware requirements for Llama 2/3, Code Llama
- Llama-2 7B: Needs ~16GB GPU VRAM.
- Llama-2 13B: Needs ~24–32GB VRAM.
- Llama-2 70B: Needs ~4×80GB GPUs (A100s or H100s).
- Code Llama models follow similar patterns.
For small teams, only 7B–13B models are practical on single GPUs.
Cost breakdown: GPU cloud vs on-premise
- Cloud GPU (A100 80GB): ~$2–3/hour → ~$1,500–2,000/month if always on.
- On-prem A100/H100 servers: $15k–$30k per card upfront, plus power + cooling.
- Optimized hosting (Lambda Labs, RunPod, Modal): Pay-per-use, but still $0.50–$2/hour depending on GPU.
Rule of thumb: Local models are cheaper only if you run consistently at scale. For ad-hoc tasks, API calls remain more cost-efficient.

Practical calculator: typical use cases
Customer support chatbot (1000 conversations/day)
- Avg conversation: 5 prompts × 700 tokens (500 in, 200 out).
- Daily tokens: 3.5M in + 1.4M out.
Costs/month:
- GPT-3.5: ~$75
- GPT-4o: ~$480
- Claude 3 Haiku: ~$50
- Local 13B (cloud GPU): ~$600+ infra
Content generation (100 articles/month)
- Avg article: 1500 input tokens + 1200 output tokens.
- Total/month: 150k in + 120k out.
Costs/month:
- GPT-3.5: <$1
- GPT-4o: ~$2.50
- Claude Sonnet: ~$3.60
- Local 13B: negligible per-run but infra cost applies.
Code assistance (10-person dev team)
- Avg dev: 50 prompts/day × 800 tokens (600 in, 200 out).
- Monthly total: ~12M in + 4M out.
Costs/month:
- GPT-3.5: ~$22
- GPT-4o: ~$140
- Claude Sonnet: ~$170
- Local GPU (cloud): ~$1,500
Cost optimization strategies
- Mix models: Use GPT-3.5/Claude Haiku for easy queries; upgrade to GPT-4/Claude Opus for hard cases.
- Cache responses: Store frequent answers to cut API calls.
- Tune context length: Don’t send 10k tokens if 1k is enough.
- Batch embeddings: Lower per-call costs by chunking efficiently.
- Hybrid pipeline: Retrieval + smaller model for recall, larger model only for synthesis.
Hybrid approaches: when to combine APIs and local
- APIs for high-quality reasoning, customer-facing accuracy.
- Local models for private data, cost control, or continuous workloads.
- Best of both: Local embeddings + API generation; or local small model + API fallback for tough queries.
ROI analysis: how much AI can justify in your budget
Rule of thumb: AI should save or earn at least 5× its cost.
- A $500/month chatbot is justified if it saves 20+ support hours.
- A $150/month code assistant is justified if it accelerates developer output by even 5%.
- Running local GPUs at $2k/month only makes sense if workload is consistent and mission-critical.
Without clear ROI, cheaper API-first strategies are safer for small to mid-size teams.
- OpenAI: Strong balance of cost and ecosystem. GPT-3.5 is extremely cheap for most workloads.
- Anthropic Claude: Best for large-context use cases and safety-sensitive tasks.
- Local models: Only cost-effective if workloads are massive and continuous.
Careful math + ROI framing is essential. Many teams overpay because they don’t measure tokens or underutilize smaller models.
FAQs
Is GPT-4 always worth it?
No. For FAQs and simple tasks, GPT-3.5 or Claude Haiku are cheaper and good enough.
Do local models save money?
Only if you run them 24/7 at scale. Cloud GPU costs add up quickly.
Which provider is best for long documents?
Claude (200k context window).