You don’t always need the most powerful model. Almost every provider offers three tiers — and picking the right one is more important than picking the right brand.
The Three Tiers
Almost every major provider offers roughly three tiers. The names change, but the pattern is universal. Click to explore each:
Pick a tier to explore
Understanding Token Pricing
Every provider charges per million tokens. A token is roughly ¾ of a word — so 1 million tokens ≈ 750,000 words.
Pricing has two separate rates:
- Input tokens — what you send to the model (your prompt, documents, context)
- Output tokens — what the model generates back (usually 3–4× more expensive than input)
Two additional details that matter once you’re building:
Reasoning tokens — some models spend tokens internally working through a problem before producing an answer. These count as output tokens and add to your bill.
Cache hits — if you’re sending the same large context repeatedly (e.g. a system prompt or document), providers cache it. A cache hit typically costs only ~10% of standard input prices — a big saving at scale.
Subscription plans are heavily subsidized by providers to drive adoption. The same usage billed through the API can cost significantly more at low to moderate volumes. Great if you’re an individual user — but know the difference when you’re building a product.
How to Read LLM Benchmarks
Benchmarks are standardized tests that score models on specific tasks. You’ll see them quoted everywhere in reviews and announcements. A few you’ll encounter:
- MMLU — tests general knowledge across 57 subjects (law, medicine, history, math…)
- HumanEval / MBPP — measures coding ability: can the model write correct programs?
- GPQA — graduate-level science questions designed to be hard even with Google
- MATH — competition-level math problem solving
- Chatbot Arena (LMSYS) — real humans compare two model responses and vote; the ELO score reflects actual human preference at scale
Benchmarks are a starting point, not a verdict. Models are increasingly trained on benchmark-adjacent data, inflating scores. More importantly, a model that tops the MMLU leaderboard may still fumble your specific task. The best evaluation is always your own: take 10–20 real examples from your use case and run them through competing models. What you observe beats any published number.
What's the most reliable way to evaluate which LLM is best for your task?