Your Bedrock proof of concept worked. The model was good, the API was clean, you shipped a feature, and users like it.
Then the bill arrived.
It was higher than the back-of-napkin number you used to greenlight the project, and the lines you can see on the invoice do not quite explain why. You are not alone, and the answer is not “Bedrock is expensive.”
The answer is that Bedrock pricing has five or six moving parts that are not obvious until you have lived through one or two billing cycles.
This guide is the one a senior AWS engineer would write for a senior AWS engineer. We walk through how Bedrock actually charges you in mid-2026, where the surprises live, the few levers that move the bill the most, and the times the answer is to not use Bedrock at all.
TL;DR: AWS Bedrock Pricing
|
How AWS Bedrock Pricing Actually Works (Also Known as Amazon Bedrock Pricing)
Bedrock is usage-based. There is no minimum, no subscription, and no AWS Bedrock free tier you can rely on for production (more on the free question further down). You pay for what you use, but “what you use” has five categories worth understanding before you forecast a bill.
Amazon Bedrock pricing (AWS and Amazon refer to the same service interchangeably) breaks down into four core modes, in plain terms:
- On-Demand: Pay per 1,000 input tokens, per 1,000 output tokens, per image, or per second of generated video. No commitment, no minimum, prices vary by model and AWS Region. This is where most teams start.
- Provisioned Throughput: Reserve dedicated capacity for a specific model in model units, billed hourly whether you use the capacity or not. This is what you switch to when usage gets predictable and large.
- Batch Inference: Run asynchronous jobs at 50% off the On-Demand rate. For Amazon Nova specifically, AWS also offers a Flex tier at the same 50% discount, plus a Priority tier at approximately 1.75x Standard for mission-critical workloads that need preferential compute allocation. Suitable for jobs where latency does not matter (overnight summarization, document classification, periodic enrichment).
- Prompt Caching: Cache repeated input context (system prompts, large knowledge snippets, few-shot examples) and pay a fraction of the input-token rate for the cached portion. Savings of up to 90% on the input side for context-heavy applications.
The four modes are not exclusive. A production system often uses three of them at once: –
- On-Demand for live traffic
- Batch for overnight work
- Prompt Caching on the static parts of every request, and sometimes,
- Provisioned Throughput on the model that handles the bulk of the load.
Then there are the components that sit on top of the model and have their own pricing: Bedrock Knowledge Bases for RAG, Bedrock Agents for multi-step workflows, Bedrock Flows for visual orchestration, and Bedrock Guardrails for content moderation.
Each has a separate line on the bill.
Quick reference: the four AWS Bedrock pricing modes at a glance.
| Pricing mode | How you pay | Best for | Typical discount vs On-Demand |
|---|---|---|---|
|
On-Demand |
Per 1,000 input tokens, per 1,000 output tokens, per image, per second of video | Modest, bursty, or unpredictable usage; prototypes and early production |
Baseline (no discount) |
|
Provisioned Throughput |
Hourly per model unit (no-commit, 1-month, or 6-month tiers) | Steady, high-volume, predictable workloads; required for fine-tuned or custom models |
Variable; depends on utilization and commitment |
|
Batch Inference (or Flex tier) |
Same per-token structure, asynchronous | Overnight summarization, periodic enrichment, classification, embedding generation |
50% off |
|
Prompt Caching |
Cached input tokens billed at a fraction of standard rate | Applications with large static prefixes (system prompts, few-shot examples, fixed knowledge context) |
Up to 90% off the cached portion |
On-Demand Pricing: The Default Starting Point
When you first call a Bedrock model through the API, you are on On-Demand. The model is shared multi-tenant, you pay per token, and there is no commitment.
The structure is the same across every text model: you pay separately for input tokens (your prompt, including any retrieved context and the system message) and output tokens (the model’s response). Output tokens generally cost 3x to 5x more than input tokens. This ratio is the single most important fact for cost forecasting, because most teams underweight output cost when they estimate spend.
A quick example using Amazon Nova Pro. As of June 2026 on the official AWS Nova pricing page, Nova Pro On-Demand is $0.80 per 1 million input tokens and $3.20 per 1 million output tokens, which works out to $0.0008 per 1,000 input tokens and $0.0032 per 1,000 output tokens.
So a typical chat exchange of 1,000 input tokens and 500 output tokens costs $0.0008 + $0.0016 = $0.0024 per call. At 100,000 calls per day, that is about $240 per day, or roughly $7,200 per month.
Verified Amazon Nova On-Demand pricing (US East, Standard tier, 2026).
|
Nova model |
Per 1M input tokens | Per 1M output tokens |
Best for |
|---|---|---|---|
|
Nova Micro |
$0.035 | $0.14 |
Simple classification, routing decisions, lightweight extraction |
|
Nova Lite |
$0.06 | $0.24 |
Most chat, summarization, mid-complexity tasks |
|
Nova Pro |
$0.80 | $3.20 |
Reasoning-heavy queries, agentic workflows, multimodal tasks |
|
Nova Pro (latency optimized) |
$1.00 | $4.00 |
Real-time UX where every 100ms matters |
|
Nova Premier |
$2.50 | $12.50 |
Highest-quality Nova for complex multi-step reasoning |
Note: Cache read input tokens are billed at 75% less than the on-demand input token price. Flex tier and Batch tier prices are roughly 50% off the Standard tier rates shown. Nova 2 generation (Nova 2 Lite, Nova 2 Pro, Nova 2 Omni) is available in Preview at different rates. Verify the latest rates here.
Verified Anthropic Claude pricing on Amazon Bedrock (Global Cross-region Inference, US East Ohio, 2026).
|
Claude model |
Per 1M input | Per 1M output | Batch input (50% off) | Batch output (50% off) |
Cache read (~90% off input) |
|---|---|---|---|---|---|
|
Claude Opus 4.8 |
$5.00 | $25.00 | N/A | N/A |
$0.50 |
|
Claude Opus 4.7 |
$5.00 | $25.00 | N/A | N/A |
$0.50 |
|
Claude Opus 4.6 |
$5.00 | $25.00 | $2.50 | $12.50 |
$0.50 |
|
Claude Opus 4.5 |
$5.00 | $25.00 | $2.50 | $12.50 |
$0.50 |
|
Claude Sonnet 4.6 |
$3.00 | $15.00 | $1.50 | $7.50 |
$0.30 |
|
Claude Sonnet 4.5 |
$3.00 | $15.00 | $1.50 | $7.50 |
$0.30 |
|
Claude Sonnet 4 |
$3.00 | $15.00 | $1.50 | $7.50 |
$0.30 |
|
Claude Haiku 4.5 |
$1.00 | $5.00 | $0.50 | $2.50 |
$0.10 |
|
Claude 3.5 Haiku |
$0.80 | $4.00 | N/A | N/A |
$0.08 |
|
Claude Mythos Preview |
non-GA | non-GA | non-GA | non-GA |
non-GA |
Geo and In-region cross-region inference rates are slightly higher (e.g., Sonnet 4.6 at $3.30 input / $16.50 output per 1M). Reserved Tier pricing is also available with 1-month and 3-month commitments for stable workloads. Verify the latest rates here.
Cost comparison worth knowing: Claude Sonnet 4.6 at $3.00 per 1M input tokens is roughly 4x more expensive than Nova Pro at $0.80, and 50x more expensive than Nova Lite at $0.06. That spread is the whole reason model routing matters. Sending a query to Sonnet 4.6 when Nova Lite would have answered correctly costs you 50x what it should have.
Anthropic’s Claude family on Bedrock follows the same pattern at a higher absolute cost.
- AWS Bedrock Anthropic pricing tracks Anthropic’s published API rates closely, with regional variation.
- Claude Opus 4.8 sits at the very top of the catalog ($5.00 per 1M input tokens, $25.00 per 1M output).
- Claude Sonnet 4.6 is the mainstream high-quality choice ($3.00 per 1M input, $15.00 per 1M output, with a 1M-token context window in preview).
- Claude Haiku 4.5 is the faster, cheaper option ($1.00 per 1M input, $5.00 per 1M output).
- Anthropic also has Mythos Preview available in gated research preview (non-GA, no published prices yet).
- Meta Llama and Mistral models offer alternative price-performance points. Always verify the current rate on the official AWS Bedrock pricing page before you commit to a budget, because rates move and Region matters.
For embeddings (vector models like Amazon Titan Text Embeddings or Cohere Embed), you pay only on input tokens because the output is a vector, not generated text. Embedding costs are usually small per call, but RAG applications generate a lot of them, so they add up.
For image and video generation, the unit changes. Amazon Nova Canvas charges per image generated, with a higher rate for higher resolution or premium quality. Amazon Nova Reel charges per second of generated video. Stability AI models on Bedrock follow similar per-image patterns.
The pattern across all of these: you are paying for output. The way to cut On-Demand cost is almost always to produce less output, or to produce the same output with a cheaper model.
Provisioned Throughput: When It Pays Off
Provisioned Throughput is the answer for steady, high-volume workloads. You reserve model units (each guaranteeing a known token-per-minute throughput) and you pay an hourly rate whether the capacity is in use or idle.
The trade-off is straightforward. On-Demand is cheaper per token but has variable latency under load and no capacity guarantee.
Provisioned Throughput is more expensive per hour but offers predictable performance and predictable cost. There are three commitment tiers: no-commit (pay hourly, stop any time), 1-month commitment (lower hourly rate), and 6-month commitment (lowest hourly rate).
The honest math on whether to switch:
- If your daily token volume is modest and bursty, stay on On-Demand.
- If you are running a model at high, sustained utilization (think customer-facing chat with consistent traffic, or batch processing that runs nearly all day), price out the Provisioned Throughput rate against your On-Demand bill. The break-even is workload-specific, but as a directional rule, sustained workloads above roughly the equivalent of a few hundred thousand tokens per minute start to look cheaper on Provisioned Throughput.
- If you are using a fine-tuned or imported custom model, you do not have a choice. Custom models on Bedrock require Provisioned Throughput by design, because they cannot be shared in the multi-tenant On-Demand pool.
The mistake to avoid is over-reserving. Buying 6-month commitment capacity for traffic that is not yet stable is the single fastest way to overspend on Bedrock. Most teams should run on On-Demand long enough to see the real traffic shape before they commit.
Batch Inference and Prompt Caching: The Two Quiet Levers
Two pricing modes are easy to miss and often move the bill more than the model-choice decision.
Batch Inference runs your prompts asynchronously at half the On-Demand rate (AWS also has a Flex tier at the same 50% discount for supported models, with similar trade-offs).
The use cases are non-interactive:
- Overnight document summarization
- Periodic data enrichment
- Large-scale embedding generation
- End-of-day report generation
- Anything where the request and response do not need to happen in real time. If you can hold a request for hours, you can pay half
The pattern that works: take the high-volume, latency-tolerant parts of your workload and move them to Batch. The first place to look is anything that runs on a cron schedule. The second is anything where a user kicks off a job and is happy to get an email when it is done.
Prompt Caching is the bigger lever for context-heavy applications. If your prompts include a large static prefix (a system prompt, a large set of few-shot examples, a long policy document, a knowledge base context) and the prefix repeats across requests, Bedrock can cache the prefix and bill the cached tokens at a fraction of the standard input-token rate.
Savings of up to 90% on the cached portion of the input are typical for context-heavy applications.
This is the most underused optimization in production Bedrock deployments.
If your application sends the same 5,000-token system prompt with every request, you are probably paying for 5,000 input tokens every call when you could be paying for a tiny fraction of that.
Worth an audit if you are seeing a higher-than-expected input-token bill.
The Hidden Costs: Knowledge Bases, Agents, Flows, Guardrails
Most teams budget for the model and forget the components that wrap around it. Each has its own line on the bill.
Bedrock Knowledge Bases is the managed RAG capability. The model invocation during retrieval is charged at standard inference rates.
The hidden cost is the vector store: if you use Amazon OpenSearch Serverless as the backing index (the default managed option), expect a baseline of a few hundred dollars per month minimum for the OCU capacity, even at low query volume.
For smaller workloads, alternatives like Amazon Aurora PostgreSQL with pgvector or self-managed OpenSearch can be cheaper but require more setup. There is also the embedding model cost during ingestion, which is paid once per document but recurs whenever you reindex.
Bedrock Agents orchestrate multi-step workflows where the model calls tools, retrieves context, and produces a final answer.
The agent invocation itself does not have a fixed per-call fee; what you pay is the sum of every model call the agent makes along the way.
Agents that make many tool calls and many model invocations per user request can be deceptively expensive if you are only counting the visible user interactions.
Bedrock Flows is the visual orchestration product for chaining prompts, conditions, and tools. The Flows-specific cost is roughly $0.035 per 1,000 visual node transitions, on top of the underlying model invocation costs.
For low-volume use it is trivial; for high-volume production workloads it is worth measuring.
Bedrock Guardrails charges $0.15 per 1,000 text units for content-filter evaluations on text content, plus $0.00075 per image for image content. This is per-evaluation, not per-call: if you run Guardrails on both the input and the output, you pay for both. The text-filter rate is modest per call but adds up at high request volume and large content windows.
None of these are unreasonable on their own. They become a problem when teams stack them all together without measuring each one’s contribution to the final number.
Hidden costs at a glance.
| Component | What you pay | When it kicks in |
|---|---|---|
|
Bedrock Knowledge Bases |
Standard inference rates during retrieval, plus the vector store cost (OpenSearch Serverless typically a few hundred dollars per month minimum) |
Any RAG application using Bedrock’s managed Knowledge Bases |
|
Bedrock Agents |
Sum of every model call the agent makes per user request |
Multi-step agents that chain tool calls and model invocations |
|
Bedrock Flows |
$0.035 per 1,000 visual node transitions, on top of underlying model costs |
High-volume Flows-based orchestration |
|
Bedrock Guardrails (text) |
$0.15 per 1,000 text units, charged per evaluation (input and output count separately) |
Any production app with content moderation, billed on every evaluated call |
|
Bedrock Guardrails (image) |
$0.00075 per image processed |
Image-content moderation |
|
Cross-Region data transfer |
Standard AWS egress rates |
Multi-Region deployments or data sources in a different Region than the model |
Model Routing: The Single Biggest Optimization Lever
If you read one section of this guide, read this one.
The most effective cost optimization on Bedrock is not switching from On-Demand to Provisioned Throughput. It is not Batch. It is not Prompt Caching. Those are real and worth doing. But none of them move the bill as much as using the right-sized model for each request.
The pattern: send simple requests to a cheap, fast model. Send complex requests to a large, expensive model. Decide which is which programmatically.
A worked example with verified numbers: Suppose your application handles two kinds of queries: short factual lookups (about 70% of traffic) and complex multi-step reasoning (about 30% of traffic).
If you route everything to Claude Sonnet 4.6 ($3.00/M input) or Nova Premier ($2.50/M input), you pay top-tier rates on every call.
If you route the 70% to Claude Haiku 4.5 ($1.00/M input), Nova Lite ($0.06/M input), or even Nova Micro ($0.035/M input) for the simplest tasks, and reserve Sonnet 4.6 or Nova Premier only for the 30% that needs it, you can cut total model spend by 60-80% on the routed portion without changing user-facing quality on the queries that matter.
The implementation is a classifier or a router function that runs before the expensive model call. AWS also offers a native feature called Intelligent Prompt Routing that does this for you between models in the same family (Claude Sonnet 4.6 and Haiku 4.5, Llama 3.3 70B and 3.18B, or Nova Pro and Nova Lite).
AWS publishes that it can reduce costs by up to 30% without compromising accuracy. If you want the routing pattern but do not want to build the classifier yourself, this is the no-build option.
The trade-off is real. Routing adds a small amount of latency, a tiny upfront cost for the classifier, and architectural complexity. For low-volume applications it is overkill. For anything with meaningful traffic, it is the single biggest lever you have.
This is the pattern Avahi has used with Groopview, our anchor case study below.
Optimization levers, ranked by impact.
| Lever | Typical impact | Effort to implement | Risk |
|---|---|---|---|
|
Right-size the model (routing) |
High (often 30-50% lower model spend) | Medium (classifier + routing logic, or use AWS Intelligent Prompt Routing) |
Low if you test the cheaper model on representative traffic first |
|
Prompt Caching for static context |
High for context-heavy apps (up to 90% off cached input) | Low (configuration, not architecture) |
Very low |
|
Batch Inference for offline jobs |
Medium (50% off On-Demand on the batched portion) | Medium (queue-based architecture) |
Very low |
|
Switch sustained traffic to Provisioned Throughput |
Medium for steady workloads | Medium (capacity planning + commitment) |
Medium (over-reserving wastes money) |
|
Reduce output token length (prompts, response shaping) |
Medium (output is 3-5x cheaper to cut than input) | Low (prompt engineering) |
Low |
|
Cache Guardrails-eligible outputs |
Low to medium | Low |
Very low |
AWS Bedrock Pricing Calculator: Forecasting Your Bill
The official AWS Pricing Calculator supports Amazon Bedrock and is the right tool for a first forecast. The fields that matter most:
- Model and Region (rates differ across Regions)
- Daily input tokens and daily output tokens (your most-used model and pricing mode)
- Knowledge Bases vector store size and query rate
- Guardrails evaluations per day
- Image or video generation volume, if applicable
The honest caveat: the calculator only shows you what AWS will charge for components you input. It does not catch the patterns that drive overspend (oversized models, uncached prompts, over-reserved throughput, Guardrails running on every input and every output).
For a real forecast on a real workload, instrument the application first, measure for a week or two on On-Demand, and forecast from observed usage. The calculator is a starting point, not the answer.
When NOT to Use Bedrock
Worth saying directly because no one else writes it.
Bedrock IS the right answer when:
- You want managed access to multiple frontier models behind one interface (Anthropic, Amazon Nova, Meta, Mistral, Cohere, and others without managing separate provider relationships).
- You need AWS-native integration with the rest of your stack (S3, Lambda, ECS, IAM, VPC, CloudWatch).
- You care about IAM controls, VPC isolation, and AWS-grade audit logging for compliance.
- The unit economics make sense at your scale once you apply routing, caching, and the right pricing mode.
Bedrock is NOT the right answer when:
- Your entire workload is one specific model at very high sustained volume and you can run it more cheaply elsewhere.
- You run Llama-family models at heavy sustained load (direct hosting on Amazon EC2 with Inferentia or a dedicated inference platform often beats Bedrock economics).
- Your use case fits a small open-source model and self-hosting on your own infrastructure is meaningfully cheaper.
- Your application only ever uses one provider’s models and going direct to that provider’s API wins on price (you give up the AWS-native security and integration story in exchange).
Bedrock optimizes for breadth (many models, one interface), governance (AWS-grade controls), and ease of integration. If those matter to you, Bedrock is the strongest option on the market. If you do not need them, run the math on alternatives before committing.
Real Result: How Groopview Cut AI Avatar Response Time by 80% with Dual-Nova Model Routing
Groopview builds an AI co-host avatar that processes text and images during live streams in real time. The product is latency-sensitive in a way that most enterprise AI is not: a slow response breaks the social-media experience the avatar is meant to enable.

The previous architecture routed every interaction through a single capable model. Response time was around 12 seconds. For a live-streaming product, that was too slow to feel real-time, and the model cost per interaction was higher than the unit economics could carry at scale.
We at Avahi rebuilt the architecture on Amazon Bedrock using a Dual-Nova orchestration framework, routing every request through a classifier that decides whether the query is simple or complex:
- Simple queries (most of the traffic) route to Amazon Nova Lite, the smaller and cheaper model.
- Complex queries route to Amazon Nova Pro, the larger, higher-quality model.
The stack: Amazon Bedrock (Nova Pro and Nova Lite), Amazon API Gateway, AWS Lambda, Amazon EC2 g6e GPU instances for the avatar rendering, Amazon S3, Amazon RDS, and Amazon CloudWatch for observability.
The result:
- AI avatar response time dropped from 12 seconds to about 2.5 seconds for simple queries (an 80% latency reduction).
- Complex queries return in about 7 seconds.
- Higher session stickiness and new revenue streams from the improved real-time experience.
- The cost-per-interaction dropped because the majority of traffic now hits the cheaper model.
This is the model-routing pattern made concrete. Same product, same models available, different architectural choice on which request hits which model.
Where AWS Funding Fits
The part most pricing guides cannot offer.
Avahi is an AWS Premier Tier Services Partner, and through our partnership with AWS, the proof of concept that proves your Bedrock cost case can be funded. Eligible companies may receive a funded PoC depending on the project, so you can model the actual cost on your actual workload before committing to a full build.
The structure that works: pick the workload you are evaluating, define the cost target, build a scoped PoC against your real traffic shape on Bedrock with the right routing and caching architecture, measure the bill against the forecast, and decide.
Make the Call With Avahi
Bedrock is the strongest managed-model platform on the market for teams that want frontier-model breadth with AWS-native controls. The bill becomes a problem only when teams default to the largest model, leave static prompts uncached, over-reserve throughput, and skip the routing layer that does the actual work of cost optimization.
The way to get a defensible cost forecast is not a calculator. It is a scoped PoC against your real workload that proves the routing, caching, and provisioning strategy works at your scale.
Start with a funded PoC on your highest-volume workload. Eligible companies may receive a funded PoC depending on your project.
FAQs: AWS Bedrock Pricing
How Much Does It Cost to Get Bedrock?
There is no fixed cost to get started with AWS Bedrock (or Amazon Bedrock; the service is named both ways). You enable the service in your AWS account, request access to the foundation models you want to use, and pay only for what you process: per 1,000 input and output tokens for text models, per image for image generation, per second for video, plus hourly rates for any reserved Provisioned Throughput. A small prototype touching a model like Amazon Nova Lite or Claude 3.5 Haiku can run for a few dollars a day. A production application can range from hundreds to many thousands of dollars per month depending on volume and model choice.
Is AWS Bedrock Free?
Bedrock is not free in the way Amazon S3 or AWS Lambda have free tiers you can rely on for production workloads. AWS does periodically offer credits and trial allowances for specific models, and individual model providers occasionally promote evaluation periods. For production deployments, assume Bedrock is paid usage from the first call. The right way to evaluate before committing is a scoped proof of concept, which can be partially or fully AWS-funded for eligible companies.
Is Bedrock Costly?
Bedrock is not categorically expensive. It becomes expensive in three predictable ways: defaulting to the largest model for every request when most do not need it, leaving large static prompts uncached, and over-reserving Provisioned Throughput before traffic is stable. Teams that route simple requests to small models, cache repeated context, and stay on On-Demand until traffic is predictable find that Bedrock costs are reasonable for what they deliver. (Threads on AWS Bedrock pricing Reddit consistently surface the same three failure modes: oversized models, uncached prompts, and over-reserved throughput.) The architecture choices matter more than the rate card.
What Is AWS Bedrock Used For?
AWS Bedrock is a fully managed service that gives you access to foundation models from multiple providers (Anthropic Claude, Amazon Nova, Meta Llama, Mistral, Stability AI, Cohere, and others) behind one API. It is used to build generative AI applications including chatbots, RAG systems, document processing, image and video generation, agents, and content moderation, without the work of managing the underlying model infrastructure. Bedrock is the AWS-native answer to the question of how to use frontier models inside an environment that meets enterprise IAM, VPC, and audit requirements.
How Much Is Claude on Bedrock?
AWS Bedrock pricing Claude follows the standard Bedrock pattern: priced per 1,000 input and output tokens, with output typically costing 3x to 5x more than input. Claude Sonnet 4.6 is the higher-quality, higher-cost option in the current lineup. Claude 3.5 Haiku is the faster, cheaper option for most tasks. Exact rates vary by Region and change as Anthropic releases new versions, so always verify the current rate on the official AWS Bedrock pricing page before forecasting.
What Is AWS Bedrock Provisioned Throughput Pricing?
Provisioned Throughput on Bedrock reserves dedicated capacity in model units, billed hourly per unit regardless of utilization. There are three commitment tiers: no-commit (highest hourly rate, stop any time), 1-month commitment (lower rate), and 6-month commitment (lowest rate). Each model has its own model-unit rate and its own commitment-tier savings. The break-even versus On-Demand depends on sustained utilization; for predictable high-volume workloads it usually wins, for bursty or unpredictable traffic it usually does not.
AWS Bedrock Guardrails Pricing: How Much Does It Cost?
Bedrock Guardrails are billed at $0.15 per 1,000 text units for content-filter evaluations on text content (plus $0.00075 per image for image content), as published on the AWS Bedrock pricing page as of June 2026. The pricing is per evaluation, so if you run Guardrails on both the input and the output of every model call, you pay for both evaluations. For low-volume applications this is trivial. For high-volume production deployments with large content windows, Guardrails can become a meaningful share of the total Bedrock bill and is worth measuring separately.
How Does AWS Bedrock LLM Pricing Work?
Bedrock LLM pricing has four modes: On-Demand (per 1,000 input and output tokens, no commitment), Provisioned Throughput (reserved capacity by the hour), Batch Inference (asynchronous processing at roughly half the On-Demand rate), and Prompt Caching (cached input context billed at a fraction of standard rates, up to 90% off). Most production deployments combine three of these: On-Demand for live traffic, Batch for overnight work, and Prompt Caching on static prompt prefixes. Provisioned Throughput comes in when usage is large, steady, and predictable.
Can Avahi Help Optimize My AWS Bedrock Bill?
Yes. Avahi is an AWS Premier Tier Services Partner that builds and optimizes generative AI architectures on AWS, including model routing, prompt caching, Batch migration, and right-sized Provisioned Throughput. Through our partnership with AWS, the proof of concept that proves the cost case can be funded. Eligible companies may receive a funded PoC depending on the project, so you can model the actual savings on your actual workload before committing to a rebuild.