AWS Bedrock Pricing: A Breakdown of Costs and How to Optimize Them

Your Bedrock proof of concept worked. The model was good, the API was clean, you shipped a feature, and users like it.

Then the bill arrived.

It was higher than the back-of-napkin number you used to greenlight the project, and the lines you can see on the invoice do not quite explain why. You are not alone, and the answer is not “Bedrock is expensive.”

The answer is that Bedrock pricing has five or six moving parts that are not obvious until you have lived through one or two billing cycles.

This guide is the one a senior AWS engineer would write for a senior AWS engineer. We walk through how Bedrock actually charges you in mid-2026, where the surprises live, the few levers that move the bill the most, and the times the answer is to not use Bedrock at all.

TL;DR: AWS Bedrock Pricing

Bedrock charges you four ways: On-Demand (per 1,000 tokens, per image, or per second of video), Provisioned Throughput (hourly per model unit), Batch Inference (50% off On-Demand), and Prompt Caching (up to 90% off the input-token portion for repeated context).
Hidden costs that catch teams off-guard: Knowledge Bases vector storage (typically a few hundred dollars per month minimum if you use OpenSearch Serverless as the vector store), Bedrock Guardrails at $0.15 per 1,000 text units, and Bedrock Flows at $0.035 per 1,000 visual node transitions.
The single biggest optimization lever is model routing: send simple requests to a small model like Amazon Nova Lite ($0.06 per 1M input tokens) or Nova Micro ($0.035 per 1M), and reserve a large model like Claude Sonnet 4.6 or Nova Premier only for the queries that need it. This alone can cut spend by half or more without touching the rest of the stack.
Bedrock pricing is not actually expensive for the right workload. It is expensive when teams default to the biggest model, leave large static prompts uncached, and overprovision throughput they do not need.
Want to model your specific workload on Bedrock against your current spend? Start with a funded PoC and get the numbers before you commit.

How AWS Bedrock Pricing Actually Works (Also Known as Amazon Bedrock Pricing)

Bedrock is usage-based. There is no minimum, no subscription, and no AWS Bedrock free tier you can rely on for production (more on the free question further down). You pay for what you use, but “what you use” has five categories worth understanding before you forecast a bill.

Amazon Bedrock pricing (AWS and Amazon refer to the same service interchangeably) breaks down into four core modes, in plain terms:

On-Demand: Pay per 1,000 input tokens, per 1,000 output tokens, per image, or per second of generated video. No commitment, no minimum, prices vary by model and AWS Region. This is where most teams start.
Provisioned Throughput: Reserve dedicated capacity for a specific model in model units, billed hourly whether you use the capacity or not. This is what you switch to when usage gets predictable and large.
Batch Inference: Run asynchronous jobs at 50% off the On-Demand rate. For Amazon Nova specifically, AWS also offers a Flex tier at the same 50% discount, plus a Priority tier at approximately 1.75x Standard for mission-critical workloads that need preferential compute allocation. Suitable for jobs where latency does not matter (overnight summarization, document classification, periodic enrichment).
Prompt Caching: Cache repeated input context (system prompts, large knowledge snippets, few-shot examples) and pay a fraction of the input-token rate for the cached portion. Savings of up to 90% on the input side for context-heavy applications.

The four modes are not exclusive. A production system often uses three of them at once: –

On-Demand for live traffic
Batch for overnight work
Prompt Caching on the static parts of every request, and sometimes,
Provisioned Throughput on the model that handles the bulk of the load.

Then there are the components that sit on top of the model and have their own pricing: Bedrock Knowledge Bases for RAG, Bedrock Agents for multi-step workflows, Bedrock Flows for visual orchestration, and Bedrock Guardrails for content moderation.

Each has a separate line on the bill.

Quick reference: the four AWS Bedrock pricing modes at a glance.

Pricing mode	How you pay	Best for	Typical discount vs On-Demand
On-Demand	Per 1,000 input tokens, per 1,000 output tokens, per image, per second of video	Modest, bursty, or unpredictable usage; prototypes and early production	Baseline (no discount)
Provisioned Throughput	Hourly per model unit (no-commit, 1-month, or 6-month tiers)	Steady, high-volume, predictable workloads; required for fine-tuned or custom models	Variable; depends on utilization and commitment
Batch Inference (or Flex tier)	Same per-token structure, asynchronous	Overnight summarization, periodic enrichment, classification, embedding generation	50% off
Prompt Caching	Cached input tokens billed at a fraction of standard rate	Applications with large static prefixes (system prompts, few-shot examples, fixed knowledge context)	Up to 90% off the cached portion

On-Demand Pricing: The Default Starting Point

When you first call a Bedrock model through the API, you are on On-Demand. The model is shared multi-tenant, you pay per token, and there is no commitment.

The structure is the same across every text model: you pay separately for input tokens (your prompt, including any retrieved context and the system message) and output tokens (the model’s response). Output tokens generally cost 3x to 5x more than input tokens. This ratio is the single most important fact for cost forecasting, because most teams underweight output cost when they estimate spend.

A quick example using Amazon Nova Pro. As of June 2026 on the official AWS Nova pricing page, Nova Pro On-Demand is $0.80 per 1 million input tokens and $3.20 per 1 million output tokens, which works out to $0.0008 per 1,000 input tokens and $0.0032 per 1,000 output tokens.

So a typical chat exchange of 1,000 input tokens and 500 output tokens costs $0.0008 + $0.0016 = $0.0024 per call. At 100,000 calls per day, that is about $240 per day, or roughly $7,200 per month.

Verified Amazon Nova On-Demand pricing (US East, Standard tier, 2026).

Nova model	Per 1M input tokens	Per 1M output tokens	Best for
Nova Micro	$0.035	$0.14	Simple classification, routing decisions, lightweight extraction
Nova Lite	$0.06	$0.24	Most chat, summarization, mid-complexity tasks
Nova Pro	$0.80	$3.20	Reasoning-heavy queries, agentic workflows, multimodal tasks
Nova Pro (latency optimized)	$1.00	$4.00	Real-time UX where every 100ms matters
Nova Premier	$2.50	$12.50	Highest-quality Nova for complex multi-step reasoning

Note: Cache read input tokens are billed at 75% less than the on-demand input token price. Flex tier and Batch tier prices are roughly 50% off the Standard tier rates shown. Nova 2 generation (Nova 2 Lite, Nova 2 Pro, Nova 2 Omni) is available in Preview at different rates. Verify the latest rates here.

Verified Anthropic Claude pricing on Amazon Bedrock (Global Cross-region Inference, US East Ohio, 2026).

Claude model	Per 1M input	Per 1M output	Batch input (50% off)	Batch output (50% off)	Cache read (~90% off input)
Claude Opus 4.8	$5.00	$25.00	N/A	N/A	$0.50
Claude Opus 4.7	$5.00	$25.00	N/A	N/A	$0.50
Claude Opus 4.6	$5.00	$25.00	$2.50	$12.50	$0.50
Claude Opus 4.5	$5.00	$25.00	$2.50	$12.50	$0.50
Claude Sonnet 4.6	$3.00	$15.00	$1.50	$7.50	$0.30
Claude Sonnet 4.5	$3.00	$15.00	$1.50	$7.50	$0.30
Claude Sonnet 4	$3.00	$15.00	$1.50	$7.50	$0.30
Claude Haiku 4.5	$1.00	$5.00	$0.50	$2.50	$0.10
Claude 3.5 Haiku	$0.80	$4.00	N/A	N/A	$0.08
Claude Mythos Preview	non-GA	non-GA	non-GA	non-GA	non-GA

Geo and In-region cross-region inference rates are slightly higher (e.g., Sonnet 4.6 at $3.30 input / $16.50 output per 1M). Reserved Tier pricing is also available with 1-month and 3-month commitments for stable workloads. Verify the latest rates here.

Cost comparison worth knowing: Claude Sonnet 4.6 at $3.00 per 1M input tokens is roughly 4x more expensive than Nova Pro at $0.80, and 50x more expensive than Nova Lite at $0.06. That spread is the whole reason model routing matters. Sending a query to Sonnet 4.6 when Nova Lite would have answered correctly costs you 50x what it should have.

Anthropic’s Claude family on Bedrock follows the same pattern at a higher absolute cost.

AWS Bedrock Anthropic pricing tracks Anthropic’s published API rates closely, with regional variation.
Claude Opus 4.8 sits at the very top of the catalog ($5.00 per 1M input tokens, $25.00 per 1M output).
Claude Sonnet 4.6 is the mainstream high-quality choice ($3.00 per 1M input, $15.00 per 1M output, with a 1M-token context window in preview).
Claude Haiku 4.5 is the faster, cheaper option ($1.00 per 1M input, $5.00 per 1M output).
Anthropic also has Mythos Preview available in gated research preview (non-GA, no published prices yet).
Meta Llama and Mistral models offer alternative price-performance points. Always verify the current rate on the official AWS Bedrock pricing page before you commit to a budget, because rates move and Region matters.

For embeddings (vector models like Amazon Titan Text Embeddings or Cohere Embed), you pay only on input tokens because the output is a vector, not generated text. Embedding costs are usually small per call, but RAG applications generate a lot of them, so they add up.

For image and video generation, the unit changes. Amazon Nova Canvas charges per image generated, with a higher rate for higher resolution or premium quality. Amazon Nova Reel charges per second of generated video. Stability AI models on Bedrock follow similar per-image patterns.

The pattern across all of these: you are paying for output. The way to cut On-Demand cost is almost always to produce less output, or to produce the same output with a cheaper model.

Provisioned Throughput: When It Pays Off

Provisioned Throughput is the answer for steady, high-volume workloads. You reserve model units (each guaranteeing a known token-per-minute throughput) and you pay an hourly rate whether the capacity is in use or idle.

The trade-off is straightforward. On-Demand is cheaper per token but has variable latency under load and no capacity guarantee.

Provisioned Throughput is more expensive per hour but offers predictable performance and predictable cost. There are three commitment tiers: no-commit (pay hourly, stop any time), 1-month commitment (lower hourly rate), and 6-month commitment (lowest hourly rate).

The honest math on whether to switch:

If your daily token volume is modest and bursty, stay on On-Demand.
If you are running a model at high, sustained utilization (think customer-facing chat with consistent traffic, or batch processing that runs nearly all day), price out the Provisioned Throughput rate against your On-Demand bill. The break-even is workload-specific, but as a directional rule, sustained workloads above roughly the equivalent of a few hundred thousand tokens per minute start to look cheaper on Provisioned Throughput.
If you are using a fine-tuned or imported custom model, you do not have a choice. Custom models on Bedrock require Provisioned Throughput by design, because they cannot be shared in the multi-tenant On-Demand pool.

The mistake to avoid is over-reserving. Buying 6-month commitment capacity for traffic that is not yet stable is the single fastest way to overspend on Bedrock. Most teams should run on On-Demand long enough to see the real traffic shape before they commit.

Batch Inference and Prompt Caching: The Two Quiet Levers

Two pricing modes are easy to miss and often move the bill more than the model-choice decision.

Batch Inference runs your prompts asynchronously at half the On-Demand rate (AWS also has a Flex tier at the same 50% discount for supported models, with similar trade-offs).

The use cases are non-interactive:

Overnight document summarization
Periodic data enrichment
Large-scale embedding generation
End-of-day report generation
Anything where the request and response do not need to happen in real time. If you can hold a request for hours, you can pay half

The pattern that works: take the high-volume, latency-tolerant parts of your workload and move them to Batch. The first place to look is anything that runs on a cron schedule. The second is anything where a user kicks off a job and is happy to get an email when it is done.

Prompt Caching is the bigger lever for context-heavy applications. If your prompts include a large static prefix (a system prompt, a large set of few-shot examples, a long policy document, a knowledge base context) and the prefix repeats across requests, Bedrock can cache the prefix and bill the cached tokens at a fraction of the standard input-token rate.

Savings of up to 90% on the cached portion of the input are typical for context-heavy applications.

This is the most underused optimization in production Bedrock deployments.

If your application sends the same 5,000-token system prompt with every request, you are probably paying for 5,000 input tokens every call when you could be paying for a tiny fraction of that.

Worth an audit if you are seeing a higher-than-expected input-token bill.

The Hidden Costs: Knowledge Bases, Agents, Flows, Guardrails

Most teams budget for the model and forget the components that wrap around it. Each has its own line on the bill.

Bedrock Knowledge Bases is the managed RAG capability. The model invocation during retrieval is charged at standard inference rates.

The hidden cost is the vector store: if you use Amazon OpenSearch Serverless as the backing index (the default managed option), expect a baseline of a few hundred dollars per month minimum for the OCU capacity, even at low query volume.

For smaller workloads, alternatives like Amazon Aurora PostgreSQL with pgvector or self-managed OpenSearch can be cheaper but require more setup. There is also the embedding model cost during ingestion, which is paid once per document but recurs whenever you reindex.

Bedrock Agents orchestrate multi-step workflows where the model calls tools, retrieves context, and produces a final answer.

The agent invocation itself does not have a fixed per-call fee; what you pay is the sum of every model call the agent makes along the way.

Agents that make many tool calls and many model invocations per user request can be deceptively expensive if you are only counting the visible user interactions.

Bedrock Flows is the visual orchestration product for chaining prompts, conditions, and tools. The Flows-specific cost is roughly $0.035 per 1,000 visual node transitions, on top of the underlying model invocation costs.

For low-volume use it is trivial; for high-volume production workloads it is worth measuring.

Bedrock Guardrails charges $0.15 per 1,000 text units for content-filter evaluations on text content, plus $0.00075 per image for image content. This is per-evaluation, not per-call: if you run Guardrails on both the input and the output, you pay for both. The text-filter rate is modest per call but adds up at high request volume and large content windows.

None of these are unreasonable on their own. They become a problem when teams stack them all together without measuring each one’s contribution to the final number.

Hidden costs at a glance.

Component	What you pay	When it kicks in
Bedrock Knowledge Bases	Standard inference rates during retrieval, plus the vector store cost (OpenSearch Serverless typically a few hundred dollars per month minimum)	Any RAG application using Bedrock’s managed Knowledge Bases
Bedrock Agents	Sum of every model call the agent makes per user request	Multi-step agents that chain tool calls and model invocations
Bedrock Flows	$0.035 per 1,000 visual node transitions, on top of underlying model costs	High-volume Flows-based orchestration
Bedrock Guardrails (text)	$0.15 per 1,000 text units, charged per evaluation (input and output count separately)	Any production app with content moderation, billed on every evaluated call
Bedrock Guardrails (image)	$0.00075 per image processed	Image-content moderation
Cross-Region data transfer	Standard AWS egress rates	Multi-Region deployments or data sources in a different Region than the model

Model Routing: The Single Biggest Optimization Lever

If you read one section of this guide, read this one.

The most effective cost optimization on Bedrock is not switching from On-Demand to Provisioned Throughput. It is not Batch. It is not Prompt Caching. Those are real and worth doing. But none of them move the bill as much as using the right-sized model for each request.

The pattern: send simple requests to a cheap, fast model. Send complex requests to a large, expensive model. Decide which is which programmatically.

A worked example with verified numbers: Suppose your application handles two kinds of queries: short factual lookups (about 70% of traffic) and complex multi-step reasoning (about 30% of traffic).

If you route everything to Claude Sonnet 4.6 ($3.00/M input) or Nova Premier ($2.50/M input), you pay top-tier rates on every call.

If you route the 70% to Claude Haiku 4.5 ($1.00/M input), Nova Lite ($0.06/M input), or even Nova Micro ($0.035/M input) for the simplest tasks, and reserve Sonnet 4.6 or Nova Premier only for the 30% that needs it, you can cut total model spend by 60-80% on the routed portion without changing user-facing quality on the queries that matter.

The implementation is a classifier or a router function that runs before the expensive model call. AWS also offers a native feature called Intelligent Prompt Routing that does this for you between models in the same family (Claude Sonnet 4.6 and Haiku 4.5, Llama 3.3 70B and 3.18B, or Nova Pro and Nova Lite).

AWS publishes that it can reduce costs by up to 30% without compromising accuracy. If you want the routing pattern but do not want to build the classifier yourself, this is the no-build option.

The trade-off is real. Routing adds a small amount of latency, a tiny upfront cost for the classifier, and architectural complexity. For low-volume applications it is overkill. For anything with meaningful traffic, it is the single biggest lever you have.

This is the pattern Avahi has used with Groopview, our anchor case study below.

Optimization levers, ranked by impact.

Lever	Typical impact	Effort to implement	Risk
Right-size the model (routing)	High (often 30-50% lower model spend)	Medium (classifier + routing logic, or use AWS Intelligent Prompt Routing)	Low if you test the cheaper model on representative traffic first
Prompt Caching for static context	High for context-heavy apps (up to 90% off cached input)	Low (configuration, not architecture)	Very low
Batch Inference for offline jobs	Medium (50% off On-Demand on the batched portion)	Medium (queue-based architecture)	Very low
Switch sustained traffic to Provisioned Throughput	Medium for steady workloads	Medium (capacity planning + commitment)	Medium (over-reserving wastes money)
Reduce output token length (prompts, response shaping)	Medium (output is 3-5x cheaper to cut than input)	Low (prompt engineering)	Low
Cache Guardrails-eligible outputs	Low to medium	Low	Very low

AWS Bedrock Pricing Calculator: Forecasting Your Bill

The official AWS Pricing Calculator supports Amazon Bedrock and is the right tool for a first forecast. The fields that matter most:

Model and Region (rates differ across Regions)
Daily input tokens and daily output tokens (your most-used model and pricing mode)
Knowledge Bases vector store size and query rate
Guardrails evaluations per day
Image or video generation volume, if applicable

The honest caveat: the calculator only shows you what AWS will charge for components you input. It does not catch the patterns that drive overspend (oversized models, uncached prompts, over-reserved throughput, Guardrails running on every input and every output).

For a real forecast on a real workload, instrument the application first, measure for a week or two on On-Demand, and forecast from observed usage. The calculator is a starting point, not the answer.

When NOT to Use Bedrock

Worth saying directly because no one else writes it.

Bedrock IS the right answer when:

You want managed access to multiple frontier models behind one interface (Anthropic, Amazon Nova, Meta, Mistral, Cohere, and others without managing separate provider relationships).
You need AWS-native integration with the rest of your stack (S3, Lambda, ECS, IAM, VPC, CloudWatch).
You care about IAM controls, VPC isolation, and AWS-grade audit logging for compliance.
The unit economics make sense at your scale once you apply routing, caching, and the right pricing mode.

Bedrock is NOT the right answer when:

Your entire workload is one specific model at very high sustained volume and you can run it more cheaply elsewhere.
You run Llama-family models at heavy sustained load (direct hosting on Amazon EC2 with Inferentia or a dedicated inference platform often beats Bedrock economics).
Your use case fits a small open-source model and self-hosting on your own infrastructure is meaningfully cheaper.
Your application only ever uses one provider’s models and going direct to that provider’s API wins on price (you give up the AWS-native security and integration story in exchange).

Bedrock optimizes for breadth (many models, one interface), governance (AWS-grade controls), and ease of integration. If those matter to you, Bedrock is the strongest option on the market. If you do not need them, run the math on alternatives before committing.

Real Result: How Groopview Cut AI Avatar Response Time by 80% with Dual-Nova Model Routing

Groopview builds an AI co-host avatar that processes text and images during live streams in real time. The product is latency-sensitive in a way that most enterprise AI is not: a slow response breaks the social-media experience the avatar is meant to enable.

The previous architecture routed every interaction through a single capable model. Response time was around 12 seconds. For a live-streaming product, that was too slow to feel real-time, and the model cost per interaction was higher than the unit economics could carry at scale.

We at Avahi rebuilt the architecture on Amazon Bedrock using a Dual-Nova orchestration framework, routing every request through a classifier that decides whether the query is simple or complex:

Simple queries (most of the traffic) route to Amazon Nova Lite, the smaller and cheaper model.
Complex queries route to Amazon Nova Pro, the larger, higher-quality model.

The stack: Amazon Bedrock (Nova Pro and Nova Lite), Amazon API Gateway, AWS Lambda, Amazon EC2 g6e GPU instances for the avatar rendering, Amazon S3, Amazon RDS, and Amazon CloudWatch for observability.

The result:

AI avatar response time dropped from 12 seconds to about 2.5 seconds for simple queries (an 80% latency reduction).
Complex queries return in about 7 seconds.
Higher session stickiness and new revenue streams from the improved real-time experience.
The cost-per-interaction dropped because the majority of traffic now hits the cheaper model.

This is the model-routing pattern made concrete. Same product, same models available, different architectural choice on which request hits which model.

Read the full case study →

Where AWS Funding Fits

The part most pricing guides cannot offer.

Avahi is an AWS Premier Tier Services Partner, and through our partnership with AWS, the proof of concept that proves your Bedrock cost case can be funded. Eligible companies may receive a funded PoC depending on the project, so you can model the actual cost on your actual workload before committing to a full build.

The structure that works: pick the workload you are evaluating, define the cost target, build a scoped PoC against your real traffic shape on Bedrock with the right routing and caching architecture, measure the bill against the forecast, and decide.

Start with a funded PoC →

Make the Call With Avahi

Bedrock is the strongest managed-model platform on the market for teams that want frontier-model breadth with AWS-native controls. The bill becomes a problem only when teams default to the largest model, leave static prompts uncached, over-reserve throughput, and skip the routing layer that does the actual work of cost optimization.

The way to get a defensible cost forecast is not a calculator. It is a scoped PoC against your real workload that proves the routing, caching, and provisioning strategy works at your scale.

Start with a funded PoC on your highest-volume workload. Eligible companies may receive a funded PoC depending on your project.

FAQs: AWS Bedrock Pricing

How Much Does It Cost to Get Bedrock?

There is no fixed cost to get started with AWS Bedrock (or Amazon Bedrock; the service is named both ways). You enable the service in your AWS account, request access to the foundation models you want to use, and pay only for what you process: per 1,000 input and output tokens for text models, per image for image generation, per second for video, plus hourly rates for any reserved Provisioned Throughput. A small prototype touching a model like Amazon Nova Lite or Claude 3.5 Haiku can run for a few dollars a day. A production application can range from hundreds to many thousands of dollars per month depending on volume and model choice.

Is AWS Bedrock Free?

Bedrock is not free in the way Amazon S3 or AWS Lambda have free tiers you can rely on for production workloads. AWS does periodically offer credits and trial allowances for specific models, and individual model providers occasionally promote evaluation periods. For production deployments, assume Bedrock is paid usage from the first call. The right way to evaluate before committing is a scoped proof of concept, which can be partially or fully AWS-funded for eligible companies.

Is Bedrock Costly?

Bedrock is not categorically expensive. It becomes expensive in three predictable ways: defaulting to the largest model for every request when most do not need it, leaving large static prompts uncached, and over-reserving Provisioned Throughput before traffic is stable. Teams that route simple requests to small models, cache repeated context, and stay on On-Demand until traffic is predictable find that Bedrock costs are reasonable for what they deliver. (Threads on AWS Bedrock pricing Reddit consistently surface the same three failure modes: oversized models, uncached prompts, and over-reserved throughput.) The architecture choices matter more than the rate card.

What Is AWS Bedrock Used For?

AWS Bedrock is a fully managed service that gives you access to foundation models from multiple providers (Anthropic Claude, Amazon Nova, Meta Llama, Mistral, Stability AI, Cohere, and others) behind one API. It is used to build generative AI applications including chatbots, RAG systems, document processing, image and video generation, agents, and content moderation, without the work of managing the underlying model infrastructure. Bedrock is the AWS-native answer to the question of how to use frontier models inside an environment that meets enterprise IAM, VPC, and audit requirements.

How Much Is Claude on Bedrock?

AWS Bedrock pricing Claude follows the standard Bedrock pattern: priced per 1,000 input and output tokens, with output typically costing 3x to 5x more than input. Claude Sonnet 4.6 is the higher-quality, higher-cost option in the current lineup. Claude 3.5 Haiku is the faster, cheaper option for most tasks. Exact rates vary by Region and change as Anthropic releases new versions, so always verify the current rate on the official AWS Bedrock pricing page before forecasting.

What Is AWS Bedrock Provisioned Throughput Pricing?

Provisioned Throughput on Bedrock reserves dedicated capacity in model units, billed hourly per unit regardless of utilization. There are three commitment tiers: no-commit (highest hourly rate, stop any time), 1-month commitment (lower rate), and 6-month commitment (lowest rate). Each model has its own model-unit rate and its own commitment-tier savings. The break-even versus On-Demand depends on sustained utilization; for predictable high-volume workloads it usually wins, for bursty or unpredictable traffic it usually does not.

AWS Bedrock Guardrails Pricing: How Much Does It Cost?

Bedrock Guardrails are billed at $0.15 per 1,000 text units for content-filter evaluations on text content (plus $0.00075 per image for image content), as published on the AWS Bedrock pricing page as of June 2026. The pricing is per evaluation, so if you run Guardrails on both the input and the output of every model call, you pay for both evaluations. For low-volume applications this is trivial. For high-volume production deployments with large content windows, Guardrails can become a meaningful share of the total Bedrock bill and is worth measuring separately.

How Does AWS Bedrock LLM Pricing Work?

Bedrock LLM pricing has four modes: On-Demand (per 1,000 input and output tokens, no commitment), Provisioned Throughput (reserved capacity by the hour), Batch Inference (asynchronous processing at roughly half the On-Demand rate), and Prompt Caching (cached input context billed at a fraction of standard rates, up to 90% off). Most production deployments combine three of these: On-Demand for live traffic, Batch for overnight work, and Prompt Caching on static prompt prefixes. Provisioned Throughput comes in when usage is large, steady, and predictable.

Can Avahi Help Optimize My AWS Bedrock Bill?

Yes. Avahi is an AWS Premier Tier Services Partner that builds and optimizes generative AI architectures on AWS, including model routing, prompt caching, Batch migration, and right-sized Provisioned Throughput. Through our partnership with AWS, the proof of concept that proves the cost case can be funded. Eligible companies may receive a funded PoC depending on the project, so you can model the actual savings on your actual workload before committing to a rebuild.

Get In Touch

Related Blog

June 29, 2026

AI Agents for Non-Technical Businesses: Where to Start

June 25, 2026

Managed AI Infrastructure for Lean Dev Teams: How Small Teams Ship Production AI on AWS

June 24, 2026

AI Capacity Planning: How to Scale on AWS Without Hitting a Wall

Explore all solutions