Your AI feature is working. Demos go well, paying customers are using it, the dashboard looks healthy.
Then one Tuesday afternoon you push a new prompt template, traffic doubles for an hour, and the whole thing slows to a crawl. Some requests time out. A few return a quota error you have never seen before.
Your AWS bill, when it lands, is the second surprise.
If you are running an AI workload on AWS at any real scale, you are doing AI capacity planning, whether you have a plan or not. Most early-stage teams do not, because the MVP that won them their first users did not need one.
That is the moment this article is for. Here is what AI capacity planning is, the walls you hit on AWS as you grow, and how to plan for capacity before the next funding round, traffic spike, or quota limit decides for you.
TL;DR: AI Capacity Planning on AWS
|
What Is AI Capacity Planning?
AI capacity planning is the process of sizing compute, model selection, throughput quotas, and supporting infrastructure to handle current and projected demand for an AI workload, without over-provisioning cost or under-provisioning reliability. On AWS, that means modeling token throughput, request concurrency, and service quotas alongside traditional compute and database capacity.
Traditional capacity planning sizes CPU, memory, and disk for a known load curve. AI capacity planning has to model three things at once: how the model behaves under load, how many tokens each request consumes, and how AWS service quotas limit you well before your hardware does.
That last part is what most teams miss. You can have spare capacity in every dimension that shows up on a CloudWatch dashboard and still get throttled because you hit a per-account token limit on a managed service.
Why AI Capacity Planning Breaks Most Startup Stacks
The stack that got you to your first thousand users was probably built for prototyping, not for capacity. Three things make AI workloads especially hard to capacity-plan after the fact.
Inference is bursty and concurrency-sensitive. A single request looks fast in testing. Fifty concurrent requests against the same endpoint can run at a fraction of that speed, because GPU memory, batching, and queueing change the math.
Token-based quotas hit before CPU does. On managed services like Amazon Bedrock, your account has request-per-minute (RPM) and tokens-per-minute (TPM) limits per model. Default quotas are tuned for evaluation, not production traffic, and you hit them long before your servers feel any strain.
Costs scale non-linearly. Prompt size, output length, and model choice each multiply the bill. A small bump in average prompt length, multiplied across every request, can change your monthly bill by 30% or more without any change in user count, which is why ongoing AWS cost optimization for AI workloads is now table stakes.
When all three compound, the workload that was fine last quarter starts failing in ways your monitoring was not built to catch.
The Four Capacity Walls You Hit on AWS
Each wall has a specific symptom and a specific fix, and most teams hit them in roughly the same order as they grow.
A short note first: these walls are not theoretical. They show up in real production traffic, often within weeks of a launch or a new feature flag, and they tend to compound. Hitting the quota wall hides the concurrency wall, and once you fix concurrency, the cost picture becomes the new constraint. The fastest way to plan capacity is to know which wall you are closest to right now.
The Quota Wall
Symptom: requests start returning ThrottlingException or capacity errors, even though CPU and memory are nowhere near saturated.
What is happening: you have run into your service-level quota for the AWS service you are using. On Bedrock, default per-account quotas are around 800 RPM and 600,000 TPM on a model like Llama 3.3 70B Instruct. In Avahi’s own performance testing, getting to 6,000 RPM and 36 million TPM took specially granted, elevated quotas that are not automatic.
Fix: identify your binding quota per service and model, then request increases through AWS Support well before you need them. Production quota approvals can take time.
The Concurrency Wall
Symptom: per-request latency is fine in isolation, but tail latency spikes hard as concurrent requests climb.
In Avahi’s testing, tokens-per-second on Llama 3.3 70B Instruct ranged from about 135 at low concurrency to roughly 119 at concurrency 50 with small prompts. With large prompts, the same workload delivered close to half the throughput of small prompts at concurrency 50.
Fix: stress-test at the concurrency levels you actually expect, not the average. Right-size models for each task, and route simple work to smaller models so the big model is not the choke point.
The Cost Wall
Symptom: The AWS bill grows faster than your active user count.
This usually means your prompt sizes, retry logic, or model choice are doing more work per request than they need to. AI cost scales with tokens and model size, not requests, so a 20% bump in average context length is a 20% bump in spend.
Fix: tighten prompts, cache repeated requests, batch where you can, and right-size models per task. Cost work for AI workloads usually starts with model routing before infrastructure.
The Latency Wall
Symptom: time-to-first-token (TTFT) stays stable for a while, then degrades sharply at higher concurrency, especially for large prompts.
Fix: cap context size, stream output where the user-facing flow allows it, and use autoscaling that adds capacity ahead of demand instead of after.
How to Plan AI Capacity on AWS (A Practical Approach)
There is no universal capacity number. There is, instead, a measured process you can run on your own workload.
- Baseline a single request. Measure time-to-first-token, time-to-last-token, tokens-per-second, and cost per request on a representative prompt. This is your reference point.
- Stress-test under concurrency. Run the same prompt at 1, 5, 10, 25, and 50 concurrent requests. Capture p50, p95, and p99 latency at each level. The gap between p50 and p99 is where users feel pain.
- Map workloads to the right AWS scaling primitive. Stateless web requests fit Lambda or Fargate. Long-running batch jobs fit ECS with auto scaling. Custom inference fits SageMaker endpoints with managed autoscaling. Foundation models fit Amazon Bedrock with attention to quotas.
- Request quota increases ahead of need. Identify your binding quota per model and service, calculate the production ceiling you expect, and file the increase request weeks before launch, not the day you go live.
- Add the right scaling primitives at the right layer. Auto Scaling Groups for EC2, target tracking on application load balancers, predictive scaling on CloudWatch metrics, caching at the application layer, and rate limiting at the API gateway.
Done in this order, you go from guessing to a documented plan you can defend in a board meeting or a security review.
Which AWS Services Scale Automatically (and Which Do Not)?
A common reason capacity plans break is treating every AWS service as if it scales the same way. They do not.
| AWS service | Scales automatically? | What to know |
|---|---|---|
|
AWS Lambda |
Yes |
Scales to thousands of concurrent executions; watch account concurrency limits |
|
AWS Fargate (ECS) |
Yes, with service auto scaling configured |
Target tracking on CPU, memory, or custom metrics |
|
Aurora Serverless v2 |
Yes |
Capacity units scale up and down with load |
|
Amazon Bedrock |
Managed, but quota-bound |
No instances to scale; RPM and TPM quotas are the ceiling |
|
SageMaker endpoints |
With auto scaling policies |
Real-time endpoints need target tracking or scheduled scaling configured |
|
Amazon EC2 |
Only with Auto Scaling Groups |
Define launch templates, min/max/desired capacity, and scaling policies |
|
S3, DynamoDB on-demand |
Yes |
Effectively unlimited for typical workloads |
Can you use Auto Scaling without a load balancer? Yes. Auto Scaling Groups can scale on any CloudWatch metric: SQS queue depth, custom application metrics, or scheduled actions. Load balancers are common because most web traffic uses target tracking on request count or response time, but they are not a requirement.
The takeaway: pick the scaling primitive that fits each layer of your AI stack, and do not assume “managed” means “infinitely scalable.”
How to Use AI for Capacity Planning
There is a meta-version of this problem worth flagging, because it shows up in search.
You can use AI to forecast demand on your own workload. LLMs can read historical traffic logs and propose autoscaling policies. Forecasting models can predict daily and weekly load patterns from CloudWatch metrics, which feeds into predictive scaling. AI-driven load-testing tools can generate realistic traffic profiles instead of static scripts.
The honest framing: AI helps you model your AI’s behavior, but it does not replace baseline testing. You still need real measurements from your own workload before any model can usefully forecast it. Use AI to compress the analysis cycle, not to skip the measurement step.
Real Result: Vela Health Goes From MVP to Patient-Scale in 5 Weeks

Vela Health is a digital health startup whose AWS setup had grown organically. Their AI workloads were running on OpenAI with ChromaDB and FAISS for vector search, with no environment separation and no formal capacity baseline. It worked for development. It was not going to survive a patient-facing launch.
Avahi delivered the full graduation in five weeks.
- Establishing a multi-account landing zone with separated environments
- Implementing CI/CD via GitHub Actions with OIDC and zero hard-coded credentials
- Using ECS Fargate for the backend, RDS MySQL and ElastiCache Redis for data and queues, and Secrets Manager for secrets
- Migrating OpenAI workloads to Amazon Bedrock and replacing ChromaDB and FAISS with OpenSearch KNN
The result: a secure, scalable AWS platform ready for real patients, with autoscaling primitives where Vela Health needed them and a stack that no longer depended on one engineer’s tribal knowledge.
That is the difference between an MVP that gets you users and infrastructure that lets you keep them.
Real Result: Healthi Adjudicates Claims in Under a Minute With Agentic AI 
Healthi is an insurance technology company processing claims at a pace manual review cannot match. The capacity challenge was not just volume, it was latency under volume: claims decisions needed to land in under a minute, every time, even as request rates spiked.
Avahi designed an agentic AI workflow on AWS that delivers real-time adjudication while keeping spike-load costs in check.
The result: claims decisions in under a minute, on an architecture that absorbs traffic spikes without each spike showing up as a line item on the bill.
That is what AI capacity planning looks like in practice for a real-time workload.
Where AI Capacity Plans Quietly Break
A short list of patterns that show up in almost every audit Avahi runs.
- Planning for averages, not p99. Your average user is fine. The 99th percentile is the one who churns.
- Ignoring TPM and RPM quotas. The capacity ceiling lives at the model level, not the compute level.
- Picking the model first, capacity second. Frontier models are expensive at scale, and the cheapest model that meets the quality bar usually wins on unit economics.
- No caching layer. Repeated identical requests should not be paying for inference twice.
- No rate limiting. A misbehaving client or a bot run can drain your daily quota in an hour.
- Autoscaling without warm pools for cold-start-sensitive workloads. Scaling out is only useful if the new capacity is ready by the time the traffic arrives.
Most of these are quick to fix once you know where to look. The hard part is finding them before they find you.
Where AWS Funding Fits
Through our partnership with AWS, the work to harden an AI workload’s capacity model can often be funded. Eligible companies may receive a funded PoC depending on the project.
The low-risk way to start is a scoped PoC that targets the single biggest capacity bottleneck first, so you see the result on real workload metrics before committing to a broader build.
Plan Your AI Capacity on AWS With Avahi
AI capacity planning is one of those problems that is invisible until it is urgent. The fix is straightforward: measure your workload, map it to the right AWS primitives, and harden the layer that breaks first.
At Avahi we do this as an AWS Premier Tier Services Partner with six AWS Competencies, including Generative AI and Migration. Through our partnership with AWS, much of the work can be funded.
Start with a scoped PoC to plan and harden your AI workload’s capacity on AWS. Eligible companies may receive a funded PoC depending on your project.
FAQs on AI Capacity Planning and Scaling on AWS
How Do You Auto Scale in AWS?
You define an Auto Scaling Group for EC2 or service auto scaling for ECS, Fargate, Lambda, or SageMaker, then attach scaling policies. Common policies include target tracking (hold a metric at a target value), step scaling (react to threshold breaches), and scheduled scaling (anticipated load). Each policy uses CloudWatch metrics as the trigger.
Which AWS Services Can Scale Automatically Without Intervention?
Lambda, Fargate, Aurora Serverless v2, DynamoDB on-demand, and S3 scale automatically without manual capacity steps. EC2, SageMaker endpoints, and ECS need auto scaling configured. Bedrock is managed but bound by per-account RPM and TPM quotas, so it scales within its limits and not beyond them.
What Are the Challenges of Scaling AI?
Three things make AI workloads harder to scale than traditional compute: inference is bursty and concurrency-sensitive, token-based quotas constrain throughput at the model level, and costs grow non-linearly with prompt size and model choice. Planning for averages instead of p99 latency is the most common mistake.
Can We Use Auto Scaling Without a Load Balancer?
Yes. Auto Scaling Groups can scale on any CloudWatch metric, including SQS queue depth, custom application metrics, or a schedule. Load balancers are common for web traffic because target tracking on request count or response time is convenient, but they are not required for Auto Scaling to function.
How Long Does It Take to Re-Architect an AI Workload for Scale on AWS?
It depends on the starting point and the scope, but a focused build that targets the single biggest capacity bottleneck is typically a matter of weeks rather than months on AWS, because most of the heavy lifting is managed services rather than custom infrastructure. Vela Health, for example, completed a full graduation to production-ready in five weeks.
Can Avahi Fund the Work to Scale My AI Workload on AWS?
Yes, potentially. As an AWS Premier Tier Services Partner, Avahi can fund a scoped proof of concept that addresses your highest-priority scaling bottleneck first. Eligible companies may receive a funded PoC depending on the project.
