Traditional internet architecture solved latency with caching. Static content, images, JavaScript bundles—all pushed to edge nodes milliseconds from users. CDNs achieve 95-99% cache hit rates. The compute stays centralized; the content moves to the edge.

AI breaks this model completely.

Every inference requires real GPU cycles. You can’t cache a conversation. You can’t pre-compute a response to a question that hasn’t been asked. The token that completes a sentence depends on every token before it.

This means AI latency is dominated by one thing: distance to compute.

The Physics of Latency

Light travels through fiber at roughly two-thirds the speed of light in vacuum—about 200,000 km/s. For practical network latency estimation, real-world routing adds overhead. A useful approximation: 133km per millisecond of one-way latency, or about 665km per 10ms round-trip.

But round-trip time (RTT) is what matters. Every API call, every inference request, every token streamed back to a user requires a round trip. The user waits for the full cycle.

This gives us latency rings—concentric zones of user experience radiating from each data center:

RTTRadiusCoverage
10ms~665 kmSame metro + adjacent cities
20ms~1,330 kmRegional (Northeast corridor, West Coast)
40ms~2,660 kmCoast-to-coast, barely
60ms~3,990 kmContinental US + Canada/Mexico

A data center in Northern Virginia covers NYC, Philadelphia, Pittsburgh, and Boston within 20ms. But Miami? That’s 35-40ms minimum. Phoenix? Over 50ms.

Humans Detect Latency

This isn’t abstract. Humans are latency detectors.

Research shows we perceive delays starting at 100-120ms. Pauses over 200ms feel unnatural. Beyond 500ms, users experience anxiety. Over one second feels broken.

And these thresholds stack.

Consider a voice AI assistant:

  • Audio to telephony network: 20-50ms
  • Network to cloud: 15-40ms (depends on distance)
  • Speech-to-text: 100-200ms
  • Model inference: 50-500ms
  • Text-to-speech: 100-200ms
  • Audio back to user: 35-75ms

Total: 320ms to over a second. And that’s the happy path—average latency under normal conditions.

Tail Latency Is What Breaks Things

Average latency is a vanity metric. What matters is P95 and P99—the experience of your unluckiest users.

A system with 200ms average latency might show:

  • P95: 800ms
  • P99: 2+ seconds

Amazon found every 100ms of latency costs 1% in sales. But that was measuring averages. Tail latency damage is worse:

  • 7% conversion drop per 100ms at P95
  • 40% more hang-ups when voice agents exceed 1 second
  • 86% of users leave after two bad experiences

When your compute is concentrated in a handful of locations, tail latency isn’t an edge case. It’s a guarantee for users far from those locations.

The Concentration Problem

Here’s what the data shows: 95% of AI compute runs in just four locations. In the US, that’s effectively Northern Virginia (us-east-1), Oregon (us-west-2), and one or two others depending on provider.

This happened for good reasons. These regions have:

  • Cheap power
  • Existing fiber infrastructure
  • Regulatory familiarity
  • Established operations teams

But the result is a latency map where most of America sits in yellow and brown zones—30-50ms+ from the nearest GPU cluster. For real-time AI applications, that’s the difference between “magical” and “broken.”

The concentration is getting worse, not better. In one analysis of $60M weekly cloud spend, us-east-1 represented 49.9% of infrastructure—up 16.1% from the previous period. Fear of operational complexity is driving consolidation, even when the economics argue for distribution.

Making Trade-offs Visible

Every infrastructure placement decision is a business decision.

When a capacity planner chooses us-west-2 over deploying in Texas, they’re choosing which populations get responsive AI and which don’t. When they concentrate training and inference in one region, they’re accepting latency for some users in exchange for operational simplicity.

These aren’t wrong decisions—they’re trade-offs. The problem is they’re usually invisible.

Atlas exists to make them visible.

By mapping cloud regions against population centers and overlaying latency rings, you can see exactly what each placement decision means:

  • A 10ms latency difference affects 50 million users
  • A 5% efficiency gap at scale costs $3M monthly
  • That “temporary” cluster in us-east-1 that nobody wants to touch? It’s serving 41M people within the 10ms ring

When you visualize $60M/week in cloud spend through this lens, patterns emerge. The $78M H100 training cluster that “made sense at the time” is now 45% of total spend and geographically suboptimal. The us-east-1 dominance isn’t strategic—it’s inertia.

The Latency Ring Mental Model

Latency rings provide a simple framework for capacity decisions:

10ms ring (665km): Premium user experience. Real-time voice, interactive AI, gaming. Only users in the same metro or adjacent cities get this.

20ms ring (1,330km): Good experience. Conversational AI works well. Most users won’t notice latency. Regional coverage—Northeast corridor from a Virginia DC, West Coast from Oregon.

40ms ring (2,660km): Acceptable for most applications. Non-real-time inference, batch processing, less latency-sensitive use cases. Barely covers coast-to-coast.

60ms+ ring: Degraded experience. Users notice delays. Voice AI feels sluggish. Only acceptable when there’s no alternative.

When evaluating a new region or provider, the question isn’t “what’s the cost per GPU-hour?” It’s “which populations move into a better latency ring, and what’s that worth?”

Connecting Latency to Economics

The real insight is connecting latency rings to economic value.

Population within each ring tells you user reach. But GDP within each ring tells you revenue potential. A data center covering 40M people in the Northeast corridor (high GDP per capita, high enterprise density) delivers different business value than one covering 40M people across rural regions.

Atlas correlates cloud spend with Census population data and BEA economic data to answer questions like:

  • What’s the GDP-weighted latency of our current infrastructure?
  • Which underserved populations would benefit most from a new region?
  • Is the premium for a California region justified by the user base it serves?

This turns infrastructure planning from “gut feel plus spreadsheets” into quantified trade-off analysis.

What This Means for Capacity Planning

If you’re running AI infrastructure at scale, latency rings change how you think about several problems:

Region selection: Don’t just compare costs. Compare latency-weighted user coverage. A 20% cost premium for a region that moves 30M users from the 40ms ring to the 20ms ring might be the right trade-off.

Multi-region strategy: Latency rings reveal coverage gaps. If your current regions leave major metros in the 40ms+ zone, you’re leaving user experience on the table.

Commitment optimization: Reserved instances and committed use discounts lock you into regions. Latency ring analysis helps you commit in the right places, not just the cheapest ones.

Capacity stranding: Unused quotas in well-positioned regions are worse than unused quotas in poorly-positioned ones. Prioritize filling capacity where the latency rings matter.

The Bottom Line

AI compute can’t be cached. Every inference needs real GPU cycles. That makes data center placement a user experience decision, not just an operations decision.

Latency rings make the trade-offs visible. When you see that your infrastructure leaves 100M users in degraded latency zones while concentrating spend in locations that made sense five years ago, the path forward becomes clearer.

Infrastructure decisions are business decisions. The companies that treat them that way—quantifying latency impact, correlating with economic value, making trade-offs explicit—will deliver better AI experiences than those still optimizing for cost alone.

A 10ms improvement doesn’t show up in your cloud bill. It shows up in user retention, conversion rates, and whether your AI feels magical or broken.