AI Code Review

Jan 21, 2026

How to A/B Test New LLMs Using Shadow Traffic in Production

Amartya | CodeAnt AI Code Review Platform
Sonali Sood

Founding GTM, CodeAnt AI

Top 11 SonarQube Alternatives in 2026
Top 11 SonarQube Alternatives in 2026
Top 11 SonarQube Alternatives in 2026

You've tested your new LLM in staging. The benchmarks look great. But the moment you flip the switch in production, users start complaining about weird responses, and you're scrambling to roll back.

The gap between "works in testing" and "works for real users" is where shadow traffic and A/B testing come in. This guide walks through how to set up both methods, when to use each, and how to combine them for safe, confident LLM deployments.

What is Shadow Traffic Testing for LLMs

To safely test a new LLM in production, you first shadow deploy the candidate model by sending it live user requests without showing its output. Your production model keeps serving users normally. Meanwhile, the shadow model processes the same requests in parallel, and you log its responses for later analysis.

Think of it like a dress rehearsal. The new model performs with real data, but the audience (your users) only sees the main act. You collect metrics on quality, latency, cost, and accuracy from both models, then compare them side by side.

Shadow traffic testing gives you real-world validation without any user risk:

  • No user impact: Shadow responses stay hidden; users only see production output

  • Real-world data: Tests run against actual traffic patterns, not synthetic benchmarks

  • Risk-free evaluation: Compare new models against your baseline before any exposure

This approach works especially well when switching LLM providers, testing major prompt changes, or validating cost and latency tradeoffs.

What is A/B Testing for LLMs

A/B testing takes things a step further by actually exposing users to different variants. You split live traffic between a control variant (your current production model) and a treatment variant (the new candidate). Unlike shadow testing, users receive and interact with responses from both variants.

The goal here is measuring real user behavior. Click-through rates, satisfaction scores, task completion, and engagement all come into play. Shadow testing tells you if a model can perform well. A/B testing tells you if users prefer it.

Factor

Shadow Traffic

A/B Testing

User exposure

None

Partial

Measures real user behavior

No

Yes

Risk level

Low

Medium

Best for

Pre-deployment validation

Live optimization

When to Use Shadow Traffic and A/B Tests

The right testing method depends on your goals and risk tolerance. Often, you'll use both in sequence.

Switching LLM providers or models

Shadow traffic is ideal when evaluating a new provider, say, moving from OpenAI to Anthropic. You validate response quality at scale without exposing users to potential regressions. Once shadow metrics look promising, you can proceed to A/B testing.

Testing prompt changes at scale

For prompt variations, the choice depends on what you're measuring. Shadow tests reveal output quality differences before going live. A/B tests measure how users actually respond to the differences.

Evaluating cost and latency tradeoffs

Shadow traffic reveals cost-per-request and latency differences between models without affecting your production SLAs. This is particularly useful when comparing models with different pricing structures or response times.

Validating behavior under production load

Both methods help stress-test new models with real traffic volumes. Shadow testing validates technical performance. A/B testing validates user experience under load.

Shadow Traffic vs A/B Testing for LLM Deployment

Understanding when to use each method, and how to combine them, is key to safe LLM deployment.

When shadow traffic works best

Shadow traffic works best when you require zero user risk. New model validation, provider migrations, and compliance-sensitive environments all fit this category. If you can't afford unexpected outputs reaching users, shadow testing is your starting point.

When A/B testing works best

A/B testing is the right choice when you need real user feedback signals. If your success metrics depend on user satisfaction, conversion rates, or task completion, shadow testing alone won't give you the full picture.

How to combine both methods

A phased approach reduces deployment risk significantly. First, run a shadow test to validate quality and performance. Once the shadow model meets your thresholds, proceed with an A/B test to measure real user impact. This combination catches both technical regressions and user experience issues.

How to Set Up Shadow Traffic for LLM Testing

Here's a practical walkthrough for implementing shadow traffic in your LLM pipeline.

1. Design your traffic mirroring architecture

Use a proxy or gateway pattern to intercept production requests and duplicate them to the shadow model. API gateways like Kong or service meshes like Istio work well here. The key is ensuring the shadow path doesn't block or slow the production path.

2. Configure async request routing

Shadow requests can't block production responses. Use asynchronous calls or message queues (RabbitMQ, Kafka, or cloud-native equivalents) to decouple shadow processing from the primary request-response path. This separation keeps your production latency unaffected.

3. Store shadow responses for analysis

Log all shadow model outputs with request IDs for later comparison. Include metadata like timestamps, token counts, model version, and any relevant context. This data becomes your evaluation dataset.

4. Compare shadow results against production

After collecting sufficient data, run evaluation scripts to compare outputs. Use automated scorers for metrics like relevance, coherence, and safety. Human evaluation adds another layer for subjective quality assessment.

Tip: Start with a small percentage of traffic (5-10%) to validate your shadow infrastructure before scaling up.

How to Run A/B Tests on LLM Prompts and Models

Once shadow testing builds confidence, A/B testing validates real-world user impact.

1. Define control and treatment variants

Document the exact differences between variants. This could be model version, prompt wording, temperature settings, or system instructions. Clear documentation prevents confusion during analysis.

2. Set up traffic splitting

Use feature flags or experiment platforms (LaunchDarkly, Eppo, PostHog) to route users to different variants. Random, unbiased assignment is essential for reliable results. Sticky assignment ensures users see consistent behavior within a session.

3. Add evaluation scorers

Define what "better" means for your use case. Automated metrics like response relevance and coherence provide scale. Human evaluation catches nuances that automated scorers miss.

4. Run the experiment and collect data

Let the experiment run long enough for statistical significance. Resist the temptation to peek at results or stop early. Premature conclusions lead to false positives.

5. Analyze results and determine winners

Compare metrics across variants. Look for statistically significant differences before declaring a winner. Consider both primary metrics (user satisfaction) and guardrail metrics (latency, cost, error rates).

Key Metrics for LLM Performance Evaluation

Tracking the right metrics determines whether your testing actually tells you something useful.

Response quality and accuracy

Assess correctness, relevance, and helpfulness using automated scorers and human review. LLM-as-a-judge approaches (using one model to evaluate another) scale well but benefit from human calibration.

Latency and throughput

Measure time-to-first-token and total response time. Both directly impact user experience, especially in interactive applications. Throughput matters for high-volume use cases.

Cost per request

Track token usage and API costs per request. Compare across models to identify cost-efficient options. A model that's 10% better but 3x more expensive might not be the right choice for your use case.

Error rates and fallback frequency

Monitor failed requests, timeouts, and fallback triggers. A model with great average performance but frequent failures creates a poor user experience.

User satisfaction signals

For A/B tests, capture direct signals: thumbs up/down, regenerate clicks, explicit feedback, and task completion rates. User satisfaction signals are your ground truth for user preference.

How to Transition From Testing to Full Deployment

Testing is only valuable if it leads to confident deployment decisions.

1. Set promotion criteria before testing

Define clear thresholds before starting any test. What latency is acceptable? What quality score is required? What cost increase is tolerable? Pre-defined criteria prevent post-hoc rationalization.

2. Use feature flags for gradual rollout

Start with 1% of traffic, then 5%, then 10%. Gradual rollout catches issues that only appear at scale. Feature flags make rollback instant if problems emerge.

3. Monitor metrics post-rollout

Continue watching key metrics after promotion. Production behavior sometimes differs from test behavior. Early detection prevents major incidents.

4. Implement automated rollback triggers

Set up alerts that trigger automatic rollback if critical metrics degrade. If error rates spike or latency doubles, you want immediate action, not a page at 3 AM.

Best Practices for Production LLM Testing

A few principles make the difference between testing that works and testing that misleads.

Isolate shadow traffic from user-facing latency

Never let shadow processing slow production responses. Asynchronous processing is non-negotiable. If your shadow infrastructure adds latency, users pay the price.

Use representative datasets

Test against traffic that reflects real-world usage. Cherry-picked examples or synthetic data can hide problems that only appear with actual user queries.

Account for LLM non-determinism

LLMs produce variable outputs for identical inputs. Run multiple evaluations per input to capture variance. A single comparison can be misleading.

Set cost budgets for parallel requests

Shadow traffic doubles your API costs. Set spending limits and monitor usage closely. A runaway shadow test can blow through your budget quickly.

Log everything for debugging

Capture full request/response pairs, model versions, and evaluation scores. When something goes wrong, and it will, comprehensive logs make debugging possible.

Ship LLM Changes Faster With Confidence

Teams that combine shadow traffic and A/B testing catch regressions before users notice them. The phased approach, shadow first, then A/B, balances speed with safety.

As you ship LLM-powered features, the code around your AI matters just as much as the models themselves. CodeAnt AI helps engineering teams maintain code quality and security across the entire development lifecycle, ensuring your LLM integrations stay reliable as they scale. To learn more try our OSS yourself for 14-days free!

FAQs

Can shadow traffic tests run on open-source LLMs hosted locally?

Can shadow traffic tests run on open-source LLMs hosted locally?

Can shadow traffic tests run on open-source LLMs hosted locally?

How do A/B tests handle conversation history in multi-turn LLM applications?

How do A/B tests handle conversation history in multi-turn LLM applications?

How do A/B tests handle conversation history in multi-turn LLM applications?

What happens if the shadow model outperforms production during the test?

What happens if the shadow model outperforms production during the test?

What happens if the shadow model outperforms production during the test?

Does shadow traffic testing require separate infrastructure?

Does shadow traffic testing require separate infrastructure?

Does shadow traffic testing require separate infrastructure?

Table of Contents

Start Your 14-Day Free Trial

AI code reviews, security, and quality trusted by modern engineering teams. No credit card required!

Share blog: