AI Code Review

Jan 20, 2026

How to Safely Test New LLMs in Production Using Shadow Traffic and A/B Testing

Amartya | CodeAnt AI Code Review Platform

Sonali Sood

Founding GTM, CodeAnt AI

Swapping out an LLM in production feels a lot like changing the engine on a plane mid-flight. One wrong move and your users notice immediately, degraded responses, slower latency, or worse, hallucinations that erode trust.

Shadow traffic and A/B testing give you a way to validate new models on real-world inputs before committing. This guide walks through how to set up both approaches, which metrics to track, and how to move from testing to a safe production rollout.

What is Shadow Traffic Testing for LLMs?

Shadow traffic testing lets you evaluate a new LLM on real production requests without affecting users. You duplicate live traffic to the candidate model, log its responses, and compare them to your current production model. Users only see outputs from the existing model, so there's zero risk to their experience.

Think of it like a dress rehearsal. The new model processes everything the production model does, but its responses never reach users. Instead, you capture and analyze them offline.

Here's how the mechanics work:

Traffic duplication: Real requests go to both production and candidate models at the same time
Silent evaluation: Candidate responses are logged but never shown to users
Comparative analysis: You compare responses offline to assess quality, latency, and accuracy

This approach gives you real-world data instead of synthetic test sets. You see exactly how the candidate handles the messy, unpredictable inputs your users actually send.

What is A/B Testing for LLM Prompts and Models?

A/B testing splits live user traffic between a control (your current production model) and a treatment (the new variant). Unlike shadow traffic, users actually receive responses from the candidate model.

You might wonder why you'd take that risk. The answer is that some signals only show up when users interact with the output. Task completion rates, follow-up questions, and thumbs-up ratings reveal whether users prefer the new model's responses. Shadow testing can't capture that.

Key characteristics:

Live exposure: Users receive responses from either control or treatment
Random assignment: Traffic split ensures unbiased comparison
Direct measurement: Captures actual user behavior and satisfaction signals

Shadow Traffic vs A/B Testing for LLMs

Key Differences in How Each Method Works

The core distinction: shadow testing is observational (no user impact), while A/B testing is experimental (real user exposure).

Factor	Shadow Traffic Testing	A/B Testing
User impact	None—users see only production responses	Direct—users see candidate responses
Risk level	Zero risk to user experience	Controlled risk with live exposure
What you measure	Model performance, latency, cost	User behavior, satisfaction, outcomes
Best for	Initial validation of new models	Final validation before full rollout

How to Choose the Right Approach for Your Use Case

Shadow traffic comes first when risk is high or the new LLM is untested. A/B testing follows when you want real user feedback and the candidate has passed shadow validation.

Many teams run both in sequence. Shadow first to catch obvious regressions, then A/B to validate user experience before full rollout.

When to Use Shadow Traffic and A/B Testing

Deploying a New LLM Model

Swapping models, migrating from one provider to another or upgrading versions—requires validation. Shadow traffic helps assess output quality differences before any user exposure. You'll catch regressions in tone, accuracy, or formatting that benchmarks might miss.

Changing Prompts or System Instructions

Prompt engineering changes can have unpredictable effects. A/B testing works well here since prompt changes are typically lower risk than full model swaps. You can measure user response directly and iterate quickly.

Optimizing for Cost or Latency

Teams often test smaller or faster models to reduce costs. Shadow traffic lets you measure latency and token usage on real workloads without impacting users. You might find that a "cheaper" model actually costs more per task due to lower token efficiency.

Validating Safety and Compliance Requirements

Regulated industries require evidence that new LLMs meet safety, security, and compliance standards. Shadow testing generates audit trails and logs for compliance review before production rollout.

How to Set Up Shadow Traffic for LLM Testing

1. Define Your Baseline and Candidate Models

The baseline is your current production model. The candidate is the new model you want to evaluate. Be explicit about which model version, provider, or prompt configuration you're comparing. Document everything, you'll thank yourself later.

2. Configure Traffic Routing Without User Impact

Set up your infrastructure to duplicate requests. The production model serves users while the candidate receives the same input asynchronously. You can implement this at the API gateway, load balancer, or application layer.

3. Set Up Logging and Response Capture

Capture both responses (production and candidate) with metadata like latency, token count, and timestamps. Structure your logs for easy comparison and analysis. Apply the same data handling and encryption policies you use for production data.

4. Run Shadow Tests and Collect Data

Run shadow traffic long enough to capture diverse inputs, including edge cases and peak traffic patterns. Monitor for errors, timeouts, and cost accumulation during the shadow phase. One full business cycle is often a good minimum duration.

How to Run A/B Tests for LLM Prompts and Models

1. Create Control and Treatment Variants

The control is your current production prompt or model. The treatment is the new variant. Document exactly what differs between them: model, prompt text, temperature, or other parameters.

2. Split Live Traffic Between Variants

Randomly assign users or requests to control vs treatment. Start with a small traffic percentage to limit the blast radius if the treatment underperforms. Feature flags make this easy to manage and roll back.

3. Collect and Compare Results in Real Time

Stream results into a dashboard or evaluation platform where you can monitor key metrics side by side. Watch for regressions in quality, latency, or user behavior. Don't wait until the end to look at the data.

4. Determine Statistical Significance

Statistical significance means you have enough data to be confident the difference isn't due to chance. The required sample size depends on the magnitude of the difference you expect to detect. Smaller expected differences require more data.

Key Metrics for LLM Evaluation During Testing

Latency and Response Time

Latency is the time from request to response. Slow responses degrade user experience and can cause timeouts in downstream systems. First-token latency matters especially for streaming responses.

Cost per Request

Cost includes API fees, token usage, and compute. Shadow traffic doubles your API spend temporarily, so monitor it closely and set budget alerts.

Output Quality and Accuracy

This measures whether the LLM's response is correct, helpful, and well-formatted. It often requires automated scoring (LLM-as-judge) or human evaluation for a sample of responses.

Hallucination and Safety Scores

Hallucination is the generation of false or unsupported information. Safety scores measure harmful, biased, or inappropriate outputs. Both are critical for compliance and trust.

User Satisfaction Signals

A/B tests can capture thumbs up/down ratings, task completion rates, and follow-up queries. User satisfaction signals reveal whether users prefer the new model's responses, something shadow testing can't measure.

How to Move from LLM Testing to Production Deployment

1. Analyze Results Against Success Criteria

Before promoting a candidate, define clear pass/fail thresholds upfront. Compare shadow or A/B results against your criteria. If you didn't define success criteria before the test, you're making decisions based on intuition rather than data.

2. Run a Canary Deployment

A canary deployment routes a small percentage of real traffic to the new model while monitoring closely. This is the transition between A/B testing and full rollout.

3. Gradually Increase Traffic to the New Model

Use a progressive rollout pattern. Increase traffic in stages while watching for regressions. Don't go from 5% to 100% in one jump.

4. Monitor Closely and Prepare for Rollback

Set up automated alerting and a one-click rollback mechanism. If metrics degrade, revert immediately. The ability to roll back quickly is more valuable than the ability to roll out quickly.

Best Practices for Safe LLM Testing in Production

Always have a rollback plan ready: Rollback can be instant and automated. Define your rollback trigger conditions before starting any test
Set clear success and failure thresholds: Define what "better" means before the test, specific metric improvements or regressions that trigger decisions
Limit blast radius with feature flags: Feature flags let you instantly disable the candidate without a code deploy
Automate LLM evaluation where possible: Manual review doesn't scale. Use automated scorers to evaluate outputs continuously
Ensure code quality in your testing infrastructure: Testing infrastructure is code—it benefits from reviews, tests, and quality checks. Platforms like CodeAnt AI help maintain quality and catch issues in evaluation pipelines before they reach production

Common Mistakes When Testing New LLMs

Relying only on synthetic test data: Curated test sets miss the diversity and messiness of real user inputs. Shadow traffic solves this by using actual production requests
Ignoring edge cases and long-tail inputs: Rare inputs often expose model weaknesses. Run tests long enough to capture low-frequency but high-impact scenarios
Underestimating the cost of shadow traffic: Shadow traffic doubles your LLM API costs during testing. Budget accordingly and set spending alerts
Skipping human evaluation for output quality: Automated metrics miss nuance. Include human review for a sample of responses, especially for subjective quality judgments

Build a Reliable LLM Testing Pipeline That Scales

Testing isn't a one-time event. It's an ongoing practice for continuous LLM evaluation as models evolve, prompts change, and user expectations shift.

Robust testing infrastructure requires the same engineering rigor as production systems. Code quality, automated reviews, and security scanning matter just as much in your evaluation code as in your application code. A bug in your testing pipeline can lead to bad deployment decisions that affect every user.