AI Code Review

AI Code Review Benchmark: Results from 200,000 Real Pull Requests

Amartya | CodeAnt AI Code Review Platform
Sonali Sood

Founding GTM, CodeAnt AI

CODEANT AI · MARCH 2026 · BENCHMARK

AI code review tools have exploded in the past two years. Every major AI vendor now offers some version of automated pull request review. Tools promise to detect bugs, security issues, performance problems, and code quality regressions before code merges.

But until recently there was a fundamental problem:

There was no independent benchmark measuring how well these tools actually perform.

Most published comparisons were produced by the vendors themselves. Predictably, every benchmark declared its own tool the winner. That changed in March 2026 with the release of Martian’s Code Review Bench, the first independent evaluation framework designed specifically for AI code review systems.

The benchmark evaluates 17 AI code review tools across more than 200,000 real pull requests from open-source repositories. Instead of relying only on curated datasets, the benchmark measures something much closer to reality:

Which review comments developers actually act on.

In this first release, CodeAnt AI ranked #3 globally, achieving a 51.7% F1 score across real-world pull requests.

This article explains:

  • How the benchmark works

  • What precision, recall, and F1 mean for code review

  • Where CodeAnt AI ranked

  • Why different tools score differently

  • What the results reveal about the future of AI code review

The First Independent AI Code Review Benchmark

For years, evaluating AI code review tools was surprisingly difficult.

Traditional static analysis tools had long-established benchmarks and datasets. But modern AI-based review systems operate differently. They analyze code context, developer intent, architectural patterns, and repository history.

Measuring their performance requires something more sophisticated than a fixed bug dataset.

Martian approached the problem differently.

Their Code Review Bench was designed by a research team including engineers previously associated with:

  • DeepMind

  • Anthropic

  • Meta

Importantly, Martian does not build a competing code review product.

This independence is what makes the benchmark credible.

The benchmark analyzes:

  • 200,000+ real pull requests

  • thousands of GitHub repositories

  • 17 AI code review tools

Instead of manually labeling bugs, the benchmark observes developer behavior directly. If developers fix code after a review comment, that comment is treated as a meaningful signal.

If developers ignore or dismiss the comment, that signal is recorded as well. This creates one of the largest behavioral datasets ever used to evaluate automated code review.

How Martian’s Benchmark Works

Martian designed the benchmark using two complementary evaluation systems. This architecture solves many problems that have historically made AI evaluation unreliable.

1. Online Benchmark: Real Developer Behavior

The online benchmark tracks how developers respond to automated review comments.

Martian monitored open-source pull requests across GitHub repositories during January–February 2026. For each review comment generated by a tool, the benchmark asks:

Did the developer modify the code after the comment? If yes, the comment counts as a true positive. If the comment was ignored, dismissed, or resulted in no change, it contributes to the evaluation differently depending on the scenario.

This method removes a common source of bias in AI benchmarks.

There is:

  • no curated answer key

  • no manual annotation

  • no static list of bugs

Only real developer decisions. Because the dataset includes over 200,000 pull requests, it provides an unusually strong signal about how useful review comments actually are in practice.

2. Offline Benchmark: Controlled Comparison

While the online benchmark measures real-world behavior, Martian also runs a traditional offline evaluation.

In this setup:

  • All tools review the same 50 pull requests

  • Results are compared against a curated gold dataset

This allows researchers to compare tools under identical conditions.

The offline benchmark answers questions like:

  • Which tool identifies known bugs?

  • Which tool misses issues?

  • Which tool generates excessive false positives?

While smaller than the online dataset, this controlled environment helps validate the tools’ technical detection capabilities.

Benchmark Results: The Global Leaderboard

The results below are sorted by F1 score, the benchmark’s primary metric.

Rank

Tool

F1

Precision

Recall

#1

Qodo Extended

64.3%

62.3%

66.4%

#2

Augment

53.8%

47.0%

62.8%

#3

CodeAnt AI

51.7%

52.2%

51.1%

#4

Qodo

47.9%

42.6%

54.7%

#5

Cursor Bugbot

44.9%

46.2%

43.8%

#6

Devin

42.9%

50.5%

37.2%

#7

Cubic Dev

41.8%

29.9%

69.3%

#8

Propel

41.6%

46.0%

38.0%

#9

Greptile

40.6%

31.6%

56.9%

#10

Claude Code Reviewer

39.0%

37.3%

40.9%

#11

GitHub Copilot

35.5%

26.6%

53.3%

CodeAnt AI ranked #3 globally, achieving:

  • 51.7% F1 score

  • 52.2% precision

  • 51.1% recall

Across thousands of repositories and hundreds of thousands of pull requests.

What Precision, Recall, and F1 Mean in Code Review

Understanding these metrics is essential for interpreting the benchmark.

Precision

Precision measures how often a tool’s comments identify real problems. A precision of 52.2% means: More than 1 in 2 comments generated by CodeAnt resulted in a real code change.

In practice this reflects signal-to-noise ratio. Low precision tools produce many false alarms. Developers eventually ignore them. High precision tools generate fewer, more trusted comments.

Recall

Recall measures how many of the real issues in a pull request the tool actually detects. CodeAnt AI’s recall of 51.1% means the system surfaces more than half of the issues present in the pull requests it analyzes.

Recall reflects coverage. A tool with low recall may stay quiet most of the time but miss critical bugs.

F1 Score

F1 combines precision and recall into a single balanced metric. It penalizes tools that optimize only one side of the tradeoff. A tool cannot achieve a high F1 score by:

  • detecting everything but generating noise

  • commenting rarely but missing bugs

The metric rewards systems that maintain both accuracy and coverage.

Why AI Code Review Tools Struggle With This Balance

Automated code review has always faced a structural challenge. Developers dislike tools that generate too many false positives. But they also dislike tools that miss real problems. This creates a difficult optimization problem.

High Recall Strategy

Some tools aim to detect as many issues as possible. They comment frequently. The result is high recall but low precision. Developers begin ignoring the comments.

High Precision Strategy

Other tools comment only when extremely confident. This produces high precision. But recall drops sharply. Important issues remain undetected.

Balanced Systems

The most useful tools operate in the middle. They detect a large portion of issues while maintaining a high signal-to-noise ratio. This balance is what the F1 score attempts to measure.

Why the Offline Benchmark May Underestimate Some Tools

Martian also highlights an important limitation in the current offline evaluation. The curated gold dataset used for the offline benchmark was initially built using datasets from two existing tools:

  • Augment

  • Greptile

While this allowed the benchmark to launch quickly, it introduces a structural bias. The dataset naturally reflects the categories of bugs those tools were originally designed to detect.

If another tool identifies a real issue not present in the gold set, the benchmark may classify that comment as a false positive.

In other words:

  • The tool may be correct.

  • But the dataset does not yet contain the issue.

Martian has already observed cases where comments initially marked as false positives were later confirmed to be legitimate bugs.

This is one reason the benchmark will continue evolving.

Why the Rankings May Change Over Time

The benchmark is designed to update regularly. Each monthly iteration introduces:

  • additional pull requests

  • expanded gold datasets

  • improved evaluation coverage

As the dataset becomes more representative of real-world code, offline and online results are expected to converge.

This means the leaderboard will likely shift over time.

Tools that detect broader categories of issues may benefit as the dataset expands.

Where CodeAnt AI Performs Particularly Well

Beyond the overall leaderboard, the benchmark also evaluates specific code review domains. In several categories, CodeAnt ranks at the top of the benchmark.

These include:

  • Security patch analysis

  • Testing issues in pull requests

  • Logging and PII leak detection

  • Large pull request review

Each of these areas is analyzed in detail in the following benchmark breakdowns:

  • Security Patch Detection Benchmark

  • Testing Issue Detection Benchmark

  • Logging and PII Detection Benchmark

  • Large Pull Request Review Benchmark

Each analysis explores how different AI systems perform on specific categories of engineering problems.

What the Benchmark Reveals About AI Code Review

The Martian benchmark provides an early look at how modern AI systems perform in real engineering environments.

Several patterns emerge from the results.

  • First, AI code review is still a rapidly evolving field. No tool currently achieves perfect precision or recall.

  • Second, tools differ significantly in how they balance signal and coverage.

  • Some systems prioritize detection breadth, while others prioritize comment accuracy.

  • Third, real developer behavior is a powerful evaluation signal.

Traditional benchmarks often struggle to capture how tools behave in real workflows. By observing developer responses to comments, Martian’s benchmark measures the real impact of automated review systems. This approach may become the standard for evaluating developer tools going forward.

Conclusion: A New Standard for Evaluating AI Code Review

The Martian Code Review Bench is the first large-scale independent attempt to measure AI code review tools using real developer behavior. Across more than 200,000 pull requests, the benchmark reveals both the progress and the limitations of modern AI-assisted code review.

In this first release:

CodeAnt AI ranked #3 globally with a 51.7% F1 score, balancing both precision and recall across thousands of repositories and engineering teams. More importantly, the benchmark introduces a transparent methodology that can evolve over time.

As datasets expand and evaluation methods improve, the leaderboard will continue to change. What will remain constant is the core objective:

High-signal code review that developers trust and act on. If you want to see what CodeAnt AI surfaces in your own repositories, you can install it in minutes and start reviewing pull requests today.

FAQs

What is the Martian Code Review Benchmark?

How does the benchmark measure AI code review performance?

Why does the benchmark include both online and offline evaluations?

What does CodeAnt AI’s 51.7% F1 score mean in practice?

How can teams start using AI for automated code review?

Table of Contents

Start Your 14-Day Free Trial

AI code reviews, security, and quality trusted by modern engineering teams. No credit card required!

Share blog: