Code Security

How to Evaluate a Penetration Testing Provider Before You Waste Budget

Amartya | CodeAnt AI Code Review Platform
Sonali Sood

Founding GTM, CodeAnt AI

The Buying Problem Nobody Talks About…

Security budgets are getting larger. The penetration testing market is growing fast. And the number of vendors claiming to offer "AI-powered penetration testing" has exploded in the last two years to the point where the phrase is nearly meaningless as a differentiator.

Here is the buying problem this creates: penetration testing is one of the few enterprise security purchases where you genuinely cannot evaluate the quality of what you're buying until after you've bought it. You can read case studies. You can check credentials. You can ask for sample reports. But you don't know whether the methodology is actually thorough until the engagement is complete and you're looking at the findings — or not looking at findings that should have been there.

This information asymmetry is exploited, often unintentionally, by providers whose methodology is weaker than their marketing. Not because they're dishonest — because "AI-powered" is a legitimate description of adding ML to scanner output prioritization, just as it's a legitimate description of AI that reads your source code, traces your data flows, and chains findings into exploit paths. The label doesn't distinguish between them. The questions in this guide do.

Every question here was designed to produce a specific type of answer that reveals what the methodology actually is. The right answers are detailed, technical, and specific to your application. The wrong answers are vague, category-level, and could apply to any vendor in the market.

By the end of this guide, you'll have a complete evaluation framework: the questions, what good and bad answers look like, a scoring rubric, red flags that should end conversations immediately, and a final checklist you can bring into any vendor evaluation.

Related: What Is AI Penetration Testing? The Complete Deep-Dive Guide | AI Pentesting vs Traditional Pentesting: An Honest Head-to-Head

Why the Standard Evaluation Process Fails

Most organizations evaluate penetration testing vendors the same way they evaluate any enterprise software vendor: website review, analyst report, reference calls, proposal comparison, price negotiation.

This process works poorly for penetration testing because:

Websites are indistinguishable. Every vendor's website says "comprehensive," "AI-powered," "expert researchers," "proven methodology." These words have been drained of meaning through universal adoption.

Reference calls are curated. Vendors provide references from engagements that went well. You don't hear from the customer whose critical IDOR was missed, who found out about the vulnerability six months later from a security researcher.

Sample reports can be cherry-picked. A vendor can show you their best report from their most skilled consultant's best engagement. You don't know if that represents their median output or their 95th percentile.

Price comparison is meaningless without methodology comparison. A $20,000 engagement that finds your critical vulnerability is cheaper than a $10,000 engagement that misses it and you spend $4M on breach response.

The solution is to ask questions that are technically specific enough that vague or marketing-level answers are immediately identifiable — and that a vendor with a strong methodology can answer in detail from direct experience.

[IMAGE PLACEHOLDER: Comparison visual — two vendor websites side by side, both claiming "AI-powered penetration testing," "comprehensive coverage," "expert researchers." Show that the websites are indistinguishable. Then show the nine questions as the differentiator underneath.]

Question 1: Can You Provide a Working Proof-of-Exploit for Every Finding You Report?

Why This Question Is the Foundation

This is the single most important question in any penetration testing evaluation. The answer immediately classifies the provider.

A penetration test finding has two possible states: confirmed exploitable, or suspected based on evidence. The difference matters operationally because:

A confirmed finding comes with a working proof-of-concept — a curl command, a Python script, or step-by-step browser reproduction that any engineer on your team can execute and reproduce the vulnerability within minutes. The finding is not theoretical. It is demonstrated.

A suspected finding comes with a description of why the response pattern suggests a possible vulnerability. It may or may not be real. It requires your engineering team to spend time investigating whether the issue is genuine before they can even begin remediation.

When a provider reports 50 findings with no proof-of-exploit, your engineering team will spend weeks triaging. They'll find that many findings are false positives, scanner artifacts, or vulnerabilities in code paths that aren't actually reachable in your application. The security work becomes triage work.

What Good Answers Look Like

Strong answer: "Yes, every finding we report comes with a working proof-of-concept. For web vulnerabilities, that's typically a curl command or Python script that reproduces the issue. For logic flaws, it's a documented step-by-step reproduction sequence. We don't report anything we haven't confirmed works — if we can't exploit it, it doesn't go in the report."

Acceptable answer: "For the majority of findings, yes. There are rare categories — some denial-of-service findings, some race conditions — where we can demonstrate the mechanism but not safely execute full exploitation in production. In those cases, we document the reproduction steps and the evidence we used to confirm the vulnerability."

What Bad Answers Look Like

Weak answer: "We use CVSS scoring to rate our findings and document the evidence that led to each finding."

Red flag answer: "We provide detailed descriptions of each vulnerability with references to the relevant CVE or OWASP entry."

Neither of these answers the question. If a provider can't directly say "yes, proof-of-exploit for every finding," they're running a scanner and calling the output a penetration test.

How to Verify

Ask for a redacted sample report. Open any finding. Look for:

  • A curl command with actual parameters

  • A Python or Bash script that performs the test

  • A step-by-step reproduction sequence specific enough to run in under 10 minutes

If the finding has a description, a CVSS score, and a link to OWASP — but no PoC — that's scanner output.

Question 2: Walk Me Through How You Trace a SQL Injection From the HTTP Request to the Database

Why This Question Matters

This question tests whether the provider actually performs dataflow analysis — the technique that finds injection vulnerabilities by following user-controlled input from where it enters the application to where it reaches a dangerous sink — or whether they rely on response-based detection that only finds obvious injection patterns.

The distinction matters because:

Response-based SQL injection detection (what DAST scanners do) finds cases where the injection produces an observable error message, a timing difference, or a data difference in the HTTP response. It misses injection vulnerabilities where the response looks normal — blind injections, second-order injections, injections in code paths that don't return data to the API caller.

Dataflow-based SQL injection detection (what white box penetration testing does) follows the input through the code regardless of whether the response is observable. It finds the injection at the source — at the specific line where user input reaches a raw SQL query — even if the vulnerability produces no external signal.

What Good Answers Look Like

Strong answer: "In white box engagements, we trace user-controlled input from the HTTP request handler through every function call it touches — through middleware, service layers, repositories — until it reaches a database call. We specifically look for cases where parameterization is missing: raw SQL construction via string concatenation or f-strings, ORM raw() methods with unparameterized input, stored procedure calls that construct dynamic SQL internally. We map the exact path from entry point to the vulnerable line. The finding includes the file, class, method, and line number where the input reaches the database without parameterization."

Follow-up you should ask: "Can you give me an example of a SQL injection you found through dataflow tracing that was invisible to external testing?"

A provider who does real dataflow analysis will have specific examples — a blind injection in a logging function, a second-order injection where user input was stored then executed later, an injection in a background job that never returns HTTP responses.

What Bad Answers Look Like

Weak answer: "We test all input parameters for SQL injection using our payload library and analyze the responses for error messages and timing differences."

This describes DAST. It does not describe dataflow analysis. The provider is testing from the outside — they will miss every injection vulnerability that doesn't produce an observable response anomaly.

Another weak answer: "Our AI identifies potential injection points and tests them automatically."

This is still DAST with AI-assisted payload generation. "Identifies potential injection points" from the outside — not from reading the code.

The Code Example That Reveals Understanding

Ask the provider: "How would you find SQL injection in code like this?"

class ReportView(APIView):
    def post(self, request):
        user_id = request.data.get('user_id')
        report_type = request.data.get('report_type', 'summary')

        # This looks safe from the outside — no error message, no timing diff
        # The injection is in a background task that never returns HTTP data

        generate_report_task.delay(user_id, report_type)
        return Response({'status': 'Report generation started'})

# tasks.py — background Celery task
@shared_task
def generate_report_task(user_id, report_type):
    with connection.cursor() as cursor:
        # VULNERABLE — report_type injected directly into query
        # Response to API caller: always {"status": "Report generation started"}
        # External scanner: sees only the 200 OK response, finds nothing
        cursor.execute(
            f"SELECT * FROM reports WHERE user_id = {user_id} "
            f"AND type = '{report_type}'"
        )
class ReportView(APIView):
    def post(self, request):
        user_id = request.data.get('user_id')
        report_type = request.data.get('report_type', 'summary')

        # This looks safe from the outside — no error message, no timing diff
        # The injection is in a background task that never returns HTTP data

        generate_report_task.delay(user_id, report_type)
        return Response({'status': 'Report generation started'})

# tasks.py — background Celery task
@shared_task
def generate_report_task(user_id, report_type):
    with connection.cursor() as cursor:
        # VULNERABLE — report_type injected directly into query
        # Response to API caller: always {"status": "Report generation started"}
        # External scanner: sees only the 200 OK response, finds nothing
        cursor.execute(
            f"SELECT * FROM reports WHERE user_id = {user_id} "
            f"AND type = '{report_type}'"
        )
class ReportView(APIView):
    def post(self, request):
        user_id = request.data.get('user_id')
        report_type = request.data.get('report_type', 'summary')

        # This looks safe from the outside — no error message, no timing diff
        # The injection is in a background task that never returns HTTP data

        generate_report_task.delay(user_id, report_type)
        return Response({'status': 'Report generation started'})

# tasks.py — background Celery task
@shared_task
def generate_report_task(user_id, report_type):
    with connection.cursor() as cursor:
        # VULNERABLE — report_type injected directly into query
        # Response to API caller: always {"status": "Report generation started"}
        # External scanner: sees only the 200 OK response, finds nothing
        cursor.execute(
            f"SELECT * FROM reports WHERE user_id = {user_id} "
            f"AND type = '{report_type}'"
        )

A provider who does dataflow analysis explains how they'd trace report_type from request.data.get() through generate_report_task.delay() into the Celery task where it reaches cursor.execute() with f-string formatting.

A provider who does DAST scanning says they'd send SQL payloads to the POST endpoint and analyze the response. They will never find this vulnerability — the response is always {"status": "Report generation started"} regardless of what's injected.

Question 3: What CVEs Has Your Team Published? Can I Verify Them?

Why This Question Is the Credibility Filter

CVE numbers are public, verifiable, and assigned by external authorities — MITRE, or a registered CNA (CVE Numbering Authority). They cannot be fabricated. A firm with published CVEs has demonstrated two things under external validation:

  1. They can find novel vulnerabilities — not just known patterns from CVE databases

  2. They can document findings rigorously enough to pass external review and meet the disclosure standard

A firm without published CVEs has not demonstrated either. They may still be competent — but they haven't produced the public, verifiable evidence that separates verifiable claims from marketing claims.

How to Verify

Every CVE is searchable at nvd.nist.gov. Search by CVE number. You'll see: the affected package, the CVSS score, the CWE classification, and the reference links including the original disclosure. If the disclosure links back to the firm's research blog, the attribution is confirmed.

For CodeAnt AI: 87+ published CVEs including CVE-2026-29000 (pac4j-jwt, CVSS 10.0 — full authentication bypass) and CVE-2026-28292 (simple-git, CVSS 9.8 — arbitrary command execution). All assigned via VulnCheck, a registered CNA. All verifiable in the NVD.

The significance isn't the number alone. It's that the same AI reasoning engine that produced these findings — in widely-deployed production packages used by 1.85 billion monthly downloads — is applied to your codebase. The track record is evidence of capability, not just a credential for its own sake.

What to Ask If They Have No Published CVEs

"Has your team identified vulnerabilities in third-party software that were responsibly disclosed?"

Some legitimate security firms do internal research that gets fixed without public CVE assignment. But a firm that has been in this business for years with no published CVEs has not demonstrated its ability to find novel vulnerabilities in real production software.

Follow up: "What was the most significant vulnerability your team has found in production software in the last 12 months? Can you describe it in technical detail?"

The depth and specificity of the answer reveals genuine research capability vs. scanner operation.

Question 4: Do You Scan Git History for Deleted Secrets — and How?

Why This Question Reveals White Box Depth

Git history scanning is a specific, non-obvious technique that is only possible in white box engagements and that consistently produces active credential findings. The fact that a provider does or doesn't do it tells you something precise about how thoroughly they conduct source code review.

Here's why it matters:

# The problem a developer creates:
git commit -m "Add database configuration"
# Commit contains: DATABASE_URL=postgres://admin:prod_password123@db.company.com:5432/app

# The developer realizes the mistake:
git commit -m "Remove credentials from config"
# Commit removes the credentials from the current file

# What the developer believes: the credentials are gone
# What is actually true: they exist forever in git history

# Any of these retrieves the deleted credentials:
git show <first_commit_hash>:config/database.yml
git log --all -p -- "config/database.yml" | grep "DATABASE_URL"
git log --all -S "prod_password123" --source --all
# The problem a developer creates:
git commit -m "Add database configuration"
# Commit contains: DATABASE_URL=postgres://admin:prod_password123@db.company.com:5432/app

# The developer realizes the mistake:
git commit -m "Remove credentials from config"
# Commit removes the credentials from the current file

# What the developer believes: the credentials are gone
# What is actually true: they exist forever in git history

# Any of these retrieves the deleted credentials:
git show <first_commit_hash>:config/database.yml
git log --all -p -- "config/database.yml" | grep "DATABASE_URL"
git log --all -S "prod_password123" --source --all
# The problem a developer creates:
git commit -m "Add database configuration"
# Commit contains: DATABASE_URL=postgres://admin:prod_password123@db.company.com:5432/app

# The developer realizes the mistake:
git commit -m "Remove credentials from config"
# Commit removes the credentials from the current file

# What the developer believes: the credentials are gone
# What is actually true: they exist forever in git history

# Any of these retrieves the deleted credentials:
git show <first_commit_hash>:config/database.yml
git log --all -p -- "config/database.yml" | grep "DATABASE_URL"
git log --all -S "prod_password123" --source --all

The developer rotated nothing. The credential is active. Any attacker with a clone of the repository can retrieve and use it. The majority of "git history secret" findings CodeAnt AI encounters are still active — credentials that were deleted from the codebase but never rotated.

What Good Answers Look Like

Strong answer: "Yes — it's part of our standard white box methodology, not an add-on. We scan every branch, every tag, every commit in the repository history. We use a combination of pattern-based detection for known secret formats and entropy analysis for high-randomness strings that might be secrets we don't have patterns for. Every discovered historical secret is verified for current validity before being reported — we don't report dead credentials as critical findings."

The verification step is important. A provider that scans Git history but doesn't verify whether discovered secrets are still active will generate false urgency around rotated credentials. The finding that matters is: this credential from 14 months ago is still active and grants production database access.

What Bad Answers Look Like

Weak answer: "We use SAST tools to check the current codebase for hardcoded secrets."

This is checking HEAD only. Git history is invisible to this approach.

Another weak answer: "We check all files in the repository for credentials."

"All files in the repository" typically means the current state of the default branch — not the commit history. Ask explicitly: "When you say all files, does that include every historical commit, or just the current state of the codebase?"

Question 5: How Do You Test Business Logic — Walk Me Through a Specific Example

Why This Question Exposes the Deepest Gap

Business logic testing is the category most clearly absent from scanner-based approaches and most clearly present in genuine penetration testing. It requires understanding what an application is supposed to do — not just what HTTP response patterns look anomalous — and then systematically verifying that every flow enforces that intent.

The question asks for a specific example because a provider who actually does business logic testing has stories. They know what it felt like to discover that a payment confirmation endpoint could be called without completing the payment step. They know what the HTTP request looked like, what the response contained, what the impact was.

A provider who doesn't do business logic testing gives you a category description — "we test business logic including payment flows and access controls" — that could be read off any vendor's website.

What Good Answers Look Like

Strong answer: "In a recent gray box engagement, the application had a multi-step checkout flow. We mapped the complete flow from the API calls the frontend made, then tested every permutation: skipping steps, calling steps out of order, calling the final confirmation endpoint directly with a valid cart ID but without the payment step having been completed. The confirmation endpoint accepted the request and returned a success response — the order was created without payment processing. The vulnerability was that the confirmation endpoint only validated that the cart existed and belonged to the user, not that a payment record with a matching cart ID existed in a completed state. CVSS 7.5, classified as business logic flaw — direct financial loss per exploitation."

The specifics matter: file they tested, what the successful attack looked like, what the server returned, what the root cause was. Vague descriptions of business logic testing are not evidence of having done it.

What Bad Answers Look Like

Weak answer: "We test business logic vulnerabilities including access control, payment manipulation, and workflow bypass."

This is a category list. Ask: "Can you describe a specific business logic finding from a recent engagement — what the application was supposed to do, what it actually did, what the payload looked like, and what the impact was?"

If they can't produce a specific technical story, they're describing a category they've read about, not a technique they practice.

The Follow-Up That Confirms or Refutes

After their answer, ask: "How do you test rate limiting bypass?"

Strong answer describes specific techniques: rotating X-Forwarded-For headers, varying request parameters that reset rate limit counters, using different API versions of the same endpoint, testing whether mobile and web API versions share rate limit state.

Weak answer: "We check if rate limiting is implemented."

Question 6: Show Me a Sample Report — Does It Include Root Cause to File and Line?

Why the Report Is the Most Direct Evidence of Methodology Quality

A sample report is the most honest window into what the engagement actually produces. It's not curated for marketing — it's the actual output format. Every gap in the report format reflects a gap in what the methodology actually delivers.

Ask for a sample report before any contract discussion. Tell them you want a representative example — not their best engagement, but a typical one.

What to Look For in the Report

Finding structure — minimum acceptable standard:




What the PoC should look like:

# Acceptable PoC — specific, runnable, confirms the vulnerability
curl -X GET \\
  "<https://api.target.com/v1/products?sort=price,(SELECT+SLEEP(5)>)--" \\
  -H "Authorization: Bearer [test_token]" \\
  -w "\\nTime: %{time_total}s"
# Expected: Response delayed by ~5 seconds — confirms blind SQL injection
# Acceptable PoC — specific, runnable, confirms the vulnerability
curl -X GET \\
  "<https://api.target.com/v1/products?sort=price,(SELECT+SLEEP(5)>)--" \\
  -H "Authorization: Bearer [test_token]" \\
  -w "\\nTime: %{time_total}s"
# Expected: Response delayed by ~5 seconds — confirms blind SQL injection
# Acceptable PoC — specific, runnable, confirms the vulnerability
curl -X GET \\
  "<https://api.target.com/v1/products?sort=price,(SELECT+SLEEP(5)>)--" \\
  -H "Authorization: Bearer [test_token]" \\
  -w "\\nTime: %{time_total}s"
# Expected: Response delayed by ~5 seconds — confirms blind SQL injection

What a bad PoC looks like:




This is not reproducible. An engineer receiving this cannot verify the vulnerability without rediscovering it themselves.

Root cause — the differentiator between depth levels:




Remediation specificity:




Red Flags in Sample Reports

  • Findings without PoC

  • CVSS scores without the full vector string

  • Root cause described at the feature level ("the search functionality") not the code level

  • Remediation that's a link to OWASP

  • No compliance mapping

  • Findings that could have been generated by a scanner with no manual verification

Question 7: Is Retesting Included, and What Does the Verification Report Look Like?

Why Retesting Defines Whether Security Actually Improved

A penetration test that doesn't include retesting tells you what was broken. It doesn't tell you whether the fixes worked. These are different things.

Engineering teams fix what they understand from the report description. Sometimes they fix exactly the right thing. Sometimes they fix a symptom, not the root cause, and the vulnerability is still exploitable through a slightly different path. Sometimes they fix one instance of a pattern but miss the same pattern in a different part of the codebase.

Retesting verifies which scenario is true. A written verification report closes audit loops — SOC 2, PCI-DSS, and HIPAA auditors all want evidence of remediation, not just evidence of findings.

What Good Retesting Looks Like

Strong answer: "Retesting is included in every engagement at no additional cost. After your team completes remediation, we schedule a retest window. We re-execute the proof-of-concept for every confirmed finding. For findings where the fix was made at the code level, we also review the remediation code to confirm the fix is correct and doesn't introduce new vulnerabilities. We deliver a written verification report that documents each finding's status: Remediated, Partially Remediated, or Open with notes. This is the document you submit to auditors as remediation evidence."

Ask to see a sample verification report. It should look something like:

Retest Verification Report — Engagement ID: CAI-2026-0847
Testing dates: [original] / [retest]

Finding VLN-001: SQL Injection — Product Search Sort Parameter
  Original CVSS: 8.3
  Status: REMEDIATED ✓
  Verification method: Proof-of-concept payload no longer effective
                       Allowlist validation confirmed in source
  Retest date: [date]
  Tester: [researcher]

Finding VLN-002: IDOR — Order Detail Endpoint
  Original CVSS: 7.6
  Status: REMEDIATED ✓
  Verification method: User A token no longer returns User B orders
                       WHERE user_id filter confirmed in ORM query
  Retest date: [date]

Retest Verification Report — Engagement ID: CAI-2026-0847
Testing dates: [original] / [retest]

Finding VLN-001: SQL Injection — Product Search Sort Parameter
  Original CVSS: 8.3
  Status: REMEDIATED ✓
  Verification method: Proof-of-concept payload no longer effective
                       Allowlist validation confirmed in source
  Retest date: [date]
  Tester: [researcher]

Finding VLN-002: IDOR — Order Detail Endpoint
  Original CVSS: 7.6
  Status: REMEDIATED ✓
  Verification method: User A token no longer returns User B orders
                       WHERE user_id filter confirmed in ORM query
  Retest date: [date]

Retest Verification Report — Engagement ID: CAI-2026-0847
Testing dates: [original] / [retest]

Finding VLN-001: SQL Injection — Product Search Sort Parameter
  Original CVSS: 8.3
  Status: REMEDIATED ✓
  Verification method: Proof-of-concept payload no longer effective
                       Allowlist validation confirmed in source
  Retest date: [date]
  Tester: [researcher]

Finding VLN-002: IDOR — Order Detail Endpoint
  Original CVSS: 7.6
  Status: REMEDIATED ✓
  Verification method: User A token no longer returns User B orders
                       WHERE user_id filter confirmed in ORM query
  Retest date: [date]

This document is what you hand to an auditor. "We had these findings, here's evidence they were fixed, here's who retested them and when."

What Bad Retesting Looks Like

Weak answer: "Retesting is available as an additional engagement."

Additional cost, additional scheduling, additional weeks of delay. For a team trying to close a compliance audit, this is a significant problem.

Another weak answer: "We can schedule a retest call with your engineering team to review the fixes."

A call where engineers explain what they changed is not a retest. A retest is independent verification that the fix works — running the PoC and confirming it no longer succeeds.

Question 8: What Is Your Pricing Model If No Critical Vulnerability Is Found?

Why Financial Structure Reveals Methodology Confidence

This question doesn't just evaluate pricing. It evaluates how confident the provider is in their own methodology.

A provider who charges regardless of outcome is selling effort. The fee covers the consultant's time. You pay the same whether they find ten critical vulnerabilities or none. There's no financial alignment between the fee and the security outcome you're purchasing.

A provider who offers performance-based pricing — or a guarantee tied to finding outcomes — is selling results. The fee is tied to the security value delivered. They're only willing to offer this if their methodology is effective enough that they expect to find critical vulnerabilities in most engagements.

CodeAnt AI's Model

If no CVSS 9+ critical vulnerability or active data leak is found, you pay nothing. The complete report — every low and medium finding, full methodology documentation, compliance mapping — is delivered at zero cost.

This works because:

  1. The methodology is comprehensive enough to find critical vulnerabilities in the vast majority of applications that have them

  2. The researchers behind it have published 87+ CVEs — demonstrating the ability to find critical vulnerabilities in production software under controlled, verifiable conditions

  3. A no-finding outcome is genuinely informative — it means the specific tested surface, at that point in time, doesn't have obvious critical vulnerabilities. That's worth knowing.

What to Ask Any Provider

"If your engagement finds no CVSS 9+ vulnerabilities, what do we pay?"

Strong answer: "If we don't find a CVSS 9+ or an active data leak, you pay nothing. You receive the complete report for free."

Acceptable answer: "Our standard engagement is fixed-fee. However, we've never completed a full assessment of a production SaaS application and found nothing significant — our average engagement finds 2–3 critical findings."

Weak answer: "Our pricing is competitive and reflects the thoroughness of our methodology." (Deflection — they pay regardless.)

Red flag answer: "Security testing value isn't measured by what's found." (True in general, false as a reason not to offer any financial accountability.)

Question 9: How Long Does an Engagement Take — From Our Scoping Call to Report Delivery?

Why Timeline Reveals Actual AI Involvement

This is the question that most directly reveals whether AI is genuinely in the methodology or only in the marketing.

A human penetration tester in a standard engagement takes 3–5 days to test a web application, 1–2 weeks to produce the report, and the calendar time from initial inquiry to report delivery is 6–10 weeks. This is not laziness — it's the realistic throughput of human-bounded analysis at reasonable quality.

An AI reasoning engine working continuously on the same application covers more ground in 48 hours than a human tester covers in 5 days. Not because the AI is more skilled — because it processes in parallel, doesn't need sleep, and doesn't make time-allocation trade-offs between "interesting" and "thorough."

If a provider claims to use AI in their penetration testing methodology but quotes you a 3–4 week timeline for a standard web application assessment, ask: what is the AI doing for three weeks?

The honest answer in most cases is: the "AI" is assisting with report generation and finding prioritization, not conducting the analysis. The timeline is human-bounded because the analysis is human-conducted.

What Good Timelines Look Like

Strong answer: "Scoping call today — testing starts within 24 hours of authorization. For a Full Assessment (black box + white box + gray box), report delivery is within 48–96 hours of testing start. Total calendar time from scoping to report: 3–5 days. Walkthrough call within 2 days of delivery. Retest scheduled as soon as remediation is complete — typically within 10 business days of report delivery."

Acceptable answer: "Standard engagement is 5–7 business days from scope confirmation to report delivery. We front-load the testing — most findings are identified in the first 48 hours."

What Bad Timelines Look Like

Weak answer: "We typically deliver the report 2–3 weeks after the testing window closes. Testing windows are 1–2 weeks."

Total: 4–5 weeks. If AI is doing the analysis, why does it take 4–5 weeks?

Red flag answer: "Our standard timeline is 6–8 weeks from initial scoping to report delivery."

This is a traditional consulting engagement timeline. Whatever AI is in this product, it's not conducting the analysis.

The Red Flags That Should End the Conversation

Beyond the nine questions, these are the signals that indicate a provider is not what their marketing claims:

Red Flag 1: Cannot show a sample report on request If a provider can't show a redacted sample report, they either don't have good ones or their standard reports are too generic to share without embarrassment. Either way, it's a problem.

Red Flag 2: The sample report has no proof-of-exploit You saw this above. No PoC = scanner output formatted as a pentest report.

Red Flag 3: "AI" is only mentioned in the context of reporting, prioritization, or noise reduction If every AI capability they describe is about making output cleaner or more readable — not about how the analysis is conducted — the AI is cosmetic, not methodological.

Red Flag 4: They can't describe a specific business logic finding from memory Every provider who actually does business logic testing has stories. If they can only describe it categorically ("we test business logic"), they haven't done it.

Red Flag 5: No researcher background or CVE track record A firm without published CVEs has not demonstrated public, verifiable research capability. This doesn't make them incompetent — but it means you have no external validation of their ability to find novel vulnerabilities.

Red Flag 6: Retest is a separate paid engagement Retesting is not a luxury — it's the verification step that confirms fixes worked. If it's not included, the engagement doesn't close the loop.

Red Flag 7: Timeline longer than 2 weeks from scoping to report For a standard web application assessment with AI-driven analysis, there is no methodology reason for this to take longer than a week. Longer timelines indicate human-bounded analysis, not AI-driven analysis.

Red Flag 8: No methodology documentation in the deliverables The finding report tells you what was found. The methodology report tells auditors how it was found. If the deliverable doesn't include methodology documentation, the engagement may not satisfy compliance requirements.

Red Flag 9: Vague answers to technical questions If your technical questions get marketing answers — "our AI is trained on millions of vulnerabilities" instead of "our AI traces dataflows from HTTP request to database sink" — the methodology description is marketing, not technical reality.

The Complete Evaluation Scorecard

Use this scorecard in every vendor evaluation. Score each question 0–2:

  • 0: Could not answer, gave a vague/marketing answer, or the answer was a red flag

  • 1: Gave a partially specific answer — category-level but not fully technical

  • 2: Gave a specific, technical, detailed answer with examples

Question

Score (0-2)

Notes

Q1: PoC for every finding?



Q2: Dataflow analysis from HTTP to DB?



Q3: Published CVEs — verifiable?



Q4: Git history scanning?



Q5: Business logic testing with specific example?



Q6: Sample report with root cause to line + diff?



Q7: Retest included with verification report?



Q8: Performance-based pricing or guarantee?



Q9: Timeline under 2 weeks from scoping to report?



Total

/18


Scoring interpretation:

Total Score

Interpretation

16–18

Strong provider — methodology matches marketing

12–15

Acceptable — some methodology gaps, negotiate inclusions

8–11

Significant gaps — likely scanner-based with AI label

Below 8

Walk away — the marketing and methodology don't match

The Final Checklist: Before You Sign Anything

Print this. Use it.

Documentation to request before contracting:

  • [ ] Redacted sample report (representative engagement, not cherry-picked)

  • [ ] Sample retest verification report

  • [ ] CVE publication list with CVE numbers (verify in NVD)

  • [ ] Methodology overview document (what phases, what tools, what scope)

  • [ ] Sample authorization letter (confirms legal framework for the engagement)

  • [ ] Sample compliance mapping (how findings map to SOC 2 / PCI-DSS / HIPAA)

Contractual commitments to confirm:

  • [ ] Scope definition process and authorization letter included

  • [ ] Proof-of-exploit for every confirmed finding

  • [ ] CVSS 4.0 scoring with full vector string per finding

  • [ ] Root cause to file and line (white box scope)

  • [ ] Remediation diff per finding (white box scope)

  • [ ] Retest included at no additional cost

  • [ ] Written verification report post-retest

  • [ ] Escalation protocol for critical findings during testing

  • [ ] Non-destructive testing commitment (no data modification or deletion)

  • [ ] NDA and authorization documentation before code access (white box)

  • [ ] Methodology report included in deliverables

  • [ ] Performance pricing clause or guarantee documented

Technical verification before testing begins:

  • [ ] Scope document reviewed and agreed

  • [ ] Test accounts created in staging environment (gray box)

  • [ ] Repository access configured as read-only (white box)

  • [ ] Communication protocol established for critical finding escalation

  • [ ] Rules of engagement documented and signed

Don't Buy a Pentest. Buy a Methodology With Accountability.

The penetration testing market has a buying problem and a selling problem. Buyers can't easily evaluate methodology quality before purchasing. Sellers know this and market at the level of abstraction where all providers sound similar.

The nine questions in this guide break through that abstraction. They are technically specific enough that a provider with a real methodology answers them in detail from direct experience. A provider without a real methodology gives you marketing language.

The evaluation scorecard converts those answers into a comparable score. The checklist ensures you get the contractual commitments that make the engagement actually useful.

What you're ultimately buying is not an engagement window. Not a report document. Not a CVSS score per finding. You're buying the answer to one question: what can an attacker actually do to our users and our data right now?

The provider who answers that question with working proof-of-exploit, root cause to specific line of code, and a retest that verifies the fixes — while standing behind their methodology with a guarantee — is selling you the answer. Everyone else is selling you the process of looking for the answer.

CodeAnt AI stands behind the methodology with the clearest guarantee in the market: no CVSS 9+ critical vulnerability found, no payment. 87+ published CVEs. Testing starts within 24 hours of scoping.

→ Book a 30-minute scoping call. Bring these questions. We'll answer all that you have on the free demo call that you book here.

Continue reading:

  • What Is AI Penetration Testing? The Complete Deep-Dive Guide

  • Black Box vs White Box vs Gray Box: The Complete Technical Breakdown

  • Why Security Scanners Miss the Vulnerabilities That Actually Get You Breached

  • How AI Penetration Testing Works: Step-by-Step Methodology

  • AI Pentesting vs Traditional Pentesting: An Honest Head-to-Head

FAQs

We're a startup with limited budget. Does this evaluation process apply to us?

Should we evaluate multiple providers simultaneously?

How do we handle a situation where we have a preferred vendor relationship but want to verify methodology quality?

What if a vendor's answers are strong but their price is significantly higher than competitors?

Is there a minimum company size or application complexity where AI pentesting makes sense?

Table of Contents

Start Your 14-Day Free Trial

AI code reviews, security, and quality trusted by modern engineering teams. No credit card required!

Share blog: