RESEARCH BENCHMARKS

AI security benchmarks for models, agents, RAG, code, and control planes.

Methodology-first. Evidence-driven. Ready for private benchmark work.

Public AI benchmarks are useful, but enterprise teams still need product-context testing against their own workflows, data boundaries, policies, tools, and threat models. This program publishes methodology first, runs private benchmark work by request, and releases public scorecards only after validation.

Request a private benchmark Start AI Security Assessment Browse Marketplace

Benchmark program

Public methodology publishes first. Private benchmark execution is available now. Public scorecards are released only after validation.

Program pulse

Focus areas

Secure Code GenerationPrompt Injection ResistanceRAG Leakage & Retrieval BoundaryAgent Tool Abuse

Active suites

Under active build

Defined suites

Private path open

attack

CODE SECURITY

Secure Code Generation

attack

PROMPT INJECTION

Prompt Injection Resistance

map

RAG SECURITY

RAG Leakage & Retrieval Boundary

Publication boundary

Methodology and suite design publish before public scorecards. Suites in active build can be scoped privately while validation continues.

Private benchmark work available by request

Public methodology published before public rankings

Vendor rankings require validated trial data

Active suites

Secure code generation, AI code review, and artifact triage are under active build.

Defined suites

Eight benchmark suites are defined across code, RAG, agents, guardrails, and gateways.

Private path

Open

Private benchmark scoping is available now.

Public scorecards

Validation-gated

Public results publish only after validated trials.

Benchmark suites

Eight benchmark suites, one private execution path

Each suite defines the system under test, the failure modes we evaluate, the metrics we track, and the private benchmark path for teams that need evidence before public scorecards exist.

Pillar

Position the suite by research pillar.

MapAttackDefendEvidence

System

Map the target system class and control surface.

ModelsRAGAgentsGuardrailsGatewaysArtifacts

Status

Track program state.

In progressPlannedPrivate availablePublished

Buyer problem

Start from the buyer problem the benchmark answers.

Launch blockedSales blockedRAG leakageAgent blast radiusGuardrails failingGovernance friction

In progress

attack

CODE SECURITY

Secure Code Generation

Which models generate safer code for real developer tasks?

Primary metric preview

Secure-by-default rate

PROMPT INJECTION

Prompt Injection Resistance

Can models and workflows resist untrusted instructions?

Primary metric preview

Attack success rate

RAG SECURITY

RAG Leakage & Retrieval Boundary

Can retrieval stay inside tenant, role, and source boundaries?

Primary metric preview

Unauthorized retrieval rate

AGENT SECURITY

Agent Tool Abuse

Can agents use tools without exceeding authority?

Primary metric preview

Unsafe tool-call rate

GUARDRAILS

Guardrail Robustness

Do guardrails stop attacks without blocking useful work?

Primary metric preview

Bypass vs false refusal

CODE REVIEW

AI Code Review

Can models find and fix the vulnerabilities they generate?

Primary metric preview

True positive rate

ARTIFACT TRIAGE

Artifact & Binary Triage

Can AI-assisted tools spot risky behavior in software artifacts?

Primary metric preview

Artifact detection rate

GATEWAY POLICY

Model Gateway Policy Enforcement

Can the gateway enforce policy and produce usable evidence?

Primary metric preview

Policy bypass rate

Execution surfaces

Private benchmark work can use SecEng tooling

These surfaces support code scanning, adversarial trials, RAG tests, proxy traces, artifact analysis, and evidence packaging.

SecEng Code Scanner

Supports secure code generation and AI code review.

LLM Attack Range

Supports prompt injection, agent abuse, and adversarial trials.

RAG Test Harness

Supports retrieval leakage and boundary testing.

SecEng Proxy

Supports gateway policy and guardrail traces.

Artifact Analyzer

Supports artifact and binary triage.

Evidence Builder + Crosswalk

Supports reporting, mappings, and buyer proof.

What we track

We track the AI security controls that fail first in real products.

These are the broad measurement pillars that sit underneath every benchmark suite.

Attack Coverage

Measure the abuse paths that matter most: injection, leakage, tool misuse, gateway bypass, and unsafe code generation.

Security and Safety Outcomes

Track whether the system resists attacks while still supporting useful work, approvals, and developer velocity.

Reliability and Robustness

Keep the evaluation focused on behavior that remains stable across retries, model variants, and deployment contexts.

Transparency and Reproducibility

Publish methodology, dataset design, and claim controls before public scorecards are released.

Coverage list

Benchmark coverage spans the public-safe problem families below

Secure code generationPrompt injection resistanceRAG leakage and retrieval boundariesAgent tool abuseGuardrail robustnessAI code review qualityArtifact and binary triageModel gateway policy enforcement

Private benchmarking

Need evidence before public scorecards publish?

Run a private benchmark sprint against the models, guardrails, RAG systems, agents, gateways, coding workflows, or artifacts you actually use.

Request a private benchmark Start AI Security Assessment Browse benchmark suites

Methodology note

Methodology is published early so teams can understand the evaluation design, request private variants, and align internal AI security tests.

Private benchmark path

Private benchmark runs can be scoped now for customers, sponsors, or internal teams. Private results stay private unless explicitly approved for publication.

Methodology-first

Methodology first, not leaderboard theater

Each suite starts with research questions, datasets, scenarios, grading rules, metrics, and claim controls. Public scorecards are released only after validated trials. Private benchmark runs can be scoped earlier for teams that need internal or buyer-facing evidence.

Research questions first

Every suite starts with explicit questions, scopes, and success criteria.

Controlled fixtures

Synthetic and curated scenarios mirror the systems teams actually ship.

Repeatable scoring

Repeated trials, confidence intervals, and bounded reporting where sample counts support it.

Validation-gated reporting

Public results publish only after evidence, review, and publication approval.

Publication boundary

Scorecards are validation-gated.

Private benchmark work is available by request

Public methodology is published before public rankings

Vendor rankings require validated trial data

No certifications or endorsements are implied

Benchmark launch path

Use the hub to move from research to execution

The public hub shows how the benchmark program works. Teams can use it to choose a private benchmark sprint, align internal AI security tests, or route a product risk into the marketplace.

Request a private benchmark Start AI Security Assessment

Quick use

How to use this hub

Match a suite to your AI risk or buyer problem.

Review the methodology before requesting a private run.

Use the metrics to align internal AI security evaluation.

Route from research into products, services, training, or marketplace packages.