RESEARCH BENCHMARKS
AI security benchmarks for models, agents, RAG, code, and control planes.
Methodology-first. Evidence-driven. Ready for private benchmark work.
Methodology-first. Evidence-driven. Ready for private benchmark work.
Benchmark program
Public methodology publishes first. Private benchmark execution is available now. Public scorecards are released only after validation.
Program pulse
Focus areas
Active suites
3
Under active build
Defined suites
8
Private path open
CODE SECURITY
Secure Code Generation
PROMPT INJECTION
Prompt Injection Resistance
RAG SECURITY
RAG Leakage & Retrieval Boundary
Publication boundary
Methodology and suite design publish before public scorecards. Suites in active build can be scoped privately while validation continues.
Secure code generation, AI code review, and artifact triage are under active build.
Eight benchmark suites are defined across code, RAG, agents, guardrails, and gateways.
Private benchmark scoping is available now.
Public results publish only after validated trials.
Benchmark suites
Eight benchmark suites, one private execution path
Each suite defines the system under test, the failure modes we evaluate, the metrics we track, and the private benchmark path for teams that need evidence before public scorecards exist.
Pillar
Position the suite by research pillar.
System
Map the target system class and control surface.
Status
Track program state.
Buyer problem
Start from the buyer problem the benchmark answers.
CODE SECURITY
Secure Code Generation
Which models generate safer code for real developer tasks?
Primary metric preview
Secure-by-default rate
PROMPT INJECTION
Prompt Injection Resistance
Can models and workflows resist untrusted instructions?
Primary metric preview
Attack success rate
RAG SECURITY
RAG Leakage & Retrieval Boundary
Can retrieval stay inside tenant, role, and source boundaries?
Primary metric preview
Unauthorized retrieval rate
AGENT SECURITY
Agent Tool Abuse
Can agents use tools without exceeding authority?
Primary metric preview
Unsafe tool-call rate
GUARDRAILS
Guardrail Robustness
Do guardrails stop attacks without blocking useful work?
Primary metric preview
Bypass vs false refusal
CODE REVIEW
AI Code Review
Can models find and fix the vulnerabilities they generate?
Primary metric preview
True positive rate
ARTIFACT TRIAGE
Artifact & Binary Triage
Can AI-assisted tools spot risky behavior in software artifacts?
Primary metric preview
Artifact detection rate
GATEWAY POLICY
Model Gateway Policy Enforcement
Can the gateway enforce policy and produce usable evidence?
Primary metric preview
Policy bypass rate
Execution surfaces
Private benchmark work can use SecEng tooling
These surfaces support code scanning, adversarial trials, RAG tests, proxy traces, artifact analysis, and evidence packaging.
SecEng Code Scanner
Supports secure code generation and AI code review.
LLM Attack Range
Supports prompt injection, agent abuse, and adversarial trials.
RAG Test Harness
Supports retrieval leakage and boundary testing.
SecEng Proxy
Supports gateway policy and guardrail traces.
Artifact Analyzer
Supports artifact and binary triage.
Evidence Builder + Crosswalk
Supports reporting, mappings, and buyer proof.
What we track
We track the AI security controls that fail first in real products.
These are the broad measurement pillars that sit underneath every benchmark suite.
Attack Coverage
Measure the abuse paths that matter most: injection, leakage, tool misuse, gateway bypass, and unsafe code generation.
Security and Safety Outcomes
Track whether the system resists attacks while still supporting useful work, approvals, and developer velocity.
Reliability and Robustness
Keep the evaluation focused on behavior that remains stable across retries, model variants, and deployment contexts.
Transparency and Reproducibility
Publish methodology, dataset design, and claim controls before public scorecards are released.
Coverage list
Benchmark coverage spans the public-safe problem families below
Private benchmarking
Need evidence before public scorecards publish?
Run a private benchmark sprint against the models, guardrails, RAG systems, agents, gateways, coding workflows, or artifacts you actually use.
Methodology note
Methodology is published early so teams can understand the evaluation design, request private variants, and align internal AI security tests.
Private benchmark path
Private benchmark runs can be scoped now for customers, sponsors, or internal teams. Private results stay private unless explicitly approved for publication.
Methodology-first
Methodology first, not leaderboard theater
Each suite starts with research questions, datasets, scenarios, grading rules, metrics, and claim controls. Public scorecards are released only after validated trials. Private benchmark runs can be scoped earlier for teams that need internal or buyer-facing evidence.
Research questions first
Every suite starts with explicit questions, scopes, and success criteria.
Controlled fixtures
Synthetic and curated scenarios mirror the systems teams actually ship.
Repeatable scoring
Repeated trials, confidence intervals, and bounded reporting where sample counts support it.
Validation-gated reporting
Public results publish only after evidence, review, and publication approval.
Publication boundary
Scorecards are validation-gated.
Benchmark launch path
Use the hub to move from research to execution
The public hub shows how the benchmark program works. Teams can use it to choose a private benchmark sprint, align internal AI security tests, or route a product risk into the marketplace.
Quick use