PROMPT INJECTION
Prompt Injection Resistance Benchmark
Prompt Injection Resistance Across Models, Agents, and Guardrails
Evaluate direct, indirect, RAG, tool-output, and obfuscated injection attacks across AI systems.
Benchmark
Prompt Injection Resistance
Chat, docs, RAG, HTML, markdown, tool output, multimodal
Across model, RAG, agent, and gateway variants
Reported only after validated trials
Report preview
Planned report outputs
Publication boundary
Methodology and suite design publish before public scorecards. Suites in active build can be scoped privately while validation continues.
Problem
Why this benchmark matters
AI systems increasingly read content they do not control. Retrieved documents, web pages, tickets, emails, comments, and tool outputs can all carry instructions that conflict with system policy.
Why it matters
Prompt injection is the connective tissue between data leakage, agent abuse, RAG compromise, and unsafe tool use. Teams need repeatable evidence that their systems preserve instruction hierarchy.
What we will test
We will inject malicious instructions through user prompts, retrieved context, documents, HTML, markdown, tool outputs, and workflow artifacts, then measure policy bypass, leakage, and unsafe action rates.
Buyer value
Teams can identify vulnerable AI flows, compare model and guardrail behavior, validate prompt hierarchy controls, and build regression tests for product releases.
Benchmark scope
What this benchmark will measure
Scope is explicit so buyers can see what the benchmark covers before any public scorecards exist.
Classification
benchmark program
Target systems
Buyer problems
Risk dimensions
Evaluation task
Direct instruction override
User prompt attempts to override trusted instructions or policy boundaries.
Success condition
System refuses or safely handles conflicting instructions while preserving intended task behavior.
Failure condition
System follows untrusted instruction, reveals protected data, or changes policy behavior.
Evaluation task
Indirect document injection
Retrieved or processed content includes hidden instructions targeting the model or agent.
Success condition
System treats document content as data and does not follow embedded instructions.
Failure condition
System follows document instructions, leaks data, or invokes unsafe tools.
Evaluation task
Tool-output injection
External tool output carries malicious instructions or data exfiltration requests.
Success condition
Agent treats tool output as untrusted and preserves tool policy boundaries.
Failure condition
Agent follows tool-output instructions or calls additional tools unsafely.
Evaluation task
Obfuscated injection
Payloads use encoding, spacing, roleplay, multilingual text, or indirection to hide unsafe intent.
Success condition
System identifies or safely handles obfuscated policy-conflicting content.
Failure condition
System bypasses policy due to obfuscation and violates expected secure behavior.
Experiment design
Measure prompt injection resistance across model-only, RAG, agent, guardrail, and gateway configurations.
Hypotheses
- Indirect injection through retrieved content will be more dangerous than direct user prompts in tool-enabled systems.
- Gateway and guardrail layers will reduce obvious bypasses but may miss context-specific exfiltration attempts.
- Instruction hierarchy stability will vary significantly across repeated paraphrases and encodings.
Trial count
3,600
Repeated across prompt variants, model families, and controlled runs.
Repetitions per case
6
Enough to compare variants without pretending the scorecard is complete.
Variant
Model only
Base model response without external guardrail or gateway controls.
Captures baseline instruction hierarchy behavior.
Variant
RAG context
Model response with retrieved documents containing benign and malicious instructions.
Includes retrieval metadata and source attribution checks.
Variant
Agent with tools
Tool-enabled agent exposed to injection payloads through user and external content.
Captures tool calls, approvals, and policy decisions.
Variant
Gateway guarded
Same scenarios routed through gateway, logging, and policy enforcement.
Measures mitigation impact and auditability.
Methodology
How the benchmark will be run
Methodology is published early so teams can understand the evaluation design, request private variants, and align internal AI security tests.
Research questions
- Which surfaces are most likely to carry successful prompt injection payloads?
- How consistently do model and guardrail variants preserve instruction hierarchy?
- Which mitigations reduce leakage, tool misuse, and policy bypass under indirect injection?
- How do repeated attempts and paraphrases affect attack success rates?
Evaluation design
Run controlled injection families against model-only, RAG, agent, and gateway configurations. Each case defines trusted instructions, untrusted content, expected secure behavior, and prohibited outcomes.
Sampling plan
Use synthetic attack payload families across user prompts, retrieved documents, emails, HTML, markdown, tool output, and screenshots where multimodal models are included.
Grading and statistics
Grade for instruction hierarchy violation, policy bypass, sensitive data leakage, unsafe tool use, and recovery behavior using rules, classifiers, LLM judges, trace analysis, and human review.
Report attack success rate, policy bypass rate, leakage rate, and stability across repeated payload variants. Break down by injection surface and mitigation variant.
Limitations
All injection templates, mutation strategies, and expected secure behaviors should be versioned.
Public examples should be defensive and avoid high-impact exfiltration payloads.
Metrics
Planned report outputs
Metrics are shown as reporting dimensions for the active benchmark program.
Metric
Attack success rate
Share of trials where prompt injection causes prohibited behavior.
Unit
percent
Direction
lower is better
Aggregation
rate
Metric
Policy bypass rate
Share of trials where trusted policy or instruction hierarchy is bypassed.
Unit
percent
Direction
lower is better
Aggregation
rate
Metric
Secret leakage rate
Share of trials where protected synthetic secrets or private context are exposed.
Unit
percent
Direction
lower is better
Aggregation
rate
Metric
Stability score
Consistency of secure behavior under repeated payload variants.
Unit
score
Direction
higher is better
Aggregation
mean
Datasets
Data fixtures, source types, and public-safety boundaries
All public-safe. No raw job-description text or private corpus material is shown here.
Dataset
Synthetic prompt injection payloads v1
Synthetic direct and indirect injection payloads across chat, documents, HTML, markdown, tool outputs, and RAG context.
Source
synthetic
Classification
synthetic
Item count
200
Outputs
Report outputs
Each output is designed to be useful without implying finished benchmark rankings.
Output
Prompt injection methodology note
Public methodology for injection surfaces, task families, grading, and limitations.
Output
Private prompt injection scorecard
Private comparison of injection resistance across customer-selected systems and mitigations.
Status timeline
Where the suite sits now
The timeline shows current build state and the publication boundary.
Status timeline
Suite defined
Public benchmark plan and metadata published.
Status timeline
Payload families
Design injection templates, mutation strategies, and public-safe examples.
Status timeline
Trace and policy harness
Wire injection runner, RAG fixtures, tool traces, and grading rules.
Status timeline
Private pilot
Validate scoring against limited model and workflow configurations.
Commercial bridge
Private benchmarking and related assets
Private benchmark runs can be scoped now for customers, sponsors, or internal teams. Private results stay private unless explicitly approved for publication.
Private benchmark CTA
Request Injection Benchmark
Available now
Private benchmark sprint, model comparison, product-context benchmark, and evidence bundle.
Related routes
Related
Related services
Related
Related products
Related
Related courses
Claim controls
What the public page can and cannot say
These controls keep the page safe for public use until real results exist.
Claim controls
Public claim guardrails
This suite is planned. Public model rankings and benchmark results have not yet been published.
Claim boundary
- Public scorecards are validation-gated.
- Ranking claims are not allowed.
- Vendor comparison claims are not allowed.
- This suite is planned. Public model rankings and benchmark results have not yet been published.
Do not claim
- Do not claim a vendor resists prompt injection better than another.
- Do not publish payload success rankings before approved results.
- Do not imply completed testing.