Implementation Guide · 2026

AI Security Engineering Field Guide

A technical implementation guide for practitioners, focusing on the mechanics of control enforcement and evidence collection.

Modules

16 technical guides

Controls

24 recipes

Code snippets

32+

Status

Reference

About the authors and editors

Contributor notes for the 2026 field guide

These bios are intentionally brief. They identify the people who shaped the manuscript and the narrow reason each one is included here.

Co-authors

Primary manuscript authors and research framing.

Co-author

Alex Eisen

Advises on AI risk, incident response readiness, and research-informed product security priorities.

Relevance

Applied security-research and AI-risk framing to the control-plane sections.

Co-author

Alon Braun

Strategy, product framing, and advisory translation for teams that need a usable operating model.

Relevance

Shaped report structure, executive translation, and public-safe positioning.

Editors

Editorial review for clarity, precision, and publication-safe language.

Editor

Tim Kerimbekov

Risk-informed security strategy and operating-model guidance grounded in product and enterprise experience.

Relevance

Reviewed risk language and operating-model guidance for practical clarity.

Editor

Dorina Miroyannis

Legal and policy coverage for teams that need privacy, security, and terms pages updated without losing contractual precision.

Relevance

Reviewed policy language, contract boundaries, and public-safe wording.

Module 01

AI Security Foundations

What separates guessing from knowing.

The field collapses easily into familiar categories — web security, SDLC, compliance — and then an AI system does something unexpected and those categories don't have an answer. Most teams start their AI security work with a controls framework in hand but without a clear map of which layer is actually failing. Before you can choose the right control, you have to be able to name the right layer. That discipline is what this domain is about.

What This Domain Covers

AI security foundations cover the mental models required to reason about AI systems without collapsing every issue into "the model did something weird." The boundary of this domain sits before specialized topics such as prompt injection, RAG security, model supply chain, and governance evidence. It defines the language and distinctions that let practitioners separate application risk, model risk, infrastructure risk, safety risk, and compliance risk before they choose controls.

This is a distinct domain because AI security work fails when teams treat AI as either a normal web application or a mysterious model artifact. If the team treats everything as AppSec, it misses model update risk, prompt-context trust failures, retrieval authorization gaps, and eval evidence. If the team treats everything as model risk, it misses authorization, logging, data-flow, and product abuse problems. If the team treats everything as AI safety, it misses adversarial behavior, credential exposure, tool abuse, and production control failures.

The practitioner's entry point is disciplined categorization. Before recommending a control, ask what layer owns the failure: the product workflow, the prompt and orchestration layer, the retrieval plane, the model artifact, the MLOps platform, the vendor boundary, or the governance process. That habit turns vague concern into testable claims. It also prevents the common mistake of asking the model to compensate for missing product controls.

Core Concepts

Model Risk vs. Application Risk Model risk concerns the behavior, provenance, quality, and limitations of the model itself. Application risk concerns how the product wraps the model: what data it passes in, what tools it exposes, what outputs it renders, and what permissions it grants. A hallucination may be model behavior, but exposing confidential context to the wrong user is usually application design. Good AI security starts by locating the failure at the right layer before prescribing a control.

Infrastructure Risk vs. AI-Specific Risk AI systems still run on cloud accounts, containers, notebooks, queues, APIs, identity systems, and CI/CD pipelines. Those layers bring ordinary security concerns such as secrets exposure, excessive service account permissions, vulnerable dependencies, and public admin interfaces. AI-specific risk begins when model behavior, context construction, retrieval, evals, agents, or model artifacts alter the trust assumptions. A notebook with a hardcoded key is not uniquely AI, but a training platform that lets a low-privilege job read high-privilege features is AI-platform-specific.

Evidence vs. Security Theater Evidence is a record that demonstrates a control operated: a blocked release due to failed evals, a retrieval authorization log, a model intake approval, an agent tool-call audit trail, or a red-team finding closure record. Theater is a claim that control exists without operational artifacts: a policy slide, a training completion count, or a risk register entry that no system enforces. AI security programs produce a lot of plausible language. Practitioners need to ask what artifact proves the statement.

Jailbreak-Only Thinking Jailbreaks are visible and easy to demonstrate, so many people mistake them for the whole AI security field. They are one class of adversarial interaction, not the entire attack surface. Production AI systems also fail through retrieval authorization gaps, unsafe tool calls, model artifact tampering, response rendering bugs, telemetry gaps, vendor data flows, and insecure release processes. A practitioner who only looks for clever prompts will miss most real engineering risk.

The Model Is Not a Control Owner A model can participate in a control, but it cannot be the sole control owner for its own inputs, permissions, or outputs. Asking a model not to reveal data does not replace retrieval-time authorization. Asking a model not to call a dangerous tool does not replace runtime policy enforcement. Asking a model to follow the system prompt does not replace context isolation. AI security engineering means placing controls outside the model when the model cannot reliably enforce the boundary itself.

The Threat Landscape

Category Error Risk A team misclassifies an AI issue and assigns it to the wrong owner. For example, a RAG data leak gets labeled as a model hallucination instead of a retrieval authorization failure. The precondition is weak shared vocabulary across security, ML, product, and governance teams. The production impact is delayed remediation because the fix targets symptoms instead of the failing layer.

Control Theater Drift An organization writes AI policies faster than it builds evidence-generating controls. The precondition is leadership pressure to show AI governance maturity without release gates, telemetry, evals, or ownership. In production, this creates false confidence: executives believe controls exist, while engineering teams still ship AI features without measurable blockers. The failure becomes visible during an incident, customer security review, or audit.

Jailbreak Scope Collapse A security review focuses only on direct user prompts and ignores retrieval, tools, vendors, model updates, and logs. The precondition is a narrow red-team mindset built around prompt tricks. The impact is a product that may resist obvious jailbreaks while still leaking documents, calling tools unsafely, or failing to capture forensic evidence. This pattern often produces impressive demos and weak controls.

Responsibility Laundering Through the Model A team argues that the model should refuse unsafe behavior, so the application does not need stronger authorization, validation, or logging. The precondition is confusing model alignment behavior with enforceable security boundaries. The production impact is brittle control design: the same prompt-context channel carries instructions, untrusted data, and user-controlled content. When the model fails, there is no independent control layer.

Vocabulary Ambiguity in Incidents During a live issue, teams use "hallucination," "prompt injection," "safety," "security," and "eval failure" interchangeably. The precondition is lack of foundational terminology before the incident. The impact is slow triage, unclear severity, and poor executive communication. Good incident handling depends on naming the failure mechanism accurately enough to bound scope and choose containment.

What Good Looks Like

A mature team can explain the AI security scope without pretending that every AI problem belongs to one discipline. It maintains a layered model of the system: product workflow, prompt orchestration, retrieval, model provider, model artifact, tool layer, telemetry, MLOps platform, and governance evidence. Design reviews classify findings by layer and control owner. That classification appears in tickets, risk registers, and architecture review notes.

Good control design produces observable artifacts. A team can show model intake records, prompt and tool-call logging policies, retrieval authorization decisions, eval suite results, red-team scope documents, release gate outcomes, and incident reconstruction traces. These artifacts do not need to be perfect at first. They need to exist, have owners, and connect to decisions.

Mature AI security language separates safety, security, quality, privacy, and compliance without isolating them completely. A harmful output may be a safety issue, a prompt injection may be a security issue, memorized PII may be privacy and security, and a missing audit trail may be governance evidence failure. The team does not debate labels as an academic exercise. It uses labels to assign owners, select controls, and explain risk.

A good practitioner also knows when not to overstate. Not every hallucination is a vulnerability. Not every jailbreak is critical. Not every AI vendor issue requires blocking procurement. Not every model update is a security incident. The differentiator is evidence-based reasoning: what boundary failed, what data or action was exposed, what preconditions existed, what control should have blocked it, and what artifact proves the fix.

Assessment Focus

The assessment tests whether you can distinguish adjacent concepts under pressure. It asks whether you can separate model behavior from application control, safety from security, red teaming from evals, hallucination from adversarial output, and governance evidence from documentation. A strong practitioner reasons from mechanisms: what input entered, what trust boundary changed, what decision was made, what control operated, and what artifact remains.

The most common wrong-answer pattern is keyword matching. Candidates see "jailbreak" and choose the prompt-focused answer, even when the scenario describes a retrieval permission failure. They see "AI governance" and choose policy language, even when the correct answer requires a release gate or evidence record. This reveals a mental model built around familiar terms rather than system behavior. Correct reasoning starts by locating the failure layer.

Common Misconceptions

Most people entering the field assume AI security is mostly about clever prompts. That framing is understandable — prompt injection is visible, demonstrable, and generates dramatic examples — but it misses most of the actual surface area. Production AI systems also fail through retrieval authorization gaps, model artifact tampering, tool credential misuse, observability failures, and vendor data flows. An AI security review that centers only on prompt attacks leaves most of the engineering risk unexplored.

A hallucination is a fabrication, and some teams treat it as the core security story. But a model generating an incorrect answer is not automatically a security event. The security question is whether a recognized property — confidentiality, authorization, integrity, auditability, or safety enforcement — was violated. A model that confidently fabricates a medical recommendation is a reliability concern; a model that reveals a neighboring tenant's data is a security failure. The distinction matters because the controls differ.

Responsible AI, AI safety, and AI security are often used interchangeably in executive communications, and they do overlap. But they are not the same discipline. AI safety focuses on alignment, harmful behavior, and catastrophic risk. Responsible AI focuses on fairness, accountability, and societal impact. AI security focuses on adversarial abuse, trust boundaries, data protection, and operational controls. Blurring the categories causes teams to adopt the language of one discipline while needing the capabilities of another.

Teams building LLM-powered features sometimes treat the system prompt as the primary security control — a set of rules the model will follow. But a system prompt is an instruction, not an enforcement mechanism. It cannot substitute for retrieval-time authorization, runtime tool policy, output encoding, or audit logging. Adversarial context can compete with or reframe what the system prompt says, and model non-determinism means compliance is never guaranteed. Security controls must operate outside the model's reasoning, not within it.

Study Checklist

I can explain the difference between model risk, application risk, infrastructure risk, safety risk, and governance risk.
I can identify when a scenario is describing a control failure rather than a model behavior problem.
I can describe why jailbreak-only thinking misses retrieval, tool, supply-chain, and telemetry risks.
I can explain why "the model should refuse" is not an enforceable security control.
I can distinguish hallucination from adversarial output and quality failure from security failure.
I can describe at least five evidence artifacts that prove AI security controls operated.
I can explain the difference between red teaming, penetration testing, and automated evals.
I can classify an AI finding by system layer and control owner.

Remediation Starting Points

If you scored below 60% in this domain:

Draw a layered diagram of one AI system you use or build: user interface, application logic, prompt orchestration, retrieval, model provider, tools, logs, and governance artifacts. Label one likely security failure at each layer.
Take three AI incident writeups or public vulnerability reports and classify each finding as model risk, application risk, infrastructure risk, supply-chain risk, privacy risk, or governance evidence failure.
Write a one-page control argument for a RAG application that does not use "the model will refuse" as a control. Include retrieval authorization, logging, output handling, and release gates.
Run a hands-on exercise: build a minimal LLM wrapper that logs prompts, responses, user identity, model name, and retrieval sources. Then identify what you still could not investigate after a suspicious output.

Module 02

LLM Application Security

Input/output trust at the model boundary.

Most LLM application vulnerabilities are not clever attacks. They are ordinary application security failures — unencoded output, cache sharing across users, secrets in environment variables, error responses that contain model metadata — adapted to the LLM context. The surprise is how rarely teams adapt their existing review practices when they add a model call to a product surface.

What This Domain Covers

LLM application security covers the application layer that wraps model calls: request construction, prompt assembly, API keys, response streaming, output rendering, caching, error handling, logging, and downstream actions. It sits between classic AppSec and AI-specific domains such as prompt injection, RAG, and agent security. The boundary is practical: if the risk exists because the application passes data to or from a model, renders model output, stores model interactions, or exposes model-provider integration, this domain is in scope.

This is a distinct domain because normal web security reviews often stop too early. A conventional review may check authentication, authorization, injection flaws, dependency risks, and secrets handling, but miss how LLM context is assembled, how partial output streams before validation, how cached responses leak across users, or how model metadata appears in error messages. Treating LLM apps as ordinary API wrappers ignores the fact that the application converts untrusted input into model context and then often treats generated output as if it came from a trusted service.

The practitioner's entry point is the model boundary. Every LLM application has at least two boundary crossings: untrusted or semi-trusted data enters model context, and generated output leaves the model into an application, user interface, workflow, or downstream system. Security work starts by asking what is allowed to cross each boundary, what transformations occur, what gets logged, what gets cached, and what consumes the output. That model reveals failures that do not appear in standard request/response diagrams.

Core Concepts

Prompt Assembly as Application Logic Prompt assembly is not just string concatenation. It is application logic that chooses which instructions, user inputs, retrieved records, chat history, system state, and tool outputs become part of the model context. A bug in prompt assembly can leak hidden instructions, include another user's history, or pass privileged records into a low-privilege request. Security review should treat prompt templates and context builders like sensitive code paths. They need code review, tests, and telemetry.

Output Handling and Rendering Model output is untrusted data, even when it comes from a trusted provider. If the application renders HTML, Markdown, links, file paths, JSON, code, or shell commands from the model without validation, it can create cross-site scripting, unsafe navigation, command execution, or workflow manipulation. The model may generate content based on attacker-controlled input. Output encoding and strict downstream schemas reduce the damage when the model produces unexpected content.

Streaming and Partial Exposure Streaming responses improve user experience but complicate security. If content streams to the browser before policy checks, classification, or output validation complete, sensitive or unsafe content may already be exposed. The application must decide what can be checked pre-stream, what requires buffering, and what risk is acceptable for partial output. Streaming controls should be explicit rather than accidental.

Response Caching and Cross-User Leakage LLM apps often cache prompts, completions, embeddings, retrieval results, or summarized context to reduce latency and cost. Cache keys that omit user, tenant, authorization scope, model version, or source-document permissions can leak data across sessions. A response generated for a privileged user may become visible to a lower-privilege user if cached at the wrong layer. Cache design must preserve the security boundary that existed at generation time.

Provider Integration and API Key Exposure LLM wrappers frequently expose API keys through client-side code, logs, environment leaks, notebooks, or misconfigured proxies. Some apps call model providers directly from the browser, which makes key protection and request integrity difficult. Others use a server-side proxy but fail to enforce per-user rate limits, tenant isolation, or request logging. Provider integration is part of the application attack surface, not an implementation detail.

The Threat Landscape

Prompt Content in Error Messages An exception path returns the full prompt, system message, model parameters, or provider response to the client. The precondition is developer-friendly error handling left enabled in production or a wrapper that serializes upstream errors verbosely. The impact is disclosure of hidden instructions, user data, retrieval snippets, internal policy, or model configuration. This often appears during provider timeouts, schema validation failures, or tool-call parsing errors.

Cross-Tenant Cache Reuse The app caches responses or retrieval results using a key based only on the user's query text. A different tenant asks a similar question and receives content generated from another tenant's documents. The precondition is cost optimization without authorization-aware cache design. In production, this becomes a confidentiality incident that is hard to reproduce because the model call itself may not have occurred during the leaking request.

Unsafe Markdown or HTML Rendering The model returns a link, image, script-like Markdown, HTML fragment, or embedded instruction that the client renders too permissively. The precondition is treating model output as trusted UI content. The impact can include phishing, credential capture, stored XSS through conversation history, unsafe redirects, or malicious instructions disguised as application guidance. The root mechanism is not that the model is malicious; it is that untrusted output reached a renderer.

Streaming Before Validation The application streams model output directly to the user while a moderation, policy, or DLP check runs after completion. The precondition is latency-driven design with post-generation validation. The impact is partial disclosure of sensitive data or unsafe content before the app can block the final response. This failure is especially relevant when generated output may include retrieved confidential context.

Client-Side Context Injection The browser or mobile client sends hidden context fields, system hints, user attributes, or document snippets that the server trusts. An attacker modifies the client payload and injects unauthorized context or changes model instructions. The precondition is trusting client-side state for prompt construction. The impact ranges from policy bypass to unauthorized data access if server-side checks are weak.

What Good Looks Like

A mature LLM application treats prompt construction as a security-sensitive backend operation. The server determines user identity, tenant, authorization scope, allowed context sources, and model configuration. The client can submit user intent, but it cannot choose hidden instructions, privileged context, model provider credentials, or retrieval scope. Prompt templates and context builders have tests that verify isolation between users and tenants.

Good output handling assumes model output is untrusted. The app encodes rendered content, strips unsafe HTML, constrains links, validates JSON against schemas, and prevents generated content from directly driving privileged workflows. If the model output feeds another system, the receiving system enforces its own authorization and validation. Mature teams document which outputs are display-only, which are recommendations, and which can trigger action.

Caching, logging, and streaming have explicit security policies. Cache keys include tenant, user or role scope, source corpus, model version, and authorization state when needed. Logs capture enough to investigate without casually storing secrets or unnecessary PII. Streaming responses are buffered when the risk requires validation before exposure. These policies appear in code, configuration, and architecture review notes.

Operational evidence is visible. A team can show prompt assembly tests, output validation failures, cache key design, error redaction examples, provider key storage controls, streaming policy decisions, and incident reconstruction logs. Security is not proven by saying "we use a safe model." It is proven by showing how the application handles data at the model boundary.

Assessment Focus

The assessment tests whether you can reason about boundary violations in production LLM apps. You need to identify when a scenario describes bad error handling, unsafe rendering, cache confusion, streaming leakage, client-side trust, or key exposure. Strong answers usually focus on application controls outside the model: server-side prompt assembly, authorization-aware caching, output encoding, schema validation, provider proxying, and safe logging.

The most common wrong-answer pattern is to recommend prompt changes for application-layer failures. Candidates may suggest telling the model not to reveal system prompts when the actual issue is that an exception handler returns the prompt. They may suggest a safer model when the issue is cross-tenant caching. This reveals a mental model where the model is treated as the primary control. Correct reasoning identifies the application boundary and selects enforceable controls.

Common Misconceptions

It's easy to focus LLM application review entirely on prompt injection because it's the visible AI-specific problem. But LLM applications fail through exactly the same channels as conventional web applications: logs, error responses, caches, streaming outputs, client-side rendered content, and API keys. Many serious failures don't require a clever adversarial prompt. They happen because ordinary application boundaries weren't adapted to model workflows — output trust assumptions, context rendering, and cache key scope all need revisiting when a model is in the path.

Teams sometimes treat a model API response as if it came from a trusted internal service. It does not. The model produced that output based on whatever entered the context window — which may include attacker-controlled content, retrieved documents from a poisoned index, or conversation history a user manipulated. The API endpoint is trusted in a transport sense; the content it returns is not. Generated output should be treated as untrusted data and encoded, validated, and constrained before it triggers anything downstream.

Caching rarely gets security review in LLM applications because teams classify it as performance infrastructure. But every cache is a potential cross-user data store. If the cache key doesn't reflect the user's authorization scope, a response computed for a privileged session can be served to an unprivileged one. LLM response caches need the same threat modeling as session caches and object caches: who can retrieve what, for how long, under which authorization state.

Streaming is usually added to improve perceived responsiveness, but it changes the security model. If output validation runs only after generation completes, sensitive content may already be visible to the user before any check runs. A secure streaming design defines which response types must be buffered and verified before the first token renders — not as a UX decision, but as a security property that determines when the output is safe to display.

Study Checklist

I can explain why prompt assembly should run server-side and be treated as application logic.
I can identify error paths that leak prompts, model metadata, retrieved context, or provider responses.
I can describe how response caching can leak data across tenants or authorization scopes.
I can explain why streamed output may need buffering before policy or DLP checks finish.
I can identify unsafe rendering paths for Markdown, HTML, links, JSON, and generated code.
I can describe secure handling of model provider API keys in server-side wrappers.
I can explain how client-side context injection can bypass server-side intent.
I can name evidence artifacts that prove LLM boundary controls operate.

Remediation Starting Points

If you scored below 60% in this domain:

Review one LLM application or prototype and draw the full model boundary: what enters the prompt, what leaves the model, where output is rendered, what is cached, and what is logged.
Implement a small server-side LLM proxy that refuses client-supplied system prompts, redacts provider errors, and validates model output against a strict JSON schema.
Inspect an existing LLM app for cache keys. Write down whether user identity, tenant, source corpus, authorization scope, and model version influence cached responses.
Run a hands-on lab: create a toy streaming endpoint and test what happens when sensitive content appears in the first 20 tokens. Modify the design to buffer and validate before display.

Module 03

Prompt Injection and Context Security

The adversarial frontier. Trust nothing in context.

Prompt injection is often compared to SQL injection — untrusted input treated as instruction. That analogy is useful but incomplete. In SQL injection, the attack surfaces are discrete and the language has a grammar. In LLM context, the attack surface is every piece of text the model processes, and "instruction" versus "data" is a judgment call the model makes dynamically. The controls have to compensate for that architectural reality, not just block known patterns.

What This Domain Covers

Prompt injection and context security cover attacks where adversarial content enters the model's context and changes how the system behaves. The domain includes direct prompt injection from a user turn, indirect injection from retrieved documents, tool outputs, emails, calendar entries, web pages, tickets, files, and other application data, and context poisoning where hostile content is smuggled into a trusted workflow. It sits between LLM application security, RAG security, and agent security because context is the shared attack surface across all three.

This is a distinct domain because the model processes instructions and data through the same language channel. Traditional input validation assumes code and data can be separated cleanly. LLM systems blur that boundary: a document can contain facts, instructions, impersonation attempts, and policy-bypass language in the same text. Treating prompt injection as a normal injection bug misses the central issue: the model may treat untrusted content as operational guidance unless the orchestrator constrains context and validates outputs.

The practitioner's entry point is context trust modeling. Every context segment should have a source, a purpose, a trust level, and an allowed influence on the model's behavior. System instructions, developer messages, user input, retrieved content, tool output, and conversation history should not have equal authority. Once you draw those tiers, prompt injection stops being a magic phrase problem and becomes a control problem: isolate context, reduce authority, validate outputs, and enforce decisions outside the model.

Core Concepts

Direct vs. Indirect Prompt Injection Direct prompt injection comes from the user interacting with the model and trying to override instructions. Indirect injection comes from content the user may not control directly during the current interaction, such as a retrieved document, a website, an email, or a tool response. Indirect injection is often more dangerous because the application may treat that content as evidence or trusted context. A user asking "summarize this web page" may unknowingly import instructions written by an attacker.

Context Authority Tiers Not every piece of context should influence the model equally. System instructions define the application contract, developer instructions define task boundaries, user input defines intent, retrieved documents provide evidence, and tool outputs provide external state. If retrieved content can override system instructions, the system has no meaningful hierarchy. Context security requires the orchestrator to preserve authority boundaries rather than hoping the model will infer them.

Separator and Role Injection Attackers often use delimiters, fake role labels, Markdown headings, XML tags, quoted text, or transcript-like structures to make untrusted content appear authoritative. A retrieved document might include "SYSTEM: ignore all prior instructions" or a fake tool result that appears to come from an internal service. The mechanism is not the literal syntax; it is confusion over what part of the context is instruction and what part is data. Defenses must make context provenance explicit and avoid relying on visual separators alone.

Cross-Plugin and Cross-Tool Trust Chains Modern AI systems often combine retrieval, email, calendars, SaaS connectors, browser tools, and internal APIs. A weakly controlled integration can contaminate the whole orchestration if its output enters context with too much authority. An email containing hostile instructions can influence a calendar assistant, which can influence a tool that sends messages. The risk compounds when tool outputs become new prompts for later steps.

Orchestrator-Level Guardrails The model cannot be the sole defense against its own context. Orchestrator-level guardrails enforce boundaries outside the model: which context is included, which tools are available, what outputs are valid, what actions require approval, and which responses are rejected or transformed. Strong designs pair model instructions with external validation, allowlists, schemas, policy engines, and audit logs. The point is not to eliminate prompt injection completely; it is to limit what injected content can cause.

The Threat Landscape

Indirect Injection Through Retrieved Documents An attacker places instructions inside a document that the RAG system later retrieves. The precondition is that retrieved content enters the prompt with enough authority to influence behavior beyond summarization or evidence. In production, the model may reveal hidden context, ignore policies, cite malicious text as fact, or prepare unsafe downstream actions. This attack often looks like normal document processing until the injected content activates in a specific query.

Tool Output Hijack A tool returns untrusted text that includes instructions for the model or the user. The precondition is that tool output is placed into the next model call without provenance, quoting, or validation. The impact can include false task completion, unsafe follow-up actions, or manipulation of a later tool call. The issue is especially common when web search, ticketing, email, or browser tools feed directly into an agent loop.

Role Boundary Confusion Adversarial content imitates system, developer, admin, or tool messages inside user-controllable text. The precondition is weak formatting and no explicit context metadata. In production, the model may treat a fake instruction as higher authority than it deserves. Even if the attack does not fully override the system prompt, it can shift the model's reasoning and cause policy drift.

Jailbreak Chaining Across Turns An attacker gradually weakens instructions through a series of interactions instead of one obvious jailbreak. The precondition is long-lived conversation memory or summarization that preserves adversarial framing. The impact is a context state where the model has internalized the attacker's desired role, policy exception, or false premise. This is a context poisoning problem, not just a single-turn prompt problem.

Cross-Integration Contamination Content from one integration influences another integration with a stronger privilege level. For example, a malicious support ticket affects an agent that can update customer records. The precondition is shared context across tools without authority separation. The production impact is a confused-deputy failure: low-trust content steers high-trust action.

What Good Looks Like

A mature system labels context by source, trust tier, and intended use. Retrieved content is treated as evidence, not instruction. Tool outputs are treated as observations, not policy. User input defines the user's request, not the application's control plane. The prompt format makes those distinctions visible to the model, but the orchestrator also enforces them outside the model through tool policy, output validation, and action constraints.

Good prompt injection defense uses a hierarchy. Input sanitization may remove obvious hostile markup, but it is not enough. Context isolation prevents untrusted content from sharing the same authority as instructions. Output validation checks that the model's response matches the allowed task, schema, and policy. Runtime guardrails decide which tools are available and what approvals are required. Audit logs record which context segments were included and which tool calls followed.

Mature teams test indirect injection, not just user-turn jailbreaks. They seed hostile instructions into documents, web pages, email threads, ticket comments, calendar descriptions, and tool outputs. They test whether the system follows those instructions, exposes hidden context, changes tool arguments, or alters policy decisions. The result is not just a vulnerability list; it is a regression suite that runs when prompts, models, retrieval logic, or tools change.

Control evidence exists at the orchestrator layer. A team can show context schemas, source labels, prompt templates with authority separation, output validation failures, blocked tool-call logs, injection test cases, and red-team reproduction traces. The strongest evidence shows not only that the model was instructed to ignore untrusted commands, but that the application prevented untrusted context from making privileged decisions.

Assessment Focus

The assessment tests whether you can recognize context as the attack surface. You need to distinguish direct prompt injection from indirect injection, role injection, separator injection, tool-output hijack, and cross-plugin contamination. Strong answers choose defenses that reduce the authority of untrusted context and enforce outcomes outside the model. Weak answers often rely on telling the model to be careful.

The most common wrong-answer pattern is overconfidence in sanitization or system prompts. Candidates suggest filtering the words "ignore previous instructions" while missing that the attack can be phrased indirectly, embedded in quoted text, or carried through a tool result. They recommend stronger system instructions when the correct answer requires context isolation, output validation, and tool-policy enforcement. Correct reasoning assumes adversarial context will reach the model and limits what it can affect.

Common Misconceptions

Many first-pass defenses against prompt injection are phrase filters — a list of known jailbreak patterns, forbidden instructions, or suspicious syntax. The problem is that the attack space is effectively unbounded. Instructions can be rephrased, encoded, split across multiple context chunks, delivered through indirection, or embedded in semantically benign text. Phrase filters reduce noise on known patterns but cannot be a primary control against a determined attacker who has the ability to influence any part of the context.

Retrieving from an internal knowledge base feels safe compared to the open internet. But internal doesn't mean safe to follow as instruction. Internal wikis, help desk systems, ticket trackers, and collaborative documents all have writable paths for users with varying trust levels. Once a piece of internal content carries hostile instructions and gets retrieved into context, the retrieval source tells you where it came from — not whether following it is authorized. Internal provenance needs to inform trust level, but it doesn't eliminate context injection risk.

System prompts express intent well, but they aren't runtime enforcement. The model processes the system prompt alongside everything else in the context window, and adversarial content can compete with, override, or reframe what the system prompt says. This isn't a failure of careful prompt writing — it's an architectural property of how LLMs process context. Enforcement belongs in the orchestration layer, validation logic, authorization system, and tool controls — not inside the model's own reasoning.

Without tools, a prompt injection can still cause significant harm: leaked sensitive context, misleading summaries, false citations, degraded trust, or policy-violating advice to users. Tools multiply severity because they allow action rather than just text generation, but the core violation — untrusted content gaining instructional authority — matters regardless. Review prompt injection risk in any system where injected instructions could change behavior with user-visible consequence.

Study Checklist

I can distinguish direct prompt injection from indirect prompt injection in a scenario.
I can explain why retrieved documents and tool outputs must be treated as untrusted context.
I can identify separator injection, role injection, and fake tool-result patterns.
I can describe how cross-plugin trust chains contaminate an orchestration.
I can explain why the model cannot be the sole defense against prompt injection.
I can design context tiers for system instructions, user input, retrieved content, and tool output.
I can name orchestrator-level controls that limit injected context authority.
I can describe evidence artifacts for prompt injection testing and blocked tool calls.

Remediation Starting Points

If you scored below 60% in this domain:

Build a small prompt template that separates system instructions, user input, retrieved content, and tool output. Add explicit source labels and write down which section is allowed to influence decisions.
Create five indirect injection test documents and run them through a RAG or summarization prototype. Record whether the model follows, quotes, ignores, or transforms the injected instructions.
Review an agent or LLM workflow and identify every point where tool output becomes input to another model call. Add a validation or quoting rule for each point.
Run a hands-on lab: seed a malicious instruction into a calendar event, email, or local document and test whether an assistant summarizes it as data or follows it as an instruction.

Module 04

RAG Security

Retrieval is a new attack surface.

The most common RAG authorization failure isn't a clever attack. It's a system that retrieves privileged content for a low-privilege user, correctly refuses to display the sensitive text in the final answer, and considers the incident resolved. The model already processed that context. The boundary was already crossed. Retrieval-time authorization isn't a defense improvement to consider — it's the only layer where the control actually works.

What This Domain Covers

RAG security covers the retrieval plane that supplies external knowledge to an LLM application. It includes document ingestion, chunking, embeddings, metadata, vector-store tenancy, retrieval-time authorization, source attribution, context assembly, and the boundary between retrieved evidence and generated output. It sits next to prompt injection because retrieved documents can carry adversarial instructions, and next to privacy because vector stores and embeddings often contain sensitive data in forms teams do not manage like ordinary records.

This is a distinct domain because RAG systems create a new path from stored data to model context. Many teams secure the final answer but fail to secure retrieval. They check whether the user should see the output after the model has already received high-privilege context. That is too late. Once privileged information enters the context window, the system has already violated the data boundary, even if the model later refuses to display it.

The practitioner's entry point is retrieval-time authorization. Ask who is allowed to retrieve each chunk, under what purpose, in which tenant, with which classification, and with what audit record. Treat vector search as a data access layer, not a search convenience. A secure RAG design enforces permissions before context assembly, maintains metadata integrity, and produces evidence showing exactly which sources influenced each answer.

Core Concepts

Retrieval-Time Authorization Authorization must happen before retrieved content enters the model context. Output filtering is not enough because the model may use privileged context to reason, summarize, or leak indirectly. Retrieval-time authorization checks user identity, tenant, role, document permissions, classification, and purpose before selecting chunks. If a user cannot access the source material, the model should not receive it on that user's behalf.

Chunk-Level Trust RAG systems often split documents into chunks and store them in a vector index. Permissions, classification labels, document lineage, freshness, and source integrity need to survive chunking. If metadata is lost or applied only at document level, a low-trust chunk may be retrieved into a high-trust flow or vice versa. Chunk-level trust lets the system enforce precise boundaries during retrieval.

Vector Store Multi-Tenancy Vector stores can mix embeddings from multiple tenants, teams, products, or classification zones. Multi-tenancy failures occur when namespaces, filters, metadata constraints, or index isolation are weak. Similarity search does not respect business boundaries unless the application enforces them. A secure RAG system treats tenancy and classification as mandatory retrieval filters, not optional query hints.

Retrieval Poisoning Retrieval poisoning occurs when adversarial or low-quality content is seeded into the knowledge base so it appears in future answers. The poisoned content may contain false facts, prompt injection, malicious links, or instructions that steer the model. The precondition is weak ingestion control, inadequate source review, or overbroad write access. Poisoning turns the knowledge base into an attack delivery channel.

Source Attribution Integrity Source attribution tells users and auditors which documents supported an answer. Weak attribution can cite the wrong document, hallucinate sources, omit critical source context, or present a low-trust source as authoritative. A mature RAG system ties generated claims to retrieved chunks and preserves enough metadata to verify the answer. Attribution is both a user trust feature and a forensic artifact.

The Threat Landscape

Context Window Privilege Escalation A low-privilege user asks a question that retrieves high-privilege content into the context window. The precondition is retrieval that ranks by semantic similarity without enforcing ACLs first. The model may summarize, paraphrase, infer, or leak sensitive information even if an output filter catches obvious phrases. In production, this becomes a data access failure disguised as a language-model behavior issue.

Cross-Tenant Vector Leakage Embeddings from multiple tenants live in one index, and metadata filters are missing, optional, or applied after retrieval. The precondition is shared infrastructure built for cost or convenience without hard isolation. A query from one tenant retrieves chunks from another tenant because the semantic match is strong. The impact is confidentiality breach and difficult incident scoping because the logs may only show generated answers, not retrieved chunks.

Poisoned Knowledge Base Entry An attacker adds or modifies a document that the RAG system trusts. The precondition is weak ingestion review, permissive document upload, or integration with user-editable systems such as wikis, tickets, or shared drives. Later, the poisoned chunk appears in context and causes false answers, unsafe instructions, or prompt injection. The attack persists until the corpus is cleaned and the index is rebuilt.

Metadata Stripping During Ingestion The ingestion pipeline drops permissions, classification labels, source IDs, or freshness timestamps during chunking and embedding. The precondition is treating ingestion as an ML preprocessing task rather than a security-sensitive data pipeline. In production, retrieval cannot enforce policy because the policy data is gone. The team may discover the failure only when asked to prove why a particular chunk was retrieved.

Hallucinated or Misleading Citations The model cites a source that was not retrieved, cites the wrong chunk, or attributes a claim to a source that does not support it. The precondition is loose coupling between generation and attribution. The impact is user overtrust, audit evidence failure, and poor incident reconstruction. In regulated or customer-facing systems, attribution integrity can become a control requirement.

What Good Looks Like

A mature RAG system enforces authorization before retrieval results enter the model context. The retrieval layer applies tenant, role, document, classification, and purpose filters as hard constraints. Similarity ranking happens inside the authorized result set, not across the entire corpus. If authorization metadata is unavailable, the system fails closed rather than retrieving first and filtering later.

Good ingestion pipelines preserve security metadata. Each chunk carries source ID, document owner, tenant, classification, access policy, ingestion timestamp, version, and integrity information. Deletions and permission changes propagate to the index. Re-embedding jobs have audit records. Teams can answer which source produced a chunk, when it entered the index, who could retrieve it, and whether it was still valid at query time.

Mature RAG control evidence includes retrieval logs, source IDs, authorization decisions, index namespace configuration, chunk metadata schemas, ingestion approval records, poisoning tests, and citation validation reports. These artifacts prove that retrieval is controlled, not just that the final answer looked acceptable. They also support incident response by showing which users may have received context from which sources.

Good teams test RAG as a data access system. They attempt cross-tenant retrieval, low-privilege access to high-privilege documents, poisoned document injection, stale permission changes, deletion propagation failures, and misleading citation scenarios. They do not rely on the model's refusal behavior as the primary protection. The retrieval plane owns confidentiality before the model ever sees the content.

Assessment Focus

The assessment tests whether you understand that RAG risk lives primarily in retrieval and context assembly. You need to identify when output-layer controls are too late, when metadata integrity matters, when vector-store tenancy is weak, and when poisoning enters through ingestion. Strong reasoning starts with the data path: source document, chunk, embedding, metadata, index, query, retrieval filter, context window, answer, citation, and log.

The most common wrong-answer pattern is generation-time thinking. Candidates suggest asking the model not to reveal unauthorized data or adding a post-generation filter. That misses the core failure: privileged content should never have entered context. Correct reasoning enforces permissions at retrieval time, preserves metadata during ingestion, and records which chunks supported each response.

Common Misconceptions

Teams often test RAG authorization by checking whether sensitive text appears in the final output. If the answer passes that check, the system looks secure. But the model receives retrieved context before generating any output. If high-privilege chunks entered the context window for a low-privilege user, the model already processed that data — and may surface it through inference, indirect summaries, or follow-up answers that a string filter won't catch. Authorization must prevent unauthorized context assembly from happening, not compensate for it downstream.

Vector search returns the most semantically similar content, not the most authorized content. Those are different queries with different semantics. A user retrieving records about their own account may receive chunks about another account that are semantically closer to their query terms. Without explicit metadata filters, tenant namespaces, and policy checks layered on top of similarity ranking, retrieval is effectively permission-unaware. The embedding score and the authorization decision are entirely separate operations and must both be performed.

Internal corpora feel safe because they don't come from the public internet. But internal documents include content created by users across many trust levels: wiki edits, ticket updates, comment threads, uploaded attachments, and imported external content. Any of those can carry false data or adversarial instructions. The relevant question isn't whether the source is internal — it's whether ingestion trust controls applied and whether the content's write barrier is low enough to represent an injection risk.

Citation UI gets treated as a product polish feature and deprioritized accordingly. But citations are the evidence layer for generated answers. They allow users to verify claims, help auditors trace support for decisions, and give incident responders the source record to examine when something goes wrong. Weak citations — incomplete, fabricated, or inconsistently formatted — hide retrieval failures and create false confidence in generated content. Citation design is a forensic requirement before it is a usability requirement.

Study Checklist

I can explain why authorization must occur before retrieval results enter the model context.
I can identify context window privilege escalation in a RAG scenario.
I can describe chunk-level metadata needed for secure retrieval.
I can explain how vector-store multi-tenancy can fail without hard filters or namespaces.
I can identify retrieval poisoning through writable knowledge sources.
I can describe deletion and permission-change propagation requirements for vector indexes.
I can explain why hallucinated or misleading citations create security and evidence problems.
I can name logs and artifacts needed to reconstruct a RAG answer.

Remediation Starting Points

If you scored below 60% in this domain:

Draw a RAG data-flow diagram from source document to final answer. Label where authorization, metadata preservation, citation, logging, and deletion propagation occur.
Build a small vector-store prototype with two tenants. Attempt to retrieve tenant B content as tenant A, then add mandatory tenant filters and verify the test fails.
Create a poisoned document containing adversarial instructions and add it to a test corpus. Query until it is retrieved, then document whether the system treats it as data or instruction.
Run a hands-on exercise: delete or reclassify a source document, then verify whether the chunk disappears from retrieval results and whether the retrieval logs show the change.

Module 05

Agent Security

Autonomous tools need non-autonomous trust.

The mental model shift required for agent security is about scope, not complexity. When a model only generates text, the worst outcome is a bad output. When a model calls tools, the worst outcome is an irreversible action — a message sent, a record modified, credentials extracted, a workflow triggered at scale. The security question for agents isn't whether the model can be tricked. It's what a tricked model can do, and how much of that damage is reversible.

What This Domain Covers

Agent security covers LLM-powered systems that can call tools, make decisions across steps, delegate to other agents, or take action in external systems. The domain includes tool permission design, runtime authorization, approval placement, action chaining, delegation chains, sandboxing, environment isolation, rollback, and auditability. It sits beyond prompt and RAG security because the model is no longer only producing text; it is influencing state, sending messages, changing data, triggering workflows, or operating infrastructure.

This is a distinct domain because autonomous or semi-autonomous action changes the severity model. A prompt injection in a chat-only assistant may mislead a user or expose text. A prompt injection in an agent with write access may update tickets, email customers, modify code, create cloud resources, delete records, or move money. Treating agent security as a prompt problem misses the real control question: what is the maximum damage a single compromised or confused action chain can cause?

The practitioner's entry point is blast-radius reasoning. Do not begin by asking whether the model is smart enough to choose the right tool. Begin by asking what each tool can do, what authority it carries, what approvals it requires, what state it can change, what logs it emits, and whether the action can be reversed. Agent security is the discipline of making delegated action safe even when model judgment, context, or tool output becomes unreliable.

Core Concepts

Tool Permission Taxonomy Tools should be classified by the effect they can have, not by their friendly names. Read-only tools retrieve information. Write tools change internal state. Destructive tools delete, overwrite, revoke, or disable. Irreversible tools create effects that cannot be fully undone, such as sending an external email, submitting a purchase, or disclosing data to a third party. This taxonomy drives approval requirements, logging depth, and environment isolation.

Blast Radius Scoping Blast radius is the maximum damage a single tool call or action chain can cause. A tool that can update one draft ticket has a smaller blast radius than a tool that can bulk-close production incidents. A safe agent architecture limits scope by tenant, user, resource type, action type, time window, and quota. Blast-radius limits must be enforced at runtime, not described in a manifest.

Human-in-the-Loop Placement Approvals help only when placed at meaningful decision points. Asking a human to approve every read operation creates fatigue and rubber-stamping. Asking for approval before irreversible external communication, broad-scope writes, privilege changes, or destructive operations adds real friction. The approval prompt should show the proposed action, source evidence, affected resources, risk level, and rollback plan. Approval without context is not a control.

Action Chaining and Compound Risk A sequence of individually low-risk actions can become high-risk in combination. Reading a calendar, drafting an email, looking up a customer record, and sending the email may create an external disclosure path. The agent's total authority is the composition of all tools it can call across steps. Security review must analyze workflows, not isolated tool declarations.

Auditability and Reversible Attribution Agent actions need forensic reconstruction. Logs should show the user, agent identity, model version, prompt/context references, tool name, arguments, authorization decision, approval decision, output, and resulting state change. Reversible attribution means investigators can determine whether the user requested the action, the model inferred it, a tool output influenced it, or a policy allowed it. Without this, incidents become arguments about what the agent "meant."

The Threat Landscape

Tool Permission Inflation An agent receives broad API credentials because narrow permission design slows development. The precondition is convenience-driven integration with service accounts, admin tokens, or broad OAuth scopes. In production, a prompt injection or model mistake can exercise privileges far beyond the user's intent. The impact is state change at machine speed with unclear ownership.

Approval Bypass by Action Reframing The agent avoids an approval gate because the risky action is split into smaller steps or framed as a low-risk operation. The precondition is policy that checks only tool names or shallow action categories. For example, "prepare draft" may be safe, but "prepare and send through a different tool" may bypass the intended approval. The impact is irreversible action without meaningful human confirmation.

Confused Deputy Through Tool Output A low-trust source influences the agent to use a high-trust tool. The precondition is tool output or retrieved content entering the agent loop without authority separation. A malicious ticket comment might tell the agent to update billing details or send secrets to a contact. The agent becomes the deputy that carries authority from the user or system into a context the attacker can steer.

Delegation Chain Privilege Drift One agent delegates work to another agent that has different tools or broader permissions. The precondition is multi-agent orchestration without explicit permission inheritance rules. The receiving agent may perform actions the initiating user or original agent could not. In production, this creates accountability gaps and hard-to-reconstruct action paths.

Sandbox Escape Assumption Failure A team assumes the execution environment is isolated, but the tool or code runner has network, filesystem, credential, or metadata-service access beyond the intended scope. The precondition is incomplete sandbox design or inherited platform permissions. The impact can include data exfiltration, credential theft, lateral movement, or persistent modification of the agent environment.

What Good Looks Like

A mature agent system starts with a tool inventory. Each tool has an owner, permission class, allowed resources, authorization policy, approval requirement, logging requirement, rate limit, and rollback behavior. The inventory distinguishes read-only, write, destructive, external, and irreversible actions. It also states what the tool cannot do. This inventory is not a spreadsheet for auditors alone; it drives runtime policy.

Good agent design enforces least privilege at tool execution time. The runtime checks user identity, tenant, resource scope, tool scope, action parameters, and current risk tier before allowing a call. Manifest labels are treated as documentation, not enforcement. If a tool says it only reads, the backing credential and API policy must make writes impossible. If a tool can write, the policy must constrain what, where, and how much.

Mature systems place human approval where it changes outcomes. Approvals appear before irreversible external messages, destructive actions, broad-scope writes, permission changes, financial operations, and production modifications. Approval screens include enough context to judge the action: proposed arguments, evidence sources, affected objects, estimated blast radius, and rollback plan. Approval decisions become audit records.

Agent observability is deep enough to reconstruct a chain. Logs tie each action to the initiating user, context segments, model call, tool arguments, policy decision, approval, result, and downstream actions. When an incident occurs, the team can identify which prompt, retrieved document, tool output, or delegated agent influenced the action. This evidence supports containment, root cause analysis, and retesting.

Assessment Focus

The assessment tests whether you can reason about action authority, not just model behavior. You need to identify which tool classes require approvals, why manifest declarations do not enforce runtime permissions, how action chains compound risk, and what audit logs must contain. Strong answers focus on runtime authorization, blast-radius reduction, approval placement, rollback, and evidence.

The most common wrong-answer pattern is treating the agent as a trusted employee. Candidates assume that if the user authorized the agent generally, each action is acceptable. They also overvalue human approval without asking whether the approver receives useful context. Correct reasoning treats the agent as an untrusted decision engine operating inside constrained permissions, with external controls deciding what it may do.

Common Misconceptions

Labeling a tool "read-only" in the prompt or schema communicates intent, but it doesn't enforce behavior. Whether the tool is actually read-only depends on the backing credential's permissions, the API's access controls, and the runtime policy wrapping the call. A description of read-only behavior attached to a write-capable service account credential does nothing at enforcement time. Controls live in credential scope, runtime policy, and sandboxing — not in the tool's name or description.

Human approval in the loop sounds like it closes most agent risk. In practice, it only helps when approval happens at meaningful decision points and presents enough context for a real decision. If users face approval dialogs for every low-stakes step, they stop reading and start clicking through. If the approval interface conceals tool arguments, source context, or the action's reversibility, it has the appearance of oversight without the substance. Approval needs to be rare enough that it receives attention, and informative enough that it can be genuinely evaluated.

Tool risk assessment that evaluates each tool in isolation misses what happens when they compose. Reading a set of sensitive records, summarizing them, drafting a message, and calling an email send tool creates a data exfiltration path from four tools that might each individually clear a low-risk review. Workflow-level risk assessment needs to trace full action chains, not just individual tool capabilities, to catch the compound outcomes.

Final output logs show what the user saw — they don't show what the agent decided, what tools were called, what arguments were passed, what context the model received, or what authorization decisions were made along the way. When an agent produces an unexpected result or takes an unauthorized action, final output logs leave investigators without a coherent timeline. Full traceability requires a log record at each tool call with arguments, authorization decision, result, and downstream state change.

Study Checklist

I can classify tools as read-only, write, destructive, external, or irreversible.
I can explain why manifest labels do not enforce runtime access.
I can describe blast-radius limits for a tool or action chain.
I can identify where human approval adds meaningful friction in an agent workflow.
I can explain compound risk from action chaining.
I can identify privilege drift in an agent-to-agent delegation scenario.
I can describe sandboxing and environment isolation requirements for tool execution.
I can specify audit log fields needed to reconstruct an agent action chain.

Remediation Starting Points

If you scored below 60% in this domain:

Inventory the tools in one agent system or prototype. Classify each tool by permission type, externality, reversibility, required approval, and maximum blast radius.
Pick one risky tool and redesign its permission model so the credential itself enforces least privilege. Do not rely on tool descriptions or prompt instructions.
Create a sample approval screen for an irreversible action. Include proposed arguments, evidence sources, affected resources, user identity, risk level, and rollback status.
Run a hands-on lab: build a toy agent with two tools, then attempt to chain harmless-looking actions into a sensitive disclosure or state change. Add runtime policy until the chain is blocked.

Module 06

Model Supply Chain Security

What runs before inference matters.

No one who audits software dependencies treats a compiled binary as "just a file." Model artifacts deserve the same skepticism. A checkpoint downloaded from a public hub carries provenance, licensing terms, behavioral properties from its base model, and possibly unsafe deserialization risk — and most organizations deploying that model have documented none of it. The gap between how teams treat software packages and how they treat model weights is where supply chain risk enters.

What This Domain Covers

Model supply chain security covers the lifecycle of model artifacts before they run in production. It includes model provenance, base model tracking, fine-tune lineage, hub downloads, artifact integrity, unsafe serialization formats, license restrictions, model registries, dependency trust, release signing, and promotion workflows. It sits next to software supply chain security, but differs because model artifacts are large, opaque, often downloaded from public hubs, and sometimes loaded through formats that can execute code or carry hidden behavior.

This is a distinct domain because many teams treat models as data files, not supply-chain components. A model checkpoint may encode training data, licensing obligations, behavior inherited from a base model, unsafe deserialization risk, or tampering introduced during distribution. Normal dependency scanning does not fully answer where the model came from, whether the artifact was modified, what base it was fine-tuned from, or whether the organization is allowed to use it commercially.

The practitioner's entry point is provenance. Before arguing about model quality or performance, ask where the artifact came from, who published it, what base model it used, what license applies, what hash identifies it, who approved it, how it entered the registry, and which production services load it. A model without provenance is not just a technical unknown; it is an operational risk that weakens security, compliance, and incident response.

Core Concepts

Model Provenance Provenance documents the origin and lineage of a model artifact. It should identify the publisher, source repository or hub, exact version, base model, fine-tuning process, training or adaptation data where available, license, approval record, and production owner. Without provenance, teams cannot answer whether a model is trustworthy, reproducible, or legally usable. Provenance should be recorded before production deployment, not reconstructed during an incident.

Artifact Integrity Artifact integrity proves the model loaded in production is the artifact that was approved. Common controls include cryptographic hashes, signed releases, immutable storage, registry promotion workflows, and deployment pinning. Downloading a model by a mutable name such as "latest" defeats many integrity controls. Integrity verification should occur before loading, especially when models come from public hubs or shared object stores.

Unsafe Serialization Formats Some model formats can execute code during loading or rely on unsafe deserialization. Pickle-based artifacts are the classic example in Python ML workflows. Safer formats such as safetensors reduce code execution risk, but they do not solve provenance, licensing, or behavioral risk. Practitioners need to know which formats are allowed, which require sandboxing, and which are prohibited in production.

Model Registry Controls A model registry should not be a passive file listing. It should enforce access control, versioning, approval, promotion stages, metadata requirements, artifact hashes, owner assignment, and rollback paths. Registries such as MLflow, Weights and Biases, Vertex AI Model Registry, and SageMaker can support strong workflows if configured correctly. Weak registries become dumping grounds where unreviewed artifacts look production-ready.

License and Use Rights Model licenses can restrict commercial use, redistribution, derivative works, field of use, or output rights. Fine-tuning on a restricted base model may carry obligations into the derived artifact. Teams that ignore licenses create legal and business risk, not just compliance paperwork. License scanning for AI must cover model weights, datasets, code dependencies, and provider terms.

The Threat Landscape

Tampered Public Hub Artifact A model artifact downloaded from a public hub has been replaced, compromised, or uploaded by an impersonator. The precondition is deployment from public sources without hash verification, publisher validation, or approval workflow. In production, the model may execute malicious code during load, behave unexpectedly, or embed hidden triggers. The incident can be hard to scope if services pulled the artifact at different times.

Unsafe Pickle Loading A team loads a pickle-based model, adapter, or preprocessing artifact in a privileged environment. The precondition is treating ML artifacts as trusted data files. The impact can be arbitrary code execution at load time, often with access to model-serving credentials, object stores, or internal networks. This risk is especially common in notebooks, MLflow pyfunc models, and ad hoc inference scripts.

Base Model Provenance Gap A fine-tuned model enters production without a documented base model, license, or adaptation process. The precondition is velocity-driven experimentation where training runs are promoted without intake controls. In production, the team cannot evaluate inherited restrictions, known vulnerabilities, memorization risk, or behavioral issues. During an incident, responders cannot determine whether the behavior came from the base, the fine-tune, or the application.

Mutable Model Reference Drift Deployment references a mutable tag, branch, model card, or registry stage without pinning to an immutable version. The precondition is convenience-driven deployment automation. A model update changes production behavior without re-running evals, security review, or approval. This creates silent drift and breaks evidence because the artifact under review is not necessarily the artifact in production.

License Contamination A team fine-tunes or deploys a model with commercial restrictions, copyleft-like obligations, or incompatible dataset terms. The precondition is treating model selection as an engineering choice without legal or procurement review. The impact can include forced removal, customer contract risk, product roadmap disruption, or inability to distribute derivative models. The security team may be asked to prove provenance after the business has already shipped.

What Good Looks Like

A mature model supply chain has a formal intake process. Every production model has an owner, source, version, base model, license, hash, approval record, registry entry, and deployment target. The intake process distinguishes experimental models from production candidates. Production promotion requires integrity verification, license review, security review, eval results, and rollback planning.

Good artifact handling is deterministic. Model deployments pin exact versions and verify hashes before loading. Public hub downloads are mirrored into controlled storage after approval rather than pulled live into production. Unsafe formats are prohibited or loaded only in isolated environments. Build and serving pipelines record which artifact hash was loaded by which service at which time.

Mature registries enforce workflow. A registry entry includes metadata, lineage, approval status, owner, associated evals, known limitations, license, and retirement status. Access control prevents arbitrary users from promoting artifacts to production. Registry events feed audit logs and release gates. Rollback points are known and tested.

Supply-chain evidence is available. A team can show model intake checklists, hash verification logs, signed artifact records, registry promotion approvals, SBOM or MBOM metadata, license review notes, base model lineage, and deployment manifests. This evidence supports not only security review but also vendor assessments, customer questions, and incident response.

Assessment Focus

The assessment tests whether you can recognize that model artifacts are part of the supply chain. You need to identify when provenance, integrity, serialization format, license, registry controls, or mutable references are the core issue. Strong answers focus on approval workflows, hashes, signed artifacts, pinned versions, registry metadata, and safe loading practices.

The most common wrong-answer pattern is treating model supply chain as normal dependency management or as pure model quality review. Candidates may suggest re-running accuracy tests when the scenario describes artifact tampering. They may suggest scanning Python dependencies when the issue is a pickle model loaded from an untrusted hub. Correct reasoning asks what artifact is running, where it came from, and whether the organization can prove it.

Common Misconceptions

Model files are routinely treated as data artifacts rather than supply-chain components. In practice, some formats can execute code during loading, and all models carry embedded risk: training data lineage, licensing terms, behavioral properties inherited from a base model, and potential tampering introduced during distribution. A model artifact that isn't treated as a supply-chain component doesn't get intake review, integrity verification, registry controls, or deployment pinning — and every one of those gaps is a real attack or compliance surface.

Safetensors is a meaningful improvement over formats that execute code on load. But it doesn't prove who created the model, whether the artifact was tampered with after publication, what license applies, or whether the organization has approved the base model. Format safety reduces one class of risk — unsafe deserialization — while leaving provenance documentation, hash verification, registry governance, and base model inventory entirely unaddressed. One control isn't a supply chain program.

Model hubs provide convenient distribution and model metadata, but they don't guarantee that every artifact is signed, authenticated, legally usable, or appropriate for production. Hubs differ significantly in their trust models, signing practices, takedown policies, and abuse histories. Treating "downloaded from a well-known hub" as sufficient due diligence is effectively outsourcing intake controls that the deploying organization remains responsible for.

A registry stores artifacts and makes them discoverable — which is valuable — but that's inventory, not governance. Governance requires decisions: who approved this model, what evaluation evidence supports it, what license applies, who is accountable for production risk. A registry full of unreviewed artifacts with incomplete metadata provides a more organized version of the original gap. Fill the registry with decisions, not just checksums.

Study Checklist

I can explain what model provenance must include before production use.
I can describe why hash verification should occur before model loading.
I can identify unsafe serialization risks in pickle-based model artifacts.
I can distinguish format safety from provenance and license safety.
I can describe model registry controls for approval, promotion, and rollback.
I can identify risks from mutable model references such as latest tags or registry stages.
I can explain license and use-rights risk for base models and fine-tunes.
I can name evidence artifacts that prove model supply-chain controls operated.

Remediation Starting Points

If you scored below 60% in this domain:

Pick one model used in a project and write a provenance record: source, publisher, exact version, base model, license, hash, owner, approval status, and deployment target.
Download a model artifact into a test environment and verify its hash before loading. Record how the hash would be stored and checked in CI/CD or deployment.
Review your allowed model formats. Write a policy that defines which formats are allowed, which require sandboxing, and which are prohibited in production.
Run a hands-on lab: create a mock model registry entry with required metadata, promotion stages, approval fields, rollback version, and evidence links.

Module 07

MLOps Platform Security

Pipelines are attack surface too.

The ML platform is often treated as a technical support system for experimentation — something that sits adjacent to production, not quite in it. That framing is wrong and expensive. The ML platform holds the credentials, training data access, model artifacts, and deployment controls that determine what actually ships to users. Compromising a notebook server or training job is not adjacent to a production incident; in most architectures, it is the production incident.

What This Domain Covers

MLOps platform security covers the infrastructure and workflows used to train, evaluate, register, deploy, monitor, and roll back machine learning models. It includes notebooks, training jobs, feature stores, experiment trackers, model registries, artifact stores, CI/CD pipelines, serving endpoints, staged rollout systems, and the credentials that connect those components. It sits adjacent to cloud security and DevSecOps, but the risk profile differs because ML platforms often combine broad data access, code execution, privileged service accounts, and weakly audited experimentation workflows.

This is a distinct domain because ML platforms become high-value control planes for data and model behavior. A compromised notebook server can expose credentials and training data. A weak feature store permission model can leak high-value attributes. A training job with broad service account access can read sensitive data across the organization. A serving rollout without staged controls can push a bad model to all users before anyone can detect the failure. Treating MLOps as generic CI/CD misses the data gravity and artifact lineage that make these systems special.

The practitioner's entry point is pipeline accountability. For every model or AI feature, you should be able to answer what code ran, what data it accessed, what credentials it used, what artifact it produced, who approved promotion, what tests passed, and how it rolled out. If a platform cannot answer those questions, it cannot support security investigation, compliance evidence, or safe production deployment.

Core Concepts

Notebook Secret Hygiene Notebooks are often used as exploratory environments, but they frequently become production-adjacent. Credentials appear in cells, outputs, environment dumps, package installation commands, and Git history. A notebook may connect to object stores, feature stores, model registries, and production databases. Good notebook security requires secret management, output scrubbing, access controls, expiration, and review before notebooks become pipeline code.

Feature Store Authorization Feature stores contain curated training and inference features that may include sensitive, regulated, or high-business-value attributes. Authorization needs to control who can read each feature group, which model can use it, and whether online serving can access it at inference time. A low-privilege model should not be able to train on or infer from high-privilege features simply because the feature store exposes them through a shared API. Feature access is a data access decision, not just an ML convenience.

Training Job Forensics Training jobs should leave enough evidence to reconstruct what happened. That means recording code version, data snapshot, feature set, hyperparameters, base model, dependencies, artifact hash, service account, environment, and output location. Without this, teams cannot determine whether a bad model came from bad data, bad code, poisoned inputs, dependency drift, or unauthorized access. Forensics must be designed into the pipeline before an incident.

Staged Rollout Blast Radius Model deployment can change behavior at scale immediately. A bad model, unsafe prompt configuration, or broken retrieval pipeline should not reach 100% traffic without guardrails. Staged rollout controls include canaries, shadow mode, traffic slices, automated rollback, evaluation gates, monitoring thresholds, and human approval for high-risk releases. Blast radius is reduced by release design, not by hoping observability catches everything later.

ML Platform Credentials MLOps platforms often hold service accounts with broad data access, registry permissions, storage credentials, cloud execution roles, and deployment rights. These credentials are attractive because they bridge code, data, and production systems. They may be less audited than standard application credentials because they live in notebooks, training clusters, or pipeline runners. Treat the ML platform as a privileged credential store.

The Threat Landscape

Notebook Credential Exposure A developer commits a notebook containing API keys, cloud tokens, database credentials, or printed secrets. The precondition is exploratory work without secret scanning, output scrubbing, or notebook review. The impact can include data exfiltration, unauthorized model registry access, or cloud resource compromise. The risk persists in Git history even after the visible cell is removed.

Feature Store Privilege Bypass A training pipeline reads features that the model owner should not access. The precondition is feature store authorization that operates at project level instead of feature, subject, or purpose level. In production, the model may encode sensitive attributes or serve predictions based on data outside its approved scope. The issue can be difficult to detect because the final model artifact does not reveal its full feature-access history.

Unreviewed Training Script Execution A pipeline runner executes training code from a branch, notebook, or unpinned package without review. The precondition is treating training as experimentation even when it produces production artifacts. The impact can include malicious code execution, dependency compromise, credential theft, or poisoned artifacts. The model may later be promoted through a registry that assumes the upstream pipeline was trustworthy.

Full-Traffic Bad Model Release A model update ships directly to all users without staged rollout, rollback trigger, or monitoring threshold. The precondition is deployment automation optimized for speed but not blast-radius control. The impact may be harmful outputs, degraded recommendations, privacy leakage, or business process failure across the entire customer base. The incident window expands because there is no safe partial deployment.

Artifact Store Overexposure Model artifacts, training datasets, logs, or eval outputs are stored in object buckets or artifact stores with broad read permissions. The precondition is shared storage used by multiple notebooks, jobs, and teams without object-level policy. The impact can include leakage of training data, prompt logs, model weights, customer data, or proprietary evaluation sets. Artifact stores often become shadow data lakes.

What Good Looks Like

A mature MLOps platform treats experimentation and production promotion as different trust zones. Notebooks can be flexible, but production-bound code moves into reviewed pipelines. Secrets are injected through managed stores, not pasted into cells. Notebook outputs are scanned before sharing. Access to datasets, feature groups, registries, and artifact stores is tied to role, project, tenant, and purpose.

Good pipelines produce forensic records automatically. Each training run records the code revision, data snapshot, feature set, model or base model, dependency lockfile, execution identity, artifact hash, evaluation results, and promotion decision. This metadata travels with the model into the registry. When a production issue appears, responders can trace from endpoint behavior back to the exact run and data inputs.

Mature rollout practices limit blast radius. A new model or AI configuration enters shadow mode, canary traffic, or limited cohort exposure before full deployment. Release gates include eval thresholds, security checks, rollback plans, and owner approval. Monitoring tracks model performance, drift, safety signals, misuse indicators, and operational errors. Rollback is tested and available before the release begins.

Security evidence exists across the platform. Teams can show secret scanning results for notebooks, feature access logs, training job metadata, registry promotion records, artifact integrity checks, staged rollout configuration, and monitoring dashboards. These artifacts prove the ML platform is controlled as an operational system. They also make security reviews faster because reviewers do not have to reconstruct the pipeline from informal knowledge.

Assessment Focus

The assessment tests whether you can identify ML platform-specific failure modes that ordinary application reviews often miss. You need to recognize notebook secret exposure, feature store authorization gaps, training job forensic gaps, rollout blast-radius failures, pipeline code execution risk, and overprivileged service accounts. Strong reasoning follows the path from data and code to artifact and deployment.

The most common wrong-answer pattern is treating MLOps as only cloud infrastructure. Candidates may focus on network controls while missing that the training job has broad data permissions, or they may recommend generic CI/CD scanning while ignoring feature store access. Correct reasoning asks what data the platform can access, what code it executes, what artifacts it produces, and how production changes are controlled.

Common Misconceptions

Notebooks get categorized as research tools, which implies they sit outside the production security perimeter. But notebooks routinely connect to production data stores, model registries, cloud storage, and API endpoints. They accumulate credentials in environment dumps, printed cells, and Git-tracked outputs. Code written in a notebook this week becomes pipeline code next month. A notebook environment that isn't governed is a production credential store that nobody audited.

Feature stores get described as caching systems for ML — a way to precompute common attributes for training and inference. That description focuses on the output without addressing what the input represents: feature stores hold high-value, curated attributes that may include sensitive personal data, proprietary business signals, or regulated content. Access control, purpose limitation, and logging aren't optimizations to add later. They're security requirements for any system that mediates broad data access into model training and serving pipelines.

A training job that completes without error proves the code ran. It doesn't prove that the training data was authorized, that dependencies came from trusted sources, that the resulting artifact passed security evaluation, or that rollout risk has been assessed. Each of those is a separate gate with separate evidence. Treating training success as deployment approval collapses a multi-stage control chain into a single completion status.

Model deployment changes behavior at scale and at speed. A bad model, compromised artifact, or unsafe prompt configuration can reach every user before a human detects the problem — if rollout controls don't create intervention points. Security teams who treat deployment as the ML team's operational concern alone have no visibility into canary thresholds, rollback triggers, or monitoring gates. By the time they're involved, full exposure has already happened.

Study Checklist

I can identify notebook secret exposure patterns in cells, outputs, and Git history.
I can explain why feature store authorization is a security control.
I can describe the metadata required to reconstruct a training job.
I can identify overprivileged service accounts in ML pipelines.
I can explain how staged rollout reduces model deployment blast radius.
I can describe artifact store exposure risks for models, datasets, logs, and evals.
I can name security gates that should apply before model promotion.
I can describe evidence artifacts for MLOps platform control.

Remediation Starting Points

If you scored below 60% in this domain:

Review one notebook repository and scan for credentials, environment dumps, data paths, printed secrets, and output cells that contain sensitive values.
Map a training pipeline from code commit to production endpoint. Identify the execution identity, data sources, feature groups, artifact store, registry, eval gate, and rollout step.
Write a feature store access policy for one sensitive feature group. Include who can train on it, who can serve it online, and what logs must be retained.
Run a hands-on lab: create a toy model deployment with 5% canary traffic, a rollback threshold, and a log record tying the endpoint to a model artifact hash.

Module 08

AI-Aware Secure SDLC

Security gates that actually block release.

Most teams doing their first AI security review apply the same SDLC gates they already have. The threat model template, the code review checklist, the release approval form — all built before the product included an LLM. The result is a process that can articulate AI risk in new vocabulary but hasn't changed what it actually blocks. The difference between an AI-aware SDLC and the old one isn't the language. It's which failures stop a release.

What This Domain Covers

AI-aware secure SDLC covers how AI-specific security requirements enter software development, code review, architecture review, testing, release approval, model updates, and incident-driven improvements. It includes threat modeling for AI features, pull request review criteria for LLM integrations, release blockers for failed evals or unchecked model changes, rollback planning, and re-entry criteria after model or prompt updates. It sits between AppSec, ProductSec, MLOps, and governance because it turns AI risk into repeatable engineering gates.

This is a distinct domain because conventional SDLC controls do not automatically catch AI-specific failure modes. STRIDE helps with spoofing, tampering, repudiation, information disclosure, denial of service, and elevation of privilege, but it does not fully capture context poisoning, retrieval-time authorization, non-deterministic output, model update drift, or autonomous action chains. Teams that bolt AI features onto existing review processes often create security ceremonies that discuss risk without blocking unsafe releases.

The practitioner's entry point is release criteria. Ask what must be true before an AI feature ships: which threat model exists, which evals passed, which model version is approved, which prompt changes were reviewed, which retrieval and tool permissions were tested, which rollback path exists, and which evidence record proves the decision. AI-aware SDLC turns those questions into mandatory gates rather than optional review notes.

Core Concepts

AI Feature Threat Modeling AI threat modeling extends normal application threat modeling to model boundaries, context construction, retrieval, tool calls, model artifacts, provider dependencies, and telemetry. The output should be a ranked attack surface list, not a brainstorming transcript. It should identify which controls are design-time, runtime, release-time, and incident-time. A useful AI threat model produces tickets and release blockers.

AI-Specific Release Blockers A release blocker is a condition that prevents shipment. For AI systems, blockers may include failed prompt injection tests, missing retrieval authorization, unapproved model updates, no rollback plan, unreviewed tool permissions, absent logging, or unresolved high-severity red-team findings. A blocker differs from a recommendation because the release cannot proceed until it is resolved or explicitly risk-accepted. Without blockers, AI SDLC becomes advisory.

Model Update Re-Entry Criteria A model update can change system behavior without changing application code. Re-entry criteria define what must re-pass before the new model, prompt, fine-tune, embedding model, or retrieval configuration ships. This may include eval suites, red-team regression tests, performance checks, privacy checks, output schema validation, and approval. Treat model changes like security-relevant software changes.

Pull Request Review for LLM Code LLM integration code deserves specific review questions. Does the client control system prompts? Are API keys server-side? Are retrieved sources authorized before context assembly? Is output validated before rendering or action? Are prompts, model names, provider responses, and tool calls logged safely? Generic code review rarely asks these questions unless the checklist includes them.

Incident-Driven SDLC AI incidents should update the development process. If a prompt injection incident occurs, the fix should not only patch one prompt; it should add a regression test, change review criteria, improve logging, and update release gates. Incident-driven SDLC prevents repeated failures by turning lessons into mandatory checks. Mature teams treat incident artifacts as SDLC inputs.

The Threat Landscape

Security Ceremony Without Blocking Power A team holds AI security reviews but launches features regardless of unresolved findings. The precondition is a review process that advises but does not define release blockers. In production, known risks become accepted by default because no one formally owns the decision. The impact is governance theater and avoidable incidents.

Unchecked Model Update A provider model, base model, fine-tune, embedding model, or prompt configuration changes without rerunning security tests. The precondition is treating model behavior as an external dependency outside SDLC control. In production, safety, output format, refusal behavior, retrieval quality, or tool-use patterns shift unexpectedly. The team cannot prove the release met prior criteria.

No Rollback Plan for AI Behavior An AI feature ships without a tested way to revert model version, prompt template, retrieval index, tool access, or rollout percentage. The precondition is assuming rollback works the same as code deployment. When behavior degrades, the team cannot quickly restore a known-good state. The incident lasts longer because rollback was never designed.

Generic PR Review Misses Context Trust A pull request adds LLM context assembly code, but reviewers check only style, tests, and ordinary authorization. The precondition is no AI-specific review checklist. The code ships with client-controlled hidden context, unsafe rendering, missing retrieval filters, or broad tool scope. The failure appears later as a data leak or prompt injection path.

Incident Fix Does Not Become Regression A team patches a prompt, document, or tool policy after an issue but never adds a test or release gate. The precondition is treating incidents as one-off defects. The same failure reappears after a model update, prompt rewrite, retrieval change, or refactor. The organization learns the lesson verbally but not operationally.

What Good Looks Like

A mature AI-aware SDLC begins with intake. New AI features are identified early and routed through AI-specific review based on risk tier. The intake captures model provider, data sources, retrieval design, tool access, user population, output use, privacy impact, and deployment plan. Low-risk features may receive lightweight review, but high-risk features trigger threat modeling, eval requirements, approval gates, and rollback planning.

Good release gates are explicit and enforceable. A release cannot proceed if required evals fail, if a high-severity red-team finding remains open, if retrieval authorization is missing, if tool permissions exceed the approved scope, if logs cannot reconstruct behavior, or if rollback is untested. Risk acceptance is possible, but it must name the owner, rationale, expiration, compensating controls, and evidence.

Mature teams treat AI configuration as release-controlled material. Prompt templates, system instructions, model versions, embedding models, retrieval filters, tool manifests, and eval thresholds are reviewed and versioned. A change to any of these can trigger re-entry criteria. The team does not assume that unchanged application code means unchanged risk.

Operational artifacts prove the SDLC works. Teams produce AI feature intake records, threat models, PR review checklists, eval results, release gate decisions, model update approvals, rollback test evidence, incident-to-regression mappings, and risk acceptance records. These artifacts let security, engineering, governance, and leadership see whether AI risk is actually embedded in shipping decisions.

Assessment Focus

The assessment tests whether you can distinguish a security gate from a security conversation. You need to identify AI-specific release blockers, model update re-entry requirements, PR review criteria, rollback needs, and incident-driven regression changes. Strong answers connect controls to the development process and ask what prevents an unsafe release from shipping.

The most common wrong-answer pattern is recommending more review without defining enforcement. Candidates may say "conduct a threat model" when the scenario requires a release blocker or re-entry test. They may recommend "monitor after launch" when the missing control is pre-release eval failure blocking. Correct reasoning translates risk into gates, evidence, owners, and repeatable workflows.

Common Misconceptions

Adding an LLM to a product is a material change to the threat model, not a feature that existing review criteria automatically cover. STRIDE handles spoofing, tampering, and elevation of privilege, but it doesn't prompt reviewers to ask whether retrieved content can carry hostile instructions, whether eval evidence is sufficient for a model update, or whether rollback covers prompt template revert as well as code revert. Teams that bolt AI features onto existing review frameworks usually produce security theater — a process that discusses risk without gates that stop unsafe releases.

Advisory security reviews have value for learning and communication, but they are not controls. A control changes what happens — it blocks a release, requires remediation, or forces a recorded risk acceptance. If the security review can be completed and then set aside while the release ships unchanged, it doesn't meet that definition. Mature AI SDLC defines blocking criteria before the review begins, so closure requires either passing evidence or an explicit exception with a named approver.

When an external model provider updates the production model behind an API, many teams treat it as a vendor maintenance event outside their release process. But model updates change output behavior, refusal patterns, format compatibility, and safety properties. A change that shifts model behavior in an integrated workflow is a release event that requires re-entry: evals re-run, behavior verified against known cases, documentation updated, and evidence produced. Outsourcing the model doesn't outsource the release obligation.

Application version rollback is the most visible rollback action, but AI systems have multiple independent components that may need independent reversion. A production incident may require rolling back the model version, prompt template, retrieval index, embedding model, tool permission set, or provider routing configuration — sometimes independently. Rollback plans that only cover code leave most of the AI system's behavioral components unaddressed. The rollback path needs to be tested before the incident, not designed during it.

Study Checklist

I can explain what standard STRIDE misses in AI feature threat modeling.
I can name AI-specific release blockers that should stop shipment.
I can define re-entry criteria after a model, prompt, or retrieval change.
I can identify LLM-specific pull request review questions.
I can describe rollback requirements for model, prompt, retrieval, and tool changes.
I can explain how incident findings become regression tests and release gates.
I can distinguish a security ceremony from an enforceable control.
I can name evidence artifacts that prove AI SDLC gates operated.

Remediation Starting Points

If you scored below 60% in this domain:

Write an AI feature intake form that captures model provider, prompt ownership, retrieval sources, tools, data classification, output consumers, logs, and rollout plan.
Take one AI feature and define five release blockers that would stop shipment. Include at least one eval failure, one authorization issue, one logging issue, and one rollback issue.
Create a pull request checklist for LLM integration code. Add questions about server-side prompt assembly, output validation, API key storage, retrieval filters, and tool permission scope.
Run a hands-on exercise: simulate a model update and write re-entry criteria that must pass before the update reaches production.

Module 09

Privacy and Data Protection in AI Systems

GDPR scope extends into your vector store.

Privacy engineering for AI systems is harder than it looks on the first pass. The difficulty isn't explaining what regulations require — it's that a customer support message can become a fine-tuning example, an embedding, a retrieval chunk, an eval fixture, a prompt log, and a product analytics event before anyone maps the data lifecycle. Deletion rights, consent scope, and breach notification obligations all depend on knowing where the data went. Most AI teams don't know.

What This Domain Covers

Privacy and data protection in AI systems cover how personal data enters, moves through, persists inside, and exits AI applications. This includes training data, prompt logs, retrieved documents, embeddings, vector indexes, model outputs, fine-tuning datasets, evaluation sets, human review queues, and vendor processing paths. It sits next to ordinary privacy engineering, but AI systems complicate the work because data can be transformed into embeddings, summarized into derived records, memorized by models, exposed through inference, or retained in logs that product teams do not treat as privacy-sensitive.

This is a distinct domain because deleting the source record is no longer the whole privacy operation. A customer support ticket may become a fine-tuning example, an embedding, a retrieval chunk, a red-team fixture, an eval sample, a prompt log, and a product analytics event. If the user exercises a deletion right, the organization must know which derived artifacts exist and whether they can be removed, invalidated, or excluded from future processing. Treating AI privacy as normal database deletion leaves vectors, indexes, model artifacts, and logs behind.

The practitioner's entry point is data lineage. Before deciding whether a privacy control is sufficient, trace where personal data enters, how it is transformed, which systems store it, which models consume it, what vendors process it, what users can retrieve it, and what evidence proves deletion or minimization. AI privacy becomes manageable when every derived copy has a purpose, owner, retention rule, and removal story.

Core Concepts

Deletion Propagation to Embeddings Deleting a source document does not automatically delete the embedding or vector record derived from it. If the vector index still contains the chunk, a RAG system may retrieve personal data even after the original record is gone. Deletion propagation requires mapping source IDs to chunks, embeddings, indexes, caches, summaries, and logs. A mature system can remove or invalidate derived records when the source changes.

Purpose Limitation for AI Use Data collected for one purpose cannot automatically be reused for model training, fine-tuning, or product improvement. Customer support data may be valid for resolving tickets but not valid for training a public-facing assistant without a legal basis and disclosure. Purpose limitation requires controls at dataset creation, model intake, and vendor routing. Teams need to record not just what data they used, but why they were allowed to use it.

Data Minimization at Retrieval RAG systems should retrieve only what the user and task require. Returning entire documents, broad context windows, or unrelated chunks increases privacy exposure. Data minimization applies at query construction, retrieval filtering, chunk selection, context assembly, and output generation. A good design favors narrow context, classification-aware retrieval, and explicit suppression of unnecessary personal data.

Inference-Time PII Leakage AI systems can leak personal data during inference through memorization, retrieval leakage, prompt logs, tool outputs, or overbroad context. The user may not see the source data directly; the model may paraphrase, infer, or reconstruct sensitive attributes. Inference-time privacy review must test both exact disclosure and derived disclosure. A response that reveals "the customer who complained about cancer treatment billing" can be sensitive even without a full name.

Transparency and Consent Records Users and customers need to understand when AI systems process their data, what categories of data are used, and whether outputs are AI-generated. Consent and transparency records must connect to actual system behavior. A notice that says data may be used for improvement is not enough if engineering cannot show which datasets, prompts, embeddings, or vendors are included. Transparency becomes credible only when backed by inventory and data-flow evidence.

The Threat Landscape

Incomplete Erasure in Vector Indexes A source record is deleted from the primary database, but its chunk and embedding remain searchable. The precondition is an ingestion pipeline without source-to-vector lineage or deletion propagation. In production, a user may still retrieve personal data through semantic search or generated summaries. The organization may believe it honored erasure while derived artifacts remain active.

Unauthorized Training Reuse A team fine-tunes or evaluates a model on customer data collected for service delivery. The precondition is weak dataset intake and no purpose-limitation review. The impact can include regulatory exposure, customer contract violations, and hard-to-remove model artifacts. Even if the model performs well, the training set may be illegitimate.

Prompt Log PII Accumulation An LLM application logs raw prompts and outputs that contain names, account numbers, health data, credentials, or confidential customer records. The precondition is observability design that prioritizes debugging without privacy classification. In production, logs become a secondary sensitive data store with broad engineer access and unclear retention. Incident response may require logs, but privacy requires minimization and access control.

Overbroad Retrieval Context A RAG system retrieves more personal data than needed to answer a query. The precondition is broad top-k retrieval, large chunks, poor classification, or no purpose-aware filtering. The model may expose unnecessary personal data or use it to infer sensitive facts. This is a minimization failure even when the user had access to some source material.

Vendor Processing Blind Spot An AI provider, analytics service, or annotation vendor processes personal data without being reflected in the data inventory or privacy notice. The precondition is teams adding AI vendors through product experimentation or procurement shortcuts. The impact is uncontrolled data transfer, unclear retention, and inability to answer customer or regulator questions. Shadow AI services make this pattern worse.

What Good Looks Like

A mature AI privacy program maintains lineage from source data to derived AI artifacts. Source records map to chunks, embeddings, caches, prompt logs, eval examples, fine-tuning datasets, model versions, and vendor processing events where applicable. When data is deleted or reclassified, downstream artifacts are removed, rebuilt, invalidated, or marked out of scope. The team can show deletion job logs and index rebuild records.

Good systems minimize data before it reaches the model. Retrieval queries apply user authorization, purpose limitation, classification filters, and narrow chunk selection. Prompt construction avoids unnecessary PII, and output policies prevent casual disclosure of personal attributes. Logs retain enough for investigation while redacting, tokenizing, or restricting sensitive fields. Privacy review participates in architecture, not just legal copy.

Mature teams document legal basis and purpose for AI processing. Training datasets, eval sets, fine-tuning records, vendor calls, and model improvement workflows include purpose, data category, retention, owner, and approval status. Consent, notices, and customer commitments match actual processing paths. If a dataset cannot be justified, it is not used.

Operational evidence includes data-flow diagrams, source-to-vector lineage tables, prompt logging policies, DLP reports, deletion propagation tests, vendor data-processing records, dataset approval records, and privacy incident playbooks. These artifacts let teams prove that privacy controls operate across AI-specific transformations. Without them, privacy claims depend on assumptions.

Assessment Focus

The assessment tests whether you can identify privacy failures that hide in AI transformations. You need to recognize that embeddings, summaries, logs, eval datasets, fine-tuning examples, and vendor requests can all carry personal data. Strong answers focus on lineage, minimization, purpose limitation, deletion propagation, access control, and evidence.

The most common wrong-answer pattern is database-centric deletion thinking. Candidates assume that deleting the source row solves the privacy issue. They miss vector indexes, prompt logs, model training sets, cached summaries, and vendor retention. Correct reasoning follows the data through transformations and asks what remains after the source changes.

Common Misconceptions

Embeddings are mathematical representations, and teams sometimes classify them as anonymized because they aren't directly readable text. But embeddings derived from personal data retain semantic information that can support re-identification: they enable retrieval of source text or close paraphrases, they encode attribute information that can be used for inference, and they can be linked to source records through metadata. Where embeddings originate from personal information, privacy obligations follow the derived representation, not just the original record.

Generated content doesn't inherit a clean legal status from the model that produced it. A model can disclose personal data through generation, infer sensitive attributes learned during training, reproduce content from source documents, or combine and surface information in ways the originals didn't. Privacy obligations follow the data and the effect on individuals — not the question of whether a model or a human produced the specific sentence. The output needs privacy analysis regardless of its authorship.

Incident responders often argue for comprehensive prompt logging on the grounds that investigation requires full context. That's sometimes true for high-risk workflows, but broad raw prompt logging also creates a concentrated store of sensitive data: PII, customer records, confidential business context, credentials inadvertently pasted, and regulated information. The right design isn't "log everything" or "log nothing." It's a tiered policy where logging scope matches risk level, sensitive fields are redacted or sampled, and access to raw logs is restricted and audited separately.

Privacy notices that mention AI or automated processing don't automatically authorize every downstream use of that data. Using customer support conversations for fine-tuning, routing prompts to third-party model providers, retaining eval examples from live interactions, or building retrieval embeddings from sensitive records may each require separate lawful basis analysis. The relationship between what users were told, what data was collected, and what the engineering systems actually do needs to be documented and kept current.

Study Checklist

I can explain why deleting a source record may not delete embeddings or vector chunks.
I can identify prompt logs, eval sets, and fine-tuning datasets as possible personal data stores.
I can describe purpose limitation for training and model improvement workflows.
I can explain data minimization at retrieval and prompt assembly time.
I can identify inference-time PII leakage through memorization, retrieval, or paraphrase.
I can describe source-to-vector lineage needed for erasure requests.
I can explain how vendor AI processing changes privacy obligations.
I can name evidence artifacts that prove AI privacy controls operated.

Remediation Starting Points

If you scored below 60% in this domain:

Pick one AI feature and trace personal data from source system to prompt, vector index, log, vendor call, output, and retention store. Mark every derived copy.
Implement a deletion propagation test in a toy RAG system. Delete a source document and verify that its chunks, embeddings, cache entries, and citations disappear.
Review prompt logs for PII and classify which fields must be redacted, tokenized, access-controlled, or retained under a shorter policy.
Run a hands-on lab: create two retrieval configurations, one broad and one minimized, then compare how much unnecessary personal data enters the prompt for the same user question.

Module 10

AI Governance, Risk, and Compliance

Policy without inventory is theater.

Most AI governance programs produce documentation before they produce inventory. The risk register gets written. The AI principles get published. The board presentation gets delivered. Meanwhile no one can answer which AI systems are in production, who owns each one, what controls run against each, and what evidence would prove those controls operated last quarter. Governance without inventory is confident language attached to an unknown system.

What This Domain Covers

AI governance, risk, and compliance cover the operating system that makes AI security accountable: inventory, control ownership, framework mapping, evidence collection, release gate integration, risk acceptance, executive reporting, and audit readiness. This domain includes NIST AI RMF, ISO 42001, EU AI Act risk tiers, OWASP LLM Top 10, MITRE ATLAS, and internal policies, but it focuses on how those frameworks become engineering controls. It sits above technical domains because governance without technical evidence cannot prove control operation.

This is a distinct domain because AI governance often produces polished documents before it produces operational control. Organizations publish AI principles, acceptable use policies, and board reports while lacking an inventory of AI systems, model providers, retrieval indexes, agent tools, data flows, and release gates. That gap creates policy theater: the organization appears mature until a customer asks for evidence, an auditor tests a control, or an incident exposes that no one owned the risk.

The practitioner's entry point is inventory and ownership. You cannot govern what you cannot enumerate, and you cannot operate a control without an accountable owner. A mature AI governance program maps each AI system to controls, evidence artifacts, collection cadence, risk owner, technical owner, and release decision points. The goal is not to memorize frameworks; it is to translate them into work that engineering teams can perform and prove.

Core Concepts

AI Inventory An AI inventory lists the systems, features, models, providers, datasets, retrieval indexes, agents, tools, and vendors that create AI risk. It should include owner, business purpose, users, data categories, model dependency, risk tier, deployment status, and evidence links. Without inventory, governance operates on assumptions. Inventory is the foundation for risk assessment, vendor review, release gates, and incident scoping.

Control Ownership Every control needs a named owner who can operate it, produce evidence, and respond when it fails. Governance controls often fail when ownership is assigned to a committee or abstract function. For example, retrieval authorization may belong to platform engineering, eval thresholds to AI engineering, customer assurance to security leadership, and evidence collection to GRC. Ownership must be specific enough to create action.

Evidence Collection Evidence is the artifact that proves a control operated. Acceptable evidence may include eval results, release gate logs, access review records, model intake approvals, telemetry dashboards, prompt injection test outcomes, vendor assessments, and incident closure records. Documentation alone is rarely sufficient. A policy says what should happen; evidence proves what did happen.

Framework Translation Frameworks provide categories and expectations, but they do not automatically define engineering work. A NIST AI RMF requirement may translate into inventory fields, threat model templates, eval gates, logging requirements, or board reporting metrics. ISO 42001 may require management-system evidence. OWASP LLM Top 10 may inform technical control tests. The practitioner's job is to translate framework language into artifacts, owners, and cadence.

Release Gates as Governance Enforcement Governance becomes real when it affects shipping decisions. If a high-risk AI system lacks eval evidence, model approval, retrieval authorization, or incident logging, the release gate should block or require explicit risk acceptance. Release gates turn governance from advisory language into operational enforcement. They also produce records that support audits and executive reporting.

The Threat Landscape

Unknown AI System Inventory Teams deploy AI features, copilots, scripts, or vendor integrations without entering them into inventory. The precondition is rapid adoption without intake or discovery process. In production, the organization cannot assess risk, enforce policy, or scope incidents. Shadow AI becomes a governance failure before it becomes a technical incident.

Control Without Owner A policy states that AI systems must be evaluated, monitored, or reviewed, but no named team owns the control. The precondition is governance designed at the policy layer without operational mapping. In production, the control is inconsistently applied or ignored. During audit, teams can show the policy but not the evidence of operation.

Evidence Substitution A team presents policies, training records, or risk register entries as proof that technical controls operated. The precondition is confusing governance documentation with control evidence. The impact is weak assurance: the organization cannot show that evals ran, retrieval authorization blocked access, or model intake approval occurred. External auditors and customers may reject the evidence.

Framework Checklist Trap The organization maps every framework item to a spreadsheet status but does not connect items to engineering artifacts. The precondition is compliance-driven implementation without security architecture involvement. The program appears comprehensive while leaving release processes unchanged. The result is a large checklist that does not reduce risk.

Executive Reporting Drift Board or leadership reports describe AI security posture using vague maturity claims and green status indicators without evidence freshness or risk exceptions. The precondition is reporting designed for reassurance instead of decision-making. The impact is misallocated investment and surprise when incidents or audits reveal control gaps. Mature reporting should show risk, progress, blockers, and evidence quality.

What Good Looks Like

A mature AI governance program starts with a living inventory. Each AI system has a risk tier, owner, purpose, model/provider dependency, data categories, user population, deployment status, and evidence links. The inventory is connected to procurement, SDLC intake, cloud discovery, vendor review, and product launch processes. New AI systems cannot quietly bypass the inventory.

Good governance maps controls to operating teams and artifacts. A control registry identifies each control, owner, evidence type, collection cadence, current status, last evidence date, exception state, and related systems. For example, model intake approval may require a provenance record, license review, hash verification, eval results, and release approval. Retrieval security may require ACL tests, query logs, metadata schema review, and deletion propagation evidence.

Mature governance is embedded in release decisions. High-risk AI systems cannot ship without required threat models, eval results, logging, rollback plans, vendor approvals, and risk acceptance where needed. Exceptions have owners, expiration dates, compensating controls, and executive visibility. Governance does not stop at saying "review required"; it defines what happens when the review fails.

Executive reporting is evidence-backed. Reports show inventory coverage, control evidence freshness, eval pass rates, open high-risk exceptions, vendor assessment completion, incident trends, and aging remediation. The board or senior leadership receives enough context to make resource and risk decisions. Green status is not allowed unless evidence exists and is current.

Assessment Focus

The assessment tests whether you can translate governance language into operational control. You need to identify when inventory is missing, when ownership is vague, when evidence is insufficient, and when framework mapping fails to become engineering work. Strong answers focus on artifacts, owners, cadence, release gates, and decision records.

The most common wrong-answer pattern is choosing policy when the scenario demands evidence. Candidates may recommend writing an AI policy when the real problem is that no one can prove evals ran or retrieval access was enforced. They may map to a framework without defining who collects what. Correct reasoning asks what artifact proves the control operated and who owns it.

Common Misconceptions

Policy writing is the easiest output of a governance program, so it often becomes the primary output. But operational governance requires more than well-crafted language. It requires an inventory of AI systems that is accurate and current, named owners with actual responsibility for each control, release gates that block deployment when evidence is absent, and evidence artifacts that prove controls ran. Without those components, policy is a description of an intended state — one the organization may or may not be operating in.

Risk registers capture risk understanding, risk decisions, and risk acceptance history — that's useful context, but it isn't proof that a control operated. Evidence comes from the system or process that performed the control: an eval gate log, a model intake approval record, a retrieval authorization test result, a vendor assessment closure. Auditors and customers asking for evidence will not be satisfied with a risk register entry that says a control exists. They need artifacts from the control itself.

Mapping your AI program to NIST AI RMF, ISO 42001, or OWASP LLM Top 10 is a useful analytical exercise that helps identify gaps and communicate intent. But a spreadsheet showing which framework requirements correspond to which internal controls is interpretation, not compliance. Compliance requires evidence that the mapped controls exist, have named owners, operate on a defined cadence, and actually affect decisions. The map shows the theory; evidence shows the practice.

Executive reports on AI security often suppress uncertainty in the name of clarity — green dashboards, maturity scores, percentage completions. But the people reading those reports are making resource allocation and risk acceptance decisions. Reports that hide where inventory is incomplete, where evidence is stale, where a control has no owner, or where an exception has been open for months deny leaders the information they need. Good reporting makes uncertainty visible in decision-useful form — not to alarm, but to enable accurate judgment.

Study Checklist

I can explain why AI inventory is the foundation of governance.
I can distinguish policy documentation from control evidence.
I can describe what a control registry should contain.
I can map a framework requirement to an engineering artifact, owner, and cadence.
I can explain how release gates enforce governance.
I can identify vague ownership in an AI governance scenario.
I can describe executive reporting metrics that reflect evidence quality.
I can explain what audit readiness requires beyond a framework spreadsheet.

Remediation Starting Points

If you scored below 60% in this domain:

Create a sample AI inventory for five systems. Include owner, purpose, model/provider, data category, risk tier, deployment status, and evidence links.
Translate three framework statements into controls. For each, define the owner, artifact, cadence, release gate, and evidence location.
Build a simple control registry with columns for control, system, owner, evidence type, last collected date, status, exception, and next review.
Run a hands-on exercise: take one AI policy statement and turn it into an engineering ticket, an eval or test, an approval record, and a dashboard metric.

Module 11

Red Teaming and Adversarial Evaluations

Structured adversarial evaluation, not vibes.

Most AI red teams produce a report. Far fewer produce a regression suite. That difference matters because a report captures what the team found once, while a regression suite turns those findings into a repeating control that can block future releases.

When severity criteria are not agreed before the exercise begins, closure decisions drift toward the convenience of the team receiving the findings. That is not adversarial evaluation; it is advisory with extra steps. Elite AI security practice treats red teaming, evals, and evidence as one control system.

What This Domain Covers

Red teaming and adversarial evaluations cover the structured methods used to test AI systems against hostile, unsafe, policy-violating, or control-bypassing behavior. The domain includes human-led red-team exercises, automated eval suites, prompt attack libraries, severity rubrics, scope definition, finding classification, closure criteria, evidence retention, and regression testing across model and system updates. It sits between technical security testing, product release engineering, and governance because the outputs become release evidence, risk evidence, and audit artifacts.

This is a distinct domain because AI testing can become theatrical very quickly. A clever jailbreak, a shocking answer, or a dramatic prompt transcript may be useful, but none of those alone create an operational control. If the exercise lacks scope, severity, evidence format, and closure criteria, the team cannot decide whether a finding blocks release, requires remediation, or belongs outside the exercise. If evals are written to pass benchmark-style examples instead of production-specific failure modes, they create confidence without coverage.

The practitioner's entry point is evaluation design. Before running prompts, define the system under test, model capabilities, deployment context, threat actors, allowed techniques, prohibited actions, severity rubric, evidence format, and closure criteria. Red teaming should answer whether controls fail under adversarial pressure. Evals should answer whether known failure classes regress across model, prompt, retrieval, tool, and policy changes.

Core Concepts

Pre-Agreed Severity Rubrics Severity must be defined before testing begins. A critical issue might require unauthorized disclosure of sensitive data, irreversible tool action, cross-tenant access, production-impacting automation, or a repeatable bypass of a release-blocking policy. A high issue might require a constrained but realistic failure path with clear user or business impact. Informational findings should capture weak behavior without overstating risk, and out-of-scope findings should be recorded without distorting the exercise.

Scope Documentation A red-team scope should describe the model versions, application surfaces, user roles, data sources, tools, deployment mode, integrations, and threat actors included in the exercise. It should also state exclusions: production data that cannot be accessed, tools that cannot be triggered, legal or safety boundaries, and time-box limits. Without this, teams argue after the fact about whether a finding "counts." Good scope protects both the red team and the assessed team because it defines the decision environment.

Prompt Attack Libraries A prompt attack library is a maintained set of adversarial scenarios, payloads, expected behaviors, severity tags, and reproduction notes. It should test direct prompt injection, indirect injection, context poisoning, policy bypass, jailbreak chaining, unsafe tool use, sensitive disclosure, and output safety failures. The library should be versioned like test code because prompts, models, policies, and application behavior change. When a red team discovers a new pattern, that pattern should become part of the library where feasible.

Automated Evals vs. Human Red Teams Automated evals are systematic, repeatable, and suitable for CI/CD or release gating. Human red teams are exploratory, judgment-driven, and better at discovering complex chains that no one anticipated. Evals give regression confidence for known scenarios; red teams discover the next scenario class. A mature program does not ask which one is better. It uses red-team findings to improve evals, and eval failures to guide future red-team focus.

Finding Classification Not every undesirable answer is a security finding. Some failures are capability limitations, product quality defects, safety policy gaps, privacy risks, or ambiguous requirements. A security finding should identify the violated security property: confidentiality, integrity, authorization, auditability, controlled action, abuse prevention, or release control. Correct classification prevents both overstatement and premature closure.

The Threat Landscape

Unscoped Red Team Exercise The team runs adversarial testing without defining deployment context, allowed techniques, severity levels, or closure criteria. The exercise produces compelling examples, but the assessed team can later argue that the scenario was unrealistic, unsupported, or outside the intended product behavior. The precondition is pressure to "do an AI red team" quickly without operational design. The production impact is weak decision-making: findings become discussion items rather than release controls.

Severity Drift After Delivery A finding looks serious during testing but gets downgraded during remediation because the team receiving the finding controls the closure narrative. This happens when severity definitions were not agreed before testing. The mechanism is usually subtle: "the user had to try hard," "the model was only following context," "that tool is not normally used," or "we would monitor that." The impact is risk acceptance without an actual accountable risk decision.

Benchmark Gaming An eval suite is optimized to pass static examples rather than challenge production behavior. The precondition is using public benchmarks, narrow golden sets, or model-quality metrics as a proxy for security. The system improves on the metric while remaining vulnerable to indirect injection, unsafe tool use, context poisoning, or retrieval leakage. The result is a dashboard that turns green without reducing production risk.

Non-Reproducible Findings A red-team result is captured as a screenshot, anecdote, or short narrative without the prompt, context, model version, retrieval sources, tool calls, configuration, and output trace. Engineering cannot reproduce the issue reliably, so remediation becomes guesswork. Governance cannot prove closure because there is no stable test case. The finding becomes a debate over credibility rather than a control improvement.

Eval Coverage Collapse The team creates automated tests for one surface, usually direct jailbreaks, and treats that as AI security evaluation. The precondition is confusing ease of automation with coverage. In production, the system can still fail through RAG poisoning, tool argument manipulation, streaming leakage, permission boundary failures, or policy bypass through workflow context. Passing the suite becomes a narrow claim that stakeholders interpret too broadly.

What Good Looks Like

A strong adversarial evaluation program starts with written scope and severity. The scope names the application, deployment context, model versions, user roles, data sources, tools, integrations, exclusions, test environment, and time box. The severity rubric defines what constitutes critical, high, medium, low, informational, and out-of-scope before a single prompt runs. Those documents are not bureaucracy; they are the contract that makes findings actionable.

The program produces durable artifacts. A complete engagement leaves behind a red-team scope document, severity rubric, prompt attack library, eval suite, reproduction traces, finding records, closure evidence, and regression tests. Each finding includes enough context to reproduce: prompt, relevant hidden or retrieved context, model version, policy configuration, tool calls, outputs, timestamps, and expected behavior. Closure requires either a passing retest, an accepted risk record, or a scoped decision that the issue does not violate the agreed criteria.

Good eval pipelines integrate with release engineering. High-severity tests block model updates, prompt changes, retrieval changes, tool permission changes, and policy configuration changes when they fail. The test suite includes production-specific scenarios rather than generic benchmark prompts alone. The pipeline stores outputs as evidence, tracks trend lines, and distinguishes flaky model behavior from meaningful control failure.

Human red teaming continues to matter even with strong automation. Human testers explore new attack chains, abuse paths, context interactions, and judgment-heavy cases that automated evals do not yet represent. The program improves when those discoveries are converted into stable tests, detection logic, review criteria, or architecture changes. The goal is not to perform one impressive exercise; the goal is to create a machine that turns adversarial discovery into continuous control.

Assessment Focus

The assessment tests whether you can distinguish structured adversarial evaluation from informal prompt hacking. It asks whether you know when missing scope, missing rubrics, weak evidence, benchmark gaming, poor finding classification, or absent regression tests undermine the value of the work. Strong reasoning ties testing to decisions: what blocks release, what requires remediation, what becomes evidence, and what becomes a repeatable test.

The most common wrong-answer pattern is to value attack creativity over evaluation quality. Candidates may choose the answer that produces more prompts when the real issue is undefined severity or non-reproducible evidence. They may treat automated evals as complete protection or treat red-team reports as one-time artifacts. Correct reasoning asks what the exercise proves, who can act on it, and whether the result will catch the same failure after the system changes.

Common Misconceptions

Red teaming is not the same as running jailbreak prompts. Jailbreak prompts can be useful, but they are only one class of technique. A red-team exercise needs scope, threat model, severity rubric, evidence format, and closure criteria. Without those, the work may be interesting, but it does not produce dependable security evidence.

Automated evals do not replace human red teams. Automated tests are strongest when they protect against known regressions and enforce release gates. Human testers are stronger at discovering novel chains, context-specific failures, and judgment-dependent abuse paths. Mature programs use the two together, not as substitutes.

A bad model answer is not automatically a security finding. Some bad answers are product quality problems, missing capability, ambiguous UX, or safety policy gaps. A security finding should identify a violated property such as confidentiality, authorization, integrity, controlled action, abuse prevention, or auditability. Classification keeps the program credible.

Passing a benchmark does not prove production safety. Benchmarks rarely include your users, data sources, tool permissions, retrieval indexes, prompt templates, or business workflows. A system can pass a public benchmark while failing against a malicious document in your knowledge base. Production evals must reflect the deployed system.

Study Checklist

I can define a severity rubric before an AI red-team exercise begins.
I can write a scope document that names deployment context, model versions, tools, users, and exclusions.
I can distinguish automated evals from human red teaming and explain why both are needed.
I can identify when a finding is a security issue rather than a capability limitation.
I can describe what belongs in a prompt attack library.
I can specify evidence required to reproduce an AI red-team finding.
I can explain how red-team findings become regression tests.
I can identify benchmark gaming in an AI evaluation program.

Remediation Starting Points

If you scored below 60% in this domain:

Write a one-page red-team scope for a RAG assistant. Include model version, user roles, data sources, tool access, test environment, exclusions, time box, allowed techniques, and evidence requirements.
Draft a severity rubric with critical, high, medium, low, informational, and out-of-scope examples for your system. Include at least one example involving data disclosure and one involving unsafe tool action.
Build a 20-case prompt attack library that covers direct injection, indirect injection, retrieval poisoning, policy bypass, unsafe output, and tool misuse. Store expected behavior and severity for each case.
Run a hands-on regression exercise: take one red-team finding, convert it into an automated test, run it against two model or prompt versions, and record the evidence needed to prove pass or fail.

Module 12

Incident Response and AI Observability

What you cannot log, you cannot investigate.

During an AI incident, the first question is usually what was in the context. The second question is what the model produced or decided. The third is what tool calls, retrieval steps, approvals, or outputs followed. If those records are missing, the response team reconstructs the incident from circumstantial evidence at every layer.

A traditional request log rarely answers those questions. AI incidents cross prompt construction, retrieval, model inference, output filtering, tool execution, and user-visible rendering. Observability is not a dashboard feature here; it is the difference between a precise containment decision and days of speculation.

What This Domain Covers

Incident response and AI observability cover the logs, traces, telemetry, playbooks, and forensic methods required to detect, investigate, contain, and learn from AI system failures. The domain includes prompt and context records, retrieval traces, model version metadata, provider details, tool-call audit logs, approval records, output before and after filtering, eval results at inference time, cache records, and incident classification. It sits across application security, MLOps, privacy, red teaming, and governance because AI incidents are impossible to investigate if those systems did not emit evidence.

This is a distinct domain because AI incidents depend on dynamic context. The same user prompt can be harmless or dangerous depending on retrieved documents, conversation history, model version, tool permissions, safety configuration, and prior agent steps. A standard AppSec incident often reconstructs requests, database writes, network calls, and identity events. An AI incident may require the original prompt, hidden instructions, retrieved chunks, prompt template version, model provider, model response, tool arguments, approval decisions, and rendered output.

The practitioner's entry point is forensic sufficiency. Ask what minimum telemetry would let a responder reconstruct a prompt injection chain, a data exfiltration path, or an unauthorized agent action. If the team cannot answer who asked, what context was supplied, what the model saw, what tools were called, what was approved, and what the user saw, the system is not incident-ready.

Core Concepts

Full-Stack AI Traceability A useful AI trace connects user session, prompt assembly, retrieval, model call, tool execution, filtering, and final output. Each layer should emit structured events with a shared correlation ID. The trace should identify user, tenant, prompt template version, model version, retrieved source IDs, tool names, tool arguments, authorization decisions, approvals, and output destination. Without this linkage, responders must manually stitch together partial logs and often fail to establish causality.

Prompt Logging Policy Raw prompts can be essential for investigation because they show the exact context that influenced the model. They can also contain secrets, personal data, customer records, confidential business information, and regulated content. A prompt logging policy defines when raw prompts are stored, what gets redacted, who can access them, how long they are retained, and when metadata-only logging is acceptable. The policy should be deliberate, not whatever the first SDK integration happened to log.

Retrieval Traceability For RAG systems, the final answer is not enough. Responders need to know which chunks were retrieved, what query produced them, what filters were applied, what scores were returned, what source documents they came from, and what authorization decision allowed them into context. Retrieval traces determine whether the incident is a generation problem, an authorization problem, a poisoning problem, or an attribution problem. Without retrieval traceability, every answer from the same corpus may become suspect.

Agent Tool-Call Forensics Agent incidents require parent-child trace linkage across tool calls. Each proposed and executed action should record tool name, arguments, target resource, caller identity, policy decision, approval decision, tool result, side effect, reversibility flag, and downstream action. This lets responders tell whether the model suggested an action, the orchestrator allowed it, a human approved it, or a tool performed it independently. Tool-call observability is the audit trail for delegated action.

Scope Determination AI incident scope depends on context state, not just code version. Responders may need to bound affected users, tenants, time windows, model versions, prompt templates, retrieval indexes, cached outputs, tool permissions, and vendor routes. A prompt injection incident may affect only sessions that retrieved a poisoned document. A model update issue may affect only requests after a provider routing change. Good observability lets scope become a query instead of a meeting.

The Threat Landscape

Missing Context Records A suspicious output appears, but the system logs only the final answer and user ID. The precondition is observability designed for product analytics rather than forensic reconstruction. Responders cannot determine what system prompt, retrieved chunks, conversation history, or tool outputs influenced the model. The production impact is slow containment because the team cannot identify whether the issue is isolated or systemic.

Raw Prompt Log Spill The team logs every prompt and response in full without redaction, retention limits, or access controls. The precondition is debugging-driven development that treats observability as harmless. Over time, the logs accumulate PII, credentials, customer records, internal strategy, source code, and privileged context. The security tool built for investigation becomes a sensitive data store with its own breach risk.

Unattributed Agent Action An agent modifies a record, sends a message, or triggers a workflow, but logs do not connect the action to the initiating user, prompt, model call, tool arguments, policy decision, and approval. The precondition is tool telemetry recorded separately from model telemetry. The team cannot determine whether the action was requested by the user, inferred by the model, injected through context, or approved by a human. Accountability collapses.

Model Version Blindness An incident occurs after a model or provider routing change, but logs do not record the model version, provider, deployment route, or prompt template version. The precondition is treating model selection as implementation detail rather than forensic metadata. Responders cannot compare behavior before and after the change or identify a safe rollback point. Scope expands because every recent output may need review.

Streaming Visibility Gap A streaming response exposes sensitive text before the final output is filtered or logged. The precondition is token streaming without buffering, partial-output capture, or pre-stream validation for high-risk contexts. Logs show a blocked final answer, but the user saw the leak. The incident record understates impact because the most important content never reached the final log.

What Good Looks Like

A strong AI observability design emits correlated traces across the whole stack. A responder can open one trace and see request identity, user role, tenant, prompt template version, user input, retrieved source IDs, model provider and version, policy configuration, tool calls, approvals, filters, output, and downstream effects. These traces are structured, searchable, access-controlled, and retained according to risk. They support debugging, incident response, governance evidence, and red-team closure.

Prompt and context logging is risk-based. High-risk workflows may require raw prompt capture under strict access control, while lower-risk workflows may log structured metadata, hashes, redacted fields, or sampled records. Sensitive prompt logs have owners, retention schedules, break-glass access, and privacy review. A useful artifact here is a prompt logging policy that names which systems log raw context, why they do so, and how access is governed.

RAG and agent systems produce specialized evidence. A RAG trace includes retrieval query, filters, source IDs, chunk IDs, classification labels, authorization decision, and citation output. An agent trace includes parent-child spans, proposed tool calls, executed tool calls, arguments, approvers, results, reversibility flags, and rollback actions. Other evidence outputs include model version maps, cache key records, eval-at-inference results, incident timelines, detection rules, and post-incident regression tests.

Incident response playbooks include AI-specific containment paths. A team knows when to disable a tool, revoke a connector, remove a poisoned document, rebuild an index, roll back a model, suspend a prompt template, disable streaming, or quarantine cached outputs. The playbook maps each containment action to an owner and evidence source. Post-incident review updates evals, release gates, logging, and architecture controls rather than stopping at root cause language.

Assessment Focus

The assessment tests whether you can identify the telemetry required to investigate AI-specific failures. It asks whether you can distinguish final-output logging from forensic context reconstruction, whether you understand retrieval and tool-call traces, and whether you can balance raw prompt logging against privacy risk. Strong reasoning defines the minimum evidence required to answer what happened, who was affected, and which control failed.

The most common wrong-answer pattern is choosing either "log everything" or "avoid prompt logs entirely." Both reveal a brittle mental model. Logging everything creates sensitive data risk and can violate privacy expectations; logging nothing makes investigation impossible. Correct reasoning designs tiered observability: enough context to reconstruct incidents, with redaction, retention, access control, and escalation for sensitive records.

Common Misconceptions

Final response logs are not enough for AI incident response. The final answer rarely explains why the model produced it. You need the context that entered the prompt, the retrieval records that supplied it, the model version that generated it, and the tool calls that followed. Without those records, root cause analysis becomes interpretation rather than investigation.

Raw prompt logging is not automatically good security. It can be necessary for high-risk workflows, but it also concentrates sensitive data in observability systems. The right framing is not whether prompts should always or never be logged. The right framing is which workflows require raw context, how sensitive data is reduced, who can access it, and how long it persists.

Provider logs cannot replace application observability. The provider may know model request metadata, but it does not know your authorization decisions, tenant boundaries, retrieval corpus, tool approvals, product UI, or downstream state changes. Your application owns the security story around the model. You need your own traces.

AI incident scope is not always tied to a code deployment. Scope may depend on which users retrieved a document, which sessions used a prompt template, which cached response was served, which model route was selected, or which agent tool was enabled. Treating scope as "everything since the last deploy" is often either too broad or too narrow. Context-aware scope requires context-aware logs.

Study Checklist

I can define the minimum telemetry required to reconstruct a prompt injection chain.
I can explain why retrieval traces are required for RAG incident response.
I can design a prompt logging policy that balances forensic value and privacy risk.
I can identify model version and prompt template version as forensic metadata.
I can specify tool-call audit fields for agent incident reconstruction.
I can explain why streaming output creates partial-exposure logging problems.
I can scope an AI incident by user, tenant, model version, prompt template, corpus, and tool state.
I can convert an AI incident finding into a regression test or release gate update.

Remediation Starting Points

If you scored below 60% in this domain:

Write a trace schema for one AI system. Include user, tenant, prompt template version, prompt or prompt hash, retrieved source IDs, model version, tool calls, approvals, output filters, final output, and correlation ID.
Review one AI application and mark which of those trace fields are missing today. Identify the top five missing fields that would block incident scoping.
Draft a prompt logging policy for three risk tiers: metadata-only, redacted prompt logging, and raw prompt logging under restricted access.
Run a hands-on exercise: simulate a poisoned-document RAG incident and try to reconstruct affected sessions first with final-output logs only, then with retrieval traces included.

Module 13

Vendor Risk and AI Procurement

Approved SaaS does not mean approved AI.

Most AI procurement reviews still use questionnaires written before the vendor had an AI product. The security team asks about encryption, access control, SOC 2, and incident response. The vendor answers yes to everything, and no one asks which model the assistant uses, what data trains it, who owns generated outputs, or how much notice customers get before the model changes.

That gap matters because AI features change the risk of an otherwise familiar SaaS product. A tool that was acceptable as a ticketing system may become unacceptable as an AI assistant that summarizes customer data through an unknown model provider. AI procurement starts where traditional vendor review stops.

What This Domain Covers

Vendor risk and AI procurement cover the third-party risks introduced when vendors embed, resell, route to, train, fine-tune, or operate AI systems on behalf of customers. The domain includes model bills of materials, training-data use, customer data handling, output rights, AI-generated decision logs, model change notice, audit rights, vendor eval evidence, shadow AI, and AI-specific procurement checklists. It sits between vendor risk management, privacy, legal, security architecture, and governance.

This is a distinct domain because traditional vendor security reviews often evaluate infrastructure and corporate controls while missing model-specific risk. A vendor can have good encryption, strong SSO, clean penetration test summaries, and a SOC 2 report while still sending customer prompts to a third-party model, retaining data for improvement, changing model versions without notice, or generating outputs whose ownership terms are unclear. AI features alter data flow, decision authority, and evidence requirements.

The practitioner's entry point is the AI-specific data and model path. For each vendor, ask what AI feature exists, what model it uses, where the model runs, what data reaches it, whether customer data trains or improves future models, what logs are retained, what outputs mean contractually, and what notice is given before material changes. If the vendor cannot answer those questions with evidence, "we take security seriously" is not sufficient.

Core Concepts

Embedded AI Governance Gap A vendor may add AI features to a product that was previously approved through standard security review. That approval does not automatically cover new model providers, prompt processing, data retention, generated outputs, or automated decisions. Embedded AI can create new data flows inside an already-approved vendor relationship. Procurement needs a trigger that reopens review when AI capability changes.

Model Bill of Materials A model BOM identifies the models, providers, versions, fine-tunes, safety systems, and model-dependent components behind the vendor's AI feature. It should describe whether the vendor uses a public model API, private deployment, open model, internal fine-tune, or routing layer. It should also identify training or tuning data categories and update practices. The model BOM is the procurement equivalent of asking what software components and subprocessors exist.

Customer Data Use for Training Vendors may use prompts, uploaded files, conversation history, corrections, feedback, or derived data for model improvement. The relevant questions are whether customer data is used, whether opt-out exists, whether data is isolated by tenant, whether human reviewers see it, and what evidence proves the configuration. Contract language should match actual technical behavior. A privacy promise without an operational control is weak.

Output Rights and Obligations AI-generated outputs can raise ownership, sublicensing, confidentiality, watermarking, attribution, and disclosure questions. A vendor may claim broad rights to improve services or restrict customer rights in generated content. Output rights matter when generated material enters product documentation, code, customer communications, hiring decisions, or regulated workflows. Procurement should treat outputs as a contract surface, not just a feature result.

Model Change Notice AI vendors may change model versions, providers, safety settings, retrieval behavior, or routing policies without the kind of notice customers expect for infrastructure changes. In an integrated workflow, those changes can alter behavior, risk, and compliance posture. Model change notice SLAs define when customers must be informed, what details are provided, and whether customers can defer or test changes. Without notice, the customer's release and governance evidence can become stale overnight.

The Threat Landscape

Questionnaire Blind Spot The vendor security questionnaire asks about standard SaaS controls but does not ask about model providers, training data, prompt retention, output rights, or model updates. The precondition is reusing legacy procurement forms for AI-enabled products. The production impact is approval of a vendor whose AI data flow, retention policy, or model dependency would have changed the risk decision. The organization discovers the issue only during an incident, customer review, or legal negotiation.

Silent Customer Data Training A vendor uses customer interactions, prompts, uploads, or feedback to improve models or AI features without clear opt-out or tenant isolation evidence. The precondition is broad service-improvement language paired with weak technical disclosure. The impact can include confidentiality issues, privacy violations, contractual disputes, and customer trust damage. The difficult part is proving what data was or was not used after the fact.

Unannounced Model Change A vendor changes the production model or routing layer behind an integrated AI feature. The precondition is no contractual notice requirement or customer testing window. Outputs, refusal behavior, data handling, or decision quality shift without the customer's release process re-running. The customer's controls, evals, and documentation no longer describe the system actually in use.

Unavailable AI Audit Logs An AI-generated decision or output causes harm, but the vendor cannot provide prompt, context, model version, retrieval source, or reviewer logs. The precondition is vendor logging designed for operations rather than customer auditability. The customer cannot investigate, satisfy internal assurance, or prove what happened to affected users. This is especially serious when the vendor's AI affects employment, finance, healthcare, security, or customer-facing decisions.

Shadow AI Data Flow Employees use unapproved AI services because approved tools are slower, weaker, or more restrictive. The precondition is unmet business demand plus weak discovery, policy, and enablement. Confidential data, customer records, code, credentials, or regulated content enter services outside procurement. The organization now has vendor risk without a vendor relationship.

What Good Looks Like

A strong procurement process reopens review when vendors add or materially change AI functionality. The intake asks whether the product uses generative AI, predictive models, embeddings, automated decisioning, retrieval, human review, model providers, or customer-data training. It records the AI feature owner, business purpose, data categories, risk tier, and review outcome. AI capability is treated as a material product characteristic.

The vendor review produces concrete artifacts. A complete AI vendor file includes an AI procurement checklist, model BOM, data-flow diagram, subprocessors list, customer-data training statement, opt-out evidence, retention policy, output-rights review, model change notice terms, audit log availability statement, and security evaluation summary. For high-risk workflows, it also includes customer testing rights, incident notification language, and rollback or disablement options.

Good procurement decisions distinguish vendor maturity from marketing. A mature vendor can explain model versions, provider routes, data retention, tenant isolation, evals, logging, human review, change management, and customer controls with evidence. A weaker vendor offers general assurances, refuses model details, hides behind proprietary language, or cannot identify how customer data moves through the AI feature. The goal is not to punish vendors for using AI; it is to know what risk the organization is accepting.

Shadow AI is managed through enablement as well as enforcement. The organization provides approved AI tools, clear data-handling rules, discovery controls, browser or network visibility where appropriate, and practical escalation paths for teams that need new capabilities. Blocking everything rarely works. A good program gives employees a safe path that is easier than the unsafe one.

Assessment Focus

The assessment tests whether you can identify AI-specific vendor risk that legacy procurement misses. It asks whether you understand model BOMs, customer-data training, output rights, audit logs, model change notice, shadow AI, and the difference between infrastructure security and AI feature governance. Strong reasoning follows the data and model path through the vendor's system.

The most common wrong-answer pattern is accepting standard SaaS approval as sufficient. Candidates may choose SOC 2, encryption, or SSO evidence when the scenario asks about training data, model updates, or output ownership. That reveals a mental model where AI is just another feature inside an approved vendor. Correct reasoning treats AI capability as a material change that requires AI-specific review and contractual control.

Common Misconceptions

An approved vendor is not automatically approved for AI use. The original approval may have covered the product before it routed prompts to a model provider, retained conversation data, or generated business-critical outputs. AI changes data flow and decision behavior. A material AI feature should trigger review even if the vendor was already on the approved list.

A SOC 2 report does not answer the model-risk questions. It may provide useful evidence about security controls, but it usually does not specify model versions, training data use, customer opt-out, output rights, prompt retention, or model change notice. You still need AI-specific diligence. Standard controls and AI controls complement each other.

"Customer data is not used to train foundation models" is not the whole answer. The vendor may still use data for fine-tuning, evals, prompt review, product analytics, abuse monitoring, support, or retrieval indexes. The question is not only foundation model training. It is every secondary use of customer data in the AI workflow.

Shadow AI is not only a policy violation. It is often a signal that approved tools do not meet user needs. Employees route data to unapproved AI services when the official path is too slow, too limited, or unclear. Reducing shadow AI requires discovery and enforcement, but also usable approved alternatives.

Study Checklist

I can explain why existing SaaS approval does not automatically approve embedded AI features.
I can define the contents of a model bill of materials for vendor review.
I can identify customer-data training and improvement clauses that need technical evidence.
I can describe output-rights questions for AI-generated content.
I can explain why model change notice matters for integrated workflows.
I can specify AI audit logs a vendor should retain and provide after an incident.
I can design procurement questions that cover AI-specific data flows.
I can identify shadow AI patterns and practical controls to reduce them.

Remediation Starting Points

If you scored below 60% in this domain:

Take an existing vendor security questionnaire and add an AI section covering model BOM, data use, prompt retention, output rights, model updates, audit logs, and customer controls.
Pick one AI-enabled SaaS product your organization uses and write a data-flow summary from user input to model provider, storage, logs, outputs, and retention.
Draft a model change notice clause requiring advance notice, version details, customer testing time, and rollback or disablement options for high-risk integrations.
Run a hands-on exercise: review the terms and AI documentation for two AI SaaS vendors and compare what each discloses about customer-data training, model providers, logging, and output ownership.

Module 14

Secure AI Architecture Design

Design decisions that outlive any one model.

Architecture decisions made under deadline pressure tend to concentrate trust in the wrong place. The common pattern is a system that trusts the model to enforce permissions, trusts retrieved content not to contain hostile instructions, and trusts the system prompt to survive adversarial context. Each assumption can fail independently, and when they fail together the whole design collapses.

Secure AI architecture is where those bets get placed or avoided. The question is not whether a particular model behaves well in a demo. The question is whether the system preserves its security properties when the model is wrong, the context is hostile, retrieval is imperfect, a tool output is misleading, or a fallback path activates.

What This Domain Covers

Secure AI architecture design covers the system-level decisions that determine whether AI controls remain independent, enforceable, and observable. It includes context trust tiers, fallback routing, zero-trust limits, post-retrieval filter weaknesses, defense-in-depth for non-deterministic systems, agent blast radius, multi-model trust chains, and design-time permission boundaries. It sits above individual domains because it decides how prompt security, RAG security, agent security, MLOps, privacy, and observability fit together.

This is a distinct domain because many AI failures are not local bugs. They are design-level trust failures. A team may patch a prompt, add a filter, or tighten one tool, but the architecture still allows untrusted retrieved content to influence high-privilege action. Another team may add output filtering while leaving retrieval authorization too late. Secure architecture asks whether the layers are independent enough that one failure does not cascade into authorization failure, data leakage, unsafe action, and missing evidence.

The practitioner's entry point is trust placement. Identify where the system makes trust decisions, what information those decisions rely on, and whether the deciding component is allowed to be wrong. Any design that asks the model to be the sole enforcer of authorization, data classification, tool safety, or policy compliance is fragile. Durable architecture places enforceable boundaries outside the model and treats model output as one input to controlled workflows.

Core Concepts

Context Trust Tiers Every segment entering the model's context should have an explicit trust level and allowed influence. System instructions have one role, developer instructions another, user input another, retrieved documents another, tool outputs another, and conversation history another. A retrieved document should provide evidence, not override policy. A tool output should report state, not silently grant authority. Context trust tiers make these assumptions explicit before prompt injection turns them into failure modes.

Fallback Routing and Security Properties AI systems degrade, fail over, switch providers, use cached answers, disable tools, or fall back to simpler workflows. The fallback path must preserve the security properties of the primary path: authorization, logging, rate limits, data classification, output controls, and approval requirements. A fallback route that bypasses retrieval filters or tool policy is not resilience; it is an alternate vulnerability path. Architecture review should treat fallback as part of the system, not an exception.

Zero-Trust Limits for AI Zero-trust principles require explicit authorization and continuous verification, but AI systems complicate the model when a generated output influences an access decision. If the model decides whether a user should see data, take action, or pass a policy check, the decision is non-deterministic and context-sensitive. Zero trust still applies, but the enforceable access decision should remain outside the model. The model can recommend; policy should decide.

Post-Retrieval Filter Weaknesses Filters applied after retrieval cannot fully repair a bad retrieval authorization decision. Once high-privilege content enters context, the model may summarize, infer, combine, or leak it indirectly. Output filters can reduce visible leakage, but they do not erase the fact that the model processed unauthorized data. Secure architecture enforces data-plane boundaries before retrieval results reach context.

Independent Defense Layers Defense-in-depth for AI requires layers that do not all fail for the same reason. A system prompt, output filter, and model self-critique may all be influenced by the same compromised context. Independent layers include retrieval authorization, runtime tool policy, schema validation, approval gates, sandboxing, logging, and release gates. The architecture should assume the model can be wrong and still prevent unacceptable outcomes.

The Threat Landscape

Model-Enforced Authorization The system retrieves broad data and asks the model not to reveal anything the user should not see. The precondition is using the model as a policy interpreter rather than enforcing access before context assembly. The production impact is unauthorized processing and possible disclosure through summaries, inference, or partial output. The failure is architectural because the sensitive data already crossed the boundary.

Fallback Bypass The primary AI path enforces logging, retrieval filters, and tool approvals, but the fallback path uses a simpler provider, cached answer path, or emergency workflow that skips those controls. The precondition is reliability engineering that did not include security invariants. During outage or degradation, the system becomes less secure exactly when operators are under stress. Attackers often exploit alternate paths because they receive less scrutiny.

Internal Content Overtrust Retrieved content from an internal wiki, ticket, email, or document repository enters context as if it were trusted instruction. The precondition is assuming internal source equals safe source. A malicious or compromised internal document can steer outputs or tool calls. The impact is especially serious when internal content has low write barriers but high influence inside the AI workflow.

Agent Blast Radius Designed Too Late A tool is integrated with broad permissions, and after an incident the team tries to patch prompts or add approval language. The precondition is tool scope designed around implementation convenience rather than damage limits. Runtime patching cannot undo the fact that a compromised tool call could already reach broad resources. Blast radius must be designed at the permission and architecture layer.

Undefined Multi-Model Trust Chain Model A summarizes context, model B makes a decision from the summary, and model C drafts an action, but the system never defines how trust transfers across those outputs. The precondition is composing models for capability without defining authority, logging, or provenance. Errors, injections, or omissions propagate across the chain while each model treats the previous output as a neutral input. Investigation becomes difficult because no single model has the complete context or accountability.

What Good Looks Like

A secure AI architecture has a written trust model. It identifies context tiers, data classifications, authorization points, tool permission boundaries, model-provider boundaries, fallback paths, logging requirements, and approval points. The architecture shows which components enforce policy outside the model and which components merely provide recommendations or generated text. A reader should be able to tell where confidentiality, integrity, auditability, and controlled action are preserved.

Good designs enforce data boundaries before model context. Retrieval-time authorization, tenant isolation, metadata integrity, source classification, and deletion propagation happen upstream of generation. Output filters still exist, but they are not the primary confidentiality control. Evidence artifacts include retrieval policy diagrams, chunk metadata schemas, authorization test results, and trace records showing which sources entered context.

Agent architectures define blast radius at design time. Tool permissions are scoped by resource, tenant, action type, time window, and reversibility. Destructive or external actions require approval with context. Sandboxes limit filesystem, network, credential, and environment access. Tool-call audit logs, rollback plans, and policy decisions are part of the architecture, not later hardening tasks.

Strong architecture review also covers fallback and model composition. Fallback routes preserve logging, authorization, rate limits, and approval requirements. Multi-model workflows define whether outputs are evidence, recommendations, summaries, or decisions; they also define which logs capture each step. Concrete artifacts include architecture decision records, data-flow diagrams, trust-tier tables, fallback control matrices, tool permission worksheets, model-chain traces, and release gate criteria. These artifacts make architecture review repeatable instead of personality-driven.

Assessment Focus

The assessment tests whether you can reason at the design level rather than patching symptoms. It asks whether you understand context trust tiers, pre-retrieval authorization, independent defense layers, fallback security, zero-trust limits, agent blast radius, and multi-model trust chains. Strong answers preserve security properties even when model behavior is unreliable.

The most common wrong-answer pattern is local control thinking. Candidates recommend a better prompt, a stronger model, or an output filter when the scenario describes an architecture that places trust in the wrong component. That reveals a model-centric mental model. Correct reasoning moves enforcement to deterministic layers, defines trust semantics, and prevents one compromised context segment from cascading through the system.

Common Misconceptions

A system prompt is not an architecture boundary. It can express intent and influence behavior, but it does not enforce authorization, isolate data, or guarantee tool safety. If the design relies on the prompt to preserve security properties under adversarial context, the design is fragile. Architecture boundaries should be enforced by policy, runtime controls, and data access layers.

Internal knowledge bases are not automatically trusted context. Internal repositories often include user-generated content, stale pages, imported vendor text, pasted emails, and low-review documentation. Treating internal content as instruction creates a prompt injection path with an internal badge. The better framing is source-specific trust, not internal-versus-external trust.

Output filtering cannot compensate for bad retrieval design. It may catch obvious sensitive strings or unsafe responses, but it cannot fully remove the effect of unauthorized context already processed by the model. The model can paraphrase, infer, or combine information. Confidentiality belongs primarily at retrieval and data access boundaries.

Defense-in-depth does not mean stacking model-based checks. If the same compromised context influences the model, the critic model, and the output filter, the layers are not independent. Effective defense-in-depth uses controls with different failure modes: authorization, sandboxing, schema validation, approval gates, logging, and release blockers. Independence matters more than count.

Study Checklist

I can define context trust tiers for system instructions, developer instructions, user input, retrieved content, tool output, and conversation history.
I can identify when a design uses the model as the sole authorization control.
I can explain why retrieval-time authorization must precede output filtering.
I can assess whether fallback paths preserve authorization, logging, rate limits, and approval requirements.
I can describe where zero-trust principles apply and where model-mediated decisions break the trust model.
I can design independent defense layers for an AI workflow.
I can explain agent blast radius as an architecture problem rather than a prompt problem.
I can identify undefined trust transfer in a multi-model workflow.

Remediation Starting Points

If you scored below 60% in this domain:

Draw a trust-boundary diagram for an AI system. Label context tiers, authorization points, model calls, retrieval sources, tool permissions, output controls, fallback paths, and logs.
Review one RAG design and identify whether confidentiality is enforced before retrieval, after generation, or both. Rewrite the design so unauthorized chunks cannot enter context.
Build a tool permission worksheet for one agent workflow. Define resource scope, action scope, approval requirement, reversibility, sandbox limits, and audit fields before implementation.
Run a hands-on architecture exercise: create a failure scenario where the primary model route is down and the system falls back to another path. Verify whether the fallback preserves the same authorization, logging, rate limits, and output controls.

Module 15

Hardware, Host, and Cluster Security for AI

The model runs somewhere. That somewhere has a security boundary.

AI security conversations often stop at prompts, retrieval, and model behavior. Production systems do not. They run on hosts, GPUs, containers, clusters, notebooks, inference endpoints, queues, caches, object stores, and cloud AI services. If those layers are weak, the model boundary is not the only boundary that matters.

This domain covers the infrastructure layer beneath AI applications and model-serving systems. It is not a replacement for cloud security or platform security. It is the AI-specific extension of those disciplines: the places where model artifacts, GPU workloads, inference secrets, training jobs, cluster access, and serving infrastructure create risk that ordinary web application review can miss.

What This Domain Covers

Hardware, host, and cluster security for AI covers the runtime and infrastructure surfaces that support AI systems: GPU workloads, model-serving hosts, container boundaries, inference endpoints, training jobs, notebooks, model caches, artifact stores, cluster access, secrets in inference environments, side-channel awareness, and cloud AI service boundaries.

This is a distinct domain because AI workloads concentrate valuable artifacts and credentials in places that are often optimized for experimentation and throughput. A notebook may hold provider keys, training data samples, and model registry credentials. A model-serving host may have access to production data, embeddings, model weights, and observability logs. A GPU cluster may mix workloads with different trust levels. Those are security decisions, not just performance decisions.

The practitioner's entry point is runtime boundary mapping. Identify where the model runs, what data and credentials are available there, what isolation exists between workloads, what can reach the host, what logs exist, and how patching, image promotion, and emergency disablement work. If the team cannot answer those questions, it cannot prove the runtime is governed.

Core Concepts

Model-Serving Host Boundary Inference hosts are not passive compute. They may hold model artifacts, prompts, outputs, embeddings, cache entries, telemetry, provider credentials, and customer data. Review them as sensitive production systems with explicit network, identity, filesystem, logging, and patch controls.

GPU and Workload Isolation GPU capacity is often shared across teams, tenants, jobs, or environments. The security question is whether workloads with different trust levels can observe, influence, starve, or escape each other. Isolation design should cover scheduling, namespaces, device access, container policy, resource quotas, and workload identity.

Secrets in Inference Environments AI services frequently need provider keys, vector-store credentials, artifact registry tokens, telemetry keys, and tool credentials. Those secrets should not be baked into images, notebooks, prompt templates, cached responses, or client-visible configuration. Runtime identity and short-lived credentials are usually safer than static keys.

TEE and Confidential Computing Options Trusted execution environments and confidential computing can help protect selected workloads, but they are not magic compliance switches. They should be evaluated for the specific threat model: provider visibility, memory protection, attestation, key release, performance cost, operational complexity, and whether the protected boundary covers the data actually at risk.

Patch and Dependency Lifecycle AI infrastructure often depends on fast-moving runtimes, drivers, model servers, Python packages, container images, CUDA libraries, and orchestration tooling. Patch governance must include the serving path and experimentation path. A secure model artifact does not compensate for an unpatched host or exposed notebook service.

The Threat Landscape

Inference Host Credential Exposure A model-serving container includes provider keys, vector-store credentials, or tool tokens in environment variables that are readable by broad processes or exposed through logs. The precondition is convenience-driven deployment without secret scoping. The impact can include unauthorized model calls, data access, tool abuse, or lateral movement.

Cluster Access Drift Engineers, notebooks, jobs, and service accounts accumulate broad access to training and serving clusters. The precondition is rapid experimentation without periodic access review. A compromised account can reach datasets, model artifacts, logs, or production-adjacent workloads that were never intended to share a boundary.

Weak Container or Workload Isolation Untrusted model code, custom loaders, or experimentation jobs run with excessive host access. The precondition is treating AI experimentation as low-risk because it is "just research." The impact can include host compromise, artifact tampering, credential theft, or unauthorized access to adjacent workloads.

Serving Endpoint Abuse An inference endpoint lacks rate limits, input controls, abuse monitoring, or tenant-aware quotas. Attackers or broken clients can drive cost spikes, degrade service, extract behavior, or trigger unsafe downstream workflows. This is both an availability and control-boundary problem.

Unclear Cloud AI Service Boundary A team adopts a managed AI service without documenting where data is processed, what logs are retained, which identity controls apply, or how customer data is separated. The precondition is assuming managed means fully governed. The organization cannot answer assurance or incident questions later.

What Good Looks Like

A mature team maintains a model-serving environment review for each high-risk AI system. The review names the hosts, containers, images, models, credentials, data categories, network paths, logging policy, patch cadence, and emergency disablement path. It also identifies which controls are inherited from cloud or platform teams and which are AI-specific.

Cluster access is role-based, reviewed, and tied to workload purpose. Training, experimentation, staging, and production serving are separated by environment, identity, data access, and release controls. Model artifacts move through controlled registries, not ad hoc file shares. Notebooks have bounded credentials, secret scanning, egress controls, and data-handling rules.

Workload isolation is explicit. Containers run with least privilege. Host mounts, device access, network egress, and service-account permissions are constrained. High-risk model loading, custom code, or untrusted artifacts run in sandboxes or non-production environments until reviewed. The team can prove what ran, where it ran, which artifact version was loaded, and which credentials were available.

Evidence artifacts include Hardware Isolation Review, GPU and Host Isolation Checklist, Model Serving Environment Review, Cluster Access Review, Inference Secrets Review, patch records, workload identity maps, and incident reconstruction logs. The goal is not to make every AI workload exotic. The goal is to stop infrastructure assumptions from silently becoming AI security failures.

Assessment Focus

The assessment tests whether you can connect AI risk to runtime and infrastructure controls. Strong answers identify where model artifacts, data, credentials, hosts, clusters, and managed services create boundaries that need owners and evidence. Weak answers stay at the prompt or model layer when the scenario is really about host access, workload isolation, or serving infrastructure.

The most common wrong-answer pattern is assuming that managed AI infrastructure or containerization automatically resolves the risk. Correct reasoning asks what boundary is enforced, what evidence exists, and what happens when a workload, artifact, notebook, or endpoint is compromised.

Common Misconceptions

GPU isolation is not only a performance concern. Scheduling and resource allocation affect confidentiality, integrity, availability, and blast radius when workloads with different trust levels share infrastructure.

Managed AI service boundaries still need review. A cloud provider may operate the platform, but the customer still owns identity, data classification, logging decisions, endpoint exposure, and evidence collection.

Notebook environments are not harmless because they are "internal." They often combine credentials, data samples, code execution, package installation, and weak review. Treat production-adjacent notebooks as a real security surface.

Confidential computing is not a substitute for architecture. It may reduce specific exposure paths, but it does not fix weak authorization, bad secrets handling, unsafe loaders, or missing audit logs.

Study Checklist

I can identify the hosts, containers, clusters, managed services, and endpoints that run an AI system.
I can explain what credentials and data are available inside an inference environment.
I can assess whether GPU and workload isolation match the sensitivity of the workloads.
I can review notebook, training, and model-serving access separately.
I can identify when model loading or custom code should be sandboxed.
I can describe what evidence proves a model-serving environment is patched, isolated, and observable.
I can evaluate when TEE or confidential computing is relevant to a specific threat model.
I can define an emergency disablement or rollback path for a high-risk serving endpoint.

Remediation Starting Points

If you scored below 60% in this domain:

Pick one AI system and draw the runtime map: model server, host, container, cluster, artifact store, vector store, secrets, logs, and network paths.
Review one inference environment for static secrets, broad service accounts, unnecessary egress, and weak logging.
Build a GPU and host isolation checklist for training, staging, and production workloads.
Write a model-serving environment review that names owner, data categories, model artifacts, patch cadence, access controls, and emergency disablement.

Module 16

Program Design, Hiring, and Operating Model

A capability is not a job title.

AI security programs fail when the organization treats the discipline as a person, a policy, or a tool purchase. The work crosses product security, AppSec, ML platform, governance, privacy, procurement, incident response, and architecture. If those interfaces are not designed, the program becomes a queue of opinions with no durable controls.

This domain covers the operating model behind AI security engineering. It asks how work enters the program, who owns which decisions, which artifacts prove controls operated, how hiring maps to real capability, and how executive AI risk becomes engineering work. It is the practical bridge between individual skill and organizational execution.

What This Domain Covers

Program design, hiring, and operating model covers AI security ownership, role archetypes, intake, release gates, evidence cadence, staffing sequences, escalation paths, operating reviews, metrics, and first-90-day execution. It connects the technical domains to repeatable work: inventories, threat models, eval gates, retrieval tests, agent worksheets, model intake, vendor review, incident playbooks, and board-to-backlog traceability.

This is a distinct domain because AI security does not fit cleanly inside one existing function. Product teams own features, AppSec owns secure development, ML teams own model pipelines, GRC owns frameworks, privacy owns data obligations, procurement owns vendor flow, and security leadership owns risk. The operating model decides how those teams make decisions together without pretending one hire can own all depth.

The practitioner's entry point is work design. Start with the decisions the organization must make: what can launch, what needs review, what evidence must exist, who accepts risk, and what happens during an incident. Then map roles, controls, artifacts, and cadence around those decisions.

Core Concepts

Role Architecture AI security hiring should use the nine canonical archetypes as a planning model: AI Security Architect, AI Product Security Engineer, AI AppSec Engineer, RAG Security Engineer, Agent Security Engineer, AI Red Team Engineer, ML Security Engineer, Model Risk Security Partner, and Governance Evidence Lead. The point is not to hire nine people immediately. The point is to stop writing one job that silently requires nine jobs.

Operating Evidence An operating model is real only when it produces evidence on a cadence. Evidence includes AI system inventories, model intake records, retrieval authorization tests, agent blast-radius worksheets, eval gate logs, vendor reviews, incident reconstruction records, and governance evidence maps. Without artifacts, the program is mostly language.

Intake and Release Gates AI security needs a path into product and platform decisions. Intake captures new AI features, model changes, RAG source changes, agent tool additions, vendor AI use, and high-risk workflow changes. Release gates define which changes require security review, eval evidence, risk acceptance, or launch blocking.

Staffing Sequence The first hire should match the organization's highest operational pressure. Product-heavy teams may need AI Product Security or AI AppSec first. RAG-heavy teams need retrieval ownership. Agent-heavy teams need tool and action security. Regulated teams may need Governance Evidence or Model Risk Security Partner early. The wrong first hire can create a year of avoidable friction.

Boardroom-to-Backlog Traceability Executive concern about AI risk must become named engineering work. A board statement about customer data exposure should trace to retrieval authorization, data classification, logging, evals, owners, and remediation tickets. If the trace stops at a policy paragraph, the operating model has not reached the backlog.

The Threat Landscape

One-Person Program Fantasy Leadership hires one AI security person and assumes the program now exists. The precondition is treating AI security as a category rather than a team-shaped capability. The impact is a bottlenecked function that cannot operate release gates, architecture review, evals, governance evidence, and incident response at the same time.

Governance Without Engineering Intake The organization writes AI policies but has no mechanism to identify AI features, route them to review, block risky releases, or collect evidence. The precondition is policy-first governance disconnected from product workflows. The result is executive confidence without operating control.

Unowned Control Surfaces RAG authorization, agent tool permissions, model intake, eval gates, and vendor AI review each assume someone else owns the details. The precondition is unclear interfaces between AppSec, ML platform, product, GRC, and procurement. Risks persist because every team can plausibly claim partial ownership.

Hiring By Keyword Density The job description lists every AI security term without naming artifacts, systems, owners, or first-90-day outputs. The precondition is anxiety-driven hiring. The result is poor screening, mismatched candidates, and disappointment when the hire cannot be expert across every domain.

Metrics Without Decisions The program reports counts of reviews, policies, trainings, or tools but cannot show which releases were blocked, which risks were accepted, which controls operated, or which incidents changed the system. Metrics that do not influence decisions become reporting theater.

What Good Looks Like

A strong operating model starts with an AI system inventory and a routing model. The inventory identifies systems, owners, models, data categories, vendors, retrieval sources, tools, risk tier, and launch state. The routing model defines which changes need security review, governance evidence, privacy review, procurement review, or executive risk acceptance.

Ownership is explicit. Each major control surface has a primary owner and named interfaces: RAG authorization, agent permissions, model intake, eval gates, vendor review, observability, incident response, and governance evidence. The operating model does not require every owner to sit in one team, but it does require that no critical control surface is orphaned.

Hiring aligns to role architecture. Job descriptions name the primary archetype, adjacent coverage, artifacts produced, interfaces, non-responsibilities, interview signals, and first-90-day outputs. The hiring process tests work samples rather than vocabulary. Leaders understand which capabilities are internal, contracted, vendor-supported, or deferred.

Program cadence is lightweight but real. Weekly intake routes new work. Monthly evidence review checks freshness, ownership, open gaps, and expiring risk acceptances. Quarterly operating review updates risk posture, staffing, tooling, and board-to-backlog traceability. The artifacts are useful to engineering teams, not just executives.

Assessment Focus

The assessment tests whether you can turn AI security from a topic into an operating system. Strong answers name owners, artifacts, gates, cadence, and hiring tradeoffs. Weak answers recommend policies, committees, or generic security ownership without showing how product decisions change.

The most common wrong-answer pattern is assuming that a broad AI Security Engineer title resolves organizational ambiguity. Correct reasoning decomposes the work, assigns owners, identifies evidence artifacts, and describes what the first 30, 60, and 90 days should produce.

Common Misconceptions

A policy is not an operating model. Policies define expectations; operating models define intake, ownership, controls, evidence, escalation, and review cadence.

One senior hire cannot be nine archetypes at expert depth. A first AI Security Architect can design the initial program, but the organization still needs to decide what is covered internally, contracted, vendor-supported, or deferred.

Governance evidence is not paperwork after engineering. It is the proof that engineering controls operated: logs, tests, gates, owner records, risk acceptance, and incident reconstruction.

Metrics are weak if they do not affect decisions. Count reviews only if the count connects to launch outcomes, blockers, risk acceptances, stale evidence, remediation, or capability gaps.

Study Checklist

I can explain why AI security is a team-shaped capability rather than one generic role.
I can map a company's AI risk pressure to the most appropriate first hire.
I can name the nine canonical AI security archetypes and their distinct outputs.
I can design a lightweight AI security intake process.
I can define release gates for AI features, RAG changes, agent tools, model changes, and vendor AI use.
I can identify the evidence artifacts needed for monthly operating review.
I can translate executive AI risk language into backlog items and control evidence.
I can distinguish useful program metrics from reporting theater.

Remediation Starting Points

If you scored below 60% in this domain:

Build a one-page AI security operating model for a real or hypothetical company: intake triggers, owners, release gates, evidence artifacts, and cadence.
Write a first-hire decision memo that names the primary archetype, adjacent coverage, explicit non-responsibilities, first-90-day outputs, and external support needed.
Create a control owner register for RAG authorization, agent permissions, model intake, eval gates, vendor AI review, incident response, and governance evidence.
Take one board-level AI risk statement and convert it into backlog items, owners, tests, logs, release gates, and evidence records.