Glossary

A comprehensive AI security glossary covering attacks, defenses, frameworks, regulations, and operational concepts.

Canonical version maintained at aisec.blog.

A

Adversarial Example attacks #

An input that has been deliberately perturbed—often imperceptibly to humans—to cause a model to produce an incorrect or attacker-desired output. Originating in computer vision (Szegedy et al., 2014), adversarial examples in NLP are constructed via token substitutions, character-level perturbations, or gradient-based suffix optimization, all while preserving semantic meaning to human readers.

Adversarial Suffix attacks #

A sequence of tokens, typically unintelligible to humans, that is appended to a prompt and optimized (often via gradient descent on a white-box model) to alter model behavior—most commonly to bypass safety filters. Adversarial suffixes are the output of attacks such as GCG and can exhibit transferability to black-box models.

Agent infrastructure #

An LLM-based system capable of planning multi-step tasks, invoking external tools, and maintaining state across interaction turns. Agents amplify LLM utility but also expand the attack surface: each tool call is a potential confused-deputy vulnerability, and multi-step reasoning chains can be hijacked by injections encountered mid-execution.

Attention Mechanism ml-concepts #

The core computational primitive in transformer models that computes a weighted sum of value vectors for each token based on query-key similarity across all positions in the context window. Attention enables long-range dependency modeling but has O(n²) complexity in sequence length, creating both performance and security implications. Attention patterns have been used in interpretability research to understand how models process injected instructions.

B

Backdoor attacks #

A hidden behavior embedded in a model during training that activates when the model receives a specific trigger pattern (e.g., a particular phrase, pixel pattern, or token sequence) and causes targeted misclassification or policy violation. Backdoors can be inserted via data poisoning or direct weight manipulation and are difficult to detect because the model behaves normally on trigger-free inputs.

BIG-bench evaluation #

A collaborative benchmark comprising 204 diverse tasks contributed by researchers specifically designed to probe LLM capabilities that resist simple pattern-matching—including logical reasoning, causal understanding, and social cognition. BIG-bench Hard (BBH) further isolates 23 tasks where models initially performed near chance, providing a more discriminating capability signal.

C

Capability Scoping defenses #

The practice of limiting the tools, APIs, and actions available to an LLM agent to the minimum set required for its intended task. By reducing the agent's capability surface, capability scoping limits the blast radius of a successful prompt injection or jailbreak that attempts to abuse the agent's tool-calling authority.

Constitutional AI defenses #

Anthropic's training methodology in which a model is guided to follow a written set of principles (a 'constitution') through self-critique and revision during training, reducing dependence on human-annotated RLHF data for harmlessness. The model is prompted to identify and correct its own policy violations before a preference model scores the outputs, enabling scalable oversight.

Contamination evaluation #

The presence of benchmark test examples in a model's training data, causing the model to effectively memorize answers rather than generalize, inflating reported scores and producing misleading comparisons. Contamination is difficult to detect when training data provenance is opaque and is a significant concern for any public benchmark dataset.

Context Window infrastructure #

The maximum number of tokens an LLM can process in a single forward pass. The context window bounds how much conversation history, system prompt content, retrieved documents, and tool outputs can be included simultaneously. Filling the context window with attacker-controlled content can dilute or displace legitimate instructions—a technique sometimes called context overflow.

D

Data Poisoning attacks #

An attack on the model training pipeline that corrupts training data to embed backdoors, degrade performance on targeted inputs, or shift model behavior at inference time. Poisoning can target pre-training corpora, fine-tuning datasets, or RLHF preference data, making detection especially difficult when the attacker has only partial control of the data supply chain.

Differential Privacy defenses #

A mathematical framework providing a formal guarantee that the inclusion or exclusion of any single training example changes model outputs by at most a bounded amount (epsilon). Applied to ML training via DP-SGD (Abadi et al., 2016), it defends against membership inference and model inversion attacks by injecting calibrated Gaussian or Laplace noise into gradients during training.

Drift evaluation #

A statistical change in model input distribution (data drift) or in the relationship between inputs and desired outputs (concept drift) that occurs over time in production. For LLM deployments, drift manifests as degrading response quality, increasing refusal rates, or emergent failure modes. Monitoring for drift requires tracking output distributions and user feedback signals continuously.

E

Embedding infrastructure #

A dense, fixed-dimensional numerical representation of text or other content produced by an encoder model. Semantically similar inputs map to nearby points in the embedding space, enabling similarity-based retrieval. Embeddings underpin vector database search in RAG systems and are also used as features in classifiers, including content moderation models.

EU AI Act standards #

The European Union's comprehensive AI regulation, which classifies AI systems into risk tiers—prohibited, high-risk, limited-risk, and minimal-risk—and imposes corresponding obligations on providers and deployers. It entered into force in August 2024 with a phased compliance timeline. High-risk applications (e.g., biometric identification, critical infrastructure, employment decisions) face requirements including conformity assessments, logging, and human oversight.

F

Fine-tuning infrastructure #

Supervised continued training applied to a pre-trained foundation model on a task-specific dataset to adapt its behavior without retraining from scratch. Fine-tuning is the primary method for specializing foundation models and is increasingly exposed as an API feature (e.g., OpenAI fine-tuning API), making it a potential attack vector for eroding safety alignment via adversarial fine-tuning datasets.

Foundation Model infrastructure #

A large-scale model trained on broad, diverse data using self-supervised objectives and subsequently adapted to downstream tasks via fine-tuning, prompting, or RLHF. GPT-4, Claude 3, Gemini 1.5, and Llama 3 are examples. Foundation models concentrate capability and risk: a vulnerability in the base model propagates to every downstream application built on it.

G

GCG (Greedy Coordinate Gradient) Attack attacks #

An optimization-based adversarial attack that uses gradient information from a white-box model to construct adversarial suffixes—short token sequences that, when appended to any prompt, reliably elicit harmful outputs. Introduced by Zou et al. (2023), GCG suffixes transfer across models and are considered a foundational automated jailbreak technique.

GDPR standards #

The General Data Protection Regulation, the EU's comprehensive personal data processing law. GDPR applies to AI systems that process EU residents' personal data during training or inference. Key implications for ML include restrictions on automated decision-making (Article 22), data minimization requirements that constrain training data collection, and breach notification obligations that extend to model outputs containing PII.

Guardrail defenses #

Any mechanism—classifier, rule set, heuristic, or secondary model—that filters, classifies, or constrains LLM inputs and/or outputs to enforce operator policy. Guardrails can be implemented as input screening (pre-LLM), output screening (post-LLM), or in-context constraints (system prompt). Layered guardrails from multiple vendors increase robustness against bypass.

H

Hallucination ml-concepts #

The tendency of LLMs to generate factually incorrect, fabricated, or internally inconsistent content with apparent confidence. In security contexts, hallucination is exploitable—adversaries can craft prompts that induce models to confabulate harmful content while appearing authoritative, and security systems relying on LLM reasoning for policy decisions are vulnerable to hallucination-induced false negatives.

HarmBench evaluation #

A standardized evaluation framework for automated red teaming of LLMs, introduced by Mazeika et al. (2024). HarmBench comprises 510 harmful behaviors across seven categories (standard, contextual, copyright, cybersecurity, chemical/biological, misinformation, and multimodal) and includes a reproducible evaluation pipeline with a fine-tuned classifier for scoring attack success.

See also: JailbreakBench, Eval Set, Jailbreak

I

Indirect Prompt Injection attacks #

A variant of prompt injection in which the malicious payload is embedded in external data sources—web pages, documents, emails, database records, or tool outputs—rather than the direct user turn. Because the model treats retrieved content as trusted context, it can execute attacker instructions without any direct interaction between the attacker and the model.

Input Sanitization defenses #

Pre-processing applied to user inputs before they are passed to the LLM, designed to detect and remove or neutralize injection payloads, malicious formatting, or policy-violating content. Techniques include instruction-keyword filtering, delimiters, XML/JSON escaping, and classifier-based screening. Sanitization is a necessary but insufficient defense in isolation.

J

Jailbreak attacks #

A technique used by end users or red teamers to elicit policy-violating outputs from a safety-trained model. Unlike prompt injection, which exploits the model's instruction-following against an operator, jailbreaks treat the user as the adversary attempting to circumvent the model's own training-time constraints. Common strategies include role-play framing, many-shot prompting, and encoding tricks.

JailbreakBench evaluation #

An open artifact for tracking progress on LLM jailbreaking research, providing a standardized set of 100 harmful behaviors, a reproducible evaluation pipeline using GPT-4 as judge, a public leaderboard of attack and defense performance, and a versioned archive of adversarial prompts. Its standardization enables fair comparison across attack methods and temporal tracking of model robustness.

See also: HarmBench, Jailbreak, Eval Set

M

Membership Inference Attack attacks #

An attack that determines whether a specific data sample was present in a model's training set by analyzing model outputs, confidence scores, or loss values. Successful membership inference constitutes a privacy violation when training data is sensitive (e.g., medical records, private communications).

Memorization ml-concepts #

The phenomenon in which a trained model stores and can reproduce verbatim fragments of its training data, particularly data that appears repeatedly or is highly unique in the corpus. Memorization is the mechanism underlying training data extraction attacks and is a privacy risk when training data contains sensitive information such as PII, credentials, or proprietary content.

MITRE ATLAS standards #

The Adversarial Threat Landscape for AI Systems, a MITRE framework that catalogs adversarial ML attack techniques, tactics, and case studies in a matrix structure analogous to MITRE ATT&CK. ATLAS enables security teams to model AI-specific threats, map defensive controls, and conduct tabletop exercises. Its case study database covers real-world AI security incidents across industry verticals.

N

NIST AI RMF standards #

The NIST AI Risk Management Framework, a voluntary framework for identifying, assessing, and managing AI risks, structured around four core functions: Govern (establish accountability), Map (identify context and risks), Measure (assess and analyze risks), and Manage (prioritize and treat risks). Published in January 2023, the AI RMF is increasingly referenced in US federal AI procurement and governance requirements.

See also: MITRE ATLAS, EU AI Act, Model Card

O

Observability evaluation #

The capability to understand and measure the internal state of an LLM application from its external outputs, typically implemented via distributed traces, structured logs, and metrics. LLM-specific observability tooling (e.g., LangSmith, Weights & Biases, Helicone) captures prompt/completion pairs, token counts, latency, and safety classifier outcomes to support debugging, auditing, and drift detection.

P

Perplexity Defense defenses #

A detection heuristic that identifies adversarial inputs by measuring their token-level perplexity under a reference language model. Adversarial suffixes produced by gradient-based attacks (e.g., GCG) often contain unnatural token sequences with anomalously high perplexity. Proposed by Alon and Kamfonas (2023), though it can be evaded by naturalness-constrained attack variants.

Prompt Injection attacks #

An attack in which attacker-controlled text is interpreted by the LLM as instructions, overriding or hijacking the legitimate system prompt or user intent. Direct prompt injection arrives via user input; indirect prompt injection arrives through content the model retrieves or ingests.

R

RAG (Retrieval-Augmented Generation) infrastructure #

An architecture that augments LLM generation by retrieving relevant documents from an external datastore—typically a vector database—and injecting them into the context window before generation. RAG extends a model's effective knowledge beyond its training cutoff and reduces hallucination on knowledge-intensive tasks, but also introduces the retrieval corpus as an attack surface for poisoning and indirect injection.

RAG Poisoning attacks #

An attack on retrieval-augmented generation pipelines that injects malicious documents or passages into the vector store or retrieval corpus. When the poisoned content is retrieved and placed in the LLM's context, it acts as an indirect prompt injection, steering outputs toward attacker-controlled responses.

Red Teaming evaluation #

A structured adversarial evaluation process in which a dedicated team attempts to elicit harmful, unsafe, or policy-violating behavior from an AI system before deployment. AI red teaming encompasses both manual creative attacks and automated attack pipelines (e.g., GCG, PAIR). Major labs conduct extensive red teaming prior to model releases and publish findings in system cards.

See also: Jailbreak, HarmBench, JailbreakBench

RLHF (Reinforcement Learning from Human Feedback) defenses #

A post-training technique that fine-tunes a language model to align with human preferences by using a reward model trained on human comparison judgments. A policy is optimized via proximal policy optimization (PPO) to maximize reward model scores. RLHF is the dominant post-training alignment method for commercial LLMs and directly shapes how models respond to sensitive requests.

S

SmoothLLM defenses #

A defense against jailbreaking attacks that works by randomly perturbing copies of the input prompt and aggregating the results through majority voting. Proposed by Robey et al. (2023), SmoothLLM exploits the observation that adversarial suffixes are fragile: even small random character substitutions degrade attack effectiveness, while benign prompts remain interpretable.

Supply Chain Attack attacks #

An attack that compromises ML models, libraries, training datasets, or deployment infrastructure upstream of the target deployment. Examples include distributing malicious model weights via Hugging Face, backdooring popular ML Python packages, or poisoning publicly available fine-tuning datasets. The Pickle format widely used for model serialization has no integrity guarantees and is a common attack vector.

System Prompt infrastructure #

Instructions provided to an LLM by the application operator before any user turns, typically not visible to the end user. The system prompt establishes persona, scope, and safety constraints for a deployment. Because models are trained to respect system prompt authority, extracting or overriding the system prompt is a common goal of prompt injection and jailbreak attacks.

T

Temperature infrastructure #

A sampling hyperparameter that scales the logit distribution before softmax, controlling output randomness. Temperature 0 approaches greedy decoding (maximally deterministic); higher values (e.g., 1.0–2.0) increase output diversity. From a security perspective, deterministic sampling makes attack success rates reproducible, while high temperatures can cause safety guardrails to be bypassed stochastically.

V

Vector Database infrastructure #

A storage system optimized for high-dimensional embedding lookup via approximate nearest-neighbor (ANN) search algorithms such as HNSW or IVF-PQ. Vector databases (e.g., Pinecone, Weaviate, Qdrant, pgvector) are the primary retrieval backend for RAG systems. Their security properties—access control, update integrity, and query isolation—directly affect RAG pipeline security.