Best AI Security Tools

Glossary

A comprehensive AI security glossary covering attacks, defenses, frameworks, regulations, and operational concepts.

Canonical version maintained at aisec.blog.

A

Adversarial Example attacks #

An input that has been deliberately perturbed—often imperceptibly to humans—to cause a model to produce an incorrect or attacker-desired output. Originating in computer vision (Szegedy et al., 2014), adversarial examples in NLP are constructed via token substitutions, character-level perturbations, or gradient-based suffix optimization, all while preserving semantic meaning to human readers.

See also: Adversarial Suffix, GCG (Greedy Coordinate Gradient) Attack, Transferability

Adversarial Suffix attacks #

A sequence of tokens, typically unintelligible to humans, that is appended to a prompt and optimized (often via gradient descent on a white-box model) to alter model behavior—most commonly to bypass safety filters. Adversarial suffixes are the output of attacks such as GCG and can exhibit transferability to black-box models.

See also: GCG (Greedy Coordinate Gradient) Attack, Adversarial Example, Transferability

Agent infrastructure #

An LLM-based system capable of planning multi-step tasks, invoking external tools, and maintaining state across interaction turns. Agents amplify LLM utility but also expand the attack surface: each tool call is a potential confused-deputy vulnerability, and multi-step reasoning chains can be hijacked by injections encountered mid-execution.

See also: Tool-Call Abuse / Confused Deputy, Capability Scoping, Tool Calling / Function Calling

Attention Mechanism ml-concepts #

The core computational primitive in transformer models that computes a weighted sum of value vectors for each token based on query-key similarity across all positions in the context window. Attention enables long-range dependency modeling but has O(n²) complexity in sequence length, creating both performance and security implications. Attention patterns have been used in interpretability research to understand how models process injected instructions.

See also: Context Window, Foundation Model

B

Backdoor attacks #

A hidden behavior embedded in a model during training that activates when the model receives a specific trigger pattern (e.g., a particular phrase, pixel pattern, or token sequence) and causes targeted misclassification or policy violation. Backdoors can be inserted via data poisoning or direct weight manipulation and are difficult to detect because the model behaves normally on trigger-free inputs.

See also: Data Poisoning, Supply Chain Attack

BIG-bench evaluation #

A collaborative benchmark comprising 204 diverse tasks contributed by researchers specifically designed to probe LLM capabilities that resist simple pattern-matching—including logical reasoning, causal understanding, and social cognition. BIG-bench Hard (BBH) further isolates 23 tasks where models initially performed near chance, providing a more discriminating capability signal.

See also: MMLU (Massive Multitask Language Understanding), Eval Set

C

Capability Scoping defenses #

The practice of limiting the tools, APIs, and actions available to an LLM agent to the minimum set required for its intended task. By reducing the agent's capability surface, capability scoping limits the blast radius of a successful prompt injection or jailbreak that attempts to abuse the agent's tool-calling authority.

See also: Tool-Call Abuse / Confused Deputy, Trust Separation, Agent

Constitutional AI defenses #

Anthropic's training methodology in which a model is guided to follow a written set of principles (a 'constitution') through self-critique and revision during training, reducing dependence on human-annotated RLHF data for harmlessness. The model is prompted to identify and correct its own policy violations before a preference model scores the outputs, enabling scalable oversight.

See also: RLHF (Reinforcement Learning from Human Feedback), Guardrail

Contamination evaluation #

The presence of benchmark test examples in a model's training data, causing the model to effectively memorize answers rather than generalize, inflating reported scores and producing misleading comparisons. Contamination is difficult to detect when training data provenance is opaque and is a significant concern for any public benchmark dataset.

See also: Eval Set, MMLU (Massive Multitask Language Understanding)

Context Window infrastructure #

The maximum number of tokens an LLM can process in a single forward pass. The context window bounds how much conversation history, system prompt content, retrieved documents, and tool outputs can be included simultaneously. Filling the context window with attacker-controlled content can dilute or displace legitimate instructions—a technique sometimes called context overflow.

See also: Tokenization, RAG (Retrieval-Augmented Generation), Model DoS

D

Data Poisoning attacks #

An attack on the model training pipeline that corrupts training data to embed backdoors, degrade performance on targeted inputs, or shift model behavior at inference time. Poisoning can target pre-training corpora, fine-tuning datasets, or RLHF preference data, making detection especially difficult when the attacker has only partial control of the data supply chain.

See also: Supply Chain Attack, RAG Poisoning, Backdoor

Differential Privacy defenses #

A mathematical framework providing a formal guarantee that the inclusion or exclusion of any single training example changes model outputs by at most a bounded amount (epsilon). Applied to ML training via DP-SGD (Abadi et al., 2016), it defends against membership inference and model inversion attacks by injecting calibrated Gaussian or Laplace noise into gradients during training.

See also: Membership Inference Attack, Model Inversion

Drift evaluation #

A statistical change in model input distribution (data drift) or in the relationship between inputs and desired outputs (concept drift) that occurs over time in production. For LLM deployments, drift manifests as degrading response quality, increasing refusal rates, or emergent failure modes. Monitoring for drift requires tracking output distributions and user feedback signals continuously.

See also: Observability

E

Embedding infrastructure #

A dense, fixed-dimensional numerical representation of text or other content produced by an encoder model. Semantically similar inputs map to nearby points in the embedding space, enabling similarity-based retrieval. Embeddings underpin vector database search in RAG systems and are also used as features in classifiers, including content moderation models.

See also: Vector Database, RAG (Retrieval-Augmented Generation)

EU AI Act standards #

The European Union's comprehensive AI regulation, which classifies AI systems into risk tiers—prohibited, high-risk, limited-risk, and minimal-risk—and imposes corresponding obligations on providers and deployers. It entered into force in August 2024 with a phased compliance timeline. High-risk applications (e.g., biometric identification, critical infrastructure, employment decisions) face requirements including conformity assessments, logging, and human oversight.

See also: GDPR, NIST AI RMF, Model Card

Eval Set evaluation #

A held-out dataset used exclusively to measure model performance after training, distinct from training and validation sets. In security contexts, eval sets include adversarial behavior benchmarks (HarmBench, JailbreakBench) and capability assessments. The integrity of an eval set is critical: contamination or adaptive overfitting to public benchmarks renders scores meaningless.

See also: Contamination, HarmBench, JailbreakBench

F

Fine-tuning infrastructure #

Supervised continued training applied to a pre-trained foundation model on a task-specific dataset to adapt its behavior without retraining from scratch. Fine-tuning is the primary method for specializing foundation models and is increasingly exposed as an API feature (e.g., OpenAI fine-tuning API), making it a potential attack vector for eroding safety alignment via adversarial fine-tuning datasets.

See also: Foundation Model, RLHF (Reinforcement Learning from Human Feedback), Data Poisoning

Foundation Model infrastructure #

A large-scale model trained on broad, diverse data using self-supervised objectives and subsequently adapted to downstream tasks via fine-tuning, prompting, or RLHF. GPT-4, Claude 3, Gemini 1.5, and Llama 3 are examples. Foundation models concentrate capability and risk: a vulnerability in the base model propagates to every downstream application built on it.

See also: Fine-tuning, RLHF (Reinforcement Learning from Human Feedback), Model Extraction

G

GCG (Greedy Coordinate Gradient) Attack attacks #

An optimization-based adversarial attack that uses gradient information from a white-box model to construct adversarial suffixes—short token sequences that, when appended to any prompt, reliably elicit harmful outputs. Introduced by Zou et al. (2023), GCG suffixes transfer across models and are considered a foundational automated jailbreak technique.

See also: Adversarial Suffix, Jailbreak, Transferability

GDPR standards #

The General Data Protection Regulation, the EU's comprehensive personal data processing law. GDPR applies to AI systems that process EU residents' personal data during training or inference. Key implications for ML include restrictions on automated decision-making (Article 22), data minimization requirements that constrain training data collection, and breach notification obligations that extend to model outputs containing PII.

See also: EU AI Act, Differential Privacy

Guardrail defenses #

Any mechanism—classifier, rule set, heuristic, or secondary model—that filters, classifies, or constrains LLM inputs and/or outputs to enforce operator policy. Guardrails can be implemented as input screening (pre-LLM), output screening (post-LLM), or in-context constraints (system prompt). Layered guardrails from multiple vendors increase robustness against bypass.

See also: Output Classifier, Input Sanitization, Trust Separation

H

Hallucination ml-concepts #

The tendency of LLMs to generate factually incorrect, fabricated, or internally inconsistent content with apparent confidence. In security contexts, hallucination is exploitable—adversaries can craft prompts that induce models to confabulate harmful content while appearing authoritative, and security systems relying on LLM reasoning for policy decisions are vulnerable to hallucination-induced false negatives.

See also: RAG (Retrieval-Augmented Generation), Eval Set

HarmBench evaluation #

A standardized evaluation framework for automated red teaming of LLMs, introduced by Mazeika et al. (2024). HarmBench comprises 510 harmful behaviors across seven categories (standard, contextual, copyright, cybersecurity, chemical/biological, misinformation, and multimodal) and includes a reproducible evaluation pipeline with a fine-tuned classifier for scoring attack success.

See also: JailbreakBench, Eval Set, Jailbreak

I

Indirect Prompt Injection attacks #

A variant of prompt injection in which the malicious payload is embedded in external data sources—web pages, documents, emails, database records, or tool outputs—rather than the direct user turn. Because the model treats retrieved content as trusted context, it can execute attacker instructions without any direct interaction between the attacker and the model.

See also: Prompt Injection, RAG Poisoning, Tool-Call Abuse / Confused Deputy

Input Sanitization defenses #

Pre-processing applied to user inputs before they are passed to the LLM, designed to detect and remove or neutralize injection payloads, malicious formatting, or policy-violating content. Techniques include instruction-keyword filtering, delimiters, XML/JSON escaping, and classifier-based screening. Sanitization is a necessary but insufficient defense in isolation.

See also: Guardrail, Output Classifier, Perplexity Defense

J

Jailbreak attacks #

A technique used by end users or red teamers to elicit policy-violating outputs from a safety-trained model. Unlike prompt injection, which exploits the model's instruction-following against an operator, jailbreaks treat the user as the adversary attempting to circumvent the model's own training-time constraints. Common strategies include role-play framing, many-shot prompting, and encoding tricks.

See also: Prompt Injection, GCG (Greedy Coordinate Gradient) Attack, Constitutional AI

JailbreakBench evaluation #

An open artifact for tracking progress on LLM jailbreaking research, providing a standardized set of 100 harmful behaviors, a reproducible evaluation pipeline using GPT-4 as judge, a public leaderboard of attack and defense performance, and a versioned archive of adversarial prompts. Its standardization enables fair comparison across attack methods and temporal tracking of model robustness.

See also: HarmBench, Jailbreak, Eval Set

M

Membership Inference Attack attacks #

An attack that determines whether a specific data sample was present in a model's training set by analyzing model outputs, confidence scores, or loss values. Successful membership inference constitutes a privacy violation when training data is sensitive (e.g., medical records, private communications).

See also: Model Inversion, Differential Privacy, Model Extraction

Memorization ml-concepts #

The phenomenon in which a trained model stores and can reproduce verbatim fragments of its training data, particularly data that appears repeatedly or is highly unique in the corpus. Memorization is the mechanism underlying training data extraction attacks and is a privacy risk when training data contains sensitive information such as PII, credentials, or proprietary content.

See also: Model Inversion, Membership Inference Attack, Differential Privacy

MITRE ATLAS standards #

The Adversarial Threat Landscape for AI Systems, a MITRE framework that catalogs adversarial ML attack techniques, tactics, and case studies in a matrix structure analogous to MITRE ATT&CK. ATLAS enables security teams to model AI-specific threats, map defensive controls, and conduct tabletop exercises. Its case study database covers real-world AI security incidents across industry verticals.

See also: OWASP LLM Top 10, NIST AI RMF

MMLU (Massive Multitask Language Understanding) evaluation #

A benchmark measuring zero-shot and few-shot knowledge across 57 academic subjects spanning STEM, humanities, social sciences, and professional domains (Hendrycks et al., 2021). MMLU is widely used as a proxy for general intelligence in foundation model comparisons. Its heavy reliance on multiple-choice format limits its ability to capture generative capability and is susceptible to contamination.

See also: Contamination, Eval Set, BIG-bench

Model Card standards #

A short document accompanying a model release that discloses intended use cases, out-of-scope applications, training data sources and limitations, evaluation results across demographic groups and adversarial conditions, and ethical considerations. Model cards, introduced by Mitchell et al. (2019), are now a de facto standard for major model releases and are becoming a regulatory expectation under the EU AI Act.

See also: EU AI Act, NIST AI RMF, Eval Set

Model DoS attacks #

An attack that causes excessive compute or memory consumption by crafting inputs that are expensive to process—such as maximally long contexts, highly repetitive tokens, or inputs that trigger worst-case attention patterns. Model DoS can degrade service availability, increase costs for operators, and is especially impactful against models with quadratic attention complexity.

See also: Context Window, Tokenization

Model Extraction attacks #

An attack that recovers a functional approximation of a target model's behavior—or in some cases its weights or architecture—through systematic black-box querying. Model extraction violates intellectual property, enables downstream fine-tuning, and can be used to build white-box surrogate models for further adversarial attack development.

See also: Membership Inference Attack, Model Inversion

Model Inversion attacks #

An attack that reconstructs sensitive training samples by optimizing inputs to maximize model confidence on a target class or prediction. In LLMs, inversion-style attacks have recovered training text by exploiting memorization, particularly from models trained on small, repeated, or highly structured datasets.

See also: Membership Inference Attack, Differential Privacy

Multimodal Injection attacks #

A prompt injection attack delivered through non-text modalities such as images (e.g., text embedded in pixel regions), audio, video, or PDF documents. As multimodal models process diverse input types, attackers can embed instruction payloads in channels that lack the same scrutiny applied to text, such as invisible Unicode characters in PDFs or steganographically hidden text in images.

See also: Prompt Injection, Indirect Prompt Injection

N

NIST AI RMF standards #

The NIST AI Risk Management Framework, a voluntary framework for identifying, assessing, and managing AI risks, structured around four core functions: Govern (establish accountability), Map (identify context and risks), Measure (assess and analyze risks), and Manage (prioritize and treat risks). Published in January 2023, the AI RMF is increasingly referenced in US federal AI procurement and governance requirements.

See also: MITRE ATLAS, EU AI Act, Model Card

O

Observability evaluation #

The capability to understand and measure the internal state of an LLM application from its external outputs, typically implemented via distributed traces, structured logs, and metrics. LLM-specific observability tooling (e.g., LangSmith, Weights & Biases, Helicone) captures prompt/completion pairs, token counts, latency, and safety classifier outcomes to support debugging, auditing, and drift detection.

See also: Drift, Eval Set

Output Classifier defenses #

A model or deterministic rule set that evaluates LLM-generated text for policy violations—such as harmful content, PII exposure, or prompt injection echoes—before returning the response to the user. Output classifiers form a defense-in-depth layer that operates independently of the primary model's safety training.

See also: Guardrail, Input Sanitization

OWASP LLM Top 10 standards #

The Open Web Application Security Project's ranked list of the ten most critical security risks for applications built on large language models. First published in 2023 and updated in 2025, it covers risks including prompt injection, insecure output handling, training data poisoning, model denial of service, and excessive agency. It serves as the primary developer-oriented security checklist for LLM application teams.

See also: Prompt Injection, Data Poisoning, Tool-Call Abuse / Confused Deputy

P

Perplexity Defense defenses #

A detection heuristic that identifies adversarial inputs by measuring their token-level perplexity under a reference language model. Adversarial suffixes produced by gradient-based attacks (e.g., GCG) often contain unnatural token sequences with anomalously high perplexity. Proposed by Alon and Kamfonas (2023), though it can be evaded by naturalness-constrained attack variants.

See also: GCG (Greedy Coordinate Gradient) Attack, Adversarial Suffix, Guardrail

Prompt Injection attacks #

An attack in which attacker-controlled text is interpreted by the LLM as instructions, overriding or hijacking the legitimate system prompt or user intent. Direct prompt injection arrives via user input; indirect prompt injection arrives through content the model retrieves or ingests.

See also: Indirect Prompt Injection, Jailbreak, System Prompt

R

RAG (Retrieval-Augmented Generation) infrastructure #

An architecture that augments LLM generation by retrieving relevant documents from an external datastore—typically a vector database—and injecting them into the context window before generation. RAG extends a model's effective knowledge beyond its training cutoff and reduces hallucination on knowledge-intensive tasks, but also introduces the retrieval corpus as an attack surface for poisoning and indirect injection.

See also: Vector Database, Embedding, RAG Poisoning

RAG Poisoning attacks #

An attack on retrieval-augmented generation pipelines that injects malicious documents or passages into the vector store or retrieval corpus. When the poisoned content is retrieved and placed in the LLM's context, it acts as an indirect prompt injection, steering outputs toward attacker-controlled responses.

See also: Indirect Prompt Injection, RAG (Retrieval-Augmented Generation), Vector Database

Red Teaming evaluation #

A structured adversarial evaluation process in which a dedicated team attempts to elicit harmful, unsafe, or policy-violating behavior from an AI system before deployment. AI red teaming encompasses both manual creative attacks and automated attack pipelines (e.g., GCG, PAIR). Major labs conduct extensive red teaming prior to model releases and publish findings in system cards.

See also: Jailbreak, HarmBench, JailbreakBench

RLHF (Reinforcement Learning from Human Feedback) defenses #

A post-training technique that fine-tunes a language model to align with human preferences by using a reward model trained on human comparison judgments. A policy is optimized via proximal policy optimization (PPO) to maximize reward model scores. RLHF is the dominant post-training alignment method for commercial LLMs and directly shapes how models respond to sensitive requests.

See also: Constitutional AI, Fine-tuning

S

SmoothLLM defenses #

A defense against jailbreaking attacks that works by randomly perturbing copies of the input prompt and aggregating the results through majority voting. Proposed by Robey et al. (2023), SmoothLLM exploits the observation that adversarial suffixes are fragile: even small random character substitutions degrade attack effectiveness, while benign prompts remain interpretable.

See also: Jailbreak, GCG (Greedy Coordinate Gradient) Attack, Adversarial Suffix

Supply Chain Attack attacks #

An attack that compromises ML models, libraries, training datasets, or deployment infrastructure upstream of the target deployment. Examples include distributing malicious model weights via Hugging Face, backdooring popular ML Python packages, or poisoning publicly available fine-tuning datasets. The Pickle format widely used for model serialization has no integrity guarantees and is a common attack vector.

See also: Data Poisoning, Model Extraction

System Prompt infrastructure #

Instructions provided to an LLM by the application operator before any user turns, typically not visible to the end user. The system prompt establishes persona, scope, and safety constraints for a deployment. Because models are trained to respect system prompt authority, extracting or overriding the system prompt is a common goal of prompt injection and jailbreak attacks.

See also: Prompt Injection, Trust Separation

T

Temperature infrastructure #

A sampling hyperparameter that scales the logit distribution before softmax, controlling output randomness. Temperature 0 approaches greedy decoding (maximally deterministic); higher values (e.g., 1.0–2.0) increase output diversity. From a security perspective, deterministic sampling makes attack success rates reproducible, while high temperatures can cause safety guardrails to be bypassed stochastically.

See also: Context Window

Tokenization infrastructure #

The process of splitting raw text into discrete tokens—typically sub-word units derived via byte-pair encoding (BPE) or SentencePiece—before feeding them to an LLM. Tokenization quirks create security-relevant edge cases: homoglyphs, unusual Unicode, and token-boundary straddling can cause injection filters to miss payloads that the model nonetheless interprets as instructions.

See also: Context Window, Adversarial Suffix

Tool Calling / Function Calling infrastructure #

An API capability that allows an LLM to invoke structured external functions by generating JSON payloads conforming to a predefined schema. The runtime interprets the JSON and executes the corresponding function (e.g., search, send_email, execute_sql). Tool calling is the mechanism through which agent authority is exercised and through which tool-call abuse attacks operate.

See also: Agent, Tool-Call Abuse / Confused Deputy, Capability Scoping

Tool-Call Abuse / Confused Deputy attacks #

An attack that exploits an LLM agent's tool-calling authority to perform unauthorized actions on behalf of an attacker. The model acts as a confused deputy—it has legitimate credentials and permissions, but an indirect prompt injection or jailbreak causes it to invoke those tools in ways the operator never intended, such as exfiltrating data or sending unauthorized requests.

See also: Indirect Prompt Injection, Agent, Capability Scoping

Transferability evaluation #

The property of adversarial examples or attack strategies that cause them to succeed against models other than the one used during attack development. Black-box transferability—where attacks optimized on a white-box source model succeed against proprietary target models—is what makes gradient-based attacks practically threatening and complicates defenses that rely on model obscurity.

See also: Adversarial Example, Adversarial Suffix, GCG (Greedy Coordinate Gradient) Attack

Trust Separation defenses #

An architectural pattern that assigns distinct privilege levels to different data planes flowing into an LLM: operator-controlled system prompts receive high trust, user input receives medium trust, and retrieved or external content receives low trust. Strict trust separation prevents indirect prompt injection from escalating to system-prompt-level authority.

See also: System Prompt, Indirect Prompt Injection, Capability Scoping

V

Vector Database infrastructure #

A storage system optimized for high-dimensional embedding lookup via approximate nearest-neighbor (ANN) search algorithms such as HNSW or IVF-PQ. Vector databases (e.g., Pinecone, Weaviate, Qdrant, pgvector) are the primary retrieval backend for RAG systems. Their security properties—access control, update integrity, and query isolation—directly affect RAG pipeline security.

See also: RAG (Retrieval-Augmented Generation), Embedding