What Makes Pentesting LLMs Different

Penetration testing LLM applications demands a different approach than conventional web testing. Classic vulnerabilities still matter, but LLMs introduce entirely new attack surfaces arising from their unique architecture. This post examines black-box strategies for discovering and exploiting vulnerabilities in LLM-driven applications.

Before diving into exploitation techniques, we need to understand why securing LLMs is uniquely challenging. The answer lies in their core architectural properties.

The Autoregressive Model

Large Language Models are autoregressive — they generate text by predicting the next token based on previous tokens. This means:

LLMs are fundamentally completion engines. Given any input, they will always try to generate a plausible continuation. There’s no built-in concept of “refusing” at the architectural level — refusals are trained behaviors that can be circumvented.

What this means - This behavior fundamentally separates LLMs from traditional, deterministic systems. A database query, for example, either returns results or errors in a fully deterministic way. Likewise, an API endpoint clearly accepts or rejects a request. By contrast, an autoregressive LLM will always produce some continuation, even when given adversarial or malformed inputs.

This has important security implications. You can’t truly “block” a model from generating a response, you can only attempt to steer it toward safer outputs. Adversarial prompts exploit this property by redirecting the model’s trajectory into unsafe continuations, and because safety is something learned during training rather than guaranteed by the architecture, those controls can sometimes be circumvented.

Note: Diffusion models (the next big thing) work differently — they iteratively refine noise into coherent outputs. This may offer different security properties, but autoregressive models dominate current text-based deployments.

Stochasticity and the Infinite Attack Surface

LLMs are stochastic/non-deterministic - which means they don’t always produce the same output even when given the exact same input. A model’s response can vary because of factors like:

Temperature sampling: Higher temperature = more randomness
Top-k/Top-p sampling: Non-deterministic token selection
Randomness in attention mechanisms: multiple continuations may look equally valid to the model, creating natural randomness in its internal attention patterns

This creates a fundamentally different security challenge from both an offensive and defensive perspective. How do we guarantee safety when a model might answer safely one moment and slip the next? How do we evaluate or validate an exploit when the same payload may succeed only half the time?

Naturally, one may ask: Why not just turn off sampling during testing? That was my first instinct too. If we set $\text{temperature} = 0$ , the model becomes deterministic - every input produces one fixed output and hence we can localize and ignore the stochastic problem.

However, setting $\text{temperature} = 0$ creates an artificial environment that may look secure, but the moment you turn sampling back on, you’ve changed the model’s behavior again. A system that is “safe” at $\text{temperature} = 0$ can easily become unsafe once randomness re-enters the picture.

Additionally, $\text{temperature} = 0$ forces greedy sampling, which is known to cause text degeneration issues - see: https://arxiv.org/abs/1904.09751.

Single Channel Architecture

LLMs process all inputs through a single modality: tokens. System instructions, user prompts, retrieved documents, and tool outputs all flow through the same channel and are processed identically by transformers. Unlike traditional systems where code and data occupy distinct memory regions or syntactic categories, there exists no instruction-data boundary that is enforced at the architectural level in LLMs.

There has been attempts address this issue at the prompt-level via techniques like Microsoft’s spotlighting, but these remain as a stopgap rather than an architectural guarantee. Essentially, the model has no inherent concept of “this token is privileged instruction” versus “this token is untrusted data” because both are just embeddings in the same vector space.

Linguistic Ambiguity

Natural language is inherently ambiguous. Unlike formal languages such as SQL or Python, which are designed for machine interpretation through explicit syntax, natural language evolved for human communication where things such as context and pragmatics resolve ambiguities.

Consider how a single phrase can be interpreted in multiple ways:

This ambiguity allows unbounded phrasings of the same attack intent:

Synonym substitution: “ignore” → “disregard” → “pay no attention to”
Syntactic restructuring: active to passive voice, different clause orderings
Cross-lingual variation: expressing the same attack in different languages

This makes formal attack enumeration impossible. While SQL injection has a finite (though large) pattern space, prompt injection allows infinite creativity. Any attempts of whitelisting or “parameterizing” inputs ultimately devolves into a game of adversarial cat-and-mouse.

Why This Matters for Pentesting

Understanding these fundamentals changes how we approach LLM security testing:

You can’t realistically enumerate all attacks, the attack surface for prompting is unbounded.
Defenses are probabilistic, so evaluations must be adjusted accordingly to account for this.
Context radically alters behavior, the same input in a different context may yield a drastically different outcome.
Testers benefit from linguistic creativity, effectively social-engineering a machine in practice.

A Note on LLM Red Teaming Tools

The LLM security landscape has seen an explosion of automated red teaming frameworks (Garak, Promptfoo, etc.) designed to test model safety. While these tools are useful for evaluating standalone model behavior, they currently lack the architectural context necessary for testing real-world applications, particularly those involving agentic workflows, tool use, or multi-step reasoning chains.

This becomes a significant limitation as soon as security concerns move beyond “Can I jailbreak the model?”. Many vulnerabilities arise not from the model itself, but from the way the application integrates and relies on it. For example, application-layer issues often appear when an LLM has the ability to read or write data, trigger actions, or influence state. Multi-agent systems create additional risks because information can pass between agents in unexpected ways. Long-context features introduce opportunities for context drifts or guardrail bypasses once the context limit has been exhausted. Business logic flaws occur when an application makes decisions based on LLM output without properly validating or constraining it. None of these failure modes can be discovered by probing a single model endpoint.

What’s Next?

In Part 2, we’ll explore the full spectrum of prompt injection techniques, from basic instruction overrides and context manipulation to advanced defense bypasses and combining AI-specific vulnerabilities with traditional web exploits. We’ll cover encoding attacks, jailbreaking through framing and roleplay, how to abuse agentic tooling, and demonstrate how prompt injection serves as a primitive for escalating into XSS, SQL injection, and access control bypasses.