- copilots,
- internal enterprise assistants,
- support systems,
- coding workflows,
- RAG pipelines,
- and autonomous agents.
As these systems become more capable, they also become more difficult to test.
Traditional security testing approaches still matter, but LLM applications introduce something unusual: the target is probabilistic, conversational, and capable of changing behaviour based on context.
That creates a different kind of attack surface.
During experimentation with prompt injection and jailbreak testing, one limitation became increasingly obvious:
Most testing workflows still rely heavily on static payload libraries.
A predefined set of prompts is replayed against the target, and the results are observed. That approach is useful, but modern models often respond differently depending on:
- provider alignment,
- conversation state,
- role framing,
- wording,
- system prompts,
- and surrounding context.
What fails in one scenario may partially succeed in another with only slight adaptation.
That observation led to an experiment:
What if the attacker itself adapted dynamically during the test?
That idea eventually became AI vs AI.
What AI vs AI does
AI vs AI is an experimental red-teaming platform for LLM applications.
Instead of replaying only static jailbreak payloads, the system uses one model to iteratively test another through:
- adaptive prompt generation,
- attack mutation,
- strategy switching,
- and feedback-driven orchestration.
The workflow is intentionally simple:
- An attacker model generates a payload.
- The payload is sent to the target LLM or application.
- A judge evaluates the response.
- The attacker adapts its next attempt based on prior outcomes.
This loop repeats until:
- the goal is achieved,
- progress stalls,
- or the iteration budget is exhausted.

The goal is not to "break every model." The goal is to explore how adaptive adversarial testing changes the quality of evaluation.
Why adaptive testing matters
Static payload libraries remain valuable, especially for reproducibility.
However, human attackers rarely operate statically.
A human red teamer typically observes:
- refusal patterns,
- moderation language,
- role assumptions,
- hidden constraints,
- partial disclosures,
- and contextual hints.
Then they adjust strategy accordingly.
AI vs AI experiments with bringing some of that iterative behaviour into automated testing.
For example, if a direct instruction override fails, the attacker may pivot toward:
- authority impersonation,
- fake tool output,
- policy introspection,
- roleplay framing,
- gradual escalation,
- or indirect extraction attempts.

The important part is not the individual payload.
The important part is that the system changes tactics based on target behaviour.
Real-world target support
One design goal was flexibility.
Instead of tightly coupling the tool to a single API provider, targets are represented as generic HTTP request templates.
That means the system can test:
- OpenAI-compatible APIs,
- local Ollama deployments,
custom chatbot wrappers,
- and other HTTP-based AI systems.
The interface supports importing:
- raw HTTP requests(just copy the request form your favourite burp tool and paste in this tool),
- cURL commands,
- headers,
- cookies,
- and JSON request bodies.

This approach turned out to be more practical than building many provider-specific integrations.
Hybrid attack orchestration
One interesting observation during development was that fully generative attacks and fully deterministic attacks both have limitations.
Pure library attacks
- reproducible,
- stable,
- predictable,
- but sometimes rigid.
Pure LLM-generated attacks
- adaptive,
- creative,
- flexible,
- but occasionally unstable or repetitive.
To balance both approaches, AI vs AI supports a hybrid mode.
In hybrid mode:
- curated attack strategies provide structure,
- while the attacker LLM dynamically mutates and evolves payloads.
This tends to produce more consistent attack progression while still allowing adaptation.
The role of the judge
Evaluating LLM security responses is not always straightforward.
A target may:
- partially comply,
- reveal policy structure,
- leak contextual hints,
- expose moderation behaviour,
- or disclose operational assumptions without fully satisfying the attacker's objective.
The judge component attempts to score:
- relevance,
- compliance,
- disclosure quality,
- and behavioural success.
This allows the orchestrator to:
- continue iterating,
- pivot strategy,
- or terminate early.
The scoring is still imperfect and remains an active area of experimentation.
Interesting behaviour during testing
One recurring pattern was that targets often reveal more information indirectly than directly.
Even when refusing a request, models sometimes expose:
- moderation structure,
- workflow assumptions,
- internal terminology,
- formatting conventions,
- or hidden behavioural constraints.
This led to separating:
- primary goal success,
- incidental findings,
- and leak tracking
inside the interface.
A successful behavioural manipulation does not necessarily imply sensitive disclosure, and vice versa.
Local attacker models behave differently
Another practical observation was the difference between hosted aligned models and local uncensored models.
Hosted APIs are often more reluctant to generate adversarial payloads.
Local models running through Ollama frequently behave much more aggressively during mutation and strategy generation.
This creates an interesting operational pattern:
- hosted APIs often make good targets,
- while local uncensored models often make effective attackers.
Current limitations
AI vs AI is still experimental software.
There are important limitations:
- judge hallucinations,
- non-deterministic outputs,
- scoring imperfections,
- inconsistent model behaviour,
- attack drift,
- and provider-specific differences.
The system is not intended as a definitive benchmark framework.
It is better viewed as:
an adaptive adversarial experimentation platform.
Why this area is interesting
As AI systems become more integrated into production workflows, testing approaches will likely need to evolve as well.
Static evaluations alone may not fully capture:
- conversational manipulation,
- long-context influence,
- adaptive prompt injection,
- agent workflow abuse,
- or iterative behavioural drift.
AI vs AI explores one possible direction:
- feedback-driven,
- iterative,
- machine-speed adversarial testing.
Not as a replacement for human security review, but potentially as a useful complement to it.
Project
GitHub Repository: