
Today we’re releasing our first security model: Wolf Defender
As AI systems become more widely used, the attack surface grows rapidly. Prompt injection, data exfiltration and manipulation of AI agents are emerging security risks - yet there are still very few open-weight security models available.
We believe that core AI security tooling should be open and accessible.
That is why we are releasing Wolf Defender, a prompt-injection detection model trained on only ~5% of our full training dataset. Despite its small size, the model already outperforms several existing open-source detectors across multiple benchmarks (see evaluation below). Our full internal model (used inside Patronus Protect) is trained on the complete dataset and performs even stronger.
Wolf Defender was intentionally trained on heavily augmented data to improve generalization across the prompt-injection threat space.
This includes techniques such as:
unicode & homoglyph perturbations
encoded injections (e.g. base64)
HTML and code comment injections
structural wrappers
spacing and casing perturbations
In addition, we applied targeted regularization techniques to reduce the common bias toward simple trigger words such as "Ignore previous instructions".
Our goal was not to create a detector that only recognizes known prompts, but one that learns the structure of prompt-injection attacks.
Wolf Defender is trained on an evolving dataset that we continuously expand with new attack patterns. Our training pipeline allows us to rapidly adapt the model to emerging threats.
We see this release as the first step toward a broader ecosystem of open AI security models. Wolf Defender will be one of several upcoming models we plan to release as part of our effort to make on-device AI usage safer and more controllable.
This first model uses a ModernBERT-style architecture to validate the approach. For truly efficient on-device AI security inference, we already have several additional approaches in development.
We’re excited to share Wolf Defender with the community and contribute to making AI systems safer.
