Hirundo achieves superior security hardening against adversarial vulnerabilities with Gemma 4
Trained with weight-level defenses, Hirundo’s E4B variant delivers elite-tier protection that outclasses models over 100 times its size.
As Large Language Models (LLMs) move into production, prompt injection attacks—where adversaries manipulate inputs to override system instructions—remain a persistent security challenge. Traditional defenses often rely on “bigger is better” logic or fragmented guardrails.
Hirundo, an AI safety platform, challenged this assumption by building advanced, weight-level resistance directly onto the foundational architecture of Gemma 4. As a result, they demonstrated that a compact model can outperform raw scale, delivering production-grade security without sacrificing the speed and cost-efficiency of a smaller footprint.
Adversarial robustness at scale
Enterprises deploying LLMs face a difficult trade-off between model capability, cost, and security. While models over 100B+ parameters are often assumed to be more robust due to their scale, they remain susceptible to sophisticated “jailbreak” techniques that bypass safety training. Alternatively, while smaller models are appreciated for their efficiency, developers historically faced challenges implementing defense layers without sacrificing general utility.
Hirundo sought to prove that security is not a function of parameter count, but of the ability to apply precise behavioral control to an established, efficient model architecture. By targeting the specific weights susceptible to adversarial manipulation, they create a “secure-by-design” model without the latency or compute costs of massive architectures.
Preserving utility while hardening security
Instead of adding external filters that slow down inference and often follow rule-based logic, Hirundo applied structural safety alignment directly to the instruction-tuned base model. This process involves identifying and excising the internal representations that make the model comply with adversarial prompts, effectively “forgetting” susceptibility to prompt injection at the weight level. Gemma 4 E4B IT provided an optimal combination of performance, size, and baseline safety alignment, enabling rapid, secure iteration in resource-constrained environments.
A common fear with aggressive security hardening is the alignment tax, an expected degradation in general performance capabilities. Hirundo’s weight-optimization process achieved a 74.47% reduction in successful attacks relative to the base model, resulting in a final Attack Success Rate (ASR) of 4.78%. Crucially, this hardening strictly preserved the model’s high performance across standard utility benchmarks, including AIME25, LiveCodeBench, GPQA, IFBench, and SCICode, as well as benign benchmarks AutoPatchBench and CyberSOCEval.
Outperforming 600B+ models: efficiency beats scale
To validate the approach, Hirundo benchmarked their hardened Gemma model against industry-leading open-weights models using PurpleLlama CyberSecEval, a responsible AI benchmark suite that evaluates cybersecurity risks. Significantly larger models complied with adversarial overrides at a higher rate, failing to preserve their original system instructions under identical pressure.
The data highlights a critical security insight: raw scale alone offers little protection against targeted jailbreaks. DeepSeek V3.2-Exp, a 685B parameter model, exhibited a 73.33% failure rate—15.6x worse than the hardened Gemma model. Similarly, despite being 30 times larger, GPT-OSS-120B lagged behind with more than 3x the attack success rate, while the 235B Qwen model proved 10.8x more vulnerable.
By pairing the efficiency of Gemma with targeted security hardening, developers can now deploy models that are not only faster and cheaper to run but fundamentally more secure than larger models.