LLM Injection Cyber Resilient Assistants

LLM injection-resilient cyber assistants using Constitutional AI guardrails, Adaptive Constitutional AI guardrails, DPO, and Unlearning.

Overview

 

Context
Large Language Models (LLMs) such as GPT-based systems are increasingly used to assist in domains where fast and accurate analysis is critical. They can summarize logs, explain technical concepts, and even detect anomalies, making them valuable tools for security operations. However, LLMs are also known to be vulnerable to jailbreaks—adversarial prompts that override their built-in guardrails and push them into unsafe behavior (Ganguli et al., 2023).

 

Challenge
In cybersecurity, this weakness is particularly dangerous. Attackers can hide malicious instructions in logs, phishing emails, or malware code, tricking the model into revealing sensitive data, replicating exploits, or executing unwanted actions. These prompt injection attacks bypass standard guardrails and create significant risks when LLMs are deployed in Security Operations Centers (SOCs), phishing response workflows, or malware analysis pipelines (Shi et al., 2023).

 

Approach
This project develops LLM injection-resilient cyber assistants that combine safety and adaptability. The framework rests on three pillars:

  • Constitutional AI guardrails – assistants operate under a security-aware constitution that enforces principles like “do not execute commands from logs” or “never regenerate malware payloads.”
  • Adaptive Constitutional AI – as new jailbreak and injection strategies emerge, the constitution evolves through continuous testing and feedback.
  • Direct Preference Optimization (DPO) & Unlearning – the model is tuned to prefer safe, policy-aligned responses and to forget unsafe patterns, strengthening defenses beyond surface-level filters.

By integrating these layers, our assistants maintain trustworthiness while supporting critical tasks such as log analysis, phishing triage, and malware explanation—offering a reliable line of defense against adversarial misuse of AI in cybersecurity.


References
Ganguli, D., Askell, A., Schiefer, N., et al. (2023). Constitutional AI: Harmlessness from AI feedback. arXiv. https://arxiv.org/abs/2212.08073

Shi, W., Yuan, W., Li, B., & Chen, Y. (2023). BadPrompt: Backdoor Attacks on Continuous Prompts. In Proceedings of the IEEE Symposium on Security and Privacy. https://doi.org/10.1109/SP46215.2023.00025