Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Ai-driven self-evolving software: A promising path toward software automation

L. Cai, Y. Ren, Y. Zhang, J. Li

What papers don’t tell you: Recovering tacit knowledge for automated paper reproduction

L. Li, R. Wang, H. Song, Y. Mao, T. Zhang, Y. Wang, J. Fan, Y. Zhang, J. Ye, C. Zhang

Diffuguard: How intrinsic safety is lost and found in diffusion large language models

Proceedings of the AAAI Conference on Artificial Intelligence

Davsp: Safety alignment for large vision-language models via deep aligned visual safety prompt

Y. Zhang, J. Li, L. Cai, G. Li

Published: 2026

Z. Li, Z. Nie, Z. Zhou, Y. Liu, Y. Zhang, Y. Cheng, Q. Wen, K. Wang, Y. Guo, J. Zhang

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Jailbreak open-sourced large language models via enforced decoding

Zhang, H., Ku, L.-W., Martins, A., Srikumar, V.

Omni-safety under cross-modality conflict: Vulnerabilities, dynamics mechanisms and efficient alignment

K. Wang, Z. Li, Z. Zhou, Y. Zhang, Y. Mi, K. Yang, Y. Zhang, J. Dong, Z. Sun, Q. Li

Smoke and mirrors: Jailbreaking llm-based code generation via implicit malicious prompts

S. Ouyang, Y. Qin, B. Lin, L. Chen, X. Mao, S. Wang

Redcodeagent: Automatic red-teaming agent against diverse code agents

Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li

Beyond autoregression: An empirical study of diffusion large language models for code generation

Mocha: Are code language models robust against multi-turn malicious coding prompts?

M. Wahed, X. Zhou, K. A. Nguyen, T. Yu, N. Diwan, G. Wang, D. Hakkani-Tür, I. Lourentzou

Published: 2025

C. Li, Y. Zhang, J. Li, L. Cai, G. Li

AAAI Conference on Artificial Intelligence (AAAI)

Security Attacks on LLM-based Code Completion Tools

Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang

Published: 2024.8.21

The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.

プロンプトインジェクション攻撃手法 LLMセキュリティ

Packmonitor: Enabling zero package hallucinations through decoding-time monitoring

X. Liu, Y. Liu, Y. Zhang, J. Li, S.-M. Hu

Xgrammar: Flexible and efficient structured generation engine for large language models

Y. Dong, C. F. Ruan, Y. Cai, R. Lai, Z. Xu, Y. Zhao, T. Chen

Lookahead-then-verify: Reliable constrained decoding for diffusion llms under context-free grammars

Transactions on Machine Learning Research

Syncode: Llm generation with grammar augmentation

S. Ugare, T. Suresh, H. Kang, S. Misailovic, G. Singh

Published: 2024

Llguidance

Microsoft

Published: 2025

Y. Zhang, Y. Li, Y. Liu, J. Li, X. Jia, Z. Li, G. Li

Structured decoding in vllm: A gentle introduction

Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems

Using grammar masking to ensure syntactic validity in llm-based modeling tasks

L. Netz, J. Reimer, B. Rumpe

Published: 2024

vLLM Blog

OpenAI Technical Report

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram

MiniMax M2.7: Early echoes of self-evolution

MiniMax

Qwen2.5 Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li

Published: 2025

International Conference on Learning Representations (ICLR)

被引用数 2

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson

Published: 2024.6.10

The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.

プロンプトインジェクション安全性アライメント LLMセキュリティ

Decoupling safety into orthogonal subspace: Cost-efficient and performance-preserving alignment for large language models

Y. Mou, X. Zhou, Y. Luo, S. Zhang, W. Ye

The llama 3 herd of models

LLaMa-Team

Qwen2. 5 technical report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H.

Agentspec: Customizable runtime enforcement for safe and reliable LLM agents

Structured Model Outputs

Haoyu Wang, Christopher M. Poskitt, Jun Sun

Exploiting prefix-tree in structured output interfaces for enhancing jailbreak attacking

Y. Li, Y. Xiong, J. Zhong, J. Zhang, J. Zhou, L. Zou

Computing Research Repository (CoRR)

Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

Shuoming Zhang, Jiacheng Zhao, Ruiyuan Xu, Xiaobing Feng, Huimin Cui

Published: 2025.4.1

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing softwares like agent systems, could be achieved. However, the feature enabling functionality of grammar-guided structured output presents significant security vulnerabilities. In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass safety mechanisms. Unlike prior attacks focused on input prompts, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane). We instantiate this with a proof-of-concept Chain Enum Attack, achieves 96.2% attack success rates across proprietary and open-weight LLMs on five safety benchmarks with a single query, including GPT-4o and Gemini-2.0-flash. Our findings identify a critical security blind spot in current LLM architectures and urge a paradigm shift in LLM safety to address control-plane vulnerabilities, as current mechanisms focused solely on data-plane threats leave critical systems exposed.

プロンプトインジェクション LLMセキュリティ LLMの安全機構の解除

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

被引用数 16

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

Published: 2023.8.8

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

LLMセキュリティプロンプトインジェクションキャラクター役割演技

Lockpicking llms: A logit-based jailbreak using token-level manipulation

Y. Li, Y. Liu, Y. Li, L. Shi, G. Deng, S. Chen, K. Wang

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Boosting Jailbreak Attack with Momentum

Yihao Zhang, Zeming Wei

Published: 2024.5.2

Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at https://github.com/weizeming/momentum-attack-llm.

プロンプトインジェクション攻撃手法ウォーターマーキング

Low-Resource Languages Jailbreak GPT-4

被引用数 4

Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach

Published: 2023.10.4

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

安全性アライメントプロンプトインジェクション脆弱性検出

Jailbreaking black box large language models in twenty queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., Wong, E.

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martin Soto, Nathan Labenz, Owain Evans

International Conference on Learning Representations (ICLR)

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

Published: 2023.10.6

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.

プロンプトインジェクション情報収集手法データ収集

Computing Research Repository (CoRR)

JULI: Jailbreak Large Language Models by Self-Introspection

Jesson Wang, Zhanhao Hu, David Wagner

Published: 2025.5.17

Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.

LLMの安全機構の解除プロンプトインジェクション APIセキュリティ

Safedpo: A simple approach to direct preference optimization with enhanced safety

G.-H. Kim, Y. J. Kim, B. Kim, H. Lee, K. Bae, Y. Jang, M. Lee

Conference on Neural Information Processing Systems (NeurIPS)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Published: 2023.5.30

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

報酬メカニズム設計アライメント強化学習最適化

arXiv preprint

Opencodeinstruct: A large-scale instruction tuning dataset for code llms

W. U. Ahmad, A. Ficek, M. Samadi, J. Huang, V. Noroozi, S. Majumdar, B. Ginsburg

Safety tax: Safety alignment makes your large reasoning models less reasonable

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, Z. Yahn, Y. Xu, L. Liu

Minimax m2.5: Built for real-world productivity

MiniMax

gpt-oss-120b & gpt-oss-20b model card

arXiv

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck