AIセキュリティポータル K Program
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Share
Abstract
Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.
Beyond static gui agent: Evolving llm-based gui testing via dynamic memory
M. Chen, Z. Liu, C. Chen, J. Wang, Y. Xue, B. Wu, Y. Huang, L. Wu, Q. Wang
Published: 2025
Davsp: Safety alignment for large vision-language models via deep aligned visual safety prompt
Y. Zhang, J. Li, L. Cai, G. Li
Published: 2026
Mocha: Are code language models robust against multi-turn malicious coding prompts?
M. Wahed, X. Zhou, K. A. Nguyen, T. Yu, N. Diwan, G. Wang, D. Hakkani-Tür, I. Lourentzou
Published: 2025
Security Attacks on LLM-based Code Completion Tools
Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang
Published: 2024.8.21
Syncode: Llm generation with grammar augmentation
S. Ugare, T. Suresh, H. Kang, S. Misailovic, G. Singh
Published: 2024
Llguidance
Microsoft
Published: 2025
Using grammar masking to ensure syntactic validity in llm-based modeling tasks
L. Netz, J. Reimer, B. Rumpe
Published: 2024
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference
J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li
Published: 2025
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson
Published: 2024.6.10
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
Shuoming Zhang, Jiacheng Zhao, Ruiyuan Xu, Xiaobing Feng, Huimin Cui
Published: 2025.4.1
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang
Published: 2023.8.8
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson
Published: 2023.10.6
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Published: 2023.5.30
Rmcbench: Benchmarking large language models’ resistance to malicious code
J. Chen, Q. Zhong, Y. Wang, K. Ning, Y. Liu, Z. Xu, Z. Zhao, T. Chen, Z. Zheng
Published: 2024
Llms caught in the crossfire: Malware requests and jailbreak challenges
H. Li, H. Gao, Z. Zhao, Z. Lin, J. Gao, X. Li
Published: 2025
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI
Published: 2026
Share