AIセキュリティポータル K Program
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data
Share
Abstract
Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.
Tabnine
Published: 2013
Codeium
Published: 2023
Ghostwriter - Code faster with AI
Published: 2023
Codium AI
Published: 2024
Using static analysis to find bugs
Nathaniel Ayewah, William Pugh, David Hovemeyer, J. David Morgenthaler, John Penix
Published: 2008
Data scarcity, robustness and extreme multi-label classification
Rohit Babbar, Bernhard Schölkopf
Published: 2019
Analyzing the state of static analysis: A large-scale evaluation in open source software
Moritz Beller, Radjino Bholanath, Shane McIntosh, Andy Zaidman
Published: 2016
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe
Published: 2023.12.8
Lora learns less and forgets less
Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle
Published: 2024
Language models are few-shot learners
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei
Published: 2020
A survey of data synthesis approaches
Hsin-Yu Chang, Pei-Yu Chen, Tun-Hsiang Chou, Chang-Sheng Kao, Hsuan-Yun Yu, Yen-Ting Lin, Yun-Nung Chen
Published: 2024
Evaluating large language models trained on code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, S. Arun Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba
Published: 2021
Static analysis for security
Brian Chess, Gary McGraw
Published: 2004
What developers want and need from program analysis: An empirical study
M. Christakis, C. Bird
Published: 2016
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model
DeepSeek-AI
Published: 2024
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Published: 2019
The llama 3 herd of models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan
Published: 2024
Afl++ combining incremental steps of fuzzing research
A. Fioraldi, D. Maier, H. Eißfeldt, M. Heuse
Published: 2020
Libafl: A framework to build modular and reusable fuzzers
Andrea Fioraldi, Dominik Christian Maier, Dongjia Zhang, Davide Balzarotti
Published: 2022
Incoder: A generative model for code infilling and synthesis
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, Mike Lewis
Published: 2023
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, Jacob Hilton
Published: 2023
Chatgpt outperforms crowd workers for text-annotation tasks
F. Gilardi, M. Alizadeh, M. Kubli
Published: 2023
Graphcodebert: Pre-training code representations with data flow
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou
Published: 2021
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models
Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, Mario Fritz
Published: 2023.2.8
Simscood: Systematic analysis of out-of-distribution generalization in fine-tuned source code models
Hossein Hajipour, Ning Yu, Cristian-Alexandru Staicu, Mario Fritz
Published: 2024
Just another copy and paste? comparing the security vulnerabilities of chatgpt generated code and stackoverflow answers
Sivana Hamer, Marcelo d’Amorim, Laurie Williams
Published: 2024
Large Language Models for Code: Security Hardening and Adversarial Testing
Jingxuan He, Martin Vechev
Published: 2023.2.11
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly
Published: 2019
Find bugs and reachable dependency vulnerabilities in code.
Semgrep Inc.
Published: 2024
An empirical study on the effectiveness of static C code analyzers for vulnerability detection
Stephan Lipp, Sebastian Banescu, Alexander Pretschner
Published: 2022
Best practices and lessons learned on synthetic data for language models
Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou
Published: 2024
In defense of soundness: a manifesto
Benjamin Livshits, Manu Sridharan, Yannis Smaragdak, Ondřej Lhoták, J. Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z. Guyer, Uday P. Khedker, Anders Møller, Dimitrios Vardoulakis
Published: 2015
On llms-driven synthetic data generation, curation, and evaluation: A survey
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang
Published: 2024
Starcoder 2 and the stack v2: The next generation
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Munoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
Published: 2024
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang
Published: 2023
Tuning language models as training data generators for augmentation-enhanced few-shot learning
Y. Meng, M. Michalski, J. Huang, Y. Zhang, T. Abdelzaher, J. Han
Published: 2023
Gpt-4 technical report
OpenAI
Published: 2023
Dynamic malware analysis in the modern era—a state of the art survey
O. Or-Meir, N. Nissim, Y. Elovici, L. Rokach
Published: 2019
Asleep at the keyboard? assessing the security of github copilot’s code contributions
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri
Published: 2022
Language models are unsupervised multitask learners
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
Published: 2019
Security implications of large language model code assistants: A user study
Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt, Siddharth Garg
Published: 2022
SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis
Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, Giovanni Vigna
Published: 2016
Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques
Mohammed Latif Siddiq, Joanna CS Santos
Published: 2022
SoK: Eternal War in Memory
László Szekeres, Mathias Payer, Tao Wei, Dawn Song
Published: 2013
The formai dataset: Generative ai in software security through the lens of formal verification
Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C Cordeiro, Vasileios Mavroeidis
Published: 2023
Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation
Y. Wang, W. Wang, S. Joty, S. C. Hoi
Published: 2021
Magicoder: Source code is all you need
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang
Published: 2023
Metamath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
Published: 2023
Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma
Published: 2024
Fuzzing: a survey for roadmap
Xiaogang Zhu, Sheng Wen, Seyit Camtepe, Yang Xiang
Published: 2022
Share