Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs

Authors: Nafis Tanveer Islam, Joseph Khoury, Andrew Seong, Elias Bou-Harb, Peyman Najafirad | Published: 2024-09-01

2024.09.012025.04.03

Authors: Nafis Tanveer Islam, Joseph Khoury, Andrew Seong, Elias Bou-Harb, Peyman Najafirad
Published: 2024-09-01

Source: https://arxiv.org/abs/2409.00571

PDF: https://arxiv.org/pdf/2409.00571

AIにより推定されたラベル

自動脆弱性修復脆弱性管理 LLMセキュリティ

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

With the recent unprecedented advancements in Artificial Intelligence (AI) computing, progress in Large Language Models (LLMs) is accelerating rapidly, presenting challenges in establishing clear guidelines, particularly in the field of security. That being said, we thoroughly identify and describe three main technical challenges in the security and software engineering literature that spans the entire LLM workflow, namely; (i) Data Collection and Labeling; (ii) System Design and Learning; and (iii) Performance Evaluation. Building upon these challenges, this paper introduces SecRepair, an instruction-based LLM system designed to reliably identify, describe, and automatically repair vulnerable source code. Our system is accompanied by a list of actionable guides on (i) Data Preparation and Augmentation Techniques; (ii) Selecting and Adapting state-of-the-art LLM Models; (iii) Evaluation Procedures. SecRepair uses a reinforcement learning-based fine-tuning with a semantic reward that caters to the functionality and security aspects of the generated code. Our empirical analysis shows that SecRepair achieves a 12% improvement in security code repair compared to other LLMs when trained using reinforcement learning. Furthermore, we demonstrate the capabilities of SecRepair in generating reliable, functional, and compilable security code repairs against real-world test cases using automated evaluation metrics.