Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval

Authors: Bushra Sabir, Shigang Liu, Seung Ick Jang, Sharif Abuadbba, Yansong Gao, Kristen Moore, SangCheol Kim, Hyoungshick Kim, Surya Nepal | Published: 2026-03-24

2026.03.242026.03.26

Authors: Bushra Sabir, Shigang Liu, Seung Ick Jang, Sharif Abuadbba, Yansong Gao, Kristen Moore, SangCheol Kim, Hyoungshick Kim, Surya Nepal
Published: 2026-03-24

Source: https://arxiv.org/abs/2603.22717

PDF: https://arxiv.org/pdf/2603.22717

Labels Predicted by AI

LLM Performance Evaluation Model DoS Adversarial Learning

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Automatically generating source code from natural language using large language models (LLMs) is becoming common, yet security vulnerabilities persist despite advances in fine tuning and prompting. In this work, we systematically evaluate whether multi LLM ensembles and collaborative strategies can meaningfully improve secure code generation. We present MULTI-LLMSECCODEEVAL, a framework for assessing and enhancing security across the vulnerability management lifecycle by combining multiple LLMs with static analysis and structured collaboration. Using SecLLMEval and SecLLMHolmes, we benchmark ten pipelines spanning single model, ensemble, collaborative, and hybrid designs. Our results show that ensemble pipelines augmented with static analysis improve secure code generation over single LLM baselines by up to 47.3