These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Multi-agent debate (MAD) systems leverage collaborative interactions among
large language models (LLMs) agents to improve reasoning capabilities. While
recent studies have focused on increasing the accuracy and scalability of MAD
systems, their security vulnerabilities have received limited attention. In
this work, we introduce MAD-Spear, a targeted prompt injection attack that
compromises a small subset of agents but significantly disrupts the overall MAD
process. Manipulated agents produce multiple plausible yet incorrect responses,
exploiting LLMs' conformity tendencies to propagate misinformation and degrade
consensus quality. Furthermore, the attack can be composed with other
strategies, such as communication attacks, to further amplify its impact by
increasing the exposure of agents to incorrect responses. To assess MAD's
resilience under attack, we propose a formal definition of MAD fault-tolerance
and develop a comprehensive evaluation framework that jointly considers
accuracy, consensus efficiency, and scalability. Extensive experiments on five
benchmark datasets with varying difficulty levels demonstrate that MAD-Spear
consistently outperforms the baseline attack in degrading system performance.
Additionally, we observe that agent diversity substantially improves MAD
performance in mathematical reasoning tasks, which challenges prior work
suggesting that agent diversity has minimal impact on performance. These
findings highlight the urgent need to improve the security in MAD design.