Low-Resource Languages Jailbreak GPT-4

TOP 文献データベース Low-Resource Languages Jailbreak GPT-4

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2310.02446

PDF

https://arxiv.org/pdf/2310.02446

文献情報

作者: Zheng-Xin Yong;Cristina Menghini;Stephen H. Bach
公開日: 2023-10-4
更新日: 2024-1-28
所属機関: Department of Computer Science, Brown University
所属の国: United States of America
会議名

AIにより推定されたラベル

安全性アライメントプロンプトインジェクション脆弱性検出

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

外部データセット

AdvBench Harmful Behaviors dataset