These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models are aligned to be safe, preventing users from
generating harmful content like misinformation or instructions for illegal
activities. However, previous work has shown that the alignment process is
vulnerable to poisoning attacks. Adversaries can manipulate the safety training
data to inject backdoors that act like a universal sudo command: adding the
backdoor string to any prompt enables harmful responses from models that,
otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024,
challenged participants to find universal backdoors in several large language
models. This report summarizes the key findings and promising ideas for future
research.