User authorization-based access privileges are a key feature in many
safety-critical systems, but have not been extensively studied in the large
language model (LLM) realm. In this work, drawing inspiration from such access
control systems, we introduce sudoLLM, a novel framework that results in
multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance
with, user access rights. sudoLLM injects subtle user-based biases into queries
and trains an LLM to utilize this bias signal in order to produce sensitive
information if and only if the user is authorized. We present empirical results
demonstrating that this approach shows substantially improved alignment,
generalization, resistance to prefix-based jailbreaking attacks, and
``fails-closed''. The persistent tension between the language modeling
objective and safety alignment, which is often exploited to jailbreak LLMs, is
somewhat resolved with the aid of the injected bias signal. Our framework is
meant as an additional security layer, and complements existing guardrail
mechanisms for enhanced end-to-end safety with LLMs.