With the increasing adoption of Large Language Models (LLMs), more
customization is needed to ensure privacy-preserving and safe generation. We
address this objective from two critical aspects: unlearning of sensitive
information and robustness to jail-breaking attacks. We investigate various
constrained optimization formulations that address both aspects in a
\emph{unified manner}, by finding the smallest possible interventions on LLM
weights that either make a given vocabulary set unreachable or embed the LLM
with robustness to tailored attacks by shifting part of the weights to a
\emph{safer} region. Beyond unifying two key properties, this approach
contrasts with previous work in that it doesn't require an oracle classifier
that is typically not available or represents a computational overhead.
Surprisingly, we find that the simplest point-wise constraint-based
intervention we propose leads to better performance than max-min interventions,
while having a lower computational cost. Comparison against state-of-the-art
defense methods demonstrates superior performance of the proposed approach.