These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) have been increasingly applied to various
domains, which triggers increasing concerns about LLMs' safety on specialized
domains, e.g. medicine. Despite prior explorations on general jailbreaking
attacks, there are two challenges for applying existing attacks on testing the
domain-specific safety of LLMs: (1) Lack of professional knowledge-driven
attacks, (2) Insufficient coverage of domain knowledge. To bridge this gap, we
propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaking
attacks from domain knowledge, requiring both attack effectiveness and
knowledge relevance. We collect a large-scale dataset with 12,974
knowledge-jailbreak pairs and fine-tune a large language model as
jailbreak-generator, to produce domain knowledge-specific jailbreaks.
Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of
jailbreak-generator in generating jailbreaks that are both threatening to the
target LLMs and relevant to the given knowledge. We also apply our method to an
out-of-domain knowledge base, showing that jailbreak-generator can generate
jailbreaks that are comparable in harmfulness to those crafted by human
experts. Data and code are available at:
https://github.com/THU-KEG/Knowledge-to-Jailbreak/.