These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have been equipped with safety mechanisms to
prevent harmful outputs, but these guardrails can often be bypassed through
"jailbreak" prompts. This paper introduces a novel graph-based approach to
systematically generate jailbreak prompts through semantic transformations. We
represent malicious prompts as nodes in a graph structure with edges denoting
different transformations, leveraging Abstract Meaning Representation (AMR) and
Resource Description Framework (RDF) to parse user goals into semantic
components that can be manipulated to evade safety filters. We demonstrate a
particularly effective exploitation vector by instructing LLMs to generate code
that realizes the intent described in these semantic graphs, achieving success
rates of up to 87% against leading commercial LLMs. Our analysis reveals that
contextual framing and abstraction are particularly effective at circumventing
safety measures, highlighting critical gaps in current safety alignment
techniques that focus primarily on surface-level patterns. These findings
provide insights for developing more robust safeguards against structured
semantic attacks. Our research contributes both a theoretical framework and
practical methodology for systematically stress-testing LLM safety mechanisms.