AIセキュリティポータル K Program
Rethinking How to Evaluate Language Model Jailbreak
Share
Abstract
Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan
Published: 2022.12.15
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
Satanjeev Banerjee, Alon Lavie
Published: 2005
Large language model-based chatbot as a source of advice on first aid in heart attack
Alexei A Birkun, Adhish Gautam
Published: 2023
Language models are few-shot learners
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei
Published: 2020
Knowledge graphs meet multimodal learning: A comprehensive survey
Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo
Published: 2024
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei
Published: 2017
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Published: 2019
GPTScore: Evaluate as you desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu
Published: 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen
Published: 2023.10.11
AOT - Attack on things: A security analysis of IoT firmware updates
Muhammad Ibrahim, Andrea Continella, Antonio Bianchi
Published: 2023
Safetynot: On the usage of the Safetynet attestation API in Android
Muhammad Ibrahim, Abdullah Imran, Antonio Bianchi
Published: 2021
Improving language understanding by generative pre-training
Klaus Kippendorff, Karthik Narasimhan, Tim Salimans, Sutskever
Published: 2018
Reliability in content analysis: Some common misconceptions and recommendations
Klaus Krippendorff
Published: 2004
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin
Published: 2004
PinSQL: Pinpoint root cause SQLs to resolve performance issues in cloud databases
Xiaoze Liu, Zheng Yin, Chao Zhao, Congcong Ge, Lu Chen
Published: 2022
Towards an automatic Turing test: Learning to evaluate dialogue responses
Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau, Min-Yen Kan
Published: 2017
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B.
Published: 2024
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Published: 2022.3.4
Bleu: a method for automatic evaluation of machine translation.
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu
Published: 2002
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson
Published: 2023.10.6
Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails
T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, J. Cohen
Published: 2023
Evaluation of ChatGPT’s responses to information needs and information seeking of dementia patients
Hamid Reza Saeidnia, Marcin Kozak, Brady D Lund, Mohammad Hassanzadeh
Published: 2023
AttackEval: How to evaluate the effectiveness of jailbreak attacking on large language models
Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou
Published: 2024
Llama 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi
Published: 2023
Llama 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi
Published: 2023
Concealed data poisoning attacks on nlp models
Eric Wallace, Tony Zhao, Shi Feng, Sameer Singh
Published: 2021
Poisoning language models during instruction tuning
Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein
Published: 2023
Survey on factuality in large language models: Knowledge, retrieval and domain-specificity
Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang
Published: 2023
Real-time workload pattern analysis for large-scale cloud databases
Jiaqi Wang, Tianyi Li, Anni Wang, Xiaoze Liu, Lu Chen
Published: 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
Published: 2023.7.6
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu
Published: 2024
TrojanSQL: SQL injection against natural language interface to database
Jinchuan Zhang, Yan Zhou, Binyuan Hui, Yaxin Liu, Ziming Li, Songlin Hu
Published: 2023
Judging LLM-as-a-judge with MT-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al.
Published: 2024
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
Published: 2023.7.28
Share