Rethinking How to Evaluate Language Model Jailbreak

How to use chatgpt to summarize an article

An overview of bard: an early experiment with generative ai

Universe 2023: Copilot transforms github into the ai-powered developer platform

Gemma: Introducing new state-of-the-art open models

Introducing Meta Llama 3: The most capable openly available LLM to date

Usage policies — openai.com

Usenix submission replication

Computing Research Repository (CoRR)

被引用数 21

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan

Published: 2022.12.15

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

性能評価アライメントプロンプトインジェクション

ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee, Alon Lavie

Published: 2005

Current Problems in Cardiology

Large language model-based chatbot as a source of advice on first aid in heart attack

Alexei A Birkun, Adhish Gautam

Published: 2023

OpenAI Technical Report

Language models are few-shot learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei

Published: 2020

Jailbreaking black box large language models in twenty queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., Wong, E.

Knowledge graphs meet multimodal learning: A comprehensive survey

Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo

Published: 2024

Advances in neural information processing systems

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Published: 2017

Proceedings of NAACL-HLT

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Published: 2019

GPTScore: Evaluate as you desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu

Published: 2023

International Conference on Learning Representations (ICLR)

被引用数 5

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

Published: 2023.10.11

The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM.

プロンプトインジェクション攻撃の評価敵対的攻撃

Proceedings of the IEEE European Symposium on Security and Privacy

AOT - Attack on things: A security analysis of IoT firmware updates

Muhammad Ibrahim, Andrea Continella, Antonio Bianchi

Published: 2023

Proceedings of the Annual International Conference on Mobile Systems, Applications, and Services

Safetynot: On the usage of the Safetynet attestation API in Android

Muhammad Ibrahim, Abdullah Imran, Antonio Bianchi

Published: 2021

OpenAI

Improving language understanding by generative pre-training

Klaus Kippendorff, Karthik Narasimhan, Tim Salimans, Sutskever

Published: 2018

Human Communication Research

Reliability in content analysis: Some common misconceptions and recommendations

Klaus Krippendorff

Published: 2004

Text summarization branches out

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin

Published: 2004

Truthfulqa: Measuring how models mimic human falsehoods

S. Lin, J. Hilton, O. Evans

IEEE International Conference on Data Engineering

PinSQL: Pinpoint root cause SQLs to resolve performance issues in cloud databases

Xiaoze Liu, Zheng Yin, Chao Zhao, Congcong Ge, Lu Chen

Published: 2022

Jailbreaking chatgpt via prompt engineering: An empirical study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, Yang Liu

Proceedings of the Annual Meeting of the Association for Computational Linguistics

Towards an automatic Turing test: Learning to evaluate dialogue responses

Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau, Min-Yen Kan

Published: 2017

Forty-first International Conference on Machine Learning

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B.

Published: 2024

Conference on Neural Information Processing Systems (NeurIPS)

被引用数 43

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

Published: 2022.3.4

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

アライメント性能評価ユーザー行動分析

Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Bleu: a method for automatic evaluation of machine translation.

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu

Published: 2002

International Conference on Learning Representations (ICLR)

被引用数 1

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

Published: 2023.10.6

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.

プロンプトインジェクション情報収集手法データ収集

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, J. Cohen

Published: 2023

Code Llama: Open foundation models for code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, et al.

Bleurt: Learning robust metrics for text generation

Research Square

Evaluation of ChatGPT’s responses to information needs and information seeking of dementia patients

Hamid Reza Saeidnia, Marcin Kozak, Brady D Lund, Mohammad Hassanzadeh

Published: 2023

T. Sellam, D. Das, A. P. Parikh

AttackEval: How to evaluate the effectiveness of jailbreak attacking on large language models

Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou

Published: 2024

arXiv

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi

Published: 2023

arXiv

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi

Published: 2023

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Concealed data poisoning attacks on nlp models

Eric Wallace, Tony Zhao, Shi Feng, Sameer Singh

Published: 2021

International Conference on Machine Learning

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

Published: 2023

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang

Published: 2023

Real-time workload pattern analysis for large-scale cloud databases

Jiaqi Wang, Tianyi Li, Anni Wang, Xiaoze Liu, Lu Chen

Published: 2023

Jailbroken: How Does LLM Safety Training Fail?

被引用数 27

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

Published: 2023.7.6

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

敵対的攻撃手法プロンプトインジェクションセキュリティ保証

Proceedings of CVPR

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu

Published: 2024

Proceedings of the Conference on Empirical Methods in Natural Language Processing

TrojanSQL: SQL injection against natural language interface to database

Jinchuan Zhang, Yan Zhou, Binyuan Hui, Yaxin Liu, Ziming Li, Songlin Hu

Published: 2023

arXiv

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, Yoav Artzi

Published: 2019

Advances in Neural Information Processing Systems

Judging LLM-as-a-judge with MT-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al.

Published: 2024

Fine-tuning language models from human preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., Irving, G.

Published: 2019