Reinforcement Learning

Robust Q-Learning under Corrupted Rewards

Authors: Sreejeet Maity, Aritra Mitra | Published: 2024-09-05
Algorithm
Convergence Guarantee
Reinforcement Learning

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Authors: Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad | Published: 2024-08-05
Prompt Injection
Reinforcement Learning
Adversarial Example

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Authors: Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang | Published: 2024-06-13
LLM Security
Prompt Injection
Reinforcement Learning

CovRL: Fuzzing JavaScript Engines with Coverage-Guided Reinforcement Learning for LLM-based Mutation

Authors: Jueon Eom, Seyeon Jeong, Taekyoung Kwon | Published: 2024-02-19
Fuzzing
Reinforcement Learning
Evaluation Method

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez | Published: 2024-01-10 | Updated: 2024-01-17
Backdoor Attack
Prompt Injection
Reinforcement Learning

Reinforcement Unlearning

Authors: Dayong Ye, Tianqing Zhu, Congcong Zhu, Derui Wang, Kun Gao, Zewei Shi, Sheng Shen, Wanlei Zhou, Minhui Xue | Published: 2023-12-26 | Updated: 2024-09-09
Robustness
Reinforcement Learning
Complexity of the Environment

The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning

Authors: Maria Rigaki, Sebastian Garcia | Published: 2023-08-31
Reinforcement Learning
Malicious Demo Construction
Adversarial attack

Robust Lipschitz Bandits to Adversarial Corruptions

Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee | Published: 2023-05-29 | Updated: 2023-10-08
Reinforcement Learning
Adversarial attack
Machine Learning Method

Attacks on Online Learners: a Teacher-Student Analysis

Authors: Riccardo Giuseppe Margiotta, Sebastian Goldt, Guido Sanguinetti | Published: 2023-05-18 | Updated: 2023-10-29
Backdoor Attack
Reinforcement Learning
Adversarial Example

ANALYSE — Learning to Attack Cyber-Physical Energy Systems With Intelligent Agents

Authors: Thomas Wolgast, Nils Wenninghoff, Stephan Balduin, Eric Veith, Bastian Fraune, Torben Woltjen, Astrid Nieße | Published: 2023-04-21
Cyber Attack
Reinforcement Learning
Attack Scenario Analysis