不適切コンテンツ生成

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Authors: Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang | Published: 2025-08-07

プロンプトインジェクション

不適切コンテンツ生成

攻撃戦略分析

2025.08.07

文献データベース

Using Hallucinations to Bypass GPT4’s Filter

Authors: Benjamin Lemkin | Published: 2024-02-16 | Updated: 2024-03-11

LLMセキュリティ

プロンプトインジェクション

不適切コンテンツ生成

2024.02.16 2025.04.03

文献データベース

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

Authors: Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, Xiangyu Zhang | Published: 2023-12-08

LLMセキュリティ

プロンプトインジェクション

不適切コンテンツ生成

2023.12.08 2025.04.03

文献データベース

Comprehensive Assessment of Toxicity in ChatGPT

Authors: Boyang Zhang, Xinyue Shen, Wai Man Si, Zeyang Sha, Zeyuan Chen, Ahmed Salem, Yun Shen, Michael Backes, Yang Zhang | Published: 2023-11-03

AIチャットボットの悪用

プロンプトインジェクション

不適切コンテンツ生成

2023.11.03 2025.04.03

文献データベース

Watch Your Language: Investigating Content Moderation with Large Language Models

Authors: Deepak Kumar, Yousef AbuHashem, Zakir Durumeric | Published: 2023-09-25 | Updated: 2024-01-17

LLM性能評価

プロンプトインジェクション

不適切コンテンツ生成

2023.09.25 2025.04.03

文献データベース

Can LLM-Generated Misinformation Be Detected?

Authors: Canyu Chen, Kai Shu | Published: 2023-09-25 | Updated: 2024-04-23

LLM性能評価

プロンプトインジェクション

不適切コンテンツ生成

2023.09.25 2025.04.03

文献データベース

Universal and Transferable Adversarial Attacks on Aligned Language Models

Authors: Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson | Published: 2023-07-27 | Updated: 2023-12-20

LLMセキュリティ

プロンプトインジェクション

不適切コンテンツ生成

2023.07.27 2025.04.03

文献データベース

Unveiling Security, Privacy, and Ethical Concerns of ChatGPT

Authors: Xiaodong Wu, Ran Duan, Jianbing Ni | Published: 2023-07-26

LLMセキュリティ

プロンプトインジェクション

不適切コンテンツ生成

2023.07.26 2025.04.03

文献データベース

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Authors: Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal | Published: 2023-06-22 | Updated: 2023-08-16

プロンプトインジェクション

不適切コンテンツ生成

敵対的攻撃

2023.06.22 2025.04.03

文献データベース