コンテンツモデレーション

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Authors: Diego Dorn, Alexandre Variengien, Charbel-Raphaël Segerie, Vincent Corruble | Published: 2024-06-03

LLMセキュリティ

コンテンツモデレーション

プロンプトインジェクション

2024.06.03 2025.04.03

文献データベース

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Authors: Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang | Published: 2024-05-24 | Updated: 2024-10-11

コンテンツモデレーション

プロンプトインジェクション

倫理的ガイドライン遵守

2024.05.24 2025.04.03

文献データベース

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety

Authors: Yu Fu, Wen Xiao, Jia Chen, Jiachen Li, Evangelos Papalexakis, Aichi Chien, Yue Dong | Published: 2024-05-24

コンテンツモデレーション

プロンプトインジェクション

防御手法

2024.05.24 2025.04.03

文献データベース

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Authors: Xuanli He, Jun Wang, Qiongkai Xu, Pasquale Minervini, Pontus Stenetorp, Benjamin I. P. Rubinstein, Trevor Cohn | Published: 2024-04-30 | Updated: 2025-03-17

コンテンツモデレーション

バックドア攻撃

プロンプトインジェクション

2024.04.30 2025.04.03

文献データベース

Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models

Authors: Mohamed R. Shoaib, Zefan Wang, Milad Taleby Ahvanooey, Jun Zhao | Published: 2023-11-29

AIと自動化の役割

コンテンツモデレーション

プライバシー保護

2023.11.29 2025.04.03

文献データベース

TrollHunter [Evader]: Automated Detection [Evasion] of Twitter Trolls During the COVID-19 Pandemic

Authors: Peter Jachim, Filipo Sharevski, Paige Treebridge | Published: 2020-12-04 | Updated: 2020-12-07

コンテンツモデレーション

セキュリティ分析

敵対的学習

2020.12.04 2025.04.03

文献データベース

Automatic Detection of Online Jihadist Hate Speech

Authors: Tom De Smedt, Guy De Pauw, Pieter Van Ostaeyen | Published: 2018-03-13

コンテンツモデレーション

プロフィール特性分析

ユーザー行動分析

2018.03.13 2025.04.03

文献データベース