評価手法

DeepSight: An All-in-One LM Safety Toolkit

Authors: Bo Zhang, Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zheng, Qihao Lin, Lewen Yan, Chen Qian, Yijin Zhou, Yuyao Wu, Shaoxiong Guo, Tianyi Du, Jingyi Yang, Xuhao Hu, Ziqi Miao, Xiaoya Lu, Jing Shao, Xia Hu | Published: 2026-02-12
プロンプトインジェクション
大規模言語モデル
評価手法

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

Authors: Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis | Published: 2026-02-12
プロンプトインジェクション
実験的検証
評価手法

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Authors: Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng | Published: 2025-12-11
ハルシネーションの検知
モデルDoS
評価手法

LLM-Assisted AHP for Explainable Cyber Range Evaluation

Authors: Vyron Kampourakis, Georgios Kavallieratos, Georgios Spathoulas, Vasileios Gkioulos, Sokratis Katsikas | Published: 2025-12-11
XAI(説明可能なAI)
信頼性評価
評価手法

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Authors: Chaomeng Lu, Bert Lagaisse | Published: 2025-12-11
モデルの頑健性保証
出力の有害度の算出
評価手法

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Authors: Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar | Published: 2025-12-11
インダイレクトプロンプトインジェクション
敵対的攻撃分析
評価手法

RHINO: Guided Reasoning for Mapping Network Logs to Adversarial Tactics and Techniques with Large Language Models

Authors: Fanchao Meng, Jiaping Gui, Yunbo Li, Yue Wu | Published: 2025-10-16
ネットワークトラフィック分析
バックドアモデルの検知
評価手法

Variables Ordering Optimization in Boolean Characteristic Set Method Using Simulated Annealing and Machine Learning-based Time Prediction

Authors: Minzhong Luo, Yudong Sun, Yin Long | Published: 2025-09-18
アルゴリズム
最適化手法
評価手法

ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System

Authors: Taesoo Kim, HyungSeok Han, Soyeon Park, Dae R. Jeong, Dohyeok Kim, Dongkwan Kim, Eunsoo Kim, Jiho Kim, Joshua Wang, Kangsu Kim, Sangwoo Ji, Woosun Song, Hanqing Zhao, Andrew Chin, Gyejin Lee, Kevin Stevens, Mansour Alharthi, Yizhuo Zhai, Cen Zhang, Joonun Jang, Yeongjin Jang, Ammar Askar, Dongju Kim, Fabian Fleischer, Jeongin Cho, Junsik Kim, Kyungjoon Ko, Insu Yun, Sangdon Park, Dowoo Baik, Haein Lee, Hyeon Heo, Minjae Gwon, Minjae Lee, Minwoo Baek, Seunggi Min, Wonyoung Kim, Yonghwi Jin, Younggi Park, Yunjae Choi, Jinho Jung, Gwanhyun Lee, Junyoung Jang, Kyuheon Kim, Yeonghyeon Cha, Youngjoon Kim | Published: 2025-09-18
セキュリティ分析
バグ修正手法
評価手法

LLM Jailbreak Detection for (Almost) Free!

Authors: Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu | Published: 2025-09-18
大規模言語モデル
評価手法
透かし技術