Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation Authors: Qingyuan Fei, Xin Liu, Song Li, Shujiang Wu, Jianwei Hou, Ping Chen, Zifeng Kang | Published: 2025-12-01 CybersecurityData-Driven Vulnerability AssessmentHallucination 2025.12.01 2025.12.03 Literature Database
Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework Authors: Junhao Li, Jiahao Chen, Zhou Feng, Chunyi Zhou | Published: 2025-11-05 HallucinationPrivacy ViolationPrivacy Protection 2025.11.05 2025.11.07 Literature Database
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From Authors: Yao Tong, Haonan Wang, Siquan Li, Kenji Kawaguchi, Tianyang Hu | Published: 2025-09-30 Token Distribution AnalysisHallucinationModel Performance Evaluation 2025.09.30 2025.10.02 Literature Database
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark Authors: Xinjie Shen, Mufei Li, Pan Li | Published: 2025-09-27 | Updated: 2025-10-13 HallucinationPrivacy Enhancing Technology倫理的選択評価 2025.09.27 2025.10.15 Literature Database
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM Authors: Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping | Published: 2025-09-22 Hallucination武器設計手法Fraud Techniques 2025.09.22 2025.09.24 Literature Database
Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification Authors: Aivin V. Solatorio | Published: 2025-09-08 HallucinationEfficient Proof System監査手法 2025.09.08 2025.09.10 Literature Database
AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models Authors: Yu Wang, Yijian Liu, Liheng Ji, Han Luo, Wenjie Li, Xiaofei Zhou, Chiyun Feng, Puji Wang, Yuhan Cao, Geyuan Zhang, Xiaojian Li, Rongwu Xu, Yilei Chen, Tianxing He | Published: 2025-07-13 | Updated: 2025-09-30 AlgorithmHallucinationPrompt validation 2025.07.13 2025.10.02 Literature Database
Using LLMs for Security Advisory Investigations: How Far Are We? Authors: Bayu Fedra Abdullah, Yusuf Sulistyo Nugroho, Brittany Reid, Raula Gaikovina Kula, Kazumasa Shimari, Kenichi Matsumoto | Published: 2025-06-16 Advice ProvisionHallucinationPrompt leaking 2025.06.16 2025.06.18 Literature Database
DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response Authors: Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi | Published: 2025-05-26 HallucinationModel Performance EvaluationEvaluation Method 2025.05.26 2025.05.28 Literature Database
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation Authors: Ethan TS. Liu, Austin Wang, Spencer Mateega, Carlos Georgescu, Danny Tang | Published: 2025-05-26 Website VulnerabilityHallucinationDynamic Vulnerability Management 2025.05.26 2025.05.28 Literature Database