AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

TOP Literature Database AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2507.09580

PDF

https://arxiv.org/pdf/2507.09580

Paper Information

Author: Yu Wang,Yijian Liu,Liheng Ji,Han Luo,Wenjie Li,Xiaofei Zhou,Chiyun Feng,Puji Wang,Yuhan Cao,Geyuan Zhang,Xiaojian Li,Rongwu Xu,Yilei Chen,Tianxing He
Published: 7-13-2025
Updated: 9-30-2025
Affiliation: Institute of Information Engineering, Chinese Academy of Sciences
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Algorithm Hallucination Prompt validation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose AICrypto, the first comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our case studies reveal that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io/.

External Datasets

AICrypto