These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across
a variety of domains. However, their applications in cryptography, which serves
as a foundational pillar of cybersecurity, remain largely unexplored. To
address this gap, we propose AICrypto, the first comprehensive benchmark
designed to evaluate the cryptography capabilities of LLMs. The benchmark
comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges,
and 18 proof problems, covering a broad range of skills from factual
memorization to vulnerability exploitation and formal reasoning. All tasks are
carefully reviewed or constructed by cryptography experts to ensure correctness
and rigor. To support automated evaluation of CTF challenges, we design an
agent-based framework. We introduce strong human expert performance baselines
for comparison across all task types. Our evaluation of 17 leading LLMs reveals
that state-of-the-art models match or even surpass human experts in memorizing
cryptographic concepts, exploiting common vulnerabilities, and routine proofs.
However, our case studies reveal that they still lack a deep understanding of
abstract mathematical concepts and struggle with tasks that require multi-step
reasoning and dynamic analysis. We hope this work could provide insights for
future research on LLMs in cryptographic applications. Our code and dataset are
available at https://aicryptobench.github.io/.