These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Several recent works have argued that Large Language Models (LLMs) can be
used to tame the data deluge in the cybersecurity field, by improving the
automation of Cyber Threat Intelligence (CTI) tasks. This work presents an
evaluation methodology that other than allowing to test LLMs on CTI tasks when
using zero-shot learning, few-shot learning and fine-tuning, also allows to
quantify their consistency and their confidence level. We run experiments with
three state-of-the-art LLMs and a dataset of 350 threat intelligence reports
and present new evidence of potential security risks in relying on LLMs for
CTI. We show how LLMs cannot guarantee sufficient performance on real-size
reports while also being inconsistent and overconfident. Few-shot learning and
fine-tuning only partially improve the results, thus posing doubts about the
possibility of using LLMs for CTI scenarios, where labelled datasets are
lacking and where confidence is a fundamental factor.
External Datasets
350 threat intelligence reports by Di Tizio et al.