These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) have been used in many application domains,
including cyber security. The application of LLMs in the cyber security domain
presents significant opportunities, such as for enhancing threat analysis and
malware detection, but it can also introduce critical risks and safety
concerns, including potential personal data leakage and automated generation of
new malware. Building on recent findings that fine-tuning LLMs with
pseudo-malicious cyber security data significantly compromises their safety,
this paper presents a comprehensive validation and extension of these safety
risks using a different evaluation framework. We employ the garak red teaming
framework with the OWASP Top 10 for LLM Applications to assess four open-source
LLMs: Mistral 7B, Llama 3 8B, Gemma 2 9B, and DeepSeek R1 8B. Our evaluation
confirms and extends previous findings, showing that fine-tuning reduces safety
resilience across all tested LLMs (e.g., the failure rate of Mistral 7B against
prompt injection increases from 9.1% to 68.7%). We further propose and evaluate
a novel safety alignment approach that carefully rewords instruction-response
pairs to include explicit safety precautions and ethical considerations. This
work validates previous safety concerns through independent evaluation and
introduces new methods for mitigating these risks, contributing towards the
development of secure, trustworthy, and ethically aligned LLMs. This approach
demonstrates that it is possible to maintain or even improve model safety while
preserving technical utility, offering a practical path towards developing
safer fine-tuning methodologies.