TOP Literature Database Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
ACM Trans. Intell. Syst. Technol.
Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
AI Security Portal bot
Information in the literature database is collected automatically.
These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The success of Large Language Models (LLMs) has led to a parallel rise in the
development of Large Multimodal Models (LMMs), which have begun to transform a
variety of applications. These sophisticated multimodal models are designed to
interpret and analyze complex data by integrating multiple modalities such as
text and images, thereby opening new avenues for a range of applications. This
paper investigates the applicability and effectiveness of prompt-engineered
LMMs that process both images and text, including models such as LLaVA,
BakLLaVA, Moondream, Gemini-pro-vision, and GPT-4o, compared to fine-tuned
Vision Transformer (ViT) models in addressing critical security challenges. We
focus on two distinct security tasks: 1) a visually evident task of detecting
simple triggers, such as small pixel variations in images that could be
exploited to access potential backdoors in the models, and 2) a visually
non-evident task of malware classification through visual representations. In
the visually evident task, some LMMs, such as Gemini-pro-vision and GPT-4o,
have demonstrated the potential to achieve good performance with careful prompt
engineering, with GPT-4o achieving the highest accuracy and F1-score of 91.9\%
and 91\%, respectively. However, the fine-tuned ViT models exhibit perfect
performance in this task due to its simplicity. For the visually non-evident
task, the results highlight a significant divergence in performance, with ViT
models achieving F1-scores of 97.11\% in predicting 25 malware classes and
97.61\% in predicting 5 malware families, whereas LMMs showed suboptimal
performance despite iterative prompt improvements. This study not only
showcases the strengths and limitations of prompt-engineered LMMs in
cybersecurity applications but also emphasizes the unmatched efficacy of
fine-tuned ViT models for precise and dependable tasks.