AIセキュリティポータル K Program
Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
Share
Abstract
The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data by integrating multiple modalities such as text and images, thereby opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered LMMs that process both images and text, including models such as LLaVA, BakLLaVA, Moondream, Gemini-pro-vision, and GPT-4o, compared to fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct security tasks: 1) a visually evident task of detecting simple triggers, such as small pixel variations in images that could be exploited to access potential backdoors in the models, and 2) a visually non-evident task of malware classification through visual representations. In the visually evident task, some LMMs, such as Gemini-pro-vision and GPT-4o, have demonstrated the potential to achieve good performance with careful prompt engineering, with GPT-4o achieving the highest accuracy and F1-score of 91.9\% and 91\%, respectively. However, the fine-tuned ViT models exhibit perfect performance in this task due to its simplicity. For the visually non-evident task, the results highlight a significant divergence in performance, with ViT models achieving F1-scores of 97.11\% in predicting 25 malware classes and 97.61\% in predicting 5 malware families, whereas LMMs showed suboptimal performance despite iterative prompt improvements. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
Chatgpt for good? on opportunities and challenges of large language models for education
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüller-meier, et al.
Published: 2023
Multimodal large language models: A survey
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, S Yu Philip
Published: 2023
Minigpt-4: Enhancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
Published: 2023
When vision meets reality: Exploring the clinical applicability of gpt-4 with vision
Jiawen Deng, Kiyan Heybati, Matthew Shammas-Toma
Published: 2024
The mnist database of handwritten digit images for machine learning research
Li Deng
Published: 2012
Utilization and comparision of convolutional neural networks in malware recognition
Ahmet Selman Bozkir, Ahmet Ogulcan Cankaya, Murat Aydos
Published: 2019
A survey on vision transformer
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu
Published: 2022
Vision transformer outperforms deep convolutional neural network-based model in classifying x-ray images
Om Uparkar, Jyoti Bharti, RK Pateriya, Rajeev Kumar Gupta, Ashutosh Sharma
Published: 2023
Share