Adversarial attacks aim to generate malicious inputs that mislead deep
models, but beyond causing model failure, they cannot provide certain
interpretable information such as ``\textit{What content in inputs make models
more likely to fail?}'' However, this information is crucial for researchers to
specifically improve model robustness. Recent research suggests that models may
be particularly sensitive to certain semantics in visual inputs (such as
``wet,'' ``foggy''), making them prone to errors. Inspired by this, in this
paper we conducted the first exploration on large vision-language models
(LVLMs) and found that LVLMs indeed are susceptible to hallucinations and
various errors when facing specific semantic concepts in images. To efficiently
search for these sensitive concepts, we integrated large language models (LLMs)
and text-to-image (T2I) models to propose a novel semantic evolution framework.
Randomly initialized semantic concepts undergo LLM-based crossover and mutation
operations to form image descriptions, which are then converted by T2I models
into visual inputs for LVLMs. The task-specific performance of LVLMs on each
input is quantified as fitness scores for the involved semantics and serves as
reward signals to further guide LLMs in exploring concepts that induce LVLMs.
Extensive experiments on seven mainstream LVLMs and two multimodal tasks
demonstrate the effectiveness of our method. Additionally, we provide
interesting findings about the sensitive semantics of LVLMs, aiming to inspire
further in-depth research.