Multimodal large language models (MLLMs) have achieved remarkable progress,
yet remain critically vulnerable to adversarial attacks that exploit weaknesses
in cross-modal processing. We present a systematic study of multimodal
jailbreaks targeting both vision-language and audio-language models, showing
that even simple perceptual transformations can reliably bypass
state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts
across three high-risk safety categories harmful content, CBRN (Chemical,
Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation
Material) tested against seven frontier models. We explore the effectiveness of
attack techniques on MLLMs, including FigStep-Pro (visual keyword
decomposition), Intelligent Masking (semantic obfuscation), and audio
perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe
vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer
>75\% attack success under perceptually modified inputs, with FigStep-Pro
achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further
uncover provider-specific weaknesses, with even basic modality transfer
yielding 25\% ASR for technical queries. These findings expose a critical gap
between text-centric alignment and multimodal threats, demonstrating that
current safeguards fail to generalize across cross-modal attacks. The
accessibility of these attacks, which require minimal technical expertise,
suggests that robust multimodal AI safety will require a paradigm shift toward
broader semantic-level reasoning to mitigate possible risks.