Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

TOP Literature Database Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2603.12023

PDF

https://arxiv.org/pdf/2603.12023

Paper Information

Author: Sarbartha Banerjee,Prateek Sahu,Anjo Vahldiek-Oberwagner,Jose Sanchez Vicarte,Mohit Tiwari
Published: 3-13-2026
Affiliation: The University of Texas at Austin
Country: United States of America
Conference

Labels Estimated by AI

Model Extraction Attack Vulnerability Management Prompt Injection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Rapid progress in generative AI has given rise to Compound AI systems - pipelines comprised of multiple large language models (LLM), software tools and database systems. Compound AI systems are constructed on a layered traditional software stack running on a distributed hardware infrastructure. Many of the diverse software components are vulnerable to traditional security flaws documented in the Common Vulnerabilities and Exposures (CVE) database, while the underlying distributed hardware infrastructure remains exposed to timing attacks, bit-flip faults, and power-based side channels. Today, research targets LLM-specific risks like model extraction, training data leakage, and unsafe generation -- overlooking the impact of traditional system vulnerabilities. This work investigates how traditional software and hardware vulnerabilities can complement LLM-specific algorithmic attacks to compromise the integrity of a compound AI pipeline. We demonstrate two novel attacks that combine system-level vulnerabilities with algorithmic weaknesses: (1) Exploiting a software code injection flaw along with a guardrail Rowhammer attack to inject an unaltered jailbreak prompt into an LLM, resulting in an AI safety violation, and (2) Manipulating a knowledge database to redirect an LLM agent to transmit sensitive user data to a malicious application, thus breaching confidentiality. These attacks highlight the need to address traditional vulnerabilities; we systematize the attack primitives and analyze their composition by grouping vulnerabilities by their objective and mapping them to distinct stages of an attack lifecycle. This approach enables a rigorous red-teaming exercise and lays the groundwork for future defense strategies.

External Datasets

algorithmic vulnerability dataset

common vulnerability exposure (CVEs) dataset

References

The shift from models to compound ai systems

M. Zaharia, O. Khattab, L. Chen, J. Q. Davis, H. Miller, C. Potts, J. Zou, M. Carbin, J. Frankle, N. Rao, A. Ghodsi

Cuda® deep neural network library

Enhanced Membership Inference Attacks against Machine Learning Models

Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, Reza Shokri

Published: 11.18.2021

How much does a machine learning algorithm leak about its training data, and why? Membership inference attacks are used as an auditing tool to quantify this leakage. In this paper, we present a comprehensive \textit{hypothesis testing framework} that enables us not only to formally express the prior work in a consistent way, but also to design new membership inference attacks that use reference models to achieve a significantly higher power (true positive rate) for any (false positive rate) error. More importantly, we explain \textit{why} different attacks perform differently. We present a template for indistinguishability games, and provide an interpretation of attack success rate across different instances of the game. We discuss various uncertainties of attackers that arise from the formulation of the problem, and show how our approach tries to minimize the attack uncertainty to the one bit secret about the presence or absence of a data point in the training set. We perform a \textit{differential analysis} between all types of attacks, explain the gap between them, and show what causes data points to be vulnerable to an attack (as the reasons vary due to different granularities of memorization, from overfitting to conditional memorization). Our auditing framework is openly accessible as part of the \textit{Privacy Meter} software tool.

Adversarial attack Membership Inference Poisoning

arxiv

Cited by 1

European Symposium on Research in Computer Security (ESORICS)

Data Poisoning Attacks Against Federated Learning Systems

Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, Ling Liu

Published: 7.17.2020

Federated learning (FL) is an emerging paradigm for distributed training of large-scale deep neural networks in which participants' data remains on their own devices with only model updates being shared with a central server. However, the distributed nature of FL gives rise to new threats caused by potentially malicious participants. In this paper, we study targeted data poisoning attacks against FL systems in which a malicious subset of the participants aim to poison the global model by sending model updates derived from mislabeled data. We first demonstrate that such data poisoning attacks can cause substantial drops in classification accuracy and recall, even with a small percentage of malicious participants. We additionally show that the attacks can be targeted, i.e., they have a large negative impact only on classes that are under attack. We also study attack longevity in early/late round training, the impact of malicious participant availability, and the relationships between the two. Finally, we propose a defense strategy that can help identify malicious participants in FL to circumvent poisoning attacks, and demonstrate its effectiveness.

Poisoning Attack Method Performance Evaluation

Proceedings of the 55th Annual Design Automation Conference

Reverse engineering convolutional neural networks through side-channel information leaks

W. Hua, Z. Zhang, G. E. Suh

Published: 2018

25th USENIX security symposium (USENIX Security 16)

Stealing machine learning models via prediction APIs

F. Tramer, F. Zhang, A. Juels, M. K. Reiter, T. Ristenpart

Published: 2016

CoRR

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao

Published: 2023

Lessons From Red Teaming 100 Generative AI Products

B. Bullwinkel, A. Minnich, S. Chawla, G. Lopez, M. Pouliot, W. Maxwell, J. de Gruyter, K. Pratt, S. Qi, N. Chikanov, R. Lutz, R. S. R. Dheekonda, B.-E. Jagdagdorj, E. Kim, J. Song, K. Hines, D. Jones, G. Severi, R. Lundeen, S. Vaughan, V. Westerhoff, P. Bryan, R. S. S. Kumar, Y. Zunger, C. Kawaguchi, M. Russinovich

Published: 2025

ACM SIGARCH Computer Architecture News

Flipping bits in memory without accessing them: An experimental study of dram disturbance errors

Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, O. Mutlu

Published: 2014

arxiv

Cited by 1

Computing Research Repository (CoRR)

SoK: Memorization in General-Purpose Large Language Models

Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, Robert West

Published: 10.24.2023

Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a wide range of tasks. A major part of this success is due to their huge training datasets and the unprecedented number of model parameters, which allow them to memorize large amounts of information contained in the training data. This memorization goes beyond mere language, and encompasses information only present in a few documents. This is often desirable since it is necessary for performing tasks such as question answering, and therefore an important part of learning, but also brings a whole array of issues, from privacy and security to copyright and beyond. LLMs can memorize short secrets in the training data, but can also memorize concepts like facts or writing styles that can be expressed in text in many different ways. We propose a taxonomy for memorization in LLMs that covers verbatim text, facts, ideas and algorithms, writing styles, distributional properties, and alignment goals. We describe the implications of each type of memorization - both positive and negative - for model performance, privacy, security and confidentiality, copyright, and auditing, and ways to detect and prevent memorization. We further highlight the challenges that arise from the predominant way of defining memorization with respect to model behavior instead of model weights, due to LLM-specific phenomena such as reasoning capabilities or differences between decoding algorithms. Throughout the paper, we describe potential risks and opportunities arising from memorization in LLMs that we hope will motivate new research directions.

Prompt Injection Measurement of Memorization Privacy Technique

Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, J. Cohen

Published: 2023

Generative ai data governance – amazon bedrock guardrails – aws

Guardrail

arxiv

Cited by 16

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

Published: 8.8.2023

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

LLM Security Prompt Injection Character Role Acting

arxiv

Cited by 5

Computing Research Repository (CoRR)

Detecting Language Model Attacks with Perplexity

Gabriel Alon, Michael Kamfonas

Published: 8.28.2023

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Prompt Injection Malicious Prompt LLM Security

arxiv

Cited by 3

Annual Meeting of the Association for Computational Linguistics (ACL)

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen

Published: 9.18.2023

Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.

Prompt Injection Defense Method Safety Alignment

Baseline defenses for adversarial attacks against aligned language models

N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, T. Goldstein

Published: 2023

arxiv

Cited by 5

Trans. Mach. Learn. Res.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

Published: 10.6.2023

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

Defense Method Prompt Injection LLM Performance Evaluation

Lamini - enterprise llm platform

Predibase: The developers platform for fine-tuning and serving llms - predibase

Prompt shields - azure ai foundry

Fact-checking with new grounding api in jina reader

Published: 2024

Fact checker ai —gemini api developer competition — google ai for developers

arxiv

Cited by 3

USENIX Security Symposium

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

Wei Zou, Runpeng Geng, Binghui Wang, Jinyuan Jia

Published: 2.13.2024

Large language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities. Despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to mitigate these limitations. The key idea of RAG is to ground the answer generation of an LLM on external knowledge retrieved from a knowledge database. Existing studies mainly focus on improving the accuracy or efficiency of RAG, leaving its security largely unexplored. We aim to bridge the gap in this work. We find that the knowledge database in a RAG system introduces a new and practical attack surface. Based on this attack surface, we propose PoisonedRAG, the first knowledge corruption attack to RAG, where an attacker could inject a few malicious texts into the knowledge database of a RAG system to induce an LLM to generate an attacker-chosen target answer for an attacker-chosen target question. We formulate knowledge corruption attacks as an optimization problem, whose solution is a set of malicious texts. Depending on the background knowledge (e.g., black-box and white-box settings) of an attacker on a RAG system, we propose two solutions to solve the optimization problem, respectively. Our results show PoisonedRAG could achieve a 90% attack success rate when injecting five malicious texts for each target question into a knowledge database with millions of texts. We also evaluate several defenses and our results show they are insufficient to defend against PoisonedRAG, highlighting the need for new defenses.

Prompt Injection Poisoning Attack Poisoning

Confusedpilot: Confused deputy risks in rag-based llms

A. RoyChowdhury, M. Luo, P. Sahu, S. Banerjee, M. Tiwari

Published: 2024

23rd USENIX security symposium (USENIX security 14)

FLUSH+ RELOAD: A high resolution, low noise, l3 cache Side-Channel attack

Y. Yarom, K. Falkner

Published: 2014

29th USENIX Security Symposium (USENIX Security 20)

An Off-Chip attack on hardware enclaves via the memory bus

D. Lee, D. Jung, I. T. Fang, C.-C. Tsai, R. A. Popa

Published: 2020

Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security

Mitigating storage side channels using statistical privacy mechanisms

Q. Xiao, M. K. Reiter, Y. Zhang

Published: 2015

Defending large language models against jailbreak attacks via semantic smoothing

J. Ji, B. Hou, A. Robey, G. J. Pappas, H. Hassani, Y. Zhang, E. Wong, S. Chang

Published: 2024

Pytorchfi: A runtime perturbation tool for dnns

A. Mahmoud, N. Aggarwal, A. Nobbe, J. Vicarte, S. Adve, C. Fletcher, I. Frosio, S. Hari

Published: 2020

LLMart: Large Language Model adversarial robustness toolbox

C. Cornelius, M. Arvinte, S. Szyller, W. Xu, N. Himayat

Published: 2025

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Everywhere all at once: Co-location attacks on public cloud faas

Z. N. Zhao, A. Morrison, C. W. Fletcher, J. Torrellas

Published: 2024

25th USENIX Security Symposium (USENIX Security 16)

One bit flips, one cloud flops: Cross-VM row hammer attacks and privilege escalation

Y. Xiao, X. Zhang, Y. Zhang, R. Teodorescu

Published: 2016

Phoenix: Rowhammer attacks on ddr5 with self-correcting synchronization

D. Meyer, P. Jattke, M. Marazzi, S. Qazi, D. Moghimi, K. Razavi

Published: 2026

34th USENIX Security Symposium (USENIX Security 25)

Rowhammer-Based trojan injection: One bit flip is sufficient for backdooring DNNs

X. Li, Y. Meng, J. Chen, L. Luo, Q. Zeng

Published: 2025

arxiv

Cited by 12

Computing Research Repository (CoRR)

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

Published: 7.28.2023

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

LLM Security Prompt Injection Inappropriate Content Generation

pcileech

Published: 2017

Proceedings 2019 Network and Distributed System Security Symposium

Thunderclap: Exploring vulnerabilities in operating system iommu protection via dma from untrustworthy peripherals

A. T. Markettos, C. Rothwell, B. F. Gutstein, A. Pearce, P. G. Neumann, S. W. Moore, R. N. M. Watson

Published: 2019

New security challenges on machine learning inference engine: Chip cloning and model reverse engineering

S. Huang, X. Peng, H. Jiang, Y. Luo, S. Yu

Published: 2020

2021 IEEE Symposium on Security and Privacy (SP)

Invisible probe: Timing attacks with pcie congestion side-channel

M. Tan, J. Wan, Z. Zhou, Z. Li

Published: 2021

2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Fault injection attack on deep neural network

Y. Liu, L. Wei, B. Luo, Q. Xu

Published: 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Understanding error propagation in deep learning neural network (dnn) accelerators and applications

G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, S. W. Keckler

Published: 2017

IEEE Transactions on Dependable and Secure Computing

Fault injection for tensorflow applications

N. Narayanan, Z. Chen, B. Fang, G. Li, K. Pattabiraman, N. Debardeleben

Published: 2022

The Journal of Supercomputing

Int-monitor: a model triggered hardware trojan in deep learning accelerators

P. Li, R. Hou

Published: 2023