Large language models (LLMs) are widely deployed, but their growing compute
demands expose them to inference cost attacks that maximize output length. We
reveal that prior attacks are fundamentally self-targeting because they rely on
crafted inputs, so the added cost accrues to the attacker's own queries and
scales poorly in practice. In this work, we introduce the first bit-flip
inference cost attack that directly modifies model weights to induce persistent
overhead for all users of a compromised LLM. Such attacks are stealthy yet
realistic in practice: for instance, in shared MLaaS environments, co-located
tenants can exploit hardware-level faults (e.g., Rowhammer) to flip memory bits
storing model parameters. We instantiate this attack paradigm with BitHydra,
which (1) minimizes a loss that suppresses the end-of-sequence token (i.e.,
EOS) and (2) employs an efficient yet effective critical-bit search focused on
the EOS embedding vector, sharply reducing the search space while preserving
benign-looking outputs. We evaluate across 11 LLMs (1.5B-14B) under int8 and
float16, demonstrating that our method efficiently achieves scalable cost
inflation with only a few bit flips, while remaining effective even against
potential defenses.