These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Our research uncovers a novel privacy risk associated with multimodal large
language models (MLLMs): the ability to infer sensitive personal attributes
from audio data -- a technique we term audio private attribute profiling. This
capability poses a significant threat, as audio can be covertly captured
without direct interaction or visibility. Moreover, compared to images and
text, audio carries unique characteristics, such as tone and pitch, which can
be exploited for more detailed profiling. However, two key challenges exist in
understanding MLLM-employed private attribute profiling from audio: (1) the
lack of audio benchmark datasets with sensitive attribute annotations and (2)
the limited ability of current MLLMs to infer such attributes directly from
audio. To address these challenges, we introduce AP^2, an audio benchmark
dataset that consists of two subsets collected and composed from real-world
data, and both are annotated with sensitive attribute labels. Additionally, we
propose Gifts, a hybrid multi-agent framework that leverages the complementary
strengths of audio-language models (ALMs) and large language models (LLMs) to
enhance inference capabilities. Gifts employs an LLM to guide the ALM in
inferring sensitive attributes, then forensically analyzes and consolidates the
ALM's inferences, overcoming severe hallucinations of existing ALMs in
generating long-context responses. Our evaluations demonstrate that Gifts
significantly outperforms baseline approaches in inferring sensitive
attributes. Finally, we investigate model-level and data-level defense
strategies to mitigate the risks of audio private attribute profiling. Our work
validates the feasibility of audio-based privacy attacks using MLLMs,
highlighting the need for robust defenses, and provides a dataset and framework
to facilitate future research.