These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The deployment of Large Language Models (LLMs) in embodied agents creates an
urgent need to measure their privacy awareness in the physical world. Existing
evaluation methods, however, are confined to natural language based scenarios.
To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation
benchmark designed to quantify the physical-world privacy awareness of
LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across
four tiers to test an agent's ability to handle sensitive objects, adapt to
changing environments, balance task execution with privacy constraints, and
resolve conflicts with social norms. Our measurements reveal a critical deficit
in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\%
accuracy in scenarios involving changing physical environments. Furthermore,
when a task was accompanied by a privacy request, models prioritized completion
over the constraint in up to 86\% of cases. In high-stakes situations pitting
privacy against critical social norms, leading models like GPT-4o and
Claude-3.5-haiku disregarded the social norm over 15\% of the time. These
findings, demonstrated by our benchmark, underscore a fundamental misalignment
in LLMs regarding physically grounded privacy and establish the need for more
robust, physically-aware alignment. Codes and datasets will be available at
https://github.com/Graph-COM/EAPrivacy.