Many mobile applications and virtual conversational agents now aim to
recognize and adapt to emotions. To enable this, data are transmitted from
users' devices and stored on central servers. Yet, these data contain sensitive
information that could be used by mobile applications without user's consent
or, maliciously, by an eavesdropping adversary. In this work, we show how
multimodal representations trained for a primary task, here emotion
recognition, can unintentionally leak demographic information, which could
override a selected opt-out option by the user. We analyze how this leakage
differs in representations obtained from textual, acoustic, and multimodal
data. We use an adversarial learning paradigm to unlearn the private
information present in a representation and investigate the effect of varying
the strength of the adversarial component on the primary task and on the
privacy metric, defined here as the inability of an attacker to predict
specific demographic information. We evaluate this paradigm on multiple
datasets and show that we can improve the privacy metric while not
significantly impacting the performance on the primary task. To the best of our
knowledge, this is the first work to analyze how the privacy metric differs
across modalities and how multiple privacy concerns can be tackled while still
maintaining performance on emotion recognition.