Audio and speech data are increasingly used in machine learning applications
such as speech recognition, speaker identification, and mental health
monitoring. However, the passive collection of this data by audio listening
devices raises significant privacy concerns. Fully homomorphic encryption (FHE)
offers a promising solution by enabling computations on encrypted data and
preserving user privacy. Despite its potential, prior attempts to apply FHE to
audio processing have faced challenges, particularly in securely computing time
frequency representations, a critical step in many audio tasks.
Here, we addressed this gap by introducing a fully secure pipeline that
computes, with FHE and quantized neural network operations, four fundamental
time-frequency representations: Short-Time Fourier Transform (STFT), Mel
filterbanks, Mel-frequency cepstral coefficients (MFCCs), and gammatone
filters. Our methods also support the private computation of audio descriptors
and convolutional neural network (CNN) classifiers. Besides, we proposed
approximate STFT algorithms that lighten computation and bit use for
statistical and machine learning analyses.
We ran experiments on the VocalSet and OxVoc datasets demonstrating the fully
private computation of our approach. We showed significant performance
improvements with STFT approximation in private statistical analysis of audio
markers, and for vocal exercise classification with CNNs. Our results reveal
that our approximations substantially reduce error rates compared to
conventional STFT implementations in FHE. We also demonstrated a fully private
classification based on the raw audio for gender and vocal exercise
classification. Finally, we provided a practical heuristic for parameter
selection, making quantized approximate signal processing accessible to
researchers and practitioners aiming to protect sensitive audio data.