With the increasingly widespread application of machine learning, how to
strike a balance between protecting the privacy of data and algorithm
parameters and ensuring the verifiability of machine learning has always been a
challenge. This study explores the intersection of reinforcement learning and
data privacy, specifically addressing the Multi-Armed Bandit (MAB) problem with
the Upper Confidence Bound (UCB) algorithm. We introduce zkUCB, an innovative
algorithm that employs the Zero-Knowledge Succinct Non-Interactive Argument of
Knowledge (zk-SNARKs) to enhance UCB. zkUCB is carefully designed to safeguard
the confidentiality of training data and algorithmic parameters, ensuring
transparent UCB decision-making. Experiments highlight zkUCB's superior
performance, attributing its enhanced reward to judicious quantization bit
usage that reduces information entropy in the decision-making process. zkUCB's
proof size and verification time scale linearly with the execution steps of
zkUCB. This showcases zkUCB's adept balance between data security and
operational efficiency. This approach contributes significantly to the ongoing
discourse on reinforcing data privacy in complex decision-making processes,
offering a promising solution for privacy-sensitive applications.