A Decade’s Battle on Dataset Bias: Are We There Yet?

Authors: Zhuang Liu, Kaiming He | Published: 2024-03-13 | Updated: 2025-03-03

2024.03.132025.05.27

Authors: Zhuang Liu, Kaiming He
Published: 2024-03-13 | Updated: 2025-03-03

Source: https://arxiv.org/abs/2403.08632

PDF: https://arxiv.org/pdf/2403.08632

Labels Predicted by AI

Bias Elimination in Training Data Deep Learning Data Curation

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

We revisit the “dataset classification” experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7 classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.