These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Web-crawled datasets have enabled remarkable generalization capabilities in
recent image-text models such as CLIP (Contrastive Language-Image pre-training)
or Flamingo, but little is known about the dataset creation processes. In this
work, we introduce a testbed of six publicly available data sources - YFCC,
LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how
pre-training distributions induce robustness in CLIP. We find that the
performance of the pre-training data varies substantially across distribution
shifts, with no single data source dominating. Moreover, we systematically
study the interactions between these data sources and find that combining
multiple sources does not necessarily yield better models, but rather dilutes
the robustness of the best individual data source. We complement our empirical
findings with theoretical insights from a simple setting, where combining the
training data also results in diluted robustness. In addition, our theoretical
model provides a candidate explanation for the success of the CLIP-based data
filtering technique recently employed in the LAION dataset. Overall our results
demonstrate that simply gathering a large amount of data from the web is not
the most effective way to build a pre-training dataset for robust
generalization, necessitating further study into dataset design. Code is
available at https://github.com/mlfoundations/clip_quality_not_quantity.