This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Jesse_marqo on 2024-10-20 21:51:41+00:00.


We have finally released the Marqo Google Shopping 10 million dataset on Hugging Face (Marqo-GS-10M). One of the largest and richest datasets for multimodal product retrieval!

  • 10M rows of query, product title, image and rank (1-100)
  • ~100k unique queries
  • ~5M unique products across fashion and home
  • Reflects real-world data and use cases and serves as a good benchmark for method development
  • Proper data splits, in-domain, novel query, novel document and novel-document and novel query.

The dataset features detailed relevance scores for each query-document pair to facilitate future research and evaluation.

!pip install datasets
from datasets import load_dataset
ds = load_dataset("Marqo/marqo-GS-10M")

We curated this large-scale dataset as part of the publication of our training framework: Generalized Contrastive Learning (GCL).

Dataset:

GCL:

Paper: