Skip to content

πŸ—‚οΈ Dataset & Curation

Snapr’s training data was sourced, filtered, cleaned, embedded, and sampled at scale. This section documents the entire process.

πŸ““ Related Notebooks:


πŸ”— Source

We used the Amazon Reviews 2023 dataset by McAuley Lab β€” a large-scale dataset of 2,811,408 items β€” focusing on 8 product categories:

  • Automotive
  • Electronics
  • Office Products
  • Tools & Home Improvement
  • Cell Phones & Accessories
  • Toys & Games
  • Appliances
  • Musical Instruments

Each product entry includes metadata such as price, title, description, and features.


🧹 Filtering Logic

Items were filtered using the following rules:

  • Price range: $0.50 ≀ price ≀ $999.49
  • Minimum text length: β‰₯ 300 characters
  • Tokenized prompt length: 150–160 tokens measured using the LLaMA tokenizer, chosen because it handles numeric values (e.g., 123) as a single token β€” making token estimation more stable and convenient for our use case.
  • Noise removal: Stripped boilerplate phrases and irrelevant product codes

πŸ”„ Sampling Strategy

To ensure a balanced dataset:

  • All items were kept if:
    • Price β‰₯ $240
    • Group size (by rounded price) ≀ 1200
  • Otherwise:
    • Sampled up to 1200 items per price group
    • Gave 5Γ— weight to rare categories, 1Γ— to overrepresented ones (e.g., Automotive)

Final curated dataset size: 409,172 items


πŸ§ͺ Train/Test Split

The dataset was randomly shuffled with seed=42:

  • Train set: 400,000 items
  • Test set: 2,000 items

Used primarily to train and evaluate the fine-tuned LLaMA model.


☁️ Storage & Hosting

The final dataset is pushed to the Hugging Face Hub.


πŸ” Embeddings & ChromaDB

We used the intfloat/e5-small-v2 model to embed all product descriptions:

  • "passage:" prefix applied for each input
  • Embeddings were stored in ChromaDB (hosted on AWS)
  • Used for:
    • Retrieval in the RAG pipeline
    • Feature vectors in XGBoost model training