🗂️ Dataset & Curation

Snapr’s training data was sourced, filtered, cleaned, embedded, and sampled at scale. This section documents the entire process.

📓 Related Notebooks:

🔗 Source

We used the Amazon Reviews 2023 dataset by McAuley Lab — a large-scale dataset of 2,811,408 items — focusing on 8 product categories:

Each product entry includes metadata such as price, title, description, and features.

Items were filtered using the following rules:

Price range: $0.50 ≤ price ≤ $999.49
Minimum text length: ≥ 300 characters
Tokenized prompt length: 150–160 tokens measured using the LLaMA tokenizer, chosen because it handles numeric values (e.g., 123) as a single token — making token estimation more stable and convenient for our use case.
Noise removal: Stripped boilerplate phrases and irrelevant product codes

To ensure a balanced dataset:

All items were kept if:
- Price ≥ $240
- Group size (by rounded price) ≤ 1200
Otherwise:
- Sampled up to 1200 items per price group
- Gave 5× weight to rare categories, 1× to overrepresented ones (e.g., Automotive)

Final curated dataset size: 409,172 items

The dataset was randomly shuffled with seed=42:

Used primarily to train and evaluate the fine-tuned LLaMA model.

The final dataset is pushed to the Hugging Face Hub.

We used the intfloat/e5-small-v2 model to embed all product descriptions:

"passage:" prefix applied for each input
Embeddings were stored in ChromaDB (hosted on AWS)
Used for:
- Retrieval in the RAG pipeline
- Feature vectors in XGBoost model training