ποΈ Dataset & Curation
Snaprβs training data was sourced, filtered, cleaned, embedded, and sampled at scale. This section documents the entire process.
π Related Notebooks:
- βοΈ Data Curation
- π Embeddings & ChromaDB
π Source
We used the Amazon Reviews 2023 dataset by McAuley Lab β a large-scale dataset of 2,811,408 items β focusing on 8 product categories:
- Automotive
- Electronics
- Office Products
- Tools & Home Improvement
- Cell Phones & Accessories
- Toys & Games
- Appliances
- Musical Instruments
Each product entry includes metadata such as price, title, description, and features.
π§Ή Filtering Logic
Items were filtered using the following rules:
- Price range: $0.50 β€ price β€ $999.49
- Minimum text length: β₯ 300 characters
- Tokenized prompt length: 150β160 tokens measured using the LLaMA tokenizer, chosen because it handles numeric values (e.g., 123) as a single token β making token estimation more stable and convenient for our use case.
- Noise removal: Stripped boilerplate phrases and irrelevant product codes
π Sampling Strategy
To ensure a balanced dataset:
- All items were kept if:
- Price β₯ $240
- Group size (by rounded price) β€ 1200
- Otherwise:
- Sampled up to 1200 items per price group
- Gave 5Γ weight to rare categories, 1Γ to overrepresented ones (e.g., Automotive)
Final curated dataset size: 409,172 items
π§ͺ Train/Test Split
The dataset was randomly shuffled with seed=42
:
- Train set: 400,000 items
- Test set: 2,000 items
Used primarily to train and evaluate the fine-tuned LLaMA model.
βοΈ Storage & Hosting
The final dataset is pushed to the Hugging Face Hub.
π Embeddings & ChromaDB
We used the intfloat/e5-small-v2 model to embed all product descriptions:
- "passage:" prefix applied for each input
- Embeddings were stored in ChromaDB (hosted on AWS)
- Used for:
- Retrieval in the RAG pipeline
- Feature vectors in XGBoost model training