I am integrating a vector-based semantic search system into a B2B ecommerce platform’s product search, and I want to select the right text embedding model.
Use Case
User queries are often:
-
Very short (1–4 words)
-
Ungrammatical
-
Misspelled
-
Contain specifications or abbreviations (e.g., “m12 nut”, “2hp pump”, “ss tank 1000l”)
-
Contain domain-specific technical terms
Each product has:
-
Title
-
Attribute fields (e.g., Material=SS, Voltage=220V)
-
Description text
I need embeddings that capture semantic meaning across these fields and match them with noisy, spec-heavy queries.
Constraints / Setup
-
English-only
-
Running on GPU (model size not a constraint)
-
Throughput: ~100 queries per second
-
Retrieval backend not yet decided but most likely Vespa
-
Fine-tuning will come later — I first need a strong base embedding model
Questions
-
Which open-source embedding models work best out of the box for ecommerce/product search?
-
Are there any models that are trained or tuned specifically for ecommerce data?
-
Should I embed (title + attributes + description) concatenated as a single document, or embed fields separately and combine?
Example queries
-
“2hp motor pump”
-
“ss nut m12”
-
“isi water tank 1000l”
-
“sewing macine” (misspelled)
Any guidance or practical experience with embedding models for ecommerce search would be appreciated.