I’m building an AI Search system where a user types a query, and the system performs a similarity check against a document corpus. While working on the initialization, I realized that the query and documents could benefit from preprocessing, optimization, and careful handling before performing similarity computations.
Instead of figuring out all the details myself, I’m wondering if there’s a blueprint, best-practice guide, or reference implementation for building an end-to-end AI Search pipeline — from query/document preprocessing to embedding, indexing, and retrieval.
Any guidance, references, or examples would be greatly appreciated.
If I may offer some feedback: the documentation as a whole is somewhat confusing. It contains repeated information at different levels of detail, which makes it difficult to follow. Even the numbering is inconsistent. While the content is very helpful, the way it is presented makes it nearly impossible to use effectively. Even when I ask an LLM for help, it remains confusing, which makes it essentially unusable.
Hi @EroStefano, a while back we wrote a blog post about how to tackle this problem of contextual searches using a document corpus, embeddings and other techniques; you can find it here