Hi all,
I’m building a RAG (Retrieval-Augmented Generation) application for my dataset of many reports. The goal is: given a problem statement, return the most relevant reports that match it closely.
Current Approach
-
Chunking strategy:
-
Initially, I converted each report into one chunk.
-
Each chunk is vectorized, then stored in FAISS for dense retrieval.
-
Retrieval is done by embedding the problem statement and searching for top matches.
-
-
Variants I tried:
-
Dense FAISS search only → Works, but sometimes returns unrelated reports.
-
Sparse search (BM25) → Slight improvement in keyword matching, but still misses some exact mentions.
-
Hybrid dense + sparse search → Combined scores, still inconsistent results.
-
-
Keyword column approach:
-
I added a separate column with keywords extracted from the problem.
-
Retrieval sometimes improved, but still not perfect — some unrelated reports are returned, and worse, some exact matches are not returned.
-
Main Problems
-
Low retrieval accuracy: Sometimes irrelevant chunks are in the top results.
-
Missed obvious matches: Even if the problem statement is literally mentioned in the report, it is sometimes not returned.
-
No control over similarity threshold: FAISS returns top-k results, but I’d like to set a minimum similarity score so irrelevant matches can be filtered out.
Questions
-
Is there a better chunking strategy for long reports to improve retrieval accuracy?
-
Are there embedding models better suited for exact + semantic matching (dense + keyword) in my case?
-
How can I set a similarity threshold in FAISS so that results below a certain score are discarded?
-
Any tips for re-ranking results after retrieval to boost accuracy?