How to Improve RAG Retrieval Accuracy and Control Similarity Threshold in FAISS / Hybrid Search

Hi all,

I’m building a RAG (Retrieval-Augmented Generation) application for my dataset of many reports. The goal is: given a problem statement, return the most relevant reports that match it closely.

Current Approach

  1. Chunking strategy:

    • Initially, I converted each report into one chunk.

    • Each chunk is vectorized, then stored in FAISS for dense retrieval.

    • Retrieval is done by embedding the problem statement and searching for top matches.

  2. Variants I tried:

    • Dense FAISS search only → Works, but sometimes returns unrelated reports.

    • Sparse search (BM25) → Slight improvement in keyword matching, but still misses some exact mentions.

    • Hybrid dense + sparse search → Combined scores, still inconsistent results.

  3. Keyword column approach:

    • I added a separate column with keywords extracted from the problem.

    • Retrieval sometimes improved, but still not perfect — some unrelated reports are returned, and worse, some exact matches are not returned.

Main Problems

  • Low retrieval accuracy: Sometimes irrelevant chunks are in the top results.

  • Missed obvious matches: Even if the problem statement is literally mentioned in the report, it is sometimes not returned.

  • No control over similarity threshold: FAISS returns top-k results, but I’d like to set a minimum similarity score so irrelevant matches can be filtered out.

Questions

  1. Is there a better chunking strategy for long reports to improve retrieval accuracy?

  2. Are there embedding models better suited for exact + semantic matching (dense + keyword) in my case?

  3. How can I set a similarity threshold in FAISS so that results below a certain score are discarded?

  4. Any tips for re-ranking results after retrieval to boost accuracy?

1 Like

1

“Semantic chunking” or so?

2

For example, this one. For more, I recommend searching on the MTEB leaderboard.

3

MetricType and distances ?

2 Likes

Thank you for your answer

1 Like