Fuzzy title matching

I have a few thousand rows of data related to movies, series and episodes. There is different representation of these titles for example the Friends-S3-E3or Friends-The one with the thumb or Friends-3.3.
I was wondering if we can’t train our own SLM, a very specific title match SLM to answer the basic questions of “tell me what title is represented by this string” or “Given that, this string is of the format -<E what is its real title?
Also, what approach should we follow to handle the scenario where we can have multiple language descriptions of the titles and our task is to get the title for unseen description.

Can we fine-tune a model which can handle the above problem description?

1 Like

You could train or fine-tune an SLM but this is not that simple. Today, there are a lot of light-weight models like the got-mini series or gemini-flash that are perfect for these tasks.

Furthermore, you can bring the costs down with some engineering tricks:

  • lets say you always want to match one new entity with an existing table / dataset: put the big table as the first LLM message (together with the system prompt) and activate caching (as this message is always the same) and then put the entity you want to match in the second llm message (no caching required). This way you will hit the cache for each row you want to match and save huge amount of token costs
  • also if your table is too big or the attention of the model is not sufficient for the table size, you can batch the table and try to match the entity with each batch (potentially resolve clashes afterwards)
  • you can also batch the entities so you always match a batch of entities against an existing large table
  • do some prompt engineering with a small sample to optimize you system prompt
  • you can also combine fuzzy matching / embedding with the LLM approach if you think that some of the entities will be matched like that (but choose a rather high threshold to avoid false positives)

I spent so much time on this because I built a merge tool for my company (disclaimer: I still work for them) that can do this either in a web app, api or sdk: everyrow-sdk/docs/reference/MERGE.md at main · futuresearch/everyrow-sdk · GitHub

1 Like