Using with postgres? - Githubissues

Tejaswgupta commented 4 months ago

I'm using Supabase for one of my projects and it has about 2.3M rows. Currently the data is only fetch using certain attributes as Full Text Search is pretty slow. Is there any way we can use BM25s with the existing infrastructure?

Thanks for your response.

xhluca commented 4 months ago

Although bm25s does not provide integrations with SQL, it should be fairly straightforward to pull the data you are interested in via a Python SQL client (e.g. sqlalchemy or psycopg2) and convert it into a list of string, which you can then pass to bm25s.

As I have not used SQL in a while, I am not confident I can provide an example, but if you wish to contribute an example and add it to examples/, I'm happy to review your PR!

Tejaswgupta commented 4 months ago

@xhluca thanks!, I'll see If I can get a PR in.

I do want to know how does BM25 work with html though? The entire corpus I have is of HTML text which is directly rendered on the client. I pre-process(removing html and extra spaces) the text before passing it to the pipeline. The problem with this is , when I do send the ranked results , the client can't show them properly because of the lack of HTML structure.

I can only think of two ways:

Removing the pre-processing provided it doesn't result in lower accuracy.
Getting the original index of each element in the output so I can create a simple mappings and then the original html text array in ranked order.

Can you provide some suggestions on which of these would be feasible?

xhluca commented 4 months ago

I'm not an expert in html pre/post processing, but option 2 seems reasonable. Also, do consider assigning a unique id to each element in the html so you can identify where it was located.

xhluca / bm25s

Using with postgres? #7