xhluca / bm25s

Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
https://bm25s.github.io
MIT License
862 stars 35 forks source link

Using with postgres? #7

Closed Tejaswgupta closed 4 months ago

Tejaswgupta commented 4 months ago

I'm using Supabase for one of my projects and it has about 2.3M rows. Currently the data is only fetch using certain attributes as Full Text Search is pretty slow. Is there any way we can use BM25s with the existing infrastructure?

Thanks for your response.

xhluca commented 4 months ago

Although bm25s does not provide integrations with SQL, it should be fairly straightforward to pull the data you are interested in via a Python SQL client (e.g. sqlalchemy or psycopg2) and convert it into a list of string, which you can then pass to bm25s.

As I have not used SQL in a while, I am not confident I can provide an example, but if you wish to contribute an example and add it to examples/, I'm happy to review your PR!

Tejaswgupta commented 4 months ago

@xhluca thanks!, I'll see If I can get a PR in.

I do want to know how does BM25 work with html though? The entire corpus I have is of HTML text which is directly rendered on the client. I pre-process(removing html and extra spaces) the text before passing it to the pipeline. The problem with this is , when I do send the ranked results , the client can't show them properly because of the lack of HTML structure.

I can only think of two ways:

Can you provide some suggestions on which of these would be feasible?

xhluca commented 4 months ago

I'm not an expert in html pre/post processing, but option 2 seems reasonable. Also, do consider assigning a unique id to each element in the html so you can identify where it was located.