Add stopwords for 10 new languages

bm777 commented 1 month ago

Add multi-language stopword support

This pull request addresses issue #32 by implementing support for stopwords in multiple languages.

Changes made:

Included stopword lists for the following languages:
- English
- German
- Dutch
- French
- Spanish
- Portuguese
- Italian
- Russian
- Swedish
- Norwegian
- Chinese
- Updated the tokenization.py file, especially _infer_stopwords function to consider other languages.

Implementation details:

Stopwords are now loaded based on the specified language

# bm25 definition here
corpus = [
    "Eine Katze ist eine Katze und schnurrt gerne",
    "Ein Hund ist der beste Freund des Menschen und liebt es zu spielen",
    "Ein Vogel ist ein wunderschönes Tier, das fliegen kann",
    "Ein Fisch ist ein Lebewesen, das im Wasser lebt und schwimmt",
]

tids = bm25s.tokenize(corpus, stopwords="de")

Users can still easily add custom stopword lists for additional languages

Testing:

A baseline needs to be defined.

If any change made here needs to be modified, feel free to coment below.

Closes #32

xhluca commented 1 month ago

Thanks! If the tests pass I will merge.

bm777 commented 1 month ago

is it a stopwords issue?

this is the new STOPWORDS

xhluca commented 1 month ago

This is the error:

Finding newlines for mmindex:   0%|          | 0.00/8.11M [00:00<?, ?B/s]
Finding newlines for mmindex: 100%|██████████| 8.11M/8.11M [00:00<00:00, 268MB/s]

  0%|          | 0/5183 [00:00<?, ?it/s]
100%|██████████| 5183/5183 [00:00<00:00, 180124.76it/s]
.
======================================================================
FAIL: test_retrieve (tests.quick.test_retrieve.TestBM25SLoadingSaving)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/bm25s/bm25s/tests/quick/test_retrieve.py", line 70, in test_retrieve
    self.assertTrue(np.array_equal(ground_truth, results), f"Expected {ground_truth}, got {results}")
AssertionError: False is not true : Expected [[0]
 [0]], got [[2]
 [2]]

----------------------------------------------------------------------
Ran 33 tests in 71.591s

FAILED (failures=1, skipped=2)

I am away so can't really debug this in the next few days. I can look into this when im back. In the meantime, feel free to run the tests locally to see what is breaking (GitHub actions doesn't seem to show everything)

bm777 commented 1 month ago

I'm debugging it...

aflip commented 1 month ago

Is it better to add the stopwords into the library or add locally? I would like to add hindi and a few indian languages.

bm777 commented 1 month ago

@aflip yes, you can.

tids = bm25s.tokenize(corpus, stopwords=["your of stop words here"])

xhluca commented 1 month ago

You can also customize the regex template and use a custom stemmer, which makes it flexible for other languages.

xhluca commented 1 month ago

Any update on the tests failure here?

xhluca commented 1 month ago

Can you change STOPWORDS_EN to STOPWORDS_EN_PLUS? This will ensure that it is backward compatible. The tests should pass after that

bm777 commented 1 month ago

@xhluca Sorry for being absent. I was working on a side project that required attention. If you believe it will pass, then I will do it now.

xhluca commented 1 month ago

It seems 522fbdc removed STOPWORDS_EN. We still need STOPWORDS_EN to be like the original (pre-PR), whereas STOPWORDS_EN_PLUS is what one can use if they want the enhanced stopwords you have added.

xhluca commented 1 month ago

Btw, the new main has decoupled core tests from comparison tests. feel free to add test_stopwords.py to core tests with a simple tests, here's the template

xhluca commented 1 month ago

import unittest
import numpy as np

import bm25s

class TestAddNameHere(unittest.TestCase):
    def setUp(self):
        # Create your corpus here
        self.corpus = [
            "a cat is a feline and likes to purr",
            "a dog is the human's best friend and loves to play",
            "a bird is a beautiful animal that can fly",
            "a fish is a creature that lives in water and swims",
        ]

    def test_add_here(self):
        corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en")
        # continue here

if __name__ == '__main__':
    unittest.main()

xhluca commented 1 month ago

Thank you for taking this to the finish line! Merging this now.

xhluca / bm25s