Closed bm777 closed 1 month ago
Thanks! If the tests pass I will merge.
This is the error:
Finding newlines for mmindex: 0%| | 0.00/8.11M [00:00<?, ?B/s]
Finding newlines for mmindex: 100%|██████████| 8.11M/8.11M [00:00<00:00, 268MB/s]
0%| | 0/5183 [00:00<?, ?it/s]
100%|██████████| 5183/5183 [00:00<00:00, 180124.76it/s]
.
======================================================================
FAIL: test_retrieve (tests.quick.test_retrieve.TestBM25SLoadingSaving)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/runner/work/bm25s/bm25s/tests/quick/test_retrieve.py", line 70, in test_retrieve
self.assertTrue(np.array_equal(ground_truth, results), f"Expected {ground_truth}, got {results}")
AssertionError: False is not true : Expected [[0]
[0]], got [[2]
[2]]
----------------------------------------------------------------------
Ran 33 tests in 71.591s
FAILED (failures=1, skipped=2)
I am away so can't really debug this in the next few days. I can look into this when im back. In the meantime, feel free to run the tests locally to see what is breaking (GitHub actions doesn't seem to show everything)
I'm debugging it...
Is it better to add the stopwords into the library or add locally? I would like to add hindi and a few indian languages.
@aflip yes, you can.
tids = bm25s.tokenize(corpus, stopwords=["your of stop words here"])
You can also customize the regex template and use a custom stemmer, which makes it flexible for other languages.
Any update on the tests failure here?
Can you change STOPWORDS_EN
to STOPWORDS_EN_PLUS
? This will ensure that it is backward compatible. The tests should pass after that
@xhluca Sorry for being absent. I was working on a side project that required attention. If you believe it will pass, then I will do it now.
It seems 522fbdc removed STOPWORDS_EN
. We still need STOPWORDS_EN
to be like the original (pre-PR), whereas STOPWORDS_EN_PLUS
is what one can use if they want the enhanced stopwords you have added.
Btw, the new main
has decoupled core tests from comparison tests. feel free to add test_stopwords.py
to core tests with a simple tests, here's the template
import unittest
import numpy as np
import bm25s
class TestAddNameHere(unittest.TestCase):
def setUp(self):
# Create your corpus here
self.corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
def test_add_here(self):
corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en")
# continue here
if __name__ == '__main__':
unittest.main()
Thank you for taking this to the finish line! Merging this now.
Add multi-language stopword support
This pull request addresses issue #32 by implementing support for stopwords in multiple languages.
Changes made:
Implementation details:
Stopwords are now loaded based on the specified language
Testing:
If any change made here needs to be modified, feel free to coment below.
Closes #32