x-tabdeveloping / neofuzz

Blazing fast fuzzy text search for Python.
https://x-tabdeveloping.github.io/neofuzz/
MIT License
38 stars 2 forks source link

neofuzz indexing fails for list of 400K strings #11

Open SeanPedersen opened 1 week ago

SeanPedersen commented 1 week ago

Error message:

python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
zsh: killed     python neofuzztest.py

Code:

import random
import string
from neofuzz import char_ngram_process

def rand_str(length):
    characters = string.ascii_letters + string.digits
    return "".join(random.choice(characters) for _ in range(length))

names = [
    rand_str(8) + " " + rand_str(6) + " " + rand_str(4) + " " + str(i)
    for i in range(400_000)
]
print(len(names))

neofuzz_process = char_ngram_process()
neofuzz_process.index(names)

query = "test 3333"

pre_filter = neofuzz_process.extract(query, limit=2000, refine_levenshtein=True)
print(pre_filter[:10])

The blazing fast speed of this lib can only shine if working on large datasets.

x-tabdeveloping commented 1 week ago

hmm interesting... Thanks for taking your time to look into this. Can I get a full error log? I have a feeling this might have something to do with PyNNDescent