zafercavdar / fasttext-langdetect

80x faster and 95% accurate language identification with Fasttext
MIT License
141 stars 21 forks source link

Crash with numpy 2 #17

Open jlqibm opened 2 weeks ago

jlqibm commented 2 weeks ago

With numpy 1.26.4 and python 3.11, things work fine. With numpy 2.1.3 and python 3.11, I get a crash: Successfully installed numpy-2.1.3 (hf) [jlquinn@cccxc520 fms-dgt-internal]$ python Python 3.11.0 (main, Mar 1 2023, 18:26:19) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from ftlangdetect import detect detect('hi there') Traceback (most recent call last): File "", line 1, in File "/dccstor/jlquinn01/miniforge3/envs/hf/lib/python3.11/site-packages/ftlangdetect/detect.py", line 45, in detect labels, scores = model.predict(text) ^^^^^^^^^^^^^^^^^^^ File "/dccstor/jlquinn01/miniforge3/envs/hf/lib/python3.11/site-packages/fasttext/FastText.py", line 239, in predict return labels, np.array(probs, copy=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: Unable to avoid copy while creating an array as requested. If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x). For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

Roman-9182 commented 1 week ago

Hi, @jlqibm.

Temporary bypass solution:

    import numpy as np
    from ftlangdetect.detect import get_or_load_model

    def custom_predict(self, text, k=1, threshold=0.0, on_unicode_error="strict"):
        """
        Given a string, get a list of labels and a list of
        corresponding probabilities. k controls the number
        of returned labels. A choice of 5, will return the 5
        most probable labels. By default this returns only
        the most likely label and probability. threshold filters
        the returned labels by a threshold on probability. A
        choice of 0.5 will return labels with at least 0.5
        probability. k and threshold will be applied together to
        determine the returned labels.

        This function assumes to be given
        a single line of text. We split words on whitespace (space,
        newline, tab, vertical tab) and the control characters carriage
        return, formfeed and the null character.

        If the model is not supervised, this function will throw a ValueError.

        If given a list of strings, it will return a list of results as usually
        received for a single line of text.
        """

        def check(entry):
            if entry.find("\n") != -1:
                raise ValueError("predict processes one line at a time (remove '\\n')")
            entry += "\n"
            return entry

        if type(text) == list:
            text = [check(entry) for entry in text]
            all_labels, all_probs = self.f.multilinePredict(
                text, k, threshold, on_unicode_error
            )

            return all_labels, all_probs
        else:
            text = check(text)
            predictions = self.f.predict(text, k, threshold, on_unicode_error)
            if predictions:
                probs, labels = zip(*predictions)
            else:
                probs, labels = ([], ())

            return labels, np.asarray(probs)

    def custom_detect(text: str, low_memory=False) -> dict[str, str | float]:
        model = get_or_load_model(low_memory)
        model.__class__.predict = custom_predict
        labels, scores = model.predict(text)
        label = labels[0].replace("__label__", '')
        score = min(float(scores[0]), 1.0)
        return {
            "lang": label,
            "score": score,
        }