xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

Match Multiple Languages on Same Line #3

Closed rgeorgi closed 7 years ago

rgeorgi commented 7 years ago

I thought it was a bug in my code, but it appears when multiple language names appear on the same line, we are only matching the first.

For instance, the snippet

doc_id=36 page=1 block_id=1-10 bbox=70.86,336.5,344.09,369.56 label=tbbbbbbbbbt 25 27
line=25 fonts=F0-10.0 bbox=70.86,359.54,344.09,369.56:> no inflection, word order is used to convey grammatical meanings
line=26 fonts=F0-10.0 bbox=70.86,348.02,331.74,358.04:> words one-syllable long, tones may be used to change meaning
line=27 fonts=F0-10.0 bbox=70.86,336.5,311.77,346.52 :> e.g., Chinese, Vietnamese, Samoan, Thai, Khmer, Tibetan

Only returns chinese for line 27.

goodmami commented 7 years ago

Oops, I thought I tested for this, but it looks like you're right. In analyzers.py, we have:

        ...
        for line in block.lines:
            startline = line.lineno
            endline = line.lineno  # same for now
            match = lg_re.search(normalize_characters(line))
            if match is not None:
                ...

But the last two lines should be something like this:

            ...
            for match in lg_re.finditer(normalize_characters(line)):
                ...
rgeorgi commented 7 years ago

Improved Language Name Matching