xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

Match wrapping language mentions #1

Closed goodmami closed 7 years ago

goodmami commented 7 years ago

It's possible that a language-mention could wrap at a column boundary. E.g.:

(Northern Frisian)

blah blah blah blah this example of Northern
Frisian:

Or

(Wangaaybuwan-Ngiyambaa)

blah blah blah blah Wangaaybuwan-
Ngiyambaa...

Or

(Algonquian)

blah blah blah blah Algon-
quian...

Note that a hyphen may be inserted, and sometimes that hyphen was originally there.

This could be handled by modifying the language_mentions() function in analyzers.py so that instead of doing a regex search on a single line, it's done on a sliding window of 2 or 3 lines. You could probably concatenate the lines in the window directly, removing spaces and hyphens, and when constructing the regex change spaces and hyphens in language names to [-\s]* (e.g. Northern[-\s]*Frisian, Wangaaybuwan[-\s*]Ngiyambaa, etc.). I think this would capture all the above examples, but you'll have to do some bookkeeping to get accurate line/column/original-text info for the Mention object to be created.

rgeorgi commented 7 years ago

Improved Language Name Matching

elirnm commented 7 years ago

@goodmami Do you know how to dynamically replace spaces and hyphens in language names with [-\s]* when creating the regex? I'm trying to do it with re.sub(r"[-\s]+", r"[-\s]*", name) in the .format() call but the opening bracket is getting escaped when it goes into the regex.

goodmami commented 7 years ago

That is probably because of the re.escape() in the code:

     lg_re = re.compile(
         r'\b({})\b'.format(
             r'|'.join(re.escape(normcaps(name)) for name in lgtable)
         ),
         flags=re.U
     )

Maybe try this for each language name:

name =  r'[-\s]*'.join(map(re.escape, re.split(r'[-\s]+', name)))

Putting it together:

     lg_re = re.compile(
         r'\b({})\b'.format(
             r'|'.join(
                 r'[-\s]*'.join(
                     map(re.escape, map(normcaps, re.split(r'[-\s]+', name))
                 )
             ) for name in lgtable)
         ),
         flags=re.U
     )

But if that's getting too hard to read, you can pull out subexpressions and precompute them or make functions.

elirnm commented 7 years ago

The Freki docs seem to have whitespace applied to the beginning of every line, sometimes a massive amount of it. Should the text field in Mention objects include all this extra whitespace or should I strip it before I create the Mention?

goodmami commented 7 years ago

Hmm, good question. The point of the text field was mostly to capture things that we'd normalized away for the regex, such as diacritics, rather than recording full context. But it's perhaps too messy to try and recover the original intended language name (e.g. reattaching wrapped names with a space, a hyphen, or direct concatenation).

For now it's probably fine to just keep all that whitespace. I don't think we're currently using the text field, so if that changes we could revisit this decision.