Closed goodmami closed 7 years ago
@goodmami Do you know how to dynamically replace spaces and hyphens in language names with [-\s]*
when creating the regex? I'm trying to do it with re.sub(r"[-\s]+", r"[-\s]*", name)
in the .format()
call but the opening bracket is getting escaped when it goes into the regex.
That is probably because of the re.escape()
in the code:
lg_re = re.compile(
r'\b({})\b'.format(
r'|'.join(re.escape(normcaps(name)) for name in lgtable)
),
flags=re.U
)
Maybe try this for each language name:
name = r'[-\s]*'.join(map(re.escape, re.split(r'[-\s]+', name)))
Putting it together:
lg_re = re.compile(
r'\b({})\b'.format(
r'|'.join(
r'[-\s]*'.join(
map(re.escape, map(normcaps, re.split(r'[-\s]+', name))
)
) for name in lgtable)
),
flags=re.U
)
But if that's getting too hard to read, you can pull out subexpressions and precompute them or make functions.
The Freki docs seem to have whitespace applied to the beginning of every line, sometimes a massive amount of it. Should the text
field in Mention objects include all this extra whitespace or should I strip it before I create the Mention?
Hmm, good question. The point of the text field was mostly to capture things that we'd normalized away for the regex, such as diacritics, rather than recording full context. But it's perhaps too messy to try and recover the original intended language name (e.g. reattaching wrapped names with a space, a hyphen, or direct concatenation).
For now it's probably fine to just keep all that whitespace. I don't think we're currently using the text field, so if that changes we could revisit this decision.
It's possible that a language-mention could wrap at a column boundary. E.g.:
(Northern Frisian)
Or
(Wangaaybuwan-Ngiyambaa)
Or
(Algonquian)
Note that a hyphen may be inserted, and sometimes that hyphen was originally there.
This could be handled by modifying the language_mentions() function in analyzers.py so that instead of doing a regex search on a single line, it's done on a sliding window of 2 or 3 lines. You could probably concatenate the lines in the window directly, removing spaces and hyphens, and when constructing the regex change spaces and hyphens in language names to
[-\s]*
(e.g.Northern[-\s]*Frisian
,Wangaaybuwan[-\s*]Ngiyambaa
, etc.). I think this would capture all the above examples, but you'll have to do some bookkeeping to get accurate line/column/original-text info for the Mention object to be created.