Tokenizer: Improving word splitting process

This is primarily an issue for German language addresses. In German, "CHAUSSEE" is a street type and it might be joined to the street name. It is common to abbreviate it to "CH". So we desire to split words like: "nameCHAUSSEE" or "nameCH", into two tokens, one for the name and for the type. The problem comes from the short abbreviation. For example "BACH" gets splits "BA", "CH". The current logic pre-splits all the attached words before it gets passed to the parser to create tokens. We could change the logic to first check if a word is in the lexicon and if it is then don't split it.

Also while this in mostly needed for separating the TYPE from the street name in German, at the time of parsing the address string into tokens we are not able to differentiate what component of the address any given token is until much later in the process. Another downside of this is that some city names have prefixes and/or suffixes that also match some of these splitting rules which makes it harder to identify them appropriately.

Assuming we wanted to take this approach, the logic would be something like this:

while (the string has chars)
   if (check if the head of the string matches a lexicon entry) {
     make it a token and push it on the stream
   }
   else {
      if (we need to split it) {
         split it
         push part1, emdash, part2 on token stream
      }
      else {
         push word on token stream
      }
   }
   push punctuation on token stream
}

But here it the problem, every word that might get split and should not be split needs to be added to the lexicon. For example every German word ending in "ch" and every word ending in any of the suffix attachments would need to be added if they are also common words in addresses. This could make the lexicon HUGE and I'm not sure if this worth while. For example, if BACH always gets split into BA - CH this should not impact the ability to geocode accurately.

The current work around for this is that the code generates a set of token that are split and a second set of tokens that are not split and standardized both of these, picking the best scoring match. Also by making sure that words classified as both TYPE and WORD in the lexicon can solve some of the CITY word splitting issues.

woodbri / address-standardizer

Tokenizer: Improving word splitting process #23