woodbri / address-standardizer

An address parser and standardizer in C++
Other
7 stars 1 forks source link

Portugal Grammar might need some work #30

Open woodbri opened 8 years ago

woodbri commented 8 years ago

Here are some sample addresses from http://www.upu.int/fileadmin/documentsFiles/activities/addressingUnit/prtEn.pdf The issue is how to handle sub-locality level 2 and sub-locality and locality.

Portugal has 18 districts which might be considered PROV but they do not explicitly come into play in the addresses. For Navteq data we can extract all the localities which would get tagged as CITY, and then potentially IGNORE the sub-localities.

The issue is not about recognizing them but how to categorize them when we pass them to the geocoder. We currently only CITY, PROV, NATION. We could stuff the CITY (aka: locality) into PROV and then put the sub-locality words into CITY.

Getting more addresses for Portugal and comparing them with real street data will probably make more sense of this.

## Home delivery (large towns):
MANUEL GASPAR                                    addressee
LG DR ANTÓNIO VIANA 1 2 DTO             street + premises, floor, side
1250–096 LISBOA                                    postcode + locality
PORTUGAL                                              country

## with sub-locality:
MARIA SILVA ANDRADE                           addressee
R PRINCIPAL VV ANDRADE                      street + premises
QUINTA DA PROVENÇA                           sub-locality level 2
CASAIS NOVOS                                        sub-locality
2580-347 ALENQUER                               postcode + locality
PORTUGAL                                               country

## Home delivery (rural region):
DR. NUNO FIGUEIREDO                           addressee
R. LEAL DA CAMARA 31 RL ESQ             street + premises, floor, side
ALGUEIRÃO                                              sub-locality
2725–079 MEM MARTINS                         postcode + locality
PORTUGAL                                               country

## PO Box delivery:
PATRICIA MARTINS                                   addressee
APARTADO 42024                                    PO Box
EC –D. LUÍS                                              post office
1201–950 LISBOA                                    postcode + locality
PORTUGAL                                              country

## Delivery to private letter boxes:
ENG. MANUEL SOUSA                              addressee
RUA DAS DESCOBERTAS                        street
CCI 8318                                                   PO Box
PENTEADO                                               sub-locality
2860–571 MOITA                                      postcode + locality
PORTUGAL                                               country

The current grammar handles these by putting them into the extra field:

[macro]
@locality @postal @citywords @region @country
@locality @postal @citywords @country
@locality @postal @citywords @region
@locality @postal @citywords
@postal @citywords @region @country
@postal @citywords @country
@postal @citywords @region
@postal @citywords

[locality]
@word @locality
@word

[word]
WORD -> EXTRA -> 0.3

[citywords]
@city @citywords
@city