wmgeolab / geoBoundaries

geoBoundaries : A Political Administrative Boundaries Dataset (www.geoboundaries.org)
http://www.geoboundaries.org
Other
272 stars 49 forks source link

Faulty metadata character encodings #1397

Closed karimbahgat closed 3 years ago

karimbahgat commented 3 years ago

What is the expected behavior?

Boundary metadata of special characters such as accents should display correctly in the metadata csv files:

What is the actual behavior? The more detail the better.

Came across several csv rows where the metadata special characters display as jibberish, particularly the boundarySource-1 and 2 fields:

Other notes on how to reproduce the issue?

It appears the issue originates at the very source, in the meta.txt file inside the sourceData zipfiles. E.g. download the CIV zipfile and extract the meta.txt. In Python3 do:

print(open('path/to/meta.txt', encoding='utf8'))

If the text file/text was correctly encoded in utf8 this would have printed correctly, but it doesn't. This means the file was saved in some other character encoding besides utf8, or that there was some issue with copy-pasting the text.

Any possible solutions?

[ ] I'll work on a PR for this!

leeberryman commented 3 years ago

@karimbahgat I also have seen similar encoding issues on Name field values, so any checks created for metadata encodings should be applied to attributes as well at submission. I particularly remember encountering this in RUS.

DanRunfola commented 3 years ago

Yes, this is in my CSV building scripts. Working on these today, so hopefully will have an update. All sorts of issues now - even some quote-encapsulation errors that break the entire thing :). Will try to reply back here with a resolution shortly.

DanRunfola commented 3 years ago

Bleh - ok, this error is turning out to be buried a little deeper in the code than I would have liked. Going to leave this issue open for now, with a plan to return. If anyone wants to jump on a PR for this please feel free; I'll post here before I pick it back up.

On the upside, I got the other CSV quoting errors fixed :)

karimbahgat commented 3 years ago

While I do think the above errors originate at the source meta.txt inside the contributed country zip files, I just submitted a PR to make sure all input-outputs of the bot csv script https://github.com/wmgeolab/geoBoundaryBot/pull/2 explicitly uses the utf8 encoding. If left unspecified the default is locale dependent which in many cases uses the 'latin' encoding. This may take care of some yet to be discovered errors, but probably not the files listed in this issue, which need to be changed at the source.

Out of curiosity, which script does the checks on zipfile submissions, if we want to add an encoding check?

DanRunfola commented 3 years ago

Running a test on the PR from https://github.com/wmgeolab/geoBoundaryBot/pulls now.

The script that runs when a PR is submitted and checks the metadata file itself is here: https://github.com/wmgeolab/geoBoundaryBot/blob/main/gbMetaCheck.py

If we wanted to check encodings within the attribute tables as well, then the file would be here: https://github.com/wmgeolab/geoBoundaryBot/blob/main/gbDataCheck.py

As usual, PRs very much welcome and appreciated :). I'm going to go look at the source data for this case now.

DanRunfola commented 3 years ago

meta.txt

Alright, fixed the encoding in the source here as well, just for reference. My guess is the source got saved into ISO-8859 automatically when it was saved on a - probably - mac?. Will need to put some clear guidelines on this for our contributions. Checks on the PRs would be very helpful as well, so leaving this open until we get those resolved. In the interim we'll do manual fixes on the detected cases; going to do an eyeball pass on the metadata build I'm running right now to build that later today.

DanRunfola commented 3 years ago

Note: I am not going to do a pass on the attribute tables of each shape to confirm this isn't an issue there; I'm sure we have some cases that are broken. If anyone runs into any, please flag with an issue.

DanRunfola commented 3 years ago

This has been fixed - a few lingering metadata issues, but the code is behaving as expected now.