zerebubuth / planet-dump-ng

Converts an OpenStreetMap database dump into planet files.
BSD 2-Clause "Simplified" License
30 stars 8 forks source link

Invalid xml data in planet osm #23

Open gartenkralle opened 3 years ago

gartenkralle commented 3 years ago

Don't know if I am right here. But I found the following data in the planet-210524.osm file. Opening tag (way) doesn't match to the closing tag (relation). Also "chaer type" seems not valid.

 <way id="933805767" timestamp="2021-04-22T09:46:48Z" version="1" chaer type="way" ref="182400916" role="inner"/>
  <member type="way" ref="182400928" role="inner"/>
  <member type="way" ref="182400935" role="inner"/>
  <member type="way" ref="182400910" role="inner"/>
  <member type="way" ref="182400991" role="inner"/>
  <member type="way" ref="182401067" role="inner"/>
  <member type="way" ref="182400927" role="inner"/>
  <member type="way" ref="182400934" role="inner"/>
  <member type="way" ref="182400921" role="inner"/>
  <member type="way" ref="182400925" role="inner"/>
  <member type="way" ref="182400985" role="inner"/>
  <member type="way" ref="182400907" role="inner"/>
  <member type="way" ref="182401094" role="inner"/>
  <member type="way" ref="182400940" role="inner"/>
  <member type="way" ref="182401005" role="inner"/>
  <member type="way" ref="182401080" role="inner"/>
  <member type="way" ref="182401092" role="inner"/>
  <member type="way" ref="182400972" role="inner"/>
  <member type="way" ref="182400983" role="inner"/>
  <member type="way" ref="182401068" role="inner"/>
  <member type="way" ref="182400942" role="inner"/>
  <member type="way" ref="182401019" role="inner"/>
  <member type="way" ref="182400989" role="inner"/>
  <member type="way" ref="182401004" role="inner"/>
  <member type="way" ref="182401022" role="inner"/>
  <member type="way" ref="182401075" role="inner"/>
  <member type="way" ref="182401077" role="inner"/>
  <member type="way" ref="182401069" role="inner"/>
  <member type="way" ref="182401041" role="inner"/>
  <member type="way" ref="182400978" role="inner"/>
  <member type="way" ref="182401090" role="inner"/>
  <member type="way" ref="182401029" role="inner"/>
  <member type="way" ref="182401031" role="inner"/>
  <member type="way" ref="182401017" role="inner"/>
  <member type="way" ref="182400995" role="inner"/>
  <member type="way" ref="182401061" role="inner"/>
  <member type="way" ref="182400986" role="inner"/>
  <member type="way" ref="182401056" role="inner"/>
  <member type="way" ref="182400959" role="inner"/>
  <member type="way" ref="182401057" role="inner"/>
  <member type="way" ref="182401058" role="inner"/>
  <member type="way" ref="182401078" role="inner"/>
  <member type="way" ref="182401086" role="inner"/>
  <tag k="natural" v="grassland"/>
  <tag k="type" v="multipolygon"/>
 </relation>

This is not the only entry where opening and closing tag doesn't match.

joto commented 3 years ago

Did you check whether the MD5 matches (see planet-210524.osm.bz2.md5)?

zerebubuth commented 3 years ago

I figured that if the file was corrupt, it would be very unlikely for bzip2 to output anything other than garbage. But playing around with it now, it does seem as if a corrupt bz2 file can decompress into something that isn't completely noise.

Unhelpfully, it seems that bzcat doesn't stop output when it senses a CRC error, but just outputs a warning to stderr and exits with a non-zero code after processing the rest of the file. So if you're not checking stderr or the exit code, it would be easy to think it had succeeded.

I started testing the original file on the planet server, but it is taking a very, very long time. I'll update here when it's finished.

gartenkralle commented 3 years ago

Did you check whether the MD5 matches (see planet-210524.osm.bz2.md5)?

Yes, did match.

zerebubuth commented 3 years ago

This is a bit weird - the planet file on the server looks completely fine. I grepped it for the way ID you mention, and the result is:

<way id="933805767" timestamp="2021-04-22T09:46:48Z" version="1" changeset="103400299" user="lipsigal" uid="438670">
  <nd ref="8654953875"/>
  <nd ref="8654953876"/>
  <nd ref="8654953877"/>
  ...

with no chaer type= or skipping into the relations section.

So if the file on the server is OK, and the MD5sum matches, and it matches your downloaded file too, does that mean that whatever problem is occurring must be during or after decompression? How are you decompressing? Using bzcat on the fly, or bunzip2, or something else?

gartenkralle commented 3 years ago

I have used 7-zip file manager version 19 under windows 10 x64.

I will try another decompressor. Thanks for investigating so far.

gartenkralle commented 3 years ago

This time I tried to uncompress with another tool (https://github.com/philr/bzip2-windows/releases) but same result.

Any more guesses?

joto commented 3 years ago

Looks to me like you (@gartenkralle) might have a problem with your hardware, faulty memory or so. I suggest running a memory tester.

zerebubuth commented 3 years ago

I think it's unlikely that a hardware fault would affect the decompression in exactly the same way with two different programs (with different memory layouts, etc...).

@gartenkralle are you decompressing the whole file? (In other words, you have a file called planet-210524.osm which is not compressed? Please could you tell me how big it is, and what the MD5sum is of the decompressed file?

gartenkralle commented 3 years ago

Did a 2 cycle memory check. No faulty memory found.

Yes I decompressed the whole file. Decompressing again and then run MD5sum on it. Results I will report in some days...

gartenkralle commented 3 years ago

Size: 1.542.302.591.588 Bytes

MD5 now running...

gartenkralle commented 3 years ago

MD5 checksum: dfdff2778d0dfad6569ecc2b3613fbb4

zerebubuth commented 3 years ago

Here's what I got, for the same input file (our MD5s match for the .osm.bz2) - I guess the computer I was using was much slower!

MD5: 2cf5fcca63685b13440902f0f1fa24e6 Size: 1,542,302,591,588

We get the same size, but different MD5s. I think something might be going wrong because it's a 1.4TiB file, and that might be pushing the limits of what the decompression software has been tested with (perhaps some subtle bugs when the file length / offset exceeds 40 bits?)

It might be worth trying some other software. I'm using bzip2, a block-sorting file compressor. Version 1.0.8, 13-Jul-2019 on Linux, so it might be worth trying to replicate that (either a virtual machine, or Windows Subsystem for Linux).

Alternatively, is it possible to do what you wanted without decompressing the whole file? If whatever is parsing the OSM file is capable of streaming (e.g: SAX or event parser) then you could bzcat planet.osm.bz2 | whatever and not need to uncompress the whole thing.

Finally, if all those things won't work, then it might be worth rewriting your parser to use the PBF binary file. The data inside is exactly the same, but the PBF is about half the size of the XML and 10 or more times quicker to parse. @joto's excellent https://github.com/osmcode/libosmium is a well-tested and fast library for parsing PBFs, and there's a suite of utilities (https://github.com/osmcode/osmium-tool) for common tasks such as making geographic extracts and filtering by tags. (I think it builds on Windows, but I don't know enough about Windows to say for sure.)

gartenkralle commented 3 years ago

Thanks for all your tips. Even with bzip2 under cygwin I got wrong MD5 checksum. Maybe a very low level bug or file system bug. Now I try doing on linux and transfering file to windows. Otherwise I will go with the PBF.

mmd-osm commented 3 years ago

@gartenkralle : do you have any updates on this? Can this issue be closed now?

gartenkralle commented 3 years ago

Yes, issue can be closed.

The tool which calculated the checksum after decompression was wrong. I did a mistake in my parsing method. In the xml file there are relations which has no members. I have not considered that case. Additionally I did not consider that utf-8 has variable sized chars. After fixing it worked fine.