zdavatz / oddb2xml

oddb2xml, create xml files using refdata, swissmedic and bag xml files
http://www.ywesee.com/Oddb2xml/Index
GNU General Public License v3.0
8 stars 5 forks source link

Illegal characters in datasets #16

Closed col-panic closed 9 years ago

col-panic commented 9 years ago

It seems that there is some problems with the created datasets when it comes to character encoding.

We face entries like

11:15:52.795 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xC2\x92s gl...' for column 'DSCR' at row 1
11:15:59.071 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xC2\x928 af...' for column 'DSCR' at row 1
11:16:11.601 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xC2\x9636 M...' for column 'DSCR' at row 1
11:16:12.961 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xC2\x96 Jog...' for column 'DSCR' at row 1
11:16:29.075 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xC2\x89 100...' for column 'DSCR' at row 1
11:17:15.998 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA4 30...' for column 'LIMITATION_TXT' at row 1
11:17:16.009 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA4 30...' for column 'LIMITATION_TXT' at row 1
11:17:16.023 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA4 30...' for column 'LIMITATION_TXT' at row 1
11:17:16.033 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA4 30...' for column 'LIMITATION_TXT' at row 1
11:17:25.245 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA5 30...' for column 'LIMITATION_TXT' at row 1
11:17:25.256 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA5 30...' for column 'LIMITATION_TXT' at row 1
11:17:26.249 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA5 16...' for column 'LIMITATION_TXT' at row 1
11:17:26.277 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xE2\x89\xA5 16...' for column 'LIMITATION_TXT' at row 1
11:17:30.666 [main] WARN  java.lang.Throwable - java.sql.SQLException: Incorrect string value: '\xCE\xB1) ni...' for column 'LIMITATION_TXT' at row 1

on certain mysql databases. I could track it to entries like

WARNING SCHAR SEMPER Cookie-O’s glutenfrei 150 g
WARNING SCHAR CER’8 after sting Roll-on 20 ml
WARNING SCHAR DermaSilk Set Body + Strumpfhöschen 24–36 Mon (98)
WARNING SCHAR Inkosport Activ Pro 80 Himbeer – Joghurt Ds 750g
WARNING SCHAR Ethacridin lactat 1‰ 100ml

vi shows the data e.g. like this

3731928     <DSCRD>Ethacridin lactat 1<89> 100ml                        </DSCRD>
3731929     <DSCRF>Ethacridin lactat 1<89> 100ml                        </DSCRF>
3731930     <SORTD>ETHACRIDIN LACTAT 1<89> 100ML                        </SORTD>
3731931     <SORTF>ETHACRIDIN LACTAT 1<89> 100ML                        </SORTF>

where the <89> is an unwritable sign.

Could you please ensure, that only valid characters are used in the xml files?

zdavatz commented 9 years ago

what charset is you computer set to? Characters all display correctly on my Mac and Linux.

zdavatz commented 9 years ago

/tmp/oddb2xml> file -i oddb_article.xml oddb_article.xml: application/xml; charset=utf-8 /tmp/oddb2xml> file -i oddb_product.xml oddb_product.xml: application/xml; charset=utf-8

col-panic commented 9 years ago

I am currently very busy, please give me some time for feedback!

zdavatz commented 9 years ago

ok, sure.

ngiger commented 9 years ago

I found the problem. I must convert each line from ISO-8859-9 (transfer.dat) to UTF-8 before exctracting the name. Should be fixed soon.

zdavatz commented 9 years ago

Ok, this will come with version 2.0.6 latest by Thursday for the event: http://hin.ch/anlass-mediupdate

col-panic commented 9 years ago

Thanks a lot @ngiger :+1:

zdavatz commented 9 years ago

Das 2.0.6 gem ist draussen. Kannst Du bitte testen Marco ob bei Dir jetzt alles geht. Danke für Dein Feedback.

col-panic commented 9 years ago

Okay, das sieht jetzt besser aus!

<DSCR>Ethacridin lactat 1‰ 100ml</DSCR>

werde noch den import testen :+1:

zdavatz commented 9 years ago

Danke Dir!

zdavatz commented 9 years ago

Version 2.0.8 ist draussen. Bitte testen.

col-panic commented 9 years ago

SIeht gut aus, nach Update keine Probleme mehr! Merci!

zdavatz commented 9 years ago

nicht vergessen immer vor dem laufen lassen des Jobs den /downloads Ordner komplett zu löschen mit "rm -r downloads".