wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

Unexpected content conversion for a hex string data #47

Closed alexeygrigorev closed 7 years ago

alexeygrigorev commented 7 years ago

I'm reading the following csv file:

uuid,document_id,timestamp,platform,geo_location,traffic_source
1fd5f051fba643,120,31905835,1,RS,2
8557aa9004be3b,120,32053104,1,VN>44,2
c351b277a358f0,120,54013023,1,KR>12,1
8205775c5387f9,120,44196592,1,IN>16,2
9cb0ccd8458371,120,65817371,1,US>CA>807,2
2aa611f32875c7,120,71495491,1,CA>ON,2
f55a6eaf2b34ab,120,73309199,1,BR>27,2
cc01b582c8cbff,120,50033577,1,CA>BC,2
6c802978b8dd4d,120,66590306,1,CA>ON,2

But paratext reads it as following:

selection_167

The uuid conversion is totally unexpected - and the issue persists even if I say text_names=['uuid']

alexeygrigorev commented 7 years ago

31 seems to solve it

selection_168

(although there's an issue with parsing the last column)

deads commented 7 years ago

Thank you for reporting your issue. Indeed, #31 solves the issue, but we are waiting for the PR issuer to remerge so we can run the tests on the PR before merging into master.

Most of the regression tests assume all data is double-quoted because this is what I do for most of the data files I used in a production environment. paratext supports backslash-escape sequences so in theory any arbitrary byte sequence can be represented.

If you have a very messy CSV file, you can use: paratext.serial.write_frame, which will write the data out using a configurable backslash escaping scheme (writing arbitrary 8-bit, printable ASCII, UTF-8, etc). In fact, the regression tests generate arbitrary UTF-8 and byte data, save in all possible formats, and read it back in. However, the key assumption to get this to work is that all non-numeric data is backslash-escaped.

deads commented 7 years ago

This issue has been resolved in the latest master.