wireservice / csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.
https://csvkit.readthedocs.io
MIT License
6.01k stars 604 forks source link

csvlook, csvstat not usable due to "must be str, not bytes" error #346

Closed bklaas closed 9 years ago

bklaas commented 10 years ago

csvlook data_dict_ca1901a.csv must be str, not bytes

I am using ArchLinux, Python 3.4.1, and csvkit installed via pip (compiled against Python 3.4.1). This feels like a Python 3.x issue.

Possible pointers for a fix: http://stackoverflow.com/questions/5512811/builtins-typeerror-must-be-str-not-bytes http://stackoverflow.com/questions/4980292/programming-python-for-absolute-beginners-chapter-7-storing-complex-data

gotero commented 10 years ago

I've gotten the errors on python 2.6.6 and 2.7.6

esparta commented 10 years ago

Can you provide a sample of your data? It's need it to reproduce your problem and maybe give a solution/patch.

gotero commented 10 years ago

I was using the data from the tutorial, ne_1033_data.xlsx.

bklaas commented 10 years ago

I can reproduce the problem against these three lines of csv:

[bklaas@bklaas csvkit_testing]$ cat test.csv RecordType,Var,Col,Wid,Frm,Value,VarLabel,ValueLabel,VarLabelOrig,ValueLabelOrig,Freq,Sel,Notes,Svar,ValueSvar,VarLabelSvar,ValueLabelSvar,UnivSvar,NoRec,NonTab,Hide,Decim,String,CommP,CodeTy,DDoc1,DTag1,JDoc1,JTag1,DDoc2,DTag2,JDoc2,JTag2,AnchorForm,AnchorInst C,RT,1,1,,,Record Type,,RECTYPE,1(1),,,,US2009A_0010,,,,All households and group quarters.,1,,,,,,,x,x,x,x,x,x,x,x,, C,SERIALNO,2,7,,,Housing unit/GQ person serial number,,SERIAL,2(7),skip: 1382515,,,US2009A_0011,,,,All households and group quarters.,1,,,,,,,x,x,x,x,x,x,x,x,, [bklaas@bklaas csvkit_testing]$ csvlook test.csv must be str, not bytes [bklaas@bklaas csvkit_testing]$

bklaas commented 10 years ago

...and to be clear, I can reproduce the issue with any csv file. [bklaas@bklaas csvkit_testing]$ cat test2.csv Test,Test2,Test3 Foo,Bar,Foobar [bklaas@bklaas csvkit_testing]$ csvlook test2.csv must be str, not bytes [bklaas@bklaas csvkit_testing]$

heyalexej commented 10 years ago

Same issue here after updating from 0.8.0 to 0.9.0 today. Appears on streaming data from an API and any csv file.

proj/tee git:(feature/social) ➜  cat xcy.csv | csvsort -c 7 | csvlook 
must be str, not bytes
proj/tee git:(feature/social) ➜  cat xcy.csv | csvlook  
must be str, not bytes
proj/tee git:(feature/social) ➜  cat xcy.csv | csvsort -r -c 7 | head        
URL,Pinterest,LinkedIn,Facebook like_count,StumbleUpon,Facebook share_count,Facebook total_count,GooglePlusOne,Delicious,Twitter,Facebook commentsbox_count,Facebook click_count,Diggs,Buzz,Facebook comment_count,Reddit
http://teespring.com/vettechsuperpower,545,2,38156,4,7822,54120,5,0,45,0,0,0,0,8142,0
http://teespring.com/veteranforfreedom2,0,0,12627,0,2784,15882,0,0,13,0,0,0,0,471,0
http://teespring.com/usmc-limitededition,4,0,10112,0,2162,13065,0,0,14,0,0,0,0,791,0
http://teespring.com/vettechmutts,309,0,8331,0,2338,12092,0,0,1,0,0,0,0,1423,0
http://teespring.com/vetsforfreedom,0,0,7045,0,1226,8500,0,0,2,0,0,0,0,229,0
http://teespring.com/valdez,3,0,4936,0,1031,7951,0,0,1,0,0,0,0,1984,0
http://teespring.com/upallnightcolts,1,0,4714,0,1211,6988,0,0,0,0,0,0,0,1063,0
http://teespring.com/veterand2,0,0,5549,0,1104,6930,0,0,1,0,0,0,0,277,0
http://teespring.com/valeriethingmeme,5,0,3581,0,1183,5641,0,0,2,0,0,0,0,877,0
proj/tee git:(feature/social) ➜  file xcy.csv                                 
xcy.csv: ASCII text, with CRLF line terminators
proj/tee git:(feature/social) ➜  stat xcy.csv                                 
  File: ‘xcy.csv’
  Size: 133892      Blocks: 264        IO Block: 4096   regular file
Device: fc09h/74513d    Inode: 279301      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/     duck)   Gid: ( 1000/     duck)
Access: 2014-10-19 07:13:11.437361226 +0700
Modify: 2014-10-19 06:48:43.829318810 +0700
Change: 2014-10-19 06:48:43.829318810 +0700
 Birth: -

Rolling back to 0.8.0 for now.

Cutuchiqueno commented 10 years ago

I confirm withy any csv data (Arch Linux Python 3.4.2 and csvkit 0.9

teto commented 10 years ago

confirm too with 3.4 on ubuntu.

jjedMoriAnktah commented 10 years ago

The same error on windows 8 and python 2.7.8

bijanhoule commented 9 years ago

I'm seeing the same thing (Python 3.4.2 / csvkit 0.9.0) % csvlook thing.csv -v

Traceback (most recent call last):
  File "/home/bhoule/python/bin/csvlook", line 9, in 
    load_entry_point('csvkit==0.9.0', 'console_scripts', 'csvlook')()
  File "/u/bhoule/python/lib/python3.4/site-packages/csvkit/utilities/csvlook.py", line 78, in launch_new_instance
    utility.main()
  File "/u/bhoule/python/lib/python3.4/site-packages/csvkit/utilities/csvlook.py", line 61, in main
    write('%s\n' % divider)
  File "/u/bhoule/python/lib/python3.4/site-packages/csvkit/utilities/csvlook.py", line 59, in 
    write = lambda t: self.output_file.write(t.encode('utf-8'))
TypeError: must be str, not bytes
bklaas commented 9 years ago

Unfortunately I have tested this and it's not a complete fix.

With released csvkit: [bklaas@bklaas csvkit]$ csvcut -c variable,label,rec usa_variables.csv | csvstat must be str, not bytes

With git clone checkout that includes the fix: (csvkit)[bklaas@bklaas csvkit]$ csvcut -c variable,label,rec usa_variables.csv | csvstat 'str' does not support the buffer interface

I can run csvcut in the released csvkit without the pipe to csvstat. In the github checkout, I can't run any csvkit commands at all without the "does not support the buffer interface" error.

I am on python 3.4.2.

bklaas commented 9 years ago

I setup a virtualenv for python 2.7 and I don't see the issue using it, so the remaining problem appears to be python 3.x-specific.

same command as last comment-- [bklaas@bklaas csvkit]$ csvcut -c variable,label,rec usa_variables.csv | csvstat

  1. variable <type 'unicode'> Nulls: False Unique values: 1310 Max length: 12
  2. label <type 'unicode'> Nulls: False Unique values: 1295 5 most frequent values: Flag for Vacancy: 2 Census year: 2 Record type: 2 Data set number: 2 Internal version of race from the PUMS: 2 Max length: 151
  3. rec <type 'unicode'> Nulls: False Values: H, P

Row count: 1310

onyxfish commented 9 years ago

Annnnnnd I broke the tests.

critmcdonald commented 9 years ago

I'm having this problem on Windows 8. On any file I can't use csvlook or csvstat without getting the "must be str, not bytes". I'm using Python 3.4.3

I was able to use csvsql command which surprised me, but pleasantly so.