wesm / pydata-book

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media
Other
22.29k stars 15.2k forks source link

Problem with parsing Movie Lens data using code in book #59

Closed chrisrb10 closed 7 years ago

chrisrb10 commented 7 years ago

Hi,

I am working through the Ch02 material - and have a problem with the initial reading of the movie lens data. I am running the initial code as in the book:

import pandas as pd
import os
encoding = 'latin1'

upath = os.path.expanduser('pydata-book-master/ch02/movielens/users.dat')
rpath = os.path.expanduser('pydata-book-master/ch02/movielens/ratings.dat')
mpath = os.path.expanduser('pydata-book-master/ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

(with paths amended to work for where I have the files)

but when I run the line:

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)

I get the message:

/Users/Chris/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

I have tried switching to a Python2 kernel - but get an equivalent message.

What is the root of this issue? as far as I can interpret this it is having a problem with specifying the multicharacter '::' as the data separator. But I don't really understand how to correct this. How should I fix it to avoid similar issues with this and future code in the book?

Many thanks

wesm commented 7 years ago

@chrisrb10 this isn't actually a bug or a problem.

@jreback @jorisvandenbossche new users have asked me about this warning many times; is there some way we could make the default behavior for multichar separators a bit more friendly?

chrisrb10 commented 7 years ago

@wesm Thanks Wes. And excuse the question - I think I can now understand how to avoid after some digging in the Pandas documentation. I agree though (particularly as a novice user) the behaviour / warning message could be more friendly. Particularly as users working with this particular piece of code are likely to be at the beginning of the learning journey.

Great book, btw, I am finding it hugely informative and rewarding.

Thanks

jorisvandenbossche commented 7 years ago

is there some way we could make the default behavior for multichar separators a bit more friendly?

I think that would be to just not display the warning? (I don't see another way, apart from trying to make the wording of the warning better) But then of course you loose the information that is included in the warning (and potentially complaints about that it is slow when using multi-char separators).

I think I can now understand how to avoid after some digging in the Pandas documentation.

@chrisrb10 Just to ask, as the experience of a user is very valuable to see how we can improve things, what did you find in the end? As actually the solution is explicitly written in the warning message: "you can avoid this warning by specifying engine='python'". So is it more that the warning scares people, and think it is an error?

I agree that the warning message is rather difficult, though. It already talks about 'engine's, 'regular expressions', .. while you just wanted to read in a csv file.

chrisrb10 commented 7 years ago

@jorisvandenbossche after some more searching, I found: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

and understood what I needed to do. It wasn't immediately clear (for a novice) that the specifying engine = 'python' was an argument within the read_csv function. My initial reaction was that it was setting that needed to be specified at a programme level, and i was trying the command higher up after the import lines.

I think the warning may scare / unnerve people - and for me it wasn't immediately obvious that the code had still successfully run, and that this was just a warning. Perhaps more clarity in the titling (around 'FutureWarning:' perhaps?) to clarify the difference between a warning and a real error / bug?

thanks again

mkimartinez commented 6 years ago

You just have to specify engine='python' as shown below users = pd.read_table('/datasets/movielens/users.dat',sep='::',engine= 'python' ,header = None,names=unames)