wesm / pydata-book

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media
Other
21.97k stars 15.07k forks source link

What would everyone like to see in the 2nd edition? #37

Closed wesm closed 7 years ago

wesm commented 8 years ago

I've started working on the revised 2nd Edition of Python for Data Analysis. The agenda / table of contents is not set in stone, though!

Any comments on the existing content or requests for new content would be welcome here. I can't make any promises, but since I know how useful the book has been for many people the last 3.5 years, I would like to make sure the 2nd edition is just as useful (if not more so!) in the following 3.5 years (which will put us all the way to 2020, if you can believe it).

Thank you all in advance for the support.

vchollati commented 8 years ago

Adding content about dask, distributed, ibis would be awesome. :P

wesm commented 8 years ago

@vchollati with 4 years of history in the mirror, I'm pretty hesitant to write about projects that are still under active development -- we did a major edit of the 1st edition to fix pandas API breakage, so my rule of thumb will be having code examples that I feel confident will still work 2 years from now.

Since the 1st edition got translated into at least 5 other languages (one or two more may be in the works), this stability is extra important as fixes in the primary English edition may take a lot longer to percolate to the translations.

little7dragon commented 8 years ago

Adding practical analysis theory and ibis is better.

atacca commented 8 years ago

A request! How about a bit on creating a simple file-handler for web forms, and outputting into live graphing?

m-sostero commented 8 years ago

Minor point, but I assume the section on IPython will be complemented with a section on Jupyter.

kokes commented 8 years ago

Speaking as someone who only skimmed the first edition (eagerly awaiting the second edition to purchase in multiple copies as reference material at work).

Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.

Regarding the comments above - the whole feather/ibis/arrow matter, while still under active development, would probably deserve a mention (not code, just to mention it), so that readers know what to anticipate. And if reading years after the book was published, they can look up relevant code.

Thanks for your work on all this.

briandk commented 8 years ago

I teach a data science/computational modeling course for undergraduates. We assume no prior programming experience, and I'm thinking of using Python/pandas. So, this may not be the book for us and my requests may not work 😇

  1. pandas has some built in plot methods. Will the book talk about/teach those at all, or will it be up to the reader to read the pandas docs on visualization?
  2. Will there be anything on interactive visualizations for the web (like the Bokeh) package? I could totally imagine, though, that that's either outside the scope of the book and/or a package that's not mature enough.

Lastly, thank you so much for making pandas!

chris1610 commented 8 years ago

Anyway, great stuff and I really look forward to the new book!

josepmv commented 8 years ago

I bought the first edition and I really enjoyed it: I read it in a linear way to learn Pandas and now I'm using it as a reference book. I like because it's very well organized. In the next edition, I would add:

leondutoit commented 8 years ago

Some topics topics that come to mind:

fdelaunay commented 8 years ago

Hi Wes, I'd love to read a chapter about Reproducible Research. How to make the analysis reproducible, how to use IPython notebooks efficiently, how to make a good team work. Thanks for asking ^^

cjburgoyne commented 8 years ago

I've been unpacking the recent Tom Augspurger posts and finding those very insightful. Would love to see strategies for unit testing scripts, especially with pipe/assign chains.

TomAugspurger commented 8 years ago

Exited that the 2nd Edition is happening :)

The layout of the first edition works well. Starting with the high-level overview of "Why Python" is great (and after these 4 years, you have even more evidence for why Python is a good choice). I really enjoyed the introductory examples to show the capabilities, before delving into sections of the APIs.

I hope you keep the section on NumPy, at least the parts on ufuncs and broadcasting. I wonder if xarray is worth a section as well.

Plotting and visualization is a bit tricky. I think the pandas .plot API is pretty much settled aside from GroupBy.plot, which is discussed here. Seaborn deserves a mention, Bokeh as well probably. Perhaps even javascript based tools like d3, inside the notebook.

I'm curious to see what you do (if anything) with the Timeseries and Financial Applications sections. Pandas is great at timeseries, and the additions of TimeGroupers and the groupby-like resample, rolling, and expanding APIs only add to that. I'm guessing part of the reason they got prominence in the first edition was because your background. I'm looking forward to what your experiences in big-data land have taught you about interacting with it from python / pandas.

The biggest omission these days is probably a section on interfacing with scikit-learn. They've done some great work over the last 4 years. Unfortunately, the dust hasn't settled on exactly what they do with DataFrames (here and linked issues in that thread), so I don't know what can be set in stone at this point. At the very least you can cover a bit about going convert from pandas' extension types (mostly just Categoricals in this context) to NumPy arrays with get_dummies.

References to Tidy Data are always popular :) so that might be worth mentioning. I've been meaning to make pd.melt more MultiIndex friendly, but haven't gotten around to it yet.

The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well. I don't have much to offer here, other than a hope that you attempt a better explanation than I can (not to say that the first edition didn't: it does emphasize the role of Indexes in slicing and reindexing/alignment).

Sorry about the wall of text, I hope some of it is useful :)

NewMountain commented 8 years ago

Hi Wes and thank you.

I would like to second the suggestion about Tom Augspurger's work. I read this post NB Viewer and it was like discovering Pandas all over again. What I got out of the first Python for Data Analysis book was "here are a million different ways to solve some problems (which you may or may not have)". What I really wanted (and still want) is a single source laying out simple, user-friendly general strategies, idioms and best practices. I've read several pandas books and Tom's notebooks and blog posts are the first thing I have seen that feels like it's answering that need.

Perhaps a closing chapter for efficient, ergonomic data work?

andportnoy commented 8 years ago

I second @kokes on this:

Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.

Also, please write about effective use of pivoting, stacking, unstacking, index-setting and other indexing-related operations.

Tagar commented 8 years ago

Examples of data wrangling / data munging / feature engineering using PySpark + pandas + etc. (I saw your reply to @vchollati on ibis, but I think Spark is more stable than it used to be).

naoyak commented 8 years ago

Sorry if this has been addressed, but Python 3?

justmarkham commented 8 years ago

The appendix of the 1st edition ("Python Language Essentials") is excellent, and I would propose keeping it in the 2nd edition. In fact, I tell my students who own the book to read the appendix first :)

KrOstir commented 8 years ago

This one is rather small, but probably many people are using Anaconda. I would therefore suggest to add Continuum Analytics Anaconda in addition to Enthought Canopy.

DavidWright123 commented 8 years ago

Anaconda Distribution information would be helpful, as I prefer it to Enthrought Canopy.

As I come from a statistics background, and I'm a newcomer to Pandas and Python (and to CS in general), it would be great to have an extra section on how Pandas works with SQLite or MySQL, and how people integrate Pandas into their workflow with databases and ETL processes.

While this may be (a bit) outside the scope of your book, I'm sure a lot of newcomers to Pandas would love some basic information or recommendations on how Pandas fits into a data analyst's typical workflow. If you can't fit it into the book, please point us in the right direction with an informative link or two.

-P.S. I'm currently on chapter 4 of your first edition, and I love it so far. So, thank you! And, please forgive me if you already include 'SQL-to-Pandas' workflow guidance later on in your 1st edition.

DavidWright123 commented 8 years ago

Wes, do you have an rough estimate of when the 2nd edition with be available for purchase?

wesm commented 8 years ago

@DavidWright123 I'll try to provide status updates over the rest of the year, but I believe it's going to be in the 1st quarter of 2017.

And it will definitely be Python 3.5-based (since Python 2.x will retire within the working lifetime of the book: http://pythonclock.org/) =)

Madfile commented 8 years ago

Programming with python3 !

louridas commented 7 years ago

The 2012 Federal Election Commission Database will probably need updating.

Apart from the fact that there are now the 2016 data, I have come across something strange while going through the existing example.

If I simply sum up the contributions to the two candidates at the end:

fec_mrbo.groupby('cand_nm').sum()

it appears that Mitt Romney raised more money than Barack Obama (679.994.900 vs. 558.359.100). These figures run contrary to expectations and what has been reported to the media.

Trying to investigate, I found that a lot of Mitt Romney transactions are transfers from one Mitt Romney committee to another, so they are not net contributions.

Indeed, running:

transfers = fec[fec['memo_text'].str.contains("TRANSFER").fillna(False)]

mr_transfers = transfers[transfers['cand_nm'] == 'Romney, Mitt']
bo_transfers = transfers[transfers['cand_nm'] == 'Obama, Barack']
print("Romney -> Romney:", mr_transfers['contb_receipt_amt'].count(), mr_transfers['contb_receipt_amt'].sum())
print("Obama -> Obama:", bo_transfers['contb_receipt_amt'].count(), bo_transfers['contb_receipt_amt'].sum())

gets me:

Romney -> Romney: 644022 295380725.37 Obama -> Obama: 0 0

So if I do some cleaning:

fec = fec[~fec['memo_text'].str.contains("TRANSFER").fillna(False)]
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
fec_mrbo.groupby('cand_nm').sum()

I get:

Obama, Barack 558359100 Romney, Mitt 384614200

which looks closer to what has been reported.

NB: I am not an American so I don't really know the campaign reporting rules. Just pointing out something that looks strange to me.

wesm commented 7 years ago

Thanks. I'll see if it's straightforward to update to the 2016 disclosure dataset

vineethbabuR commented 7 years ago

Hi Wes,

Thanks for taking the effort to writing a second edition.

Do you think it's possible for the code and the output be be provided from Jupyter notebook ?

Vineeth

dacoex commented 7 years ago

please expand the timeseries section:

guidorice commented 7 years ago

While I am very excited by the book content, working ipython/jupyter can be very frustrating especially with respect to visualizations. Please make the notebooks and examples work with anaconda.

btw, does anyone have any quickstart suggestions, or notes on how to use the notebooks with conda? The visualizations wont display.

I realize I may have to punt and try Enthought Python.

edit: nvm, got it working with anaconda, I think I was confused because I thought this was supposed to display a graph. But it does display it later with plot()

plt.figure(figsize=(10, 4))
Out[10]:
<matplotlib.figure.Figure at 0x10d58add0>
<matplotlib.figure.Figure at 0x10d58add0>

Here was my conda setup which worked:

conda create --name pfda python=2.7 numpy pandas scipy matplotlib chaco jupyter`
source activate pfda
jupyter notebook ch02.ipynb 
briandk commented 7 years ago

@guidorice: a few things.

  1. As far as I know, anaconda comes standard with Jupyter Notebook.
  2. You should probably include code like this in your preamble, which will set up your notebook to display resolution-independent graphics in line. Feel free to omit the third section of import statements if you don't use those modules.
  3. In my experience, if you're using plt directly, you should conclude your plotting commands with plt.show() to display the plot. I believe pandas plot methods do so for you.
wesm commented 7 years ago

Try running %matplotlib inline in a notebook cell

pglezen commented 7 years ago

The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well.

I also believe it is an underappreciated concept. It seems like an esoteric topic but it's truly fundamental to getting anything non-trivial done. I like how the first edition treats index objects as first class citizens of the pandas community by having its own section at the same level of Series and DataFrame. I hope this won't diminish with all the new topics being considered.

ssantic commented 7 years ago

Hey @wesm what I'd really, really recommend is that the code and plots in the book (at least the electronic formats) be in color, as opposed to just black and white. As much as I love the first edition (and the early release of the second so far), I've always found the fact that everything is just black and white really tiring to read - especially as opposed to most O'Reilly books.

TheGhostHuCodes commented 7 years ago

Hi @wesm, I'm reading along with the 2nd edition on SafariBooksOnline and have found some deprecation warnings, small wording issues, and typos. Should I open issues here or leave notes on the O'Reilly errata page?

wesm commented 7 years ago

The errata page is fine. Thanks!

ghost commented 7 years ago

Unsubscribe

Von meinem iPhone gesendet

Am 09.02.2017 um 20:29 schrieb Wes McKinney notifications@github.com:

The errata page is fine. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

krother commented 7 years ago

Hi Wes, thanks much for your great and useful book! I've done the German translation for O'Reilly and am using it all the time. The data examples are great, please keep it that way!

Yesterday I noticed that the Jupyter notebook on Ch10 contains a few deprecated function calls. Do you suggest me to create a PR, or are they being overhauled anyway?

wesm commented 7 years ago

@krother yes, I'll be generating updated notebooks with refreshed code examples using up-to-date API calls. Vielen Dank für die Übersetzung!

valmunos commented 7 years ago

Hi Wes, one thing I always think is helpful when learning from a book is problem sets. I realize your page count is probably going to be pretty high, but I do think it would make it a better overall resource.

DavidWright123 commented 7 years ago

I agree, problem sets would be awesome and very helpful. As an idea, in order to cut down on pages you could include a page or two of problems at the end of most chapters (or at least the most applicable chapters). And, instead of having lengthy explanations in an appendix, you could simply create a Jupyter Notebook for those (heck, you could even finish many of the explanations after the book is already released). I'm sure you know of options where we could use digital keys to view or print such documents if and only if we bought your book.

Thanks Wes!

On Apr 4, 2017 2:19 PM, "Vincent Lingle-Munos" notifications@github.com wrote:

Hi Wes, one thing I always think is helpful when learning from a book is problem sets. I realize your page count is probably going to be pretty high, but I do think it would make it a better overall resource.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wesm/pydata-book/issues/37#issuecomment-291587766, or mute the thread https://github.com/notifications/unsubscribe-auth/AMc82RPvAUfY0z8A3_8cUX3O_kcAtQdVks5rsonOgaJpZM4IMqb3 .

ssantic commented 7 years ago

@wesm The 2nd Edition is looking great so far. :) Do you have any idea when the new Early Release chapters (so, Chapter 10 onwards) will be available? Thanks!

wesm commented 7 years ago

There will hopefully be at least 3-4 more early release chapters coming out this month, with the rest of the first manuscript draft appearing not long thereafter.

wesm commented 7 years ago

Thanks all for the input! I hope you enjoy the 2nd edition when it ships in a few weeks.