Closed wesm closed 7 years ago
Adding content about dask, distributed, ibis would be awesome. :P
@vchollati with 4 years of history in the mirror, I'm pretty hesitant to write about projects that are still under active development -- we did a major edit of the 1st edition to fix pandas API breakage, so my rule of thumb will be having code examples that I feel confident will still work 2 years from now.
Since the 1st edition got translated into at least 5 other languages (one or two more may be in the works), this stability is extra important as fixes in the primary English edition may take a lot longer to percolate to the translations.
Adding practical analysis theory and ibis is better.
A request! How about a bit on creating a simple file-handler for web forms, and outputting into live graphing?
Minor point, but I assume the section on IPython will be complemented with a section on Jupyter.
Speaking as someone who only skimmed the first edition (eagerly awaiting the second edition to purchase in multiple copies as reference material at work).
Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.
Regarding the comments above - the whole feather/ibis/arrow matter, while still under active development, would probably deserve a mention (not code, just to mention it), so that readers know what to anticipate. And if reading years after the book was published, they can look up relevant code.
Thanks for your work on all this.
I teach a data science/computational modeling course for undergraduates. We assume no prior programming experience, and I'm thinking of using Python/pandas. So, this may not be the book for us and my requests may not work 😇
Lastly, thank you so much for making pandas!
assign
,pipe
and how to use them effectively..ix, .iloc, .loc
etc. I still get tripped up from time to time here.query
and when it should be used or not usedAnyway, great stuff and I really look forward to the new book!
I bought the first edition and I really enjoyed it: I read it in a linear way to learn Pandas and now I'm using it as a reference book. I like because it's very well organized. In the next edition, I would add:
Some topics topics that come to mind:
Hi Wes, I'd love to read a chapter about Reproducible Research. How to make the analysis reproducible, how to use IPython notebooks efficiently, how to make a good team work. Thanks for asking ^^
I've been unpacking the recent Tom Augspurger posts and finding those very insightful. Would love to see strategies for unit testing scripts, especially with pipe/assign chains.
Exited that the 2nd Edition is happening :)
The layout of the first edition works well. Starting with the high-level overview of "Why Python" is great (and after these 4 years, you have even more evidence for why Python is a good choice). I really enjoyed the introductory examples to show the capabilities, before delving into sections of the APIs.
I hope you keep the section on NumPy, at least the parts on ufuncs and broadcasting.
I wonder if xarray
is worth a section as well.
Plotting and visualization is a bit tricky. I think the pandas .plot
API is pretty much settled aside from GroupBy.plot, which is discussed here. Seaborn deserves a mention, Bokeh as well probably. Perhaps even javascript based tools like d3, inside the notebook.
I'm curious to see what you do (if anything) with the Timeseries and Financial Applications sections. Pandas is great at timeseries, and the additions of TimeGroupers
and the groupby-like resample
, rolling
, and expanding
APIs only add to that. I'm guessing part of the reason they got prominence in the first edition was because your background. I'm looking forward to what your experiences in big-data land have taught you about interacting with it from python / pandas.
The biggest omission these days is probably a section on interfacing with scikit-learn. They've done some great work over the last 4 years. Unfortunately, the dust hasn't settled on exactly what they do with DataFrames (here and linked issues in that thread), so I don't know what can be set in stone at this point. At the very least you can cover a bit about going convert from pandas' extension types (mostly just Categoricals in this context) to NumPy arrays with get_dummies.
References to Tidy Data are always popular :) so that might be worth mentioning. I've been meaning to make pd.melt
more MultiIndex friendly, but haven't gotten around to it yet.
The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well. I don't have much to offer here, other than a hope that you attempt a better explanation than I can (not to say that the first edition didn't: it does emphasize the role of Indexes in slicing and reindexing/alignment).
Sorry about the wall of text, I hope some of it is useful :)
Hi Wes and thank you.
I would like to second the suggestion about Tom Augspurger's work. I read this post NB Viewer and it was like discovering Pandas all over again. What I got out of the first Python for Data Analysis book was "here are a million different ways to solve some problems (which you may or may not have)". What I really wanted (and still want) is a single source laying out simple, user-friendly general strategies, idioms and best practices. I've read several pandas books and Tom's notebooks and blog posts are the first thing I have seen that feels like it's answering that need.
Perhaps a closing chapter for efficient, ergonomic data work?
I second @kokes on this:
Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.
Also, please write about effective use of pivoting, stacking, unstacking, index-setting and other indexing-related operations.
Examples of data wrangling / data munging / feature engineering using PySpark + pandas + etc. (I saw your reply to @vchollati on ibis, but I think Spark is more stable than it used to be).
Sorry if this has been addressed, but Python 3?
The appendix of the 1st edition ("Python Language Essentials") is excellent, and I would propose keeping it in the 2nd edition. In fact, I tell my students who own the book to read the appendix first :)
This one is rather small, but probably many people are using Anaconda. I would therefore suggest to add Continuum Analytics Anaconda in addition to Enthought Canopy.
Anaconda Distribution information would be helpful, as I prefer it to Enthrought Canopy.
As I come from a statistics background, and I'm a newcomer to Pandas and Python (and to CS in general), it would be great to have an extra section on how Pandas works with SQLite or MySQL, and how people integrate Pandas into their workflow with databases and ETL processes.
While this may be (a bit) outside the scope of your book, I'm sure a lot of newcomers to Pandas would love some basic information or recommendations on how Pandas fits into a data analyst's typical workflow. If you can't fit it into the book, please point us in the right direction with an informative link or two.
-P.S. I'm currently on chapter 4 of your first edition, and I love it so far. So, thank you! And, please forgive me if you already include 'SQL-to-Pandas' workflow guidance later on in your 1st edition.
Wes, do you have an rough estimate of when the 2nd edition with be available for purchase?
@DavidWright123 I'll try to provide status updates over the rest of the year, but I believe it's going to be in the 1st quarter of 2017.
And it will definitely be Python 3.5-based (since Python 2.x will retire within the working lifetime of the book: http://pythonclock.org/) =)
Programming with python3 !
The 2012 Federal Election Commission Database will probably need updating.
Apart from the fact that there are now the 2016 data, I have come across something strange while going through the existing example.
If I simply sum up the contributions to the two candidates at the end:
fec_mrbo.groupby('cand_nm').sum()
it appears that Mitt Romney raised more money than Barack Obama (679.994.900 vs. 558.359.100). These figures run contrary to expectations and what has been reported to the media.
Trying to investigate, I found that a lot of Mitt Romney transactions are transfers from one Mitt Romney committee to another, so they are not net contributions.
Indeed, running:
transfers = fec[fec['memo_text'].str.contains("TRANSFER").fillna(False)]
mr_transfers = transfers[transfers['cand_nm'] == 'Romney, Mitt']
bo_transfers = transfers[transfers['cand_nm'] == 'Obama, Barack']
print("Romney -> Romney:", mr_transfers['contb_receipt_amt'].count(), mr_transfers['contb_receipt_amt'].sum())
print("Obama -> Obama:", bo_transfers['contb_receipt_amt'].count(), bo_transfers['contb_receipt_amt'].sum())
gets me:
Romney -> Romney: 644022 295380725.37 Obama -> Obama: 0 0
So if I do some cleaning:
fec = fec[~fec['memo_text'].str.contains("TRANSFER").fillna(False)]
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
fec_mrbo.groupby('cand_nm').sum()
I get:
Obama, Barack 558359100 Romney, Mitt 384614200
which looks closer to what has been reported.
NB: I am not an American so I don't really know the campaign reporting rules. Just pointing out something that looks strange to me.
Thanks. I'll see if it's straightforward to update to the 2016 disclosure dataset
Hi Wes,
Thanks for taking the effort to writing a second edition.
Do you think it's possible for the code and the output be be provided from Jupyter notebook ?
Vineeth
please expand the timeseries section:
While I am very excited by the book content, working ipython/jupyter can be very frustrating especially with respect to visualizations. Please make the notebooks and examples work with anaconda.
btw, does anyone have any quickstart suggestions, or notes on how to use the notebooks with conda? The visualizations wont display.
I realize I may have to punt and try Enthought Python.
edit: nvm, got it working with anaconda, I think I was confused because I thought this was supposed to display a graph. But it does display it later with plot()
plt.figure(figsize=(10, 4))
Out[10]:
<matplotlib.figure.Figure at 0x10d58add0>
<matplotlib.figure.Figure at 0x10d58add0>
Here was my conda setup which worked:
conda create --name pfda python=2.7 numpy pandas scipy matplotlib chaco jupyter`
source activate pfda
jupyter notebook ch02.ipynb
@guidorice: a few things.
plt.show()
to display the plot. I believe pandas plot methods do so for you. Try running %matplotlib inline
in a notebook cell
The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well.
I also believe it is an underappreciated concept. It seems like an esoteric topic but it's truly fundamental to getting anything non-trivial done. I like how the first edition treats index objects as first class citizens of the pandas community by having its own section at the same level of Series
and DataFrame
. I hope this won't diminish with all the new topics being considered.
Hey @wesm what I'd really, really recommend is that the code and plots in the book (at least the electronic formats) be in color, as opposed to just black and white. As much as I love the first edition (and the early release of the second so far), I've always found the fact that everything is just black and white really tiring to read - especially as opposed to most O'Reilly books.
Hi @wesm, I'm reading along with the 2nd edition on SafariBooksOnline and have found some deprecation warnings, small wording issues, and typos. Should I open issues here or leave notes on the O'Reilly errata page?
The errata page is fine. Thanks!
Unsubscribe
Von meinem iPhone gesendet
Am 09.02.2017 um 20:29 schrieb Wes McKinney notifications@github.com:
The errata page is fine. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Wes, thanks much for your great and useful book! I've done the German translation for O'Reilly and am using it all the time. The data examples are great, please keep it that way!
Yesterday I noticed that the Jupyter notebook on Ch10 contains a few deprecated function calls. Do you suggest me to create a PR, or are they being overhauled anyway?
@krother yes, I'll be generating updated notebooks with refreshed code examples using up-to-date API calls. Vielen Dank für die Übersetzung!
Hi Wes, one thing I always think is helpful when learning from a book is problem sets. I realize your page count is probably going to be pretty high, but I do think it would make it a better overall resource.
I agree, problem sets would be awesome and very helpful. As an idea, in order to cut down on pages you could include a page or two of problems at the end of most chapters (or at least the most applicable chapters). And, instead of having lengthy explanations in an appendix, you could simply create a Jupyter Notebook for those (heck, you could even finish many of the explanations after the book is already released). I'm sure you know of options where we could use digital keys to view or print such documents if and only if we bought your book.
Thanks Wes!
On Apr 4, 2017 2:19 PM, "Vincent Lingle-Munos" notifications@github.com wrote:
Hi Wes, one thing I always think is helpful when learning from a book is problem sets. I realize your page count is probably going to be pretty high, but I do think it would make it a better overall resource.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wesm/pydata-book/issues/37#issuecomment-291587766, or mute the thread https://github.com/notifications/unsubscribe-auth/AMc82RPvAUfY0z8A3_8cUX3O_kcAtQdVks5rsonOgaJpZM4IMqb3 .
@wesm The 2nd Edition is looking great so far. :) Do you have any idea when the new Early Release chapters (so, Chapter 10 onwards) will be available? Thanks!
There will hopefully be at least 3-4 more early release chapters coming out this month, with the rest of the first manuscript draft appearing not long thereafter.
Thanks all for the input! I hope you enjoy the 2nd edition when it ships in a few weeks.
I've started working on the revised 2nd Edition of Python for Data Analysis. The agenda / table of contents is not set in stone, though!
Any comments on the existing content or requests for new content would be welcome here. I can't make any promises, but since I know how useful the book has been for many people the last 3.5 years, I would like to make sure the 2nd edition is just as useful (if not more so!) in the following 3.5 years (which will put us all the way to 2020, if you can believe it).
Thank you all in advance for the support.