Closed weilu closed 1 year ago
thanks, @weilu . I'm listing here my suggestions, feel free to choose what to implement or not.
General comments:
Specific comments:
array
library in section 2.1 or moving it to the optional content.sort_values()
, you can include .reset_index(drop=True)
in the result to show how to re-establish the index sequence if that's neededmake
is now the (named) indexdf[new_col] = 10
df.loc[(df[col1]==1) & (df[col2]==0), col_to_replace] = new_value
Also, I forgot to mention this: when I tried the notebook on my computer the following popup kept appearing constantly. In the end I used colab to review it.
I guess it's because my Pandas installation was through conda instead of pip?
Additional TODOs for databricks compatibility:
Addressing @luisesanmartin's comments:
General comments:
- [ ] I think Pandas should come first and NumPy later. Thinking about the probable course audience, familiarity with tabular representations of data will be easier to understand than higher-dimensional arrays. This can also be important in case the session takes more time than expected, in which case I'd say it's okay not to cover NumPy
Valid concern regarding time, especially given our experience from the last run. I looked into doing this but it requires major rework of the narrative of this session as there are lots of references of ndarray and numpy in the pandas section. Instead of swapping order, I further cut down on the packages section and removed the exercise. I also plan to not go through with the execution of every single cell of the numpy section, but instead talk through the simple ones and only spend a bit of time on universal functions as they are often used in data wrangling in combination with pandas functions. If we still end up short on time, I will swap pandas and numpy for the next run of the course.
The following suggestions are adopted:
Specific comments:
array
library in section 2.1 or moving it to the optional content.sort_values()
, you can include .reset_index(drop=True)
in the result to show how to re-establish the index sequence if that's neededmake
is now the (named) index
my list would be crosstab, transposing, and converting data types
- Might be a bit of a stretch because you're already covering a lot, but based on the common data wrangling operations we see in DIME I'd suggest considering adding the following:
- [ ] create a new column with the same value, as in
df[new_col] = 10
- [ ] replace a column value based on condition, as in
df.loc[(df[col1]==1) & (df[col2]==0), col_to_replace] = new_value
- [ ] append two dataframes
- [ ] Related to this ^, perhaps you could move some Pandas contents to the optional materials for the sake of time. Top of my list would be crosstab, transposing, and converting data types
I will cover only pivote_table
and skip cross_tab
and groupby
in this run of the course. If there's time I will come back for those two. If there's even more time I will cover the rest of the suggestions here. Dataframe concatenation is covered in bonus material.
@luisesanmartin On the notebook validation error, I removed the pandas package upgrade/pinning cells. Try it and let me know if it's still an issue.
Further cut down numpy content and combined its two exercises into one. Bonus and solutions updated accordingly.
Note books here for ease of review: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3.ipynb
Solutions: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/solutions-s3.ipynb
Bonus: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3-bonus.ipynb
Session Timing
Start: 9:30 Databricks access & Content Overview [10 min]
Python libraries [10 min] Instruction: 10 minutes Time check: 9:50am
NumPy [30 min] Instruction: 15 minutes 2.4. Exercises: vector dot product: 15 minutes Wei: 2 minutes intro. breakout rooms: 13 minutes (10 minutes for try-out, 3 minutes for demo) Time check: 10:30am
Pandas [100 min] Instruction: 30 minutes (actual: 11:20) 3.2.5. Exercises: read and explore excel data: 30 min breakout rooms: 30 minutes (20 minutes for try-out, 10 minutes for demo) Time check: 10:50am (actual: 11:30)
Instruction: 40 minutes
Time check: 11:30am (actual: 12:00) 3.3.6. Exercises: improve merging with index difference and outer join: homework