Update session 3 - Githubissues

weilu commented 1 year ago

Further cut down numpy content and combined its two exercises into one. Bonus and solutions updated accordingly.

Note books here for ease of review: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3.ipynb

Solutions: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/solutions-s3.ipynb

Bonus: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3-bonus.ipynb

Session Timing

Start: 9:30 Databricks access & Content Overview [10 min]

Python libraries [10 min] Instruction: 10 minutes Time check: 9:50am

NumPy [30 min] Instruction: 15 minutes 2.4. Exercises: vector dot product: 15 minutes Wei: 2 minutes intro. breakout rooms: 13 minutes (10 minutes for try-out, 3 minutes for demo) Time check: 10:30am

Pandas [100 min] Instruction: 30 minutes (actual: 11:20) 3.2.5. Exercises: read and explore excel data: 30 min breakout rooms: 30 minutes (20 minutes for try-out, 10 minutes for demo) Time check: 10:50am (actual: 11:30)

Instruction: 40 minutes

Time check: 11:30am (actual: 12:00) 3.3.6. Exercises: improve merging with index difference and outer join: homework

luisesanmartin commented 1 year ago

thanks, @weilu . I'm listing here my suggestions, feel free to choose what to implement or not.

General comments:

I think Pandas should come first and NumPy later. Thinking about the probable course audience, familiarity with tabular representations of data will be easier to understand than higher-dimensional arrays. This can also be important in case the session takes more time than expected, in which case I'd say it's okay not to cover NumPy
I suggest deleting the outputs of cells from the "vanilla" version of the notebooks so participants will discover it for themselves and it will be clearer for them which cells they haven't run yet

Specific comments:

Consider dropping the mention of the array library in section 2.1 or moving it to the optional content
In section 3.2.4, mention that almost all of the outputs of those Pandas methods are also Pandas objects (Series or DFs), so that other operations can also be applied on them
In the example of .sort_values(), you can include .reset_index(drop=True) in the result to show how to re-establish the index sequence if that's needed
Pivot table: I suggest clarifying that in the resulting df make is now the (named) index
Might be a bit of a stretch because you're already covering a lot, but based on the common data wrangling operations we see in DIME I'd suggest considering adding the following:
- create a new column with the same value, as in df[new_col] = 10
- replace a column value based on condition, as in df.loc[(df[col1]==1) & (df[col2]==0), col_to_replace] = new_value
- append two dataframes
Related to this ^, perhaps you could move some Pandas contents to the optional materials for the sake of time. Top of my list would be crosstab, transposing, and converting data types
In additional resources, "Bash Commands for Data Scientists by Giorgos Myrianthous" has a broken link

luisesanmartin commented 1 year ago

Also, I forgot to mention this: when I tried the notebook on my computer the following popup kept appearing constantly. In the end I used colab to review it.

I guess it's because my Pandas installation was through conda instead of pip?

weilu commented 1 year ago

Additional TODOs for databricks compatibility:

[x] Check for & fix any broken images in Databricks
[x] Fix latex usage in Databricks
[x] Update library section to avoid compatibility issues
[x] Remove explicit reference to Colab

weilu commented 1 year ago

Addressing @luisesanmartin's comments:

General comments:

[ ] I think Pandas should come first and NumPy later. Thinking about the probable course audience, familiarity with tabular representations of data will be easier to understand than higher-dimensional arrays. This can also be important in case the session takes more time than expected, in which case I'd say it's okay not to cover NumPy

Valid concern regarding time, especially given our experience from the last run. I looked into doing this but it requires major rework of the narrative of this session as there are lots of references of ndarray and numpy in the pandas section. Instead of swapping order, I further cut down on the packages section and removed the exercise. I also plan to not go through with the execution of every single cell of the numpy section, but instead talk through the simple ones and only spend a bit of time on universal functions as they are often used in data wrangling in combination with pandas functions. If we still end up short on time, I will swap pandas and numpy for the next run of the course.

The following suggestions are adopted:

[x] I suggest deleting the outputs of cells from the "vanilla" version of the notebooks so participants will discover it for themselves and it will be clearer for them which cells they haven't run yet

Specific comments:

[x] Consider dropping the mention of the array library in section 2.1 or moving it to the optional content
[x] In section 3.2.4, mention that almost all of the outputs of those Pandas methods are also Pandas objects (Series or DFs), so that other operations can also be applied on them
[x] In the example of .sort_values(), you can include .reset_index(drop=True) in the result to show how to re-establish the index sequence if that's needed
[x] Pivot table: I suggest clarifying that in the resulting df make is now the (named) index my list would be crosstab, transposing, and converting data types
[x] In additional resources, "Bash Commands for Data Scientists by Giorgos Myrianthous" has a broken link

Might be a bit of a stretch because you're already covering a lot, but based on the common data wrangling operations we see in DIME I'd suggest considering adding the following:

[ ] create a new column with the same value, as in df[new_col] = 10

[ ] replace a column value based on condition, as in df.loc[(df[col1]==1) & (df[col2]==0), col_to_replace] = new_value

[ ] append two dataframes

[ ] Related to this ^, perhaps you could move some Pandas contents to the optional materials for the sake of time. Top of my list would be crosstab, transposing, and converting data types

I will cover only pivote_table and skip cross_tab and groupby in this run of the course. If there's time I will come back for those two. If there's even more time I will cover the rest of the suggestions here. Dataframe concatenation is covered in bonus material.

weilu commented 1 year ago

@luisesanmartin On the notebook validation error, I removed the pandas package upgrade/pinning cells. Try it and let me know if it's still an issue.

worldbank / dec-python-course

Update session 3 #38

Session Timing