worldbank / dec-python-course

14 stars 5 forks source link

Update session 3 #38

Closed weilu closed 1 year ago

weilu commented 1 year ago

Further cut down numpy content and combined its two exercises into one. Bonus and solutions updated accordingly.

Note books here for ease of review: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3.ipynb

Solutions: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/solutions-s3.ipynb

Bonus: https://colab.research.google.com/github/worldbank/dec-python-course/blob/update-session-3/1-foundations/3-numpy-and-pandas/foundations-s3-bonus.ipynb

Session Timing

Start: 9:30 Databricks access & Content Overview [10 min]

Python libraries [10 min] Instruction: 10 minutes Time check: 9:50am

NumPy [30 min] Instruction: 15 minutes 2.4. Exercises: vector dot product: 15 minutes Wei: 2 minutes intro. breakout rooms: 13 minutes (10 minutes for try-out, 3 minutes for demo) Time check: 10:30am

Pandas [100 min] Instruction: 30 minutes (actual: 11:20) 3.2.5. Exercises: read and explore excel data: 30 min breakout rooms: 30 minutes (20 minutes for try-out, 10 minutes for demo) Time check: 10:50am (actual: 11:30)

Instruction: 40 minutes

Time check: 11:30am (actual: 12:00) 3.3.6. Exercises: improve merging with index difference and outer join: homework

luisesanmartin commented 1 year ago

thanks, @weilu . I'm listing here my suggestions, feel free to choose what to implement or not.

General comments:

Specific comments:

luisesanmartin commented 1 year ago

Also, I forgot to mention this: when I tried the notebook on my computer the following popup kept appearing constantly. In the end I used colab to review it. image

I guess it's because my Pandas installation was through conda instead of pip?

weilu commented 1 year ago

Additional TODOs for databricks compatibility:

weilu commented 1 year ago

Addressing @luisesanmartin's comments:

General comments:

  • [ ] I think Pandas should come first and NumPy later. Thinking about the probable course audience, familiarity with tabular representations of data will be easier to understand than higher-dimensional arrays. This can also be important in case the session takes more time than expected, in which case I'd say it's okay not to cover NumPy

Valid concern regarding time, especially given our experience from the last run. I looked into doing this but it requires major rework of the narrative of this session as there are lots of references of ndarray and numpy in the pandas section. Instead of swapping order, I further cut down on the packages section and removed the exercise. I also plan to not go through with the execution of every single cell of the numpy section, but instead talk through the simple ones and only spend a bit of time on universal functions as they are often used in data wrangling in combination with pandas functions. If we still end up short on time, I will swap pandas and numpy for the next run of the course.

The following suggestions are adopted:

Specific comments:

  • Might be a bit of a stretch because you're already covering a lot, but based on the common data wrangling operations we see in DIME I'd suggest considering adding the following:
    • [ ] create a new column with the same value, as in df[new_col] = 10
    • [ ] replace a column value based on condition, as in df.loc[(df[col1]==1) & (df[col2]==0), col_to_replace] = new_value
    • [ ] append two dataframes
  • [ ] Related to this ^, perhaps you could move some Pandas contents to the optional materials for the sake of time. Top of my list would be crosstab, transposing, and converting data types

I will cover only pivote_table and skip cross_tab and groupby in this run of the course. If there's time I will come back for those two. If there's even more time I will cover the rest of the suggestions here. Dataframe concatenation is covered in bonus material.

weilu commented 1 year ago

@luisesanmartin On the notebook validation error, I removed the pandas package upgrade/pinning cells. Try it and let me know if it's still an issue.