General Review 03/2022 - Githubissues

worldbank / dime-stata-training

MIT License

5 stars 8 forks source link

General Review 03/2022 #32

Open bbdaniels opened 2 years ago

bbdaniels commented 2 years ago

General

[ ] Check spelling of names
[ ] Capitalize lecture titles (use sentence case for slide titles unless proper nouns)
[ ] Add more links to DRiP chapters whenever possible -- almost every slide could have such a link
[ ] Add more links to other resources, or at least a link to "DIME Analytics Resource Directory" or homepage on every slide footer?
[ ] Check size and quality of images and screenshots
[ ] Remove italics
[ ] Improve readability of bold/highlights
[ ] Add bold/highlights to all slides using terms defined for the first time in the same presentation
[ ] Check all formatting and spelling:
- [ ] Some terms are inconsistently used (data set / dataset, etc)
- [ ] First letter after colon : should be capitalized
- [ ] Minor text is inconsistently formatted

bbdaniels commented 2 years ago

Lecture 1

[ ] Kris’ name spelled wrong in title (umlaut is on the a, not the u) — check throughout
[ ] Highlighted font (yellow) is a bit light, can this be a darker color (orange/red/bold)?
[ ] Add disadvantages of Excel approach:
Difficult to extend old instructions to new data
Difficult to correct intermediate steps
Difficult to trace someone else’s work
[ ] Add advantages of code approach
If written well, these instructions are easy to understand, edit, and re-implement
[ ] Dill in pasta??? Parsley, please!
[ ] On “skipping step” slide, add bullet: “If you mess up, you have to start cooking from the beginning; recipes are easy to edit”
[ ] “Recipes” slide: change “follow” to “understand, reuse, and adapt” in last bullet
[ ] “Ingredients” slide: add “(we often call this class the ‘unit of observation’ and refer to a row as an ‘observation’)”?
[ ] Third bullet, change “characteristics” to “a single characteristic (often called a ‘variable’)”?
[ ] Last bullet — rewrite “Each cell (often called a “data point”) contains the “value” of a “variable” for an “observation””?
[ ] “Data linkage table” slide: Last bullet, change to “In addition to the file name, give each data table a descriptive name…”
[ ] “Data Flowcharts” — image is too small; remove italics (change back to bold or highlight)

[ ] White Space (2) — use sub-bullets:

Stata does not distinguish between: 
- One empty space and many empty spaces
- One line break or many line breaks
It makes a big difference to the human eye! We would never share:
- A Word document, 
- An Excel sheet or 
- A PowerPoint presentation 
… without thinking about white space – there, we call it formatting

[ ] Can we improve the quality of the screenshots?
[ ] Write out “DIME Analytics Stata Style Guide”
[ ] “How to ask for help” — remove italics
[ ] Mention “comments” on final slide

bbdaniels commented 2 years ago

Lecture 2

[ ] First slide: Move data map image to here
[ ] Improve definition and discussion of master data sets. At least three functions:
- [ ] Enumerates all possible members for each unit of observation, with unique identifiers
- [ ] Describes how different units of observation can be connected in data
- [ ] Contains authoritative information on permanent characteristics and design variables for all units
[ ] “Semantics”: Remove slide without definitions unless this is used in the flow; if you want to keep this, update so it matches the slide with the definitions (not all terms are on both)
[ ] “Group indicators…” slide: First bullet is incorrect, you may have many final data tables for a given unit of observation! Different subsamples, time-series setups, wide/long versions, etc. may be needed.
[ ] Mention somewhere that it is easy to create data flowchart — PowerPoint for example is perfectly good and easy to edit/annotate; it won’t look as good as this professionally-created one but don’t worry about that!
[ ] More slides are needed on “master data sets” including at least one linking example
[ ] Include section title slides for all three components
[ ] “Final Data Dictionary” is odd at the end because that is not discussed in the slides — either add this or move to another presentation

bbdaniels commented 2 years ago

Lecture 3

[ ] Poor Kris!
[ ] “Mental Model”: Links to DRiP!
[ ] “Organizing data”: Bullet 2 is unclear — if there are multiple sources, each unit may have multiple files?
[ ] Add a note that ORIGINAL FILE NAMES SHOULD BE RETAINED AS RECEIVED in Raw
[ ] “Organizing code”: typo — CODE will be divided…
[ ] “Organizing code”: Again, there may often be more than one file, and one file may not correspond to a single unit of observation
[ ] “Organizing documentation”: Terminology — should this be “data dictionaries”; “codebooks”; and “data quality logs and corrections”? (If the last is a folder name, it should be shorter and not have spaces, perhaps “Corrections” or “QualityAssurance”?)
[ ] “Organizing outputs”: “…contains tables and graphs as exported from statistical software”
[ ] “Organizing data work” slides: something has gone wrong with the code font and bolding here, these should never be used together
[ ] “Organizing data work”: “saves data as .dta to eht data/raw folder without changing the underlying data”; new first bullet point in cleaning: “Cleans the data (short definition)”, as well as improved descriptions of the other bullets
[ ] “Organizing data work”: typo (“Clan” —> Clean); add “constructs final data (short definition)” bullet point
[ ] “Referencing files in code”: Add section title slide before this; “call files” —> “import, export, and otherwise access or create non-code files from within code…”
[ ] “When you opened”: Can we address the backslashes right here up front? “Your computer will often use \, but you should always use / (reasons)”; second slide: change “Do you understand” to a breakdown — DRIVE; DIRECTORY; NAME; EXTENSION. Also, WHY DO WE USE QUOTES
[ ] Do we really want to use the Stata Project approach; and can we start phasing out flash drives? (nobody should be using them!)
[ ] Don’t forget to include , clear in use commands
[ ] Section title slide for Version Control; add definition to first slide (and DRiP links!)
[ ] “Naming conventions” slides are incomplete
[ ] “Git” should not be in code font?
[ ] A steep learning curve is good — it means you are learning quickly! Maybe we should say Git has a shallow learning curve — you can start quickly with basics but it takes a long time to master

bbdaniels commented 2 years ago

Lab 1

[ ] “Opening a data set”: “CTRL + D” — include appropriate Mac command; should people save the do-file here?
[ ] “Using a do-file”: do-file name should not have spaces or caps, use _ or -
[ ] “Browsing a data set”: clarify — “We will use column/variable and row/observation interchangeably…” also, put vocab in bold, not code font
[ ] “Exploring a data set”: use “Results Window” terminology here to stay consistent with diagram; introduce “console” separately (it refers to the whole thing, right?)
[ ] “Types of variables”: no italics; can we add more detail on value labels (ie, “there is a number stored but Stata can be told to display its meaning”)? Should we add slightly more information on display format in general (ie, what you see is not always what is really there, especially for certain data types)
[ ] “Review window”: Note that it may be in a tab along with the Variables window depending on user settings. Use caps for the names of windows on further references (i.e. “the Review window”)
[ ] “Review window”: “did not run” —> “did not FINISH running; it may have run partway”
[ ] “Help file usage” graph is too wide
[ ] “Exploring a data set” — definitely at least include a list of possible numeric types and string types, even with no more detail, as well as a note on display format
[ ] “Saving a do-file”: No caps or spaces!

bbdaniels commented 2 years ago

Lab 2

[ ] “Useful commands”: Note here to NEVER use edit and to never use the editor shortcut?
[ ] “Useful commands”: I would love to include table here but unfortunately the Stata 17 syntax and function are quite different; I would also suggest mean along with summarize as it is quite handy
[ ] “Useful commands”: note that describe and codebook can also take varlists; note missing, plot, and nolabel option for tabulate; I would add list [varlist] [if] [in] here
[ ] “Useful commands”: add link to visualization page on DIME Wiki. Don’t use graph pie - introduce simple options (ie graph bar, over(foreign) stack [asy]). Add graph box
[ ] “Describe”: typo (“milliseconds”); add note that it may also be measuring days or other units; “This can be different than the way other programs, such as Excel, record time; we’ll come back to this”
[ ] “Exploring numeric”: typo (35 unique values)
[ ] “Subsetting”: Use if missing(nr_participants)?
[ ] “Subsetting”: Use , discrete in histogram as we know this variable takes integer values (see the misleading gap between 6 and 7)
[ ] “Exploring categorical variables”: Note here that labelbook takes the LABEL name, not the VARIABLE name, and how to find/distinguish
[ ] “Exploring categorical variables”: include , missing in the two-way tabulate
[ ] “Exploring date variables”: time is not always counted in milliseconds
[ ] “Exploring date variables”: show how to use format() in histogram — histogram bid_submission_date , xlab(,format(%tdDD/NN/YY)) discrete; use international format (not DMY, not US MDY); mention difference between DMY and MDY (ie, for date importing)

bbdaniels commented 2 years ago

Lab 3

[ ] Now Luiza needs an accent
[ ] “Importing data from Excel”: Maybe we can describe what the dataset is? Will we be using this dataset with other audiences?
[ ] Display of the import command is very weirdly line-wrapped. Check display of these and use screenshots instead if not possible to autoscale or prevent wrap?
[ ] “CSV”: Encoding should not be necessary in most cases, and it may not be Windows in many cases.
[ ] Both imports: Force lower case variable names? Note possible issues with overlong or non-unique column names; note formatting (colors, bold, numbers with stars or footnotes, etc) IS NOT DATA AND CANNOT BE READ
[ ] “isid”: explain the options explicitly and why you might use them; I would leave out the optional using here
[ ] I assume “Nadmetanje” is Croatian — should this get generalized?
[ ] “duplicates”: Emphasize that the tag number is an indicator for how many, not which; show egen x = group(var) to identify duplicate groups?
[ ] if: Mention missing handling?
[ ] Add demo on exporting to spreadsheet (export excel)? This is easy and super useful for many people

bbdaniels commented 2 years ago

Lab 4

[ ] Cleaning strings: Extra space not showing
[ ] Several typos in the presentation, re-read this one carefully
[ ] More explanation needed for string functions
[ ] “Encode”: bullet hierarchy is off
[ ] “Encode”: Also mention the alternative tab , gen to get the binary encodings if preferred
[ ] The two genders: “Lowest price” and “MEAT” … this might need to be explained for other contexts!
[ ] Value labels: I would NOT start with modify or even use that in an intro course. Instead, prefer to re-define the entire label explicitly lab def , replace
[ ] Re-emphasize that value labels are separate objects from variables; have their own names; and need to be attached one-by-one
[ ] Note “trivial”/“trivial procurement” label duplication explicitly
[ ] Do not change raw data with recode; generate a new variable and label it explicitly. recode var (X=Y “Label”) … , gen(var_clean)
[ ] Dates: Prefer DMY where possible, and note that US data may use MDY: always check!
[ ] Text to number: Typo (destring)
[ ] Outliers: graph box is useful as it has an explicit outlier rule
[ ] Labels: Should NOT be longer… (and ideally < 32 characters…)
[ ] Labels guidelines: 26 standard English characters; Title Case in Short Labels (NOT sentences or full questions); numbers and hyphens (if proper: “Lend-Lease 2 Status”); and unit symbols like those in “Efficiency ($USD/km)”
[ ] order: Often use sequential option — think about this when coming up with varnames

bbdaniels commented 2 years ago

Lab 5

[ ] “What is data construction”: Repeat keep/drop analogy. “Creating or removing observations and variables (rows and columns) to match your analytical needs.” Include subsetting, stacking, deidentification?
[ ] “Why a separate task”:
Data cleaning is objective; you have the most information possible to make the data reflect reality as best as possible
Data cleaning requires private information and detailed knowledge of raw data
Data construction is subjective; different people might decide to make different research decisions given the same clean data
Data construction decisions must be clearly reviewable, and therefore self-contained
[ ] “What to plan ahead”:
Which observations you will need for each analysis (original and derived)
Which variables you will need for each analysis (original and derived)
Which observations you will NOT need for each analysis
Which variables you will NOT need for each analysis
[ ] “Create new numeric variables”: include ^
[ ] Constuct exercise: unclear, also typo in money. Try “Create a variable that expresses transactions in hundreds of thousands of HRK, instead of in single HRK”; LABEL THIS VARIABLE!
[ ] datediff is nice but a bit advanced. I would start with simpler stuff, like creation of Booleans or additional categoricals. At least introduce egen first — and LABEL ALL NEW VARIABLES. Also, reverse the names for sorting: init_month and init_quart, etc instead (recall order , seq)
[ ] egen - move before data handling. Label variables.
[ ] “Best practices”: move to earlier and add more details on naming and labelling
[ ] “Aggregating”: Mention dissagregation (expand and weight)?
[ ] collapse: Screenshot of codebook doesn’t fit. Why not show count or codebook, compact before and after?
[ ] collapse: Show syntax such as collapse var (mean) var = var (min) var_min = var var2_min = var2 (count) var_count = var , by(catvar)
[ ] merge: Note that at least one data set MUST be uniquely identified (1:m, m:1, 1:1). There is no m:m merge — but there is the expansion merge joinby
[ ] merge: Note keepusing() option; note that having the same variable (name) in both data sets can cause problems
[ ] merge: Show how to tab _merge and then drop _merge (or at least rename and relabel it if you want to keep it)
[ ] Conclusion: Add iecodebook for subsetting variables; show potential Booleans for subsetting observations?

bbdaniels commented 2 years ago

Lab 6

[ ] Outputs: “Publication-quality tables and figures (.tex, .eps, .tif)”
[ ] Setting the stage: Add “It is common to have construction and analysis do-files open at the same time and to move back and forth between them. Note each file will need to run independently for this to work!”
[ ] “Useful commands”: Separate “interactive exploration” from “exploratory outputs” — interactives should not end up in do-files
[ ] tabstat - introduce idea of stored/accessible results using , save; return list; matlist? “You can build highly customized reports by saving and exporting matrices and results usig built-in commands like putexcel and putdocx.”
[ ] list varlist: Note that the sort order should never matter and should be used to get desired results
[ ] graph bar: There are lots more possible values of stat!
[ ] Before final analysis — new brief section “creating, exporting, and saving results”. Show graph save, graph combine, graph export, and putdocx. What about tables here?
[ ] Include section title slides.
[ ] Include , clear in use commands
[ ] NOT THE PIE GRAPH (use graph bar, stack)
[ ] Needs summary and closing slides
[ ] Could be organized bettwe with better roadmap

bbdaniels commented 2 years ago

Lab 7 and 8 are very short -- these can probably be cut? I might instead split Lab 6 into two or three sessions, depending how many are scheduled -- something like:

Lab 6: Creating graphics

Create oneway graphics
- Histogram
- Bar chart
- Box plot
- Combine into panels
Create twoway graphics
- Create a scatter plot
- Create a lowess plot
- Create a scatter, lowess, and histogram together
- Basic cleanup options for visual styling
Export graphics
- Viewable PNG
- High-quality EPS

Lab 7: Creating non-graphical outputs

Storing and accessing results
Create automatic tables
iebaltab
regression export?
Create custom tables and reports
collapse → outsheet
putexcel/putdocx
getting timestamps from c()

Lab 8: Extensions and looking forward

Organizing exploratory and final analysis
Monitoring data collection — iefieldkit
Creating maps — spmap
Sampling and randomization — introduction
How to get help and DIME resources
Stata best practices and resources