Closed yuwtennis closed 2 years ago
Looked in depth.
Why quality control is happening at first place. (P161)
Why quality control is happening at first place. (P161)
From the book , the author analyzes using EDA invented by Turkey.
First build parsimonious model to create hypothesis by looking outline of data. => Uses violin plot.
Then, looking at violin plot realized skinny tail as the data gets larger. =>Acknowledged outliers might be potentially exist and model change might be necessary.
Why is the author suspicious about below ? (p.162)
I get back more than 1,000 rows. Are there really more than 1,000 unique values of DEP_DELAY ? What's going on ?
Why is the author suspicious about below ? (p.162)
Just by curiosity . I think...
Answer to , What is 370 ?
Connection between CDF and three sigma rule ?
In the book , author regards departure delay as discrete variables . (Countable datasets)
P 17
Addition of the counts of the discrete occurrences is equivalent to integrating the continuous values.
What makes confusing is that then on P 165 it uses three-sigma rule that is applied to normal distributions for these datasets. This is based on assumption.
P165
if our population size is large enough (Uncountable datasets - continuous values)
So the value 370 is the Approximate expected frequency outside range of 3 sigma.
Relates to https://github.com/yuwtennis/google-data-engineer/issues/16
Need to reflect on below questions.
References
Classification
95 percentile
二項分類
Books
Doing Data Science
Exploratory Data Analytics