yuwtennis / google-data-engineer

Repository includes programs which was created while studying for google data engineer cert.
0 stars 0 forks source link

Reflect Ch05 Filterning Data on Occurence Frequency #20

Closed yuwtennis closed 2 years ago

yuwtennis commented 2 years ago

Relates to https://github.com/yuwtennis/google-data-engineer/issues/16

Need to reflect on below questions.

References

Classification

95 percentile

二項分類

Books

yuwtennis commented 2 years ago

Looked in depth.

Why quality control is happening at first place. (P161)

yuwtennis commented 2 years ago

Why quality control is happening at first place. (P161)

From the book , the author analyzes using EDA invented by Turkey.

First build parsimonious model to create hypothesis by looking outline of data. => Uses violin plot.

Then, looking at violin plot realized skinny tail as the data gets larger. =>Acknowledged outliers might be potentially exist and model change might be necessary.

yuwtennis commented 2 years ago

Why is the author suspicious about below ? (p.162)

I get back more than 1,000 rows. Are there really more than 1,000 unique values of DEP_DELAY ? What's going on ?

yuwtennis commented 2 years ago

Why is the author suspicious about below ? (p.162)

Just by curiosity . I think...

NEXT

Answer to , What is 370 ?

yuwtennis commented 2 years ago

Connection between CDF and three sigma rule ?

yuwtennis commented 2 years ago

In the book , author regards departure delay as discrete variables . (Countable datasets)

P 17

Addition of the counts of the discrete occurrences is equivalent to integrating the continuous values.

What makes confusing is that then on P 165 it uses three-sigma rule that is applied to normal distributions for these datasets. This is based on assumption.

P165

if our population size is large enough (Uncountable datasets - continuous values)

yuwtennis commented 2 years ago

So the value 370 is the Approximate expected frequency outside range of 3 sigma.

https://en.wikipedia.org/wiki/68–95–99.7_rule