worldbank / dime-data-handbook

Development Research in Practice: The DIME Analytics Data Handbook. By Kristoffer Bjärkefur, Luíza Cardoso de Andrade, Benjamin Daniels, and Maria Jones
https://worldbank.github.io/dime-data-handbook/
Other
63 stars 26 forks source link

[ch5] data table - tabular dataset is confusing with tidy def of dataset #496

Closed kbjarkefur closed 4 years ago

kbjarkefur commented 4 years ago

After the tidy session with the discussion we had afterwards, and when I went back to chapter 5 make sure that we say data table instead of just table (which can be confusing with an result table in chapter 6) I thought of this definitions. These defintions only have to be applied strictly when we talk about tidy data.

So far I think we all agree, but what I think from these definitions are that a dataset do not have rows or columns. In a tidy workflow, only data tables have rows and columns.

I discussed this with @luizaandrade and she agreed. I started to rewrite ch5 for this, and it became such a big thing that I turned it in to a quick PR.

kbjarkefur commented 4 years ago

in c7f8786 I make data table the main concept instead of tabular dataset. Tabular data is listed as a synonym to data table. This is to avoid the confusion that a dataset in tidy workflow is defined a set of data tables, and defining a dataset as a set of tabular datasets sounds weird.

kbjarkefur commented 4 years ago

This strict definition when we talk about tidy workflow means that we should avoid talking about variables in a dataset in a tidy workflow. It is ok to say in other context that a dataset has a variable, but right where we say that a dataset has data tables that has columns that are variables, then we should avoid saying taht datasets has variables. They do, but only indirectly.

See examples here 88a1c49 and here 0b91e9a.

kbjarkefur commented 4 years ago

The recommendations in b898d25 are new but in line with what we said after the tidy session. Pay special attention to this.