use proper Stata ouptut syntax and only format errors as errors

worldbank / repkit

A Stata package with tools related to computational reproducibility

https://worldbank.github.io/repkit/

6 stars 0 forks source link

use proper Stata ouptut syntax and only format errors as errors #53

Closed kbjarkefur closed 5 months ago

kbjarkefur commented 5 months ago

This makes this text display correctly regardless if screen size and settings.

@bbdaniels , do we have a reference for this as we claim this is what Stata says. I find this in the documentation, which is similar but not word by word:

merge m:m varlist ... specifies a many-to-many match merge. This is allowed for completeness, but it is difficult to imagine an example of when it would be useful. For an m:m merge, varlist does not uniquely identify the observations in either dataset. Matching is performed by combining observations with equal values of varlist; within matching values, the first observation in the master dataset is matched with the first matching observation in the using dataset; the second, with the second; and so on. If there is an unequal number of observations within a group, then the last observation of the shorter group is used repeatedly to

bbdaniels commented 5 months ago

Ah they must have updated the text. I took that from an older Statalist posting. https://www.statalist.org/forums/forum/general-stata-discussion/general/1551609-merge-m-m-vs-joinby

The reason m:m is not just a bad practice but should be flagged as a reproducibility issue is as the documentation warns. m:m specifically is a command that is known to depend on the sort order of data and produces garbage results whenever it is used. I am not aware of any other such built-in command that will not trigger either our existing sort flag or our existing randomization flag; and I don't believe m:m will trigger either of them because it does not actually rearrange or randomize the data in the process of destroying it.

kbjarkefur commented 5 months ago

How can it not show up in our flags if it creates garbage data? Either it is stable or reprun needs to be improved to catch this? I don't think the command should be opinionated in this way as it assumes we can imagine all possible cases where this command can be used.

If we want to keep the warning, I think we should just refer to the helpfile of merge and not say something that quickly can be outdated.

bbdaniels commented 5 months ago

It is always stable, it is just sort-dependent on two datasets and the results are meaningless. I agree with you that it should not be flagged as an error. However, we wanted to flag it in the rep checks, so I enabled the warning in this command. It could be moved to an option, but I think it should be the default given the frequency with which we are seeing m:m in submitted work. All results after m:m are necessarily invalid. I also agree with referring to the wording of the current helpfile.

bbdaniels commented 5 months ago

The more you read the funnier it gets...

bbdaniels commented 5 months ago

Anyway -- this is fixed in 40bb89199ef96963cb96887f3ee4d191f7465ba1 with a link to the documentation. Closing this one.