worldbank / iefieldkit

Stata commands designed for Impact Evaluations field work. These are tools that are used during/after a survey in the field for data quality monitoring.
MIT License
38 stars 18 forks source link

iecodebook: data signature `.txt` file is a binary #250

Open luizaandrade opened 1 year ago

luizaandrade commented 1 year ago

image

Maybe this is intentional, but in that case, it should not have a .txt file format.

bbdaniels commented 1 year ago

Thanks for flagging, but I don't think this is a bug -- this is the intended functionality and I am aware of this behavior. The datasignature file is created with a .txt file extension so it syncs to git under our standard .gitignore settings, and can then be used on another clone to verify data contents as intended. The contents of the file appear as below -- I do not know why this is read as a binary file (other than that it is declared as one in the header; it opens just fine in plaintext editor as shown). Tracking file level changes using git is not necessary here either so I did not pursue a workaround to make it function with git compare features.

If you agree, close issue; otherwise, my suggestion would be to produce it with the .dtasig extension (edit: or .sig.txt, although I don't know how "double" extensions work on other OS). However, users would then need to update all their .gitignore files to include that extension and achieve the intended command functionality. I thought that was an unnecessary extra step when deciding on this approach to the feature, since it has effectively no technical cost other than the ugly aesthetics of this.

Screen Shot
luizaandrade commented 1 year ago

Interesting. I see this on Notepad image

And this on VSC, where I can't open it image

But it pointing out there there are changes in a binary file on Stata is enough for the intended workflow, then we can close this. But perhaps it's useful to explain in the help file what is the purpose of the -sig.txt file?

bbdaniels commented 1 year ago

What would you want to add beyond the helpfile description here and the write-up in the SJ paper that were just accepted? We have just finished a very long review process for this feature set, so I am a bit confused why this is coming up now 😅

signature            This option requests that a datasignature be verified in the same destination folder as the
                            codebook and/or data are saved, and will return an error if a datasignature file is not found or
                            is different, guaranteeing data has not changed since the last reset of the signature.
kbjarkefur commented 1 year ago

One never knows when a bug appears. I do not think this is major bug (as in requiring a new release), but it should be addressed.

My 2 cent is this multi-staged rocket.

  1. Do not allow any file extension other than the .dtasig intended. So if someone try to use the name auto-sig.txt make the name auto-sig.txt.dtasig. File extensions does not change what a file is, only what it tells the computer it should be. And to many software .txt is raw text, but when the file is not raw text it does not want to interact with it as it could be a corrupted file that potentially could harm the computer. This way users can distinguish this binary file from raw text .txt files in their .gitignore in case they do not want any binary files in their repo.

  2. This is much more work, but would be the best thing. Use the r() values from datasignature to write a real raw text file with the same information. That is what iesave does.

But since iesave does that maybe it is ok to not do that here. And then users who wants to track this type of info use iesave for this feature and the people who is ok that this is not tracked in git can use iecodebook for this feature. But then we should not allow the file from iecodebook to seem like a raw text file.

kbjarkefur commented 1 year ago

(edit: or .sig.txt, although I don't know how "double" extensions work on other OS)

The file extension is the text after the last .. So file kristoffer.luiza.txt.ben has the file extension .ben. This does not change if the file is binary or raw text or anything. It just tells the computer it should be the file type .ben but it says nothing about what .ben is or should be.

But since this is just text that is a part of a file name then a language can make use of "double extensions". If you end your file names in Spark with .csv.gz then Spark expect it to be a .gz compressed file that includes a single .csv file. Such a file can be read on the fly in Spark without decompress it first. But this is just a naming convention so Spark knows what to expect, which then again is what all file extensions to any computer are.

bbdaniels commented 1 year ago

How does git treat files without an extension? I would rather do that if it works -- Stata does not seem to care about the extension here, but I don't think the datasignature confirm command will work if we don't use its exact output.

Best,

Benjamin Daniels (he/him) On Feb 7, 2023 at 03:56 -0500, Kristoffer Bjärkefur @.***>, wrote:

(edit: or .sig.txt, although I don't know how "double" extensions work on other OS) The file extension is the text after the last .. So file kristoffer.luiza.txt.ben has the file extension .ben. This does not change if the file is binary or raw text or anything. It just tells the computer it should be the file type .ben but it says nothing about what .ben is or should be. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>

kbjarkefur commented 1 year ago

Git can handle all files. Git ignore is just a string match (a simple version of regex or something) on the file name. So to ignore any file with extension .txt you do *.txt, and to ignore a file without extension you do * but that is not great as it ignore any files with a name, i.e. all files.

No extension does not tell the computer anything about what the file is or should be. So that is almost as bad as the wrong extension in my mind.

What is your reason to not use .dtasig?

bbdaniels commented 1 year ago

My concern is that people will not have gitignore correctly configured if they use an allowlist approach like we recommend, since dtasig would need a new file extension. The purpose of this functionality is to sync this description of the data over git instead of the data itself. If that requires editing gitignore to work, then I worry the functionality will be lost for many users. That's all! I am not very concerned about what the computer thinks in this case, because the OS and user should never interact with this file -- it is only meant to be touched by Stata once synced.

Best,

Benjamin Daniels (he/him) On Feb 7, 2023 at 10:22 -0500, Kristoffer Bjärkefur @.***>, wrote:

Git can handle all files. Git ignore is just a string match (a simple version of regex or something) on the file name. So to ignore any file with extension .txt you do .txt, and to ignore a file without extension you do but that is not great as it ignore any files with a name. No extension does not tell the computer anything about what the file is or should be. So that is almost as bad as the wrong extension in my mind. What is your opposition to using .dtasig? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>

kbjarkefur commented 1 year ago

Anyone not using a gitignore file will not ignore it. Most templates do not use allowlists so it would not ignore it. We can update our template that use allowlists to allow it (will only apply to new repos tho).

I think that is the proper long term solution. And whatever we save from not doing the proper long term solution I do not justify relying on a file without file extension. That is bad for the computer, but a file without extension also looks like a corrupted file to a human.

You can deduct my preference, but you are lead author so it is up to you.

luizaandrade commented 1 year ago

I am all for updating the gitignore. We will always have to add new things to it anyway, it's not a static file.

If I see a .txt file, I expect git to track changes to it well and to be able to see its content. maybe we already had this discussion and I forgot about it and got confused again. but if the intended format for this file is .dtasig, I think we can stick to it.

Agreed that this does not warrant a new release right now, we can add it to the next one, whenever it comes. The help file is consistent with using .dtasig too, so it seems like a very small change. But if you feel strongly that .txt works best with the intended workflow for the command, then I we can leave it as is and just add a line on the helpfile saying that the data signature will be saved using a .txt format by defaults instead of the standard .dtasig to facilitate tracking using git.

bbdaniels commented 1 year ago

add a line on the helpfile saying that the data signature will be saved using a .txt format by defaults instead of the standard .dtasig to facilitate tracking using git.

I prefer this most, but as I mentioned I am also happy to, for example, use .sig.txt instead of -sig.txt as it is now to indicate that it is not a standard text file -- that that's just for the string matching in tracking. I have seen this style in other applications although it's rather uncommon...