Defining the task - Githubissues

znicholls commented 6 years ago

@MartinaSt let's try again haha (I'm glad we worked this out now, better late than never right?)

First question, if all the input4MIPs data citations are done, we can forget about that. If not, then I think we need to look at how to map the information in the input4MIPs files (as specified here) into your citation tool.

Second question. What does the tool actually need to do? It should:

take a model output file, with filename as given by CMIP6 specs
take a template yaml file with information about who generated the date file, funders etc.
combine the two to generate a json file which fits with your data citation system
- in this step it uses information from the model file to fill out the rest of the fields required to generate the json from the yaml
upload the json file to the data citation server via your API

Third question. Does the tool need to look at the file metadata or should it all be contained in the filename? My current guess is it has to look at the model metadata.

Fourth question. Does the tool need to look at the directory structure or should it all be contained in the file metadata? My current understanding is that the directory structure contains information only from the file metadata hence looking at the directory structure is not necessary.

Fifth question. To aid users, we should have some sort of validation of the input yaml and json files which gives useful messages if there are errors?

Last (and most important) questions. If this gets built, who will use it? How much time will it save them and when does it need to be ready by? My impression is that all the groups submitting data to CMIP6 could use it. It would save them all something in the order of hours as they don't have to worry about how formatting the json correctly, entering everything by hand via the GUI or making sure they have citations for all their files. However it won't save much more as it doesn't do the addition of people/institutions via the GUI and you still (obviously) have to manually enter who did what (i.e. which yaml template should be used for which files). As far as I can tell it needs to be ready by end of August so that most (acknowledging some are already done) groups can use it to create data citations as their results come out.

Let's discuss this all in one big thread for now then I will split into smaller issues as appropriate.

MartinaSt commented 6 years ago

Hi @znicholls - Good that you haven't lost your humor!

First question. Yes, I would say input4MIPs is over and we should forget about it.

Second question (Functionality).

yes
yes
You could use the absolute path of the file in addition to the file name or open the file and read the global attributes. I would guess that the first is quicker and the second safer.
optional update
a validation of the DRS subject against the CV would be great: https://github.com/WCRP-CMIP/CMIP6_CVs

Third question. Right, the file name misses two components of the DRS subject: <activity_id> and <institution_id>. Either add that from the absolute path or the global attributes of the netCDF file.

Fourth question. (see answer of third question)

Fifth question. yes, agree. I guess (apart from the check against the registered CV mentioned in the second question's answer) that is something we have to do. To check if the created json can be uploaded into the database, we need not only to check the json format but also the availability of the referenced persons and institutes. I have opened an issue at DKRZ but am not sure, how successful I will be to persuade my colleague to implement this. - I will let you know.

Last question. Potentially, we have about 40 centers running 90 models. The number of data references (JSONs) for the coarse granularity is ca. 600. The data references for experiments (finer granularity) is much higher. The data references for these experiment data collections cannot be entered via the GUI. We hide those because the number of entries gets too high for a GUI application. It just gets confusing. Thus, every modeling center will need to use the API. Your tool could close the gap between export a json template for one detailed data reference entered via the GUI and inserting the the altered json for a different data reference. When is your tool needed? Difficult question. The project runs several years and it is possible to add/alter data references during those years. Therefore I think the tool is useful even for those who have completed the model runs or have already published some data in the ESGF.

znicholls commented 6 years ago

You could use the absolute path of the file in addition to the file name or open the file and read the global attributes. I would guess that the first is quicker and the second safer.

I am dubious speed will be an issue (iris only opens metadata so is quick if all you want to do is look at attributes) so will err on the side of safety especially given that not all groups will necessarily have all the path stuff under control (or at least we didn't so I'm happy to work towards 'us' as an end user).

To check if the created json can be uploaded into the database, we need not only to check the json format but also the availability of the referenced persons and institutes.

Ok I think there are a couple of workarounds for this even if your colleagues don't agree which will at least make a user's life easier.

When is your tool needed?

Ok so it seems like it's worth writing properly then

znicholls commented 6 years ago

I've broken this out into #12 #13 #14 #15 #16 #17 so if there's specific things probably easier to comment there. #15 and #16 are most important to get started.

znicholls commented 5 years ago

Done thanks to discussions with @durack1

znicholls / CMIP6-json-data-citation-generator

Defining the task #13