znicholls / CMIP6-json-data-citation-generator

Simple scripts to automatically generate json data citations for CMIP6 data files
https://cmip6-json-data-citation-generator.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
1 stars 2 forks source link

Correct format of json file #3

Closed znicholls closed 6 years ago

znicholls commented 6 years ago

@MartinaSt just want to check that I've correctly understood the format of the json we want to produce. Can you double check the format and my split of ignored, optional, compulsory and compulsory but fixed (i.e. fields that must be there but the content is always the same) fields?

# ignored
{
  "identifier": {
    "identifierType": "URL",
    "id": "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.VIACSAB.PCMDI.PCMDI-test-1-0"
  },
  "publisher": "Earth System Grid Federation",
  "publicationYear": "2017",
  "dates": [
    {
      "dateType": "Created",
      "date": "2017-05-03"
    }
  ],
  "language": "en",
  "resourceType": {
    "resourceTypeGeneral": "Dataset",
    "resourceType": "Digital"
  },
  "formats": [
    {
      "format": "application/x-netcdf"
    }
  ],
  "rightsList": [
    {
      "rightsURI": "http://creativecommons.org/licenses/by-sa/4.0/",
      "rights": "Creative Commons Attribution 4.0 International License (CC BY-SA 4.0)"
    }
  ],
  "descriptions": [
    {
      "descriptionType": "Abstract",
      "text": "Coupled Model Intercomparison Project Phase 6 (CMIP6) data sets. These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions, and the results will undoubtedly be relied on by authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated at a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by World Data Centre for Climate (WDCC) at DKRZ.The project includes simulations from about 90 global climate models and around 40 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6  The Earth System Model PCMDI-test 1.0 (This entry is free text for users to contribute verbose information), released in 1989, includes the components:  atmos: Earth1.0-gettingHotter (360 x 180 longitude/latitude; 50 levels; top level 0.1 mb), land: Earth1.0, ocean: BlueMarble1.0-warming (360 x 180 longitude/latitude; 50 levels; top grid cell 0-10 m), seaIce: Declining1.0-warming (360 x 180 longitude/latitude). The model was run by the Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA (PCMDI) in nominal resolutions: atmos: 1x1 degree, land: 1x1 degree, ocean: 1x1 degree, seaIce: 1x1 degree."
    }
  ],
# compulsory
  "creators": [
    {
      "creatorName": "Taylor, Karl E.",
      "givenName": "Karl E.",
      "familyName": "Taylor",
      "email": "taylor13@llnl.gov",
      "nameIdentifier": {
        "schemeURI": "http://orcid.org/",
        "nameIdentifierScheme": "ORCID",
        "pid": "0000-0002-6491-2135"
      },
      "affiliation": "Lawrence Livermore National Laboratory"
    }
  ],
  "titles": [
    "PCMDI PCMDI-test1.0 model output prepared for CMIP6 VIACSAB"
  ],
# compulsory but fixed
  "subjects": [
    {
      "subject": "CMIP6.VIACSAB.PCMDI.PCMDI-test-1-0",
      "schemeURI": "http://github.com/WCRP-CMIP/CMIP6_CVs",
      "subjectScheme": "DRS"
    },
    {
      "subject": "climate"
    },
    {
      "subject": "CMIP6"
    }
  ],
# optional 
 "contributors": [
    {
      "contributorType": "ContactPerson",
      "contributorName": "Jungclaus, Johann",
      "givenName": "Johann",
      "familyName": "Jungclaus",
      "email": "johann.jungclaus@mpimet.mpg.de",
      "nameIdentifier": {
        "schemeURI": "http://orcid.org/",
        "nameIdentifierScheme": "ORCID",
        "pid": "0000-0002-3849-4339"
      },
      "affiliation": "Max-Planck-Institut fuer Meteorologie"
    },
    {
      "contributorType": "ResearchGroup",
      "contributorName": "Max-Planck-Institut fuer Meteorologie (MPI-M)"
    }
  ],
  "relatedIdentifiers": [
    {
      "relatedIdentifier": "10.5194/gmd-10-2247-2017",
      "relatedIdentifierType": "DOI",
      "relationType": "Cites"
    }
  ],
  "fundingReferences": [
    {
      "funderName": "Federal Ministry of Education and Research (BMBF)",
      "funderIdentifier": "http://doi.org/10.13039/501100002347",
      "funderIdentifierType": "Crossref Funder ID"
    }
  ]
}
MartinaSt commented 6 years ago

Yes, but within the creator, contributor or fundingReferences blocks not all information is mandatory. We used DataCite definitions as orientation (http://doi.org/10.5438/0014) but made some changes.

minimal fundingReferences block:

"fundingReferences": [
    {
      "funderName": "Federal Ministry of Education and Research (BMBF)"
    }

minimal creator/contributor in case of a person

"creators": [
    {
      "creatorName": "Taylor, Karl E.",
      "givenName": "Karl E.",
      "familyName": "Taylor",
      "email": "taylor13@llnl.gov",
       "affiliation": "Lawrence Livermore National Laboratory"
    }
],
 "contributors": [
    {
      "contributorType": "ContactPerson",
      "contributorName": "Jungclaus, Johann",
      "givenName": "Johann",
      "familyName": "Jungclaus",
      "email": "johann.jungclaus@mpimet.mpg.de",
      "affiliation": "Max-Planck-Institut fuer Meteorologie"
    }
]

minimal creator/contributor in case of an institution

 "creators": [
    {
      "creatorName": "Max-Planck-Institut fuer Meteorologie (MPI-M)"
    }
] ,
"contributors": [
    {
      "contributorType": "ResearchGroup",
      "contributorName": "Max-Planck-Institut fuer Meteorologie (MPI-M)"
    }
]
znicholls commented 6 years ago

Hey @MartinaSt could you double check the subjects field for me please? How should this be generated, something like?

"subjects":
  [
    {
      "subject":"<activity_id>.CMIP6.<target_MIP>.<institution-id>[.<source-id>]",
      "subjectScheme":"DRS"
    },
    {"subject":"climate"},
    {"subject":"CMIP6"},
    {"subject":"<custom-user-field>"},
]

Or is this field never used by the citation tool?

MartinaSt commented 6 years ago

Hi @znicholls The DRS subject is used to connect the provided information to the right database entry. Thus it is very important! But all other subjects are ignored. Thus, I would delete the <custom-user-field> subject. According to the keys, you find in the netCDF data header, the DRS subject is constructed as:

"subjects":
  [
    {
      "subject":"<mip_era>.<activity_id>.<institution_id>.<source_id>[.<experiment_id>] ",
      "schemeURI": "http://github.com/WCRP-CMIP/CMIP6_CVs",
      "subjectScheme":"DRS"
    }
]

Btw, the first DOI on CMIP6 data was registered (Data access is still restricted to infrastructure developers.): landing page: https://doi.org/10.22033/ESGF/CMIP6.1534 JSON: https://cera-www.dkrz.de/WDCC/ui/cerasearch/cerarest/exportcmip6?input=CMIP6.CMIP.IPSL.IPSL-CM6A-LR

znicholls commented 6 years ago

Ok that complicates things. Mainly because it appears to me like different files have different conventions. For example, not all files have the experiment id easily accessible in the filename ( e.g. it has to be introspected from the input4MIPs concentrations and emissions filename based on knowledge that it comes at the end of the source_id) and it's also not always in the nc file, e.g. the input4MIPs files don't even use an experiment_id field..

For now I'll write this to target files that have an experiment_id field, assuming that all the input4MIPs stuff is now done so it isn't worth worrying about that edge case

MartinaSt commented 6 years ago

Do you use the file names only? No opening the files to read the global attributes? And no consideration of the directory structure?

An example for a CMIP6 file name is: `rlutcsaf_AERmon_CNRM-CM6-1_1pctCO2_r1i1p1f2_gr_185001-199912.nc

______
znicholls commented 6 years ago

Ok I think everything has now become much clearer. As an input provider, I was following the input forcing data specs. As you'll see, our filenames are different from the output file names.

input4MIPs name

<variable_id>_input4MIPs_<dataset_category>_<target_mip>_<source_id>_<grid_label>[_<time_range>].nc

Output name

<variable_id>_<table_id>_<source_id>_<experiment_id>_<member_id>_<grid_label>_<time slice>.nc

So it looks like I was solving a problem you didn't have (but I did as an input4MIPs provider). Haha oops!

I'm going to close this issue and start a new one to try and get us on the same page.