Open GoogleCodeExporter opened 8 years ago
For all Canadensys datasets I have decided to populate datasetName and
datasetID as follows:
datasetName: "University of British Columbia Herbarium (UBC) - Vascular Plant
Collection"
1. We populate this term with the title of the dataset, as recorded in eml.xml
metadata. It's also the name that appears on IPT:
http://data.canadensys.net/ipt/resource.do?r=ubc-vascular-specimens
2. It follows the definition
(http://rs.tdwg.org/dwc/terms/index.htm#datasetName): "The name identifying the
data set from which the record was derived." Once you download the dataset and
merge it with other records or use it for something, datasetName does indeed
identify the dataset from which the record was derived: the original published
dataset on IPT.
3. Two collections already populated this term, to indicate the subcollection.
For UBC, that subcollection is its own dataset on IPT, so the name still
indicates the subcollection. For MT, "Marie-Victorin Herbarium (MT)" now
replaces "vascular plants" & "bryophytes". Those subcollections use the same
running number, database and dataset, so there is really no point in indicating
the subcollection.
4. Users can use datasetName to cite the dataset. It will be recommended as
such in the Canadensys norms.
datasetID: "http://data.canadensys.net/ipt/archive.do?r=ubc-vascular-specimens"
1. We populate this term with the URL of the dataset.
2. It follows the definition
(http://rs.tdwg.org/dwc/terms/index.htm#datasetID): "An identifier for the set
of data. May be a global unique identifier or an identifier specific to a
collection or institution." The URL is indeed an identifier for the dataset, it
is even globally unique. You can click the URL and get more information, so it
is actionable. As for persistence: it will work until the Canadensys domain
disappears or the publisher decides to remove the dataset. Any link from the
GBIF registry or a data paper will also be broken when that happens, so we have
that problem anyway.
3. Right now, this URL is a better alternative than the non-clickable UUID
issued by GBIF. It can also be provided BEFORE the dataset is published (the
GBIF UUID is only issued AFTER publication). If at some point GBIF issues DOIs
for datasets, those can be used, but they will probably point to the same URL.
4. The URL in datasetID and the name in datasetName apply to the same unit (see
point 3 in the issue description). This is now consistent with
collectionCode/ID and institutionCode/ID.
5. Users can use the datasetID to cite the dataset. It will be recommended as
such in the Canadensys norms.
6. Currently, this link is not included in the eml.xml metadata, so the user
cannot find the original dataset back. This is a bug:
http://code.google.com/p/gbif-providertoolkit/issues/detail?id=833. Even if it
is included in the eml.xml, it is still useful to have it for each record,
especially if those records are merged with other records by an aggregator.
7. I'm wondering if I should point to the a) human readable page:
http://data.canadensys.net/ipt/resource.do?r=ubc-vascular-specimens or b) the
archive itself:
http://data.canadensys.net/ipt/archive.do?r=ubc-vascular-specimens. a) is
easier for a human to consume when he/she clicks the link in a citation, but b)
can be used for datasets not published by IPT, can be consumed by a machine and
is already used by the GBRDS. Advise would be welcome.
What do you think? Does all of this make sense?
Should we update the AppleCore guidelines to use this recommendation?
Original comment by peter.de...@gmail.com
on 29 Mar 2012 at 2:49
To elaborate on point 7: our datasets will include the datasetID for all
records: it is the URL to the dataset and I will advise people to include it in
their citation. The question is, what flavour do I point to?
A. The resource: http://data.canadensys.net/ipt/resource.do?r=acad-specimens
+ human readable
+ best user experience for someone clicking on the link in the citation
+ crawl-able by a machine
+ includes links to other flavours (eml, archive, rtf)
+ used in data papers
- not immediately readable by a machine
- cannot be used by non-IPT datasets
- not used by the GBRDS
B. The archive: http://data.canadensys.net/ipt/archive.do?r=acad-specimens
+ readable by a machine (e.g. by http://tools.gbif.org/dwca-validator/)
+ includes everything in one package: data & metadata
+ can be used for non-IPT datasets
+ also used in data papers
+ used by the GBRDS
- user probably doesn't expect a file download when clicking a link in a
citation
- no easy way to consume metadata for a human (is XML)
- cannot be used for metadata only datasets
C. The metadata: http://data.canadensys.net/ipt/eml.do?r=acad-specimens
+ readable by a machine (is XML)
+ can be used for metadata only datasets
+ used by the GBRDS
- no easy way to consume metadata for a human (is XML)
- no link to other flavours (this is a serious issue!)
- cannot be used by non-IPT datasets, unless the publisher publishes that file
in addition to the archive
- not used in data papers
Original comment by peter.de...@gmail.com
on 29 Mar 2012 at 3:56
C. Apparently there is some sort of a link in the eml, in
packageId="http://data.canadensys.net/ipt/resource.do?id=acad-specimens/v2".
There is no direct link to the archive (and data) however.
Original comment by peter.de...@gmail.com
on 29 Mar 2012 at 4:02
Hey Peter, my personal vote would be A. As you point out, it's human readable
which for me is a huge plus for non-technical consumers, and it includes
additional links.
The packageId can be thought of as a versioned identifier for the metadata.
Original comment by kyle.br...@gmail.com
on 30 Mar 2012 at 12:48
I have been going back and forth between A and B, but now I'm leaning more to
A. It's the URL I use all the time to link to the archive, because I expect
much higher human, than machine consumption. Also, with A I can track visits
with Google Analytics.
PS: The packageId doesn't work:
http://data.canadensys.net/ipt/resource.do?id=acad-specimens/v2 links to the
last version of last visited dataset, not Acadia.
Original comment by peter.de...@gmail.com
on 30 Mar 2012 at 12:59
[D.] A metadata system/protocol with a dataset URI (datasetID) that could do
content negotiation and lead a machine to a machine readable resource such as C
(http://data.canadensys.net/ipt/eml.do?r=acad-specimens) [or in the future an
RDF/XML serialization alternative...?] and a human to a human readable resource
such as A (http://data.canadensys.net/ipt/resource.do?r=acad-specimens) might
perhaps be worth contemplating...?
A content negotiation capacity might not need to be very complex. W3C has some
interesting guidelines for inspiration at:
http://www.w3.org/TR/2006/WD-swbp-vocab-pub-20060314/
Alternatively something like RDFa could be embedded into the human readable web
page at A to give a machine something to read...? [RDFa inside the web page at
A could perhaps also be a future feature]. So both A and [D] might both work
fine based on the human readable response you aim at.
Another issue (at least to think about) is that the IPT might perhaps be
replaced by another data publishing technology for the same dataset in the
future - and that providing a more neutral URI as the datasetID might perhaps
be useful?
Original comment by dag.endresen
on 30 Mar 2012 at 1:17
Hi Dag.
I agree with D and the fact that ideally, I should use a neutral URI (such as a
DOI), but I am looking for a pragmatic solution I can use right now. In that
sense A is better than B, because I can always extend it with more
possibilities (like adding RDFa, as you suggest).
Original comment by peter.de...@gmail.com
on 30 Mar 2012 at 1:30
Hi Peter! Yes, a pragmatic ID to be used now does not need to be the datasetID
to be used forever. Things evolve and it is a fact that identifiers will also
evolve. The same dataset can be given a DOI in the future [or a
content-negotiation-capable URI]. Even this solution could be replaced by a
more evolved technology further ahead...
So you are probably correct that a pragmatic solution based on what works now
is the best solution. Then A with the aim to add some RDFa further ahead sounds
good.
Original comment by dag.endresen
on 30 Mar 2012 at 1:53
Thanks to a suggestion from Tim Robertson, we have chosen for a more technology
agnostic URL:
http://dataset.canadensys.net/acad-specimens, which redirects to option A:
http://data.canadensys.net/ipt/resource.do?r=acad-specimens
That way, we have nicer and shorter URLs in the wild and an added layer to
redirect if we plan to host our datasets somewhere else or if we want to add
specific redirection (human, machine, etc.). In that sense, these URLs are not
that different from DOIs. DOIs are still better because you pay a service to
have them working indefinitely. We'll try our best to keep ours working as long
as possible.
Original comment by peter.de...@gmail.com
on 30 Mar 2012 at 6:55
Hi Peter, I think that the redirection from a more technology neutral URI is a
very good solution.
Original comment by dag.endresen
on 31 Mar 2012 at 4:49
Canadensys now has DOIs for its IPT datasets. e.g.
http://dx.doi.org/10.5886/g7j6gct1 assigned via DataCite
Original comment by davidpsh...@gmail.com
on 13 Dec 2012 at 4:50
It is nice to see that you have come to a solution for this for Canadensys. I'm
curious, though, what you think the identifier actually identifies. To me it
seems that it does not identify a data set, but rather the access point for
latest version of data published from a particular source. The data sets one
can find at that location are mutable over time, and as such the identification
of them is a tricky business. I would be careful to be very explicit when
telling consumers what the DOI really refers to because of this.
Original comment by gtuco.bt...@gmail.com
on 13 Dec 2012 at 12:44
Original issue reported on code.google.com by
peter.de...@gmail.com
on 12 Nov 2011 at 8:19