Closed adam3smith closed 2 years ago
I don't think we should separate the number from the arXiv
, since that's the canonical form of the id. Zotero can always parse out known prefixes like "arXiv:" and do different things with them as appropriate when passing to citeproc or export translators.
Though I guess the argument is that it's the id that's prefixed by arXiv
, not the number? But we've said that in a Preprint item type there'd be an "Archive ID" field that was mapped to number
. It seems more complicated and error-prone to separate different parts of the ID.
I think the best approach would be to put "ArXiV" in publisher
and just the "2202.09008" in number
.
APA does want the number to be cited as well as the URL/DOI, so just the number would be best for that. I think that an alternative delimiter/omitting the categories is fine for most citations. For journals that commonly cite ArXiV papers, the style could use the : delimiter for all servers and that would also not be too bad.
Hopefully, citation practice will shift to just citing the DOI.
If we want, we could store the full string in Extra. We have discussed adding a flexible ID system to CSL. When that happens, we can revisit storing the parts of the ArXiV ID in a structured format
@dstillman The canonical form of the id is just the number. You can't for example enter this URL:
https://ArXiV.org/arxiv:2202.09008
The arxiv:
prefix is analogous to the DOI:
prefix convention. It's a display convention, not part of the actual id
I mean, arXiv literally says the canonical form includes the prefix, in the page I linked to above. It's not just a convention.
The canonical form of identifiers from January 2015 (1501) is arXiv:YYMM.NNNNN, with 5-digits for the sequence number within the month.
I don't have a strong preference other than not including the subject categories.
No part of the ArXiV system actually respects that form other than the standard. All of the APIs, and ArXiV's own supplied BibTeX data treat the numeric part as the full ID
It's a minor messiness to include "ArXiV" two times in the citation, so we could retain the arxiv:
in the id if that's preferred
I'm inclined to agree that we'll get into trouble if we separate the prefix from the ID. But that still leaves two problems:
Jirasek, Fabian, Robert Bamler, and Stephan Mandt. 2022. “Hybridizing Physical and Data-Driven Prediction Methods for Physicochemical Properties.” arXiv ArXiv:2202.08804 [Physics, Stat], February. http://arxiv.org/abs/2202.08804.
looks very weird and
No part of the ArXiV system actually respects that form other than the standard. All of the APIs, and ArXiV's own supplied BibTeX data treat the numeric part as the full ID
What they do in their APIs isn't relevant — that's just a technical decision. The point of the canonical form is to allow the id to be identifiable out of context. Zotero and other tools should be able to see an identifier and know what it is — which, incidentally, is exactly what Add Item by Identifier and ZoteroBib do with arXiv ids based on the prefix.
I agree that the subject categories should not be included in the same field, since those aren't part of the id.
Zotero can pass whatever makes sense for CSL. If we know that styles are going to get Publisher regardless, we can recognize and strip the id prefix. We can even normalize publisher
and number
so that "arXiv" is provided as the publisher even if it's not in the metadata, as long as the id has the prefix.
Not sure about the categories.
Let's drop the categories and include ArXiV in the publisher and the prefix. It's not that odd if "ArXiV" appears twice, and it's what APA for example would expect
OK. Last question: ID including version number or not? Arxiv recommends citing a specific version, but we've also gotten push back in including it, I believe
Zotero can pass whatever makes sense for CSL. If we know that styles are going to get Publisher regardless, we can recognize and strip the id prefix. We can even normalize publisher and number so that "arXiv" is provided as the publisher even if it's not in the metadata, as long as the id has the prefix.
This is inconsistent enough across citation styles that I'm leery of messing with the data passed to citeproc. I'd err on the side of including "ArXiV" in both the publisher and id fields if that's the direction we want to go
CCIng Emiliano bc he'll likely hear about this as the changed format for arXiv is reflected in BibTeX
I don't think I was going to hear about this, so I appreciate the ping. Reading through the discussion above it seems like we're converging on a solution? I'd love to get an RDF sample of the solution.
Initial draft here: https://github.com/zotero/translators/pull/2790
Why do not include the category information ( [Physics, Stat]) in the Series field?
That would lead to those labels showing up in various citation styles (eg, Chicago), which generally shouldn’t happen
@adam3smith @bwiernik et al.: should this be closed? Do we have other major preprint translators that need the same treatment?
None that I'm aware of, closing
Thanks!
arXiv just added DOIs for preprints published in 2022, and I figured I'd take that opportunity to finally move the translator to import preprints (since importing it as journal articles is inconsistent with import from other preprint servers and has undesirable side effects).
The basics are clear, set the item type to Document and put
type: article
into Extra. The problem is the detail of the arXiv identifier. It's most commonly cited as something like thisarXiv:2202.09008 [cs, stat]
which we currently put into the publication field as a whole.One approach would be to use
publisher: arXiv
number: 2202.09008 [cs, stat]
The problem is that, to the extent that other preprint servers use IDs, those wouldn't be cited with a ":" delimiter. E.g. NIH has
as an example. So using the above, we either get non-standard output for arXiv or for all other preprint servers that use numbers(i.e. identifiers) in citations (though that's not that many. Most rely on DOIs exclusively.) We could put
arXiv:2202.09008 [cs, stat]
intonumber
, which would looks good citation-wise (at least where that's used) and leave publisher blank, but from a metadata point of view that's unsatisfactory. We could put the whole string into publisher, which also seems wrong, about equally so.cc @bwiernik @karnesky @retorquere (CCIng Emiliano bc he'll likely hear about this as the changed format for arXiv is reflected in BibTeX)
FWIW, here's how arXiv itself displays the metadata in BibTeX:
Though over at NASA ADS that same piece is