zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.28k stars 756 forks source link

Moving arXiv to preprint #2788

Closed adam3smith closed 2 years ago

adam3smith commented 2 years ago

arXiv just added DOIs for preprints published in 2022, and I figured I'd take that opportunity to finally move the translator to import preprints (since importing it as journal articles is inconsistent with import from other preprint servers and has undesirable side effects).

The basics are clear, set the item type to Document and put type: article into Extra. The problem is the detail of the arXiv identifier. It's most commonly cited as something like this arXiv:2202.09008 [cs, stat] which we currently put into the publication field as a whole.

One approach would be to use publisher: arXiv number: 2202.09008 [cs, stat]

The problem is that, to the extent that other preprint servers use IDs, those wouldn't be cited with a ":" delimiter. E.g. NIH has

BioRxiv 069187 [Preprint].

as an example. So using the above, we either get non-standard output for arXiv or for all other preprint servers that use numbers(i.e. identifiers) in citations (though that's not that many. Most rely on DOIs exclusively.) We could put arXiv:2202.09008 [cs, stat] into number, which would looks good citation-wise (at least where that's used) and leave publisher blank, but from a metadata point of view that's unsatisfactory. We could put the whole string into publisher, which also seems wrong, about equally so.

cc @bwiernik @karnesky @retorquere (CCIng Emiliano bc he'll likely hear about this as the changed format for arXiv is reflected in BibTeX)

FWIW, here's how arXiv itself displays the metadata in BibTeX:

@misc{xu2022variance,
      title={On Variance Estimation of Random Forests}, 
      author={Tianning Xu and Ruoqing Zhu and Xiaofeng Shao},
      year={2022},
      eprint={2202.09008},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

Though over at NASA ADS that same piece is

@ARTICLE{2022arXiv220209008X,
       author = {{Xu}, Tianning and {Zhu}, Ruoqing and {Shao}, Xiaofeng},
        title = "{On Variance Estimation of Random Forests}",
      journal = {arXiv e-prints},
     keywords = {Statistics - Machine Learning, Computer Science - Machine Learning, Statistics - Computation, Statistics - Methodology},
         year = 2022,
        month = feb,
          eid = {arXiv:2202.09008},
        pages = {arXiv:2202.09008},
archivePrefix = {arXiv},
       eprint = {2202.09008},
 primaryClass = {stat.ML},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2022arXiv220209008X},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
dstillman commented 2 years ago

I don't think we should separate the number from the arXiv, since that's the canonical form of the id. Zotero can always parse out known prefixes like "arXiv:" and do different things with them as appropriate when passing to citeproc or export translators.

dstillman commented 2 years ago

Though I guess the argument is that it's the id that's prefixed by arXiv, not the number? But we've said that in a Preprint item type there'd be an "Archive ID" field that was mapped to number. It seems more complicated and error-prone to separate different parts of the ID.

bwiernik commented 2 years ago

I think the best approach would be to put "ArXiV" in publisher and just the "2202.09008" in number. APA does want the number to be cited as well as the URL/DOI, so just the number would be best for that. I think that an alternative delimiter/omitting the categories is fine for most citations. For journals that commonly cite ArXiV papers, the style could use the : delimiter for all servers and that would also not be too bad.

Hopefully, citation practice will shift to just citing the DOI.

If we want, we could store the full string in Extra. We have discussed adding a flexible ID system to CSL. When that happens, we can revisit storing the parts of the ArXiV ID in a structured format

bwiernik commented 2 years ago

@dstillman The canonical form of the id is just the number. You can't for example enter this URL:

https://ArXiV.org/arxiv:2202.09008

The arxiv: prefix is analogous to the DOI: prefix convention. It's a display convention, not part of the actual id

dstillman commented 2 years ago

I mean, arXiv literally says the canonical form includes the prefix, in the page I linked to above. It's not just a convention.

The canonical form of identifiers from January 2015 (1501) is arXiv:YYMM.NNNNN, with 5-digits for the sequence number within the month.

bwiernik commented 2 years ago

I don't have a strong preference other than not including the subject categories.

No part of the ArXiV system actually respects that form other than the standard. All of the APIs, and ArXiV's own supplied BibTeX data treat the numeric part as the full ID

It's a minor messiness to include "ArXiV" two times in the citation, so we could retain the arxiv: in the id if that's preferred

adam3smith commented 2 years ago

I'm inclined to agree that we'll get into trouble if we separate the prefix from the ID. But that still leaves two problems:

  1. What to do with the publisher field because a citation that's something like

Jirasek, Fabian, Robert Bamler, and Stephan Mandt. 2022. “Hybridizing Physical and Data-Driven Prediction Methods for Physicochemical Properties.” arXiv ArXiv:2202.08804 [Physics, Stat], February. http://arxiv.org/abs/2202.08804.

looks very weird and

  1. What to do with the category information ( [Physics, Stat]) that arXiv recommends to include in citations
dstillman commented 2 years ago

No part of the ArXiV system actually respects that form other than the standard. All of the APIs, and ArXiV's own supplied BibTeX data treat the numeric part as the full ID

What they do in their APIs isn't relevant — that's just a technical decision. The point of the canonical form is to allow the id to be identifiable out of context. Zotero and other tools should be able to see an identifier and know what it is — which, incidentally, is exactly what Add Item by Identifier and ZoteroBib do with arXiv ids based on the prefix.

I agree that the subject categories should not be included in the same field, since those aren't part of the id.

dstillman commented 2 years ago

Zotero can pass whatever makes sense for CSL. If we know that styles are going to get Publisher regardless, we can recognize and strip the id prefix. We can even normalize publisher and number so that "arXiv" is provided as the publisher even if it's not in the metadata, as long as the id has the prefix.

Not sure about the categories.

bwiernik commented 2 years ago

Let's drop the categories and include ArXiV in the publisher and the prefix. It's not that odd if "ArXiV" appears twice, and it's what APA for example would expect

adam3smith commented 2 years ago

OK. Last question: ID including version number or not? Arxiv recommends citing a specific version, but we've also gotten push back in including it, I believe

bwiernik commented 2 years ago

Zotero can pass whatever makes sense for CSL. If we know that styles are going to get Publisher regardless, we can recognize and strip the id prefix. We can even normalize publisher and number so that "arXiv" is provided as the publisher even if it's not in the metadata, as long as the id has the prefix.

This is inconsistent enough across citation styles that I'm leery of messing with the data passed to citeproc. I'd err on the side of including "ArXiV" in both the publisher and id fields if that's the direction we want to go

retorquere commented 2 years ago

CCIng Emiliano bc he'll likely hear about this as the changed format for arXiv is reflected in BibTeX

I don't think I was going to hear about this, so I appreciate the ping. Reading through the discussion above it seems like we're converging on a solution? I'd love to get an RDF sample of the solution.

adam3smith commented 2 years ago

Initial draft here: https://github.com/zotero/translators/pull/2790

iago-pssjd commented 2 years ago

Why do not include the category information ( [Physics, Stat]) in the Series field?

bwiernik commented 2 years ago

That would lead to those labels showing up in various citation styles (eg, Chicago), which generally shouldn’t happen

AbeJellinek commented 2 years ago

@adam3smith @bwiernik et al.: should this be closed? Do we have other major preprint translators that need the same treatment?

adam3smith commented 2 years ago

None that I'm aware of, closing

AbeJellinek commented 2 years ago

Thanks!