Open melsophos opened 10 years ago
we get the categories from arXiv's XML, e.g. for your item: http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai%3AarXiv.org%3A0912.5286&metadataPrefix=oai_dc (under setSpec) I'm not attached to the current publication field, but I'd much prefer not to try to scrape anything off the rather unstructured arXiv item page. Any recommendations you have based on the XML data I'd probably be happy to implement.
Thank you for your answer.
So I took a look at the XML (and some other examples, like 1311.1190) and it seems that the main category is given by the first dc:subject, the others being sorted alphabetically. The problem is that the category is given by its whole name and not its abbreviation…
I think the way to go would be to switch to the atom xml rather than the oai e.g. http://arxiv.org/help/api/index http://export.arxiv.org/api/query?id_list=1107.4612 (use source view to view the xml) that has the primary category and cleaner data overall (also has the abstract). It'd take me some time before I get to that, but if you know some javascript, feel free to take a look at the translator yourself - you'd only have to modify the api calls ( https://github.com/zotero/translators/blob/master/arXiv.org.js#L85 and https://github.com/zotero/translators/blob/master/arXiv.org.js#L106 ) and the handling of the XML: https://github.com/zotero/translators/blob/master/arXiv.org.js#L115
You are right this source is much better. I never really touched javascript but I will try to see if I can do something.
don't spend too much time unless you're in a hurry. I'll do this eventually, but it's going to take a couple of months.
and here's a quick primer on translator coding: http://www.zotero.org/support/dev/translators/coding in particular, you'll want to use Scaffold, which makes testing this much, much faster.
I was wondering how the field "publication" for arxiv entries is filled: first I was thinking that the first item to appear is the main category, while the ones after - prepended with "physics:" (in my example to follow) - were the secondary categories.
But if I take the following paper: http://arxiv.org/abs/0912.5286 The publication field is "arXiv: 0912.5286 [astro-ph, physics:gr-qc, physics:hep-th]". On the opposite, the category list is: "gr-qc, astro-ph(.CO), hep-th", where gr-qc is the principal.
I need to know precisely what is the primary class because I have a script which parses the bib file exported from Zotero in order to insert the various fields for eprints (archivePrefix, eprint, primaryClass). (Note that I don't know if it is easy to find it, since at this point the categories are all collected at the same time.)