Better category support for arxiv translator

melsophos commented 10 years ago

I was wondering how the field "publication" for arxiv entries is filled: first I was thinking that the first item to appear is the main category, while the ones after - prepended with "physics:" (in my example to follow) - were the secondary categories.

But if I take the following paper: http://arxiv.org/abs/0912.5286 The publication field is "arXiv: 0912.5286 [astro-ph, physics:gr-qc, physics:hep-th]". On the opposite, the category list is: "gr-qc, astro-ph(.CO), hep-th", where gr-qc is the principal.

I need to know precisely what is the primary class because I have a script which parses the bib file exported from Zotero in order to insert the various fields for eprints (archivePrefix, eprint, primaryClass). (Note that I don't know if it is easy to find it, since at this point the categories are all collected at the same time.)

It would also be better to have the extension, e.g. ".CO" after astro-ph here.
I don't know if it is a good idea to insert the "physics:" because only some categories go under physics, if we look on the arxiv main page.

adam3smith commented 10 years ago

we get the categories from arXiv's XML, e.g. for your item: http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai%3AarXiv.org%3A0912.5286&metadataPrefix=oai_dc (under setSpec) I'm not attached to the current publication field, but I'd much prefer not to try to scrape anything off the rather unstructured arXiv item page. Any recommendations you have based on the XML data I'd probably be happy to implement.

melsophos commented 10 years ago

Thank you for your answer.

So I took a look at the XML (and some other examples, like 1311.1190) and it seems that the main category is given by the first dc:subject, the others being sorted alphabetically. The problem is that the category is given by its whole name and not its abbreviation…

adam3smith commented 10 years ago

I think the way to go would be to switch to the atom xml rather than the oai e.g. http://arxiv.org/help/api/index http://export.arxiv.org/api/query?id_list=1107.4612 (use source view to view the xml) that has the primary category and cleaner data overall (also has the abstract). It'd take me some time before I get to that, but if you know some javascript, feel free to take a look at the translator yourself - you'd only have to modify the api calls ( https://github.com/zotero/translators/blob/master/arXiv.org.js#L85 and https://github.com/zotero/translators/blob/master/arXiv.org.js#L106 ) and the handling of the XML: https://github.com/zotero/translators/blob/master/arXiv.org.js#L115

melsophos commented 10 years ago

You are right this source is much better. I never really touched javascript but I will try to see if I can do something.

adam3smith commented 10 years ago

don't spend too much time unless you're in a hurry. I'll do this eventually, but it's going to take a couple of months.

adam3smith commented 10 years ago

and here's a quick primer on translator coding: http://www.zotero.org/support/dev/translators/coding in particular, you'll want to use Scaffold, which makes testing this much, much faster.

zotero / translators

Better category support for arxiv translator #668