Support for outputting CSL JSON formatted metadata

dhimmel commented 6 years ago

It would be nice to have a bibformatter to export ISBN metadata to Citation Styles Language (CSL) JSON. This would help us add support for ISBN citations in the Manubot: see https://github.com/greenelab/manubot/issues/14.

I'm envisioning being able to do the following:

import isbnlib
isbn = '9780262035613'
metadata = isbnlib.meta(isbn, cache=None)
csl = isbnlib.registry.bibformatters['csl'](metadata)

csl would presumably a dict or collections.OrderedDict. Alternatively, it could be already dumped as a JSON string (although I think that's less preferable).

CSL JSON is a way of storing bibliographic metadata that is a successor to formats like bibtex. It's used commonly in scholarly publishing. The documentation isn't great, but here's a schema definition. Here's also some written doc.

I'm happy to help as needed. Especially I can help convert the output of isbnlib.meta to CSL JSON. Is there documentation of all the possible keys returned in the output of isbnlib.meta?

xlcnd commented 6 years ago

Thank you for your suggestion.

Answering to your question:

isbnlib.metadata returns a dictionary with keys ('ISBN-13', 'Title', 'Authors', 'Publisher', 'Year', 'Language') and values as strings (a list of strings for the 'Authors').

These are the common fields to all providers and are fixed in the library. Even then, 'Language' is NOT used with the builtin 'bibformatters' because for bibliographic citations 'Language' is the language in wich the book is written, but that is NOT the meaning of 'Language' in ISBN regestries (is usually the main language of the publisher's country)!

I will take a look at this CSL format and see if it make sense to install it in the core library as a new block in isbnlib/dev/_fmt.py(probably yes if it is widely used) or as an add-in.

But please, you are free to have a go!

xlcnd commented 6 years ago

From a rush consultation to csl-json, it seems that in order to implemente a formatting in CSL is only necessary to create a new template in isbnlib\dev\_fmt.py like:

csl = r'''{"type":"book", "id":"$ISBN", "title":"$Title", "issued": {"raw": "$Year"}, "ISBN":"$ISBN", "publisher":"$Publisher", "author": [$AUTHORS]}'''

with pos-processing for $AUTHORS

elif name == 'csl': AUTHORS = ', '.join('{"literal": "$"}'.replace("$", a) for a in authors) Is this a correct CSL-JSON data fragment and is enough?

dhimmel commented 6 years ago

Agree with the general strategy. A few points / questions:

You probably want to set date_parts rather than raw for issued. It'd be like {'date_parts': [[year]]}. It's an odd format, but that's what most styles will implement.
I'm not a fan of hardcoding the JSON structure. Instead, I'd construct a dict/OrderedDict and then json.dumps it. For example, will the implementation above escape problematic characters in the fields?
Are all values guaranteed to be populated? If not, it's better to omit that key-value pair entirely, rather than have a blank value.
Setting URL may also be nice... is there a way to get a URL for an ISBN?

xlcnd commented 6 years ago

Here is my reply point-by-point:

Change to date_parts its OK, especially if that is what most styles implement.
Hardcoding JSON (and other formats) was an option to simplify the production of small data fragments formatted in some popular bibliographic formats. For most casual users that is all they need.... however, the main goal wasn't formatting but provide metadata. You can always format the data from isbnlib.metadata as you which. Some cleaning is done in the data to avoid some obvious problematic cases, but no validation is attempted.
Some fields are mandatory but no filtering is done in order to 'clean' the empty ones.
The URL is very problematic in order to get consistency... by default the metadata for each ISBN is obtained from several providers! Some don't provide an URL and in some cases the URL is not deterministic... it depends on the region and the user (e.g. Google Books)!

But maybe it is not a good idea to implement this in the core of isbnlib, but do a plug-in because:

formatting is not the main goal of the library,
isbnlib already supports a general purpose BibJSON format and with some simple pos-processing you can get CSL-JSON from it.

xlcnd commented 6 years ago

Anyway, I have already implemented a 'simple' version to support 'CSL-JSON'! It produces things like this:

{"type":"book",
        "id":"9780321534965",
     "title":"The Art Of Computer Programming",
    "author": [{"literal": "Donald Ervin Knuth"}],
    "issued": {"date_parts": [["2008"]]},
      "ISBN":"9780321534965",
 "publisher":"Addison-Wesley"}

Is this a valid CSL document? Is this useful?

dhimmel commented 6 years ago

@xlcnd that would be useful. If you open a PR, I'd be happy to review. The only issue that I see presently is that 2008 should not be quoted. It should be an int.

xlcnd commented 6 years ago

Its already in the dev branch. Year is now an int.

xlcnd / isbnlib

Support for outputting CSL JSON formatted metadata #48