xlcnd / isbnlib

python library to validate, clean, transform and get metadata of ISBN strings (for devs).
Other
229 stars 30 forks source link

WorldCat's xISBN service to be retired #28

Closed SafeEval closed 7 years ago

SafeEval commented 8 years ago

Heads up, OCLC announced that the xID services (including xISBN) is going to be retired on 3/15/16. After that the WorldCat integration in isbnlib will be broken.

http://www.oclc.org/developer/news/2015/change-to-xid-services.en.html

xlcnd commented 8 years ago

Thanks, meanwhile I will try to find an alternative.

xlcnd commented 8 years ago

Until now, isbnlib relies on the xISBN service by default. With his dismissal the alternative is more cumbersome... and some of the information is not available with the free tier! So, by default, more and more information will come from OpenLibrary and Google Books.

Expect a HUGE drop in data quality for non-US books.

xlcnd commented 8 years ago

The alternative to xISBN is the free tier of the OCLC web service. However you will not get some fields of the core metadata, like Publisher and even Year is not consistent...

For US books these fields are available from another sources (Google Books, OpenLibrary, ...) but for non-US books xISBN is the only consistent source available!

Please, help me convince oclc.org to provide these data in the free tier of OCLC service, by sending an email to chirakot@oclc.org.

ghost commented 8 years ago

I just mailed Tony Chirakos.

xlcnd commented 8 years ago

Thanks.

vindarel commented 8 years ago

Hi there, I'm the developer of Abelujo, a software for bookshops. I'm french, and because the results for non-US books on the different isbn sources I know are pretty poor, and because I wanted to directly take the information from where it was, I decided to go my way: I scrape specific library websites. The result is the bookshops library. We can search, either by isbn or by keywords:

It's been working great for me for more than a year. The search results are excellent (but it's not as fast as a real web service, though to search one isbn metadata at a time is ok).

I see that now some info is missing from OCLC's free tier, and that by #44 you encourage local metadata providers as plugins. So maybe isbnlib can make good use of bookshops, or maybe not, given the specificities of each. Shall we talk more about it ?

xlcnd commented 8 years ago

Hi,

Thanks for your interest in isbnlib.

I took a look at your (excellent) bookshops library and I think that it would be nice if we could build some addins from it. However I see it doesn't support python 3 and has a lot of dependencies. These make it difficult to use bookshops as a direct dependency for possible addins! But a custom version of the scrappers could made very good metadata adddins...

Just now I am very busy, but in October I will work on that.

Meanwhile, if you want to have a go, the idea is to use https://github.com/xlcnd/isbnlib/blob/dev/isbnlib/_wcat.py as a template and write a specific parser using your scrapper code.

What do you think?

(ref #44)

vindarel commented 8 years ago

Hey, so I'm having a look at =_wcat=, it's a little file I wonder how all the features will fit in :p What are the features provided by an isbnlib addon, btw ?

Other thoughts: with oclc for instance you request a service, so you only have to parse the xml result. In the case of bookshops, I have to parse html. Maybe that's the difference in dependencies ? (I didn't find beautifulsoup nor lxml, typical webscraping libs, in isbnlib's requirements list and they are not included by default in python3 afaik)

I must have a better look at the plugin system, but I'm thinking about a difficulty I had, it's that all the information I needed don't appear necessarily altogether in the search results page. When we fire an http request to search from keywords, we get a list of results, but only sum-ups (exple). In the case of the french library, "luckily" we can get everything, including the isbn and the price (I looked for a website like this), but for the spanish scraper the isbn doesn't appear in the sum-up (exple). I must fire a second request to get the book's details page and from there extract the isbn. If we search from an isbn, we may need a second request to get the year, the collection or another field.

About Python3, hell yes I must and plan to make the lib python3 compatible. I should do that in the next couple of months too.

If we write an addon to scrape the same website as bookshop does, we would have two maintain to packages that do the same job. That may not be a big deal (the site didn't change in more than a year, if it changed we may have to change only a css selector or an xpath expression), but still. Maybe can the two libs share the css/xpath ? (I'm thinking out loud here) Could bookshops be an optional dependency of isbnlib ?

Lastly, "webscraping" may or may not raise legal issues and concerns among users. What do you think about that ?

Hope that makes sense :)

(and thanks for the nice word :) )

xlcnd commented 8 years ago

Hi,

Let me start by explaining why some design decisions were taken for isbnlib.

Probably, by number, the main users of isbnlib are people (teachers, students, researchers, …) who use the package isbntools (or other more advanced bibliographic software) to manage small bibliographies.

To serve this target, isbnlib focused on three main features:

Later, to complement these, some features to handle DOI references, book summary and image covers, were added too.

For the selection of metadata fields, the main criteria was to use fields of the Dublin Core for which the main data sources (xISBN, Gooble Books, Open Library) provided good data quality. Some fields like Language and Price were left out because didn’t conform with these criteria. This explains the canonical fields: ISBN, Title, Authors, Publisher and Year.

One other important goal was that the library would handle any reference independent of the language. So for default metadata provider was selected xISBN, that has metadata for books from over the world. However, other providers were included (Google Books, Open Library and isbndb.com) and the possibility to add other sources by the users was always there.

Now I made it easier to add extensions with addins:

Metadata addins, need to implement a function query(isbn) that returns a dictionary with the canonical data (ISBN, Title, Authors, Publisher, Year). That all is need!

The builtin addins follow the pattern: (1) search byh isbn, (2) get page (html, json, xml, …), (3) extract/parse the main data elements (4) Map and validate the data as canonical. However this is not mandatory, all is need is the above mentioned function.

So, my proposal is to implement this function with minimum requirements for librairie-de-paris.fr as a start.

What do you think?

P.S. As far as I know, "webscraping" is ‘fair use’ in EU (but many sites make it difficult or practically impossible to scrap main info), anyway an explicit warning must be added to the code. I don’t know what the situation is for other parts of the world.

vindarel commented 8 years ago

So, my proposal is to implement this function with minimum requirements for librairie-de-paris.fr as a start.

That looks great to me !

ghost commented 7 years ago

I see on http://www.oclc.org/developer/news/2015/change-to-xid-services.en.html that: "The retirement of the xID product will be delayed while we review options for an alternative service that will deliver some of the most important functionality from xID." Certainly, my queries using their service still work.

xlcnd commented 7 years ago

THANKS!

I wasn't aware of that! Great news...