Pull license information from CrossRef

wpoa / OA-signalling

A project to coordinate implementing a system to signal whether references cited on Wikipedia are free to reuse

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness

GNU General Public License v3.0

20 stars 4 forks source link

Pull license information from CrossRef #12

Open Daniel-Mietchen opened 10 years ago

Daniel-Mietchen commented 10 years ago

They plan to have the info available in (Northern) spring 2014.

wrought commented 10 years ago

Any progress on this data service?

wrought commented 10 years ago

Ping.

Moving to Phase 1B milestone.

gbilder commented 10 years ago

API call is

http://api.crossref.org/works/{doi}

So, for example:

http://api.crossref.org/works/10.1155/2013/530651

See license section of resulting JSON

See the API documentation for more details.

Report problems at the issue tracker

Daniel-Mietchen commented 10 years ago

Thanks, Geoffrey.

wrought commented 10 years ago

Seems like Hindawi is fully compliant, others a long way off: http://participation.labs.crossref.org/features/tdm

Another example with license info http://api.crossref.org/works/10.1155/2014/945364

"license":[{"content-version":"vor","delay-in-days":0,"start":{"date-parts":[[2014,1,1]],"timestamp":1388534400000},"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0\/"}]

gbilder commented 10 years ago

Hindawi is always pretty agile in adopting new features as they control all their own tech. Other publishers will start submitting this info soon. This hockey-stick pattern is typical for CrossRef initiatives as many publishers need to modify their third-party production systems before they can supply the data in bulk. I know most of the bigger publishers (Elsevier, Springer, PLOS, T&F) are working on this now. I am guessing that ~ half CrossRef metadata will have this info in the next 6-9 months.

wrought commented 10 years ago

@gbilder Good to hear, thanks for the update!

I think we're all coming to this realistically too--nothing will happen over night. Indeed there are many parties that have to change systems and process on their own respective schedules.

Curious to hear if you have any info on sources of scraped license data.

gbilder commented 10 years ago

Nothing specific- just that scraping is always fragile and error-prone. But I do it myself in the absence of an API or metadata, so I can hardly complain. I suppose I would generally advise that one try several passes- first through any existing APIs (.e.g CrossRef/DataCite), second through screen scraping, third through supervised screen scraping (allow human to confirm)- lastly through manual updating. I expect that relatively quickly, the last three techniques will become fallback exceptions. At least for formal scholarly articles.

wrought commented 10 years ago

Ah, indeed, I was thinking that at the very least scraped data could be helpful for naive verification of publisher-submitted license data. What's the hit rate? What is the relative coverage?

It's possible others have aggregated some of this data already, would be interesting to see.

gbilder commented 10 years ago

You can see coverage fairly easily:

compare:

http://api.crossref.org/members/98 http://api.crossref.org/members/78

wrought commented 10 years ago

Ah, cool, that is handy. However, I meant the relative coverage between the license data that is available (and discover-able) via public access (scrape) versus the coverage of submitted data from the publisher. Rather than the coverage relative between the submitted license data and the works registered with DOIs by those publishers.

Perhaps an exercise in futility, but it could give you a better idea of the range of articles for which there is currently no easily obtainable license information, and for which publisher-submissions would reveal new information, and the rate it changes over time.

Daniel-Mietchen commented 10 years ago

@gbilder The link http://participation.labs.crossref.org/features/tdm provided above to track progress on providing license information does not seem to be persistent. Any pointers on how to get an update?

gbilder commented 10 years ago

Hmm. W had a server die and are still migrating over links.

You can get same data via API like this:

http://api.crossref.org/members/78/works?filter=has-license:true,has-full-text:true&rows=0

We will fix the link ASAP.

Daniel-Mietchen commented 10 years ago

@gbilder any news on this? The API call is nice but not very handy to get an overview of the progress across publishers.

Daniel-Mietchen commented 10 years ago

Just had a chat with @gbilder who pointed me to https://github.com/CrossRef/rest-api-doc/blob/master/rest_api_tour.md , which explains the API in a very digestible fashion.

For instance, http://api.crossref.org/licenses provides an overview of licenses used, whereas the number of articles available under http://creativecommons.org/licenses/by/3.0/ can be gauged from http://api.crossref.org/works?rows=0&filter=license.url:http://creativecommons.org/licenses/by/3.0/ , and http://api.crossref.org/works?rows=100&filter=license.url:http://creativecommons.org/licenses/by/3.0/ provides the first 100 of these.

To check performance, see http://search.crossref.org/help/status .

pinging @wrought @notconfusing