wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

License stats for JATS-Con paper #101

Closed Daniel-Mietchen closed 10 years ago

Daniel-Mietchen commented 10 years ago

It would be good to have a figure that represents license mismatches. Just don't know what best to plot there.

Should be done in conjunction with MIME type stats, as per https://github.com/erlehmann/open-access-media-importer/issues/102

This would mean looping

./oa-get download-metadata pmc_pmcid && \
    ./oa-cache find-media pmc_pmcid && \
    ./oa-get update-mimetypes pmc_pmcid && \

over a certain period, e.g. six months.

Pinging https://github.com/erlehmann/open-access-media-importer/issues/97 .

erlehmann commented 10 years ago

What type of plot do you have in mind?

erlehmann commented 10 years ago

Also, what type of mismatch detection do you have in mind? After all, we are working with lookup tables right now.

Daniel-Mietchen commented 10 years ago

The thing is, I don't really know.

Perhaps concentrate on the licenses interpreted as CC BY and then check the overall number of license text strings or URIs that we have in our collection for that, and what percentage of the CC BY articles indexed in a given period can be explained by them. If that's too complicated, then do it for the publishers whose DOI prefixes are whitelisted.

I am very open to other suggestions...

erlehmann commented 10 years ago

I can plot simple licensing data if you want, like in http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses.png. Licensing detection by publisher could also be interesting http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-sample-licensing-by-publisher.png.

Daniel-Mietchen commented 10 years ago

Let's take both. What were the commands used to generate them?

http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses.png is fine except is fine as is, I think.

http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-sample-licensing-by-publisher.png needs a few brushes:

Would be nice to have a combination of the two plots too, of the order "recognized by OAMI as compatible with reuse on Wikimedia Commons".

erlehmann commented 10 years ago

I currently have no idea how they were generated. Let me look into the plot helper.

erlehmann commented 10 years ago

I think I cannot regenerate pmc-sample-licensing-by-publisher.png in short time since I cannot find the database containing the metadata for >300000 materials.

erlehmann commented 10 years ago

I may be able to do it over the period database we have for one week. Don't know how long I'll be awake, but let me look at it.

erlehmann commented 10 years ago

I know now how those pictures where generated. However, there is some weird error in the “oa-cache stats” functionality – it does not recognize “http://creativecommons.org/licenses/by/2.0/uk/” as a free license even when I introduce it into config.free_license_urls. Instead, it applies the status change to the license above it. Since it is the second-most-often-used license in the first week of may, this almost certainly means we did not pay attention to localized versions of licensing URLs being used.

I'm sorry, but I have not slept enough due to an emergency last night and need to take care of my headache now.

erlehmann commented 10 years ago

Missing publisher names added in commit 9eff58ff7cd4d126eaea96407359fe28c0de4b39.

erlehmann commented 10 years ago

Recreated both images for the first week of may:

% ./oa-cache stats pmc_pmcid | ./plot-helper
Counting supplementary materials … 6306 supplementary materials found.
100% |#########################################################################|
No prefix found in row 405 of doi_pref.tsv:
    „Brill Academic Publishers (Logos International Publishing Education“.
No prefix found in row 831 of doi_pref.tsv:
    „Institute for Computer Sciences, Social Informatics and“.
No prefix found in row 855 of doi_pref.tsv:
    „Institute of Organic Chemistry & Biochemistry, Academy of Sciences of the“.
No prefix found in row 862 of doi_pref.tsv:
    „Institute of Systematics and Evolution of Animals, Polish Academy of“.
No prefix found in row 873 of doi_pref.tsv:
    „Instituto Nacional de Investigacion y Tecnologia Agraria y Alimentaria“.
No prefix found in row 891 of doi_pref.tsv:
    „International Association of Chinese Professionals in Global Positioning“.
No prefix found in row 1316 of doi_pref.tsv:
    „Korean Society of Hematology; Korean Society of Blood and Marrow“.
Wrote figure to “licenses.png”.
Wrote figure to “mimetypes-licensing-by-publisher.png”.
erlehmann commented 10 years ago

I used the GIMP to edit the color of the bar for “http://creativecommons.org/licenses/by/2.0/uk” as it was orange instead of blue: http://daten.dieweltistgarnichtso.net/tmp/pics/licenses.png

This seems useless (to me) due to the small sample size: http://daten.dieweltistgarnichtso.net/tmp/pics/mimetypes-licensing-by-publisher.png

Klortho commented 10 years ago

Is there anything here we can use? Daniel? Never mind, I see that you linked here from the paper.

Daniel-Mietchen commented 10 years ago

Paper submitted. Closing here, with the option of opening a new ticket for the talk.