Closed Daniel-Mietchen closed 10 years ago
What type of plot do you have in mind?
Also, what type of mismatch detection do you have in mind? After all, we are working with lookup tables right now.
The thing is, I don't really know.
Perhaps concentrate on the licenses interpreted as CC BY and then check the overall number of license text strings or URIs that we have in our collection for that, and what percentage of the CC BY articles indexed in a given period can be explained by them. If that's too complicated, then do it for the publishers whose DOI prefixes are whitelisted.
I am very open to other suggestions...
I can plot simple licensing data if you want, like in http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses.png. Licensing detection by publisher could also be interesting http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-sample-licensing-by-publisher.png.
Let's take both. What were the commands used to generate them?
http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses.png is fine except is fine as is, I think.
http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-sample-licensing-by-publisher.png needs a few brushes:
Would be nice to have a combination of the two plots too, of the order "recognized by OAMI as compatible with reuse on Wikimedia Commons".
I currently have no idea how they were generated. Let me look into the plot helper.
I think I cannot regenerate pmc-sample-licensing-by-publisher.png in short time since I cannot find the database containing the metadata for >300000 materials.
I may be able to do it over the period database we have for one week. Don't know how long I'll be awake, but let me look at it.
I know now how those pictures where generated. However, there is some weird error in the “oa-cache stats” functionality – it does not recognize “http://creativecommons.org/licenses/by/2.0/uk/” as a free license even when I introduce it into config.free_license_urls. Instead, it applies the status change to the license above it. Since it is the second-most-often-used license in the first week of may, this almost certainly means we did not pay attention to localized versions of licensing URLs being used.
I'm sorry, but I have not slept enough due to an emergency last night and need to take care of my headache now.
Missing publisher names added in commit 9eff58ff7cd4d126eaea96407359fe28c0de4b39.
Recreated both images for the first week of may:
% ./oa-cache stats pmc_pmcid | ./plot-helper Counting supplementary materials … 6306 supplementary materials found. 100% |#########################################################################| No prefix found in row 405 of doi_pref.tsv: „Brill Academic Publishers (Logos International Publishing Education“. No prefix found in row 831 of doi_pref.tsv: „Institute for Computer Sciences, Social Informatics and“. No prefix found in row 855 of doi_pref.tsv: „Institute of Organic Chemistry & Biochemistry, Academy of Sciences of the“. No prefix found in row 862 of doi_pref.tsv: „Institute of Systematics and Evolution of Animals, Polish Academy of“. No prefix found in row 873 of doi_pref.tsv: „Instituto Nacional de Investigacion y Tecnologia Agraria y Alimentaria“. No prefix found in row 891 of doi_pref.tsv: „International Association of Chinese Professionals in Global Positioning“. No prefix found in row 1316 of doi_pref.tsv: „Korean Society of Hematology; Korean Society of Blood and Marrow“. Wrote figure to “licenses.png”. Wrote figure to “mimetypes-licensing-by-publisher.png”.
I used the GIMP to edit the color of the bar for “http://creativecommons.org/licenses/by/2.0/uk” as it was orange instead of blue: http://daten.dieweltistgarnichtso.net/tmp/pics/licenses.png
This seems useless (to me) due to the small sample size: http://daten.dieweltistgarnichtso.net/tmp/pics/mimetypes-licensing-by-publisher.png
Is there anything here we can use? Daniel? Never mind, I see that you linked here from the paper.
Paper submitted. Closing here, with the option of opening a new ticket for the talk.
It would be good to have a figure that represents license mismatches. Just don't know what best to plot there.
Should be done in conjunction with MIME type stats, as per https://github.com/erlehmann/open-access-media-importer/issues/102
This would mean looping
over a certain period, e.g. six months.
Pinging https://github.com/erlehmann/open-access-media-importer/issues/97 .