Daniel-Mietchen commented 9 years ago

We are taking part in the Mozilla Open Science sprint (overview) and welcome contributions to any of the software projects here at WikiProject Open Access, in particular to the YouTube exporter (#82) and the Open Access signalling project.

If you are interested in getting involved, please leave a note here, and we will take things from there.

Daniel-Mietchen commented 9 years ago

Here is what I plan to do: get an overview of all the <license> statements within the Open subset of the articles on PubMed Central.

I will update this comment as I move forward.

Day 1

Download the files that contain the XML
- took about half an hour
- total size: 15 GB
unpack: tar -zxvf *.tar.gz
- took over an hour
- total size: 68 GB (after deleting the original gz files)
explored the files in various ways while looking for the use of elements like <permissions>, <inline-formula>, <disp-formula>, <fig>, <ref>, <subj-group>, <kwd-group>
search for license statements:
- grep -ohPR --include="*.nxml" "<license(.*)</license>" .
- with removal of duplicates: grep -ohPR --include="*.nxml" "<license(.*)</license>" | awk '!x[$0]++' > license-statements.txt
- running grep -oHPR --include="*.nxml" "<license(.*)</license>" > license-statements.txt over night
- I am aware that xmlgrep would be more suited to this, but it's not available on that machine
- noticed a strange way to abbreviate Creative Commons licenses and notified publisher

Day 2

the grep resulted in a license-statements.txt of over 370MB, with license statements from over 700k nxml files (not sure why not from all ca. 800k files)
cut -d ":" -f 2- license-statements.txt | awk '!x[$0]++' | sort > license-statements-without-filenames.txt removes the file names and deduplicates license statements
- results in a 17MB file with 4062 license statements that differ in their XML character sequence.
- needs cleanup
running grep -oHPR --include="*.nxml" "<license(.*?)</license>" > license-statements.txt over night

Klortho commented 9 years ago

grep -ohPR --include="*.nxml" "<license(.*)</license>" .

If an article has more than one <license> element, this captures everything between the two. Use the non-greedy matcher, instead:

grep -ohPR --include="*.nxml" "<license(.*?)</license>" .

Daniel-Mietchen commented 9 years ago

Cool, thanks!