Open Daniel-Mietchen opened 9 years ago
Here is what I plan to do: get an overview of all the <license>
statements within the Open subset of the articles on PubMed Central.
I will update this comment as I move forward.
tar -zxvf *.tar.gz
<permissions>
, <inline-formula>
, <disp-formula>
, <fig>
, <ref>
, <subj-group>
, <kwd-group>
grep -ohPR --include="*.nxml" "<license(.*)</license>" .
grep -ohPR --include="*.nxml" "<license(.*)</license>" | awk '!x[$0]++' > license-statements.txt
grep -oHPR --include="*.nxml" "<license(.*)</license>" > license-statements.txt
over nightxmlgrep
would be more suited to this, but it's not available on that machinelicense-statements.txt
of over 370MB, with license statements from over 700k nxml files (not sure why not from all ca. 800k files)cut -d ":" -f 2- license-statements.txt | awk '!x[$0]++' | sort > license-statements-without-filenames.txt
removes the file names and deduplicates license statements
grep -oHPR --include="*.nxml" "<license(.*?)</license>" > license-statements.txt
over night
grep -ohPR --include="*.nxml" "<license(.*)</license>" .
If an article has more than one <license>
element, this captures everything between the two. Use the non-greedy matcher, instead:
grep -ohPR --include="*.nxml" "<license(.*?)</license>" .
Cool, thanks!
We are taking part in the Mozilla Open Science sprint (overview) and welcome contributions to any of the software projects here at WikiProject Open Access, in particular to the YouTube exporter (#82) and the Open Access signalling project.
If you are interested in getting involved, please leave a note here, and we will take things from there.