Closed Daniel-Mietchen closed 11 years ago
I think "media type" is the correct term, rather than "mime type". I'm changing it everywhere in the paper.
The success of any attempt is limited by our ability to compare this data. Our attempts is rate-limited to one request per every 3 seconds right now. This means that for 1000 supplementary materials (like in the linked figure), the lower bound for the time needed is 3000 seconds (50 minutes) – it is likely that the actual time is significantly higher, as this is not including the response time. For 10000 supplementary materials the lower bound is 30000 seconds (over 8 hours), for 100000 supplementary materials the lower bound is 300000 seconds (nearly 3.5 days).
I am currently trying to find out how many materials are there in the time range from 2013-03-01 until (but not including) 2013-09-01.
Publications for a single day:
$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-02 | wc -w 989
Publications for a single week:
$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-08 | wc -w 5649
Publications for a single month:
$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-31 | wc -w 139692
Daniel, do you know how many supplementary materials are there usually per paper? Depending on the answer, I would propose to make a new figure with either a day's or a week's data.
I would estimate that there is about one supplementary file per recent paper on average. I'd go for a week, which should take just a few hours to process.
http://wiki.pro-ibiosphere.eu/wiki/User:Daniel_Mietchen http://okfn.org http://wikimedia.org
On Tue, Sep 17, 2013 at 7:17 PM, Nils Dagsson Moskopp < notifications@github.com> wrote:
Daniel, do you know how many supplementary materials are there usually per paper? Depending on the answer, I would propose to make a new figure with either a day's or a week's data.
— Reply to this email directly or view it on GitHubhttps://github.com/erlehmann/open-access-media-importer/issues/102#issuecomment-24605848 .
Creating PMC IDs for first week of may:
% ./oa-pmc-ids --from 2013-03-01 --until 2013-03-08 --verbose > pmc-ids-from-2013-03-01-until-2013-03-08
Confirming number of PMC IDs:
% wc -w <pmc-ids-from-2013-03-01-until-2013-03-08 5649
PMC IDs for first week of may, for reference: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-ids-from-2013-03-01-until-2013-03-08
Creating database on host files.mi.ur.de (user erlehmann).
$ nohup sh -c 'cat pmc-ids-from-2013-03-01-until-2013-03-08 | ./oa-get download-metadata pmc_pmcid 2>oa-get-download-metadata.log'
What does the 2 do there as a third argument to oa-get?
“2>” redirects the standard error stream (stderr) to the log file, as the file descriptor of stderr is “2”. http://en.wikibooks.org/wiki/Bourne_Shell_Scripting/Files_and_streams#Redirecting_standard_error_.28and_other_streams.29
Thanks for the pointer. Should have known that one, and probably did some years ago...
Finding supplementary materials on host files.mi.ur.de (user erlehmann).
$ nohup sh -c './oa-cache find-media pmc_pmcid 2> oa-cache-find-media.log'
Caveat: Since Python 2.x uses ASCII by default if it cannot determine the encoding, the command will crash, as stderr going to a file uses ASCII. It only works after overriding the encoding used for stdin, stdout and stderr:
$ export PYTHONIOENCODING=utf-8
Source: http://docs.python.org/2/using/cmdline.html#envvar-PYTHONIOENCODING
Checking media types on host files.mi.ur.de (user erlehmann). I commented out the following lines of oa-get so no wrong license assessment on our part could distort the results:
free_materials = [ material for material in materials \ if material.article.license_url in config.free_license_urls ] materials = free_materials # Checking MIME types of non-free # supplementary materials costs time.
I then overrode the encoding used for stdin, stdout and stderr and started the process.
$ export PYTHONIOENCODING=utf-8 $ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'
The process probably will not take too long:
379 of 6306 6% |## | ETA: 02:23:25
The log shows quite some hits:
DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s013.m4v, source claimed text/plain but is video/mp4.
DOI 10.1021/ja3120955, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585461/bin/ja3120955_si_001.pdf, source claimed / but is application/pdf.
I fully agree with the proposal to ignore MS Office files:
DOI 10.1371/journal.pcbi.1002920, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3581797/bin/pcbi.1002920.s001.xls, source claimed application/vnd.ms-excel but is application/msword.
As of 57843219c150c83bcac817d8c1206212b11c45d0, “oa-cache stats” treats MIME type correctness of documents detected to be MS Office documents (“application/msword”) as unknown.
Dumping statistics:
% ./oa-cache stats pmc_pmcid > oa-stats
Output:
Counting supplementary materials … 6306 supplementary materials found. 100% |#########################################################################|
% ./plot-helper < oa-statsgenerates two files relevant to this bug: 1. A CSV file table with mismatches tallied up (top: XML; left: actual media type): http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/mediatypes-misreported.csv 2. A plot: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/mediatypes-misreported-by-publisher.png “unknown” files are all files detected as application/msword. It actually surprised me that there are so many of them.
The numbers for MS Office documents do not surprise me - there are indeed loads of them.
The figure is OK for the paper, but I think we need a larger sample (6 months or so) in order to enrich the talk.
As for the CSV (I uploaded a copy to GDocs), it contains a few cases that are interesting for OAMI and for which it would be good to have the PMCIDs :
There are also about 20 files claimed to be some audio or video format while identified as another audio or video format. Does the converter detect that or does it need the info from the media type check?
It might be worth visualizing this as a heat map, with separate color schemes for audio, video, and the rest.
Since the majority of the mismatches seem to be cases labeled as text/plain and identified as application/xml, is there an independent way to check whether our assignment is correct?
Nils, why do you think your requests are rate-limited to one every 3 seconds? I have inquired about this a few times at PMC, and no one seems to know anything about this.
Stderr log file from the media type check: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/oa-get-update-mimetypes.log
Many files labeled as text/plain and identified as application/xml seem to be Chemical Markup Language (CML) files: http://en.wikipedia.org/wiki/Chemical_Markup_Language
You can check that those actually are XML files by looking at the beginning of the files, they start with:
<'?xml version="1.0" encoding="UTF-8"?>
Regarding the video/x-flv files:
DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v001.flv, source claimed application/postscript but is video/x-flv. DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v002.flv, source claimed application/postscript but is video/x-flv. DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v003.flv, source claimed application/postscript but is video/x-flv. DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v004.flv, source claimed application/postscript but is video/x-flv. DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v005.flv, source claimed application/postscript but is video/x-flv.
Regarding files reported as text/plain actually being video/mp4:
DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s013.m4v, source claimed text/plain but is video/mp4. DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s014.m4v, source claimed text/plain but is video/mp4. DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s015.m4v, source claimed text/plain but is video/mp4. DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s016.m4v, source claimed text/plain but is video/mp4.
DOI 10.1371/journal.pgen.1003342, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591300/bin/pgen.1003342.s007.m4v, source claimed text/plain but is video/mp4.
Regarding files reported as text/plain actually being application/ogg:
DOI 10.1371/journal.pcbi.1002908, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3572992/bin/pcbi.1002908.s001.ogg, source claimed text/plain but is application/ogg. DOI 10.1371/journal.pcbi.1002908, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3572992/bin/pcbi.1002908.s002.ogg, source claimed text/plain but is application/ogg.
Regarding the file reported as audio/x-realaudio which actually was application/x-rar:
DOI 10.1371/journal.pcbi.1002933, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591266/bin/pcbi.1002933.s003.rar, source claimed audio/x-realaudio but is application/x-rar.
Klortho, http://www.ncbi.nlm.nih.gov/pmc/tools/oai/ says:
If you are using a script that makes more than 100 requests of any kind, please run it outside of the PMC system's peak hours. Do not make more than one request every 3 seconds, even at off-peak times. Peak hours are Monday to Friday, 5:00 AM to 9:00 PM, U.S. Eastern time.
I had forgotten about that. I don't think it's enforced anywhere. As long as you are not "abusing" our system, I don't think you have to worry. I'll check on getting the wording on that changed to be less onerous.
Nils, can you please provide a sentence or two on why the media type detection does not work for MS Office files?
Daniel, most files type can be inferred by a few bytes, colloquially called “magic numbers”. For example, if a file starts with finds “<!DOCTYPE html”, you can infer that it is is a HTML file. http://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files http://en.wikipedia.org/wiki/File_format#Magic_number http://en.wikipedia.org/wiki/List_of_file_signatures
Microsoft Office documents all begin with “D0 CF 11 E0” and are structured in a very complicated way. To determine what kind of file you have you have to parse the whole file and reverse engineer the MS Office format. http://social.msdn.microsoft.com/Forums/en-US/343d09e3-5fdf-4b4a-9fa6-8ccb37a35930/developing-a-tool-to-recognise-ms-office-file-types-doc-xls-mdb-ppt-#d06878c3-951d-4ba5-8aae-ce8411a00f62
For efficiency reasons, the Open Access Media Importer only looks at the first 12 bytes of a file to find out what format a file has. For audio and video formats, this is enough. More accurate MS Office document detection would necessitate a full download of all files just to infer the media type – and be of questionable value, since we are not interested in MS Office files anyway.
Nils, we're changing that text to read, "please make sure you don't do concurrent requests"
Klortho, I think this is great news. Can you assure me that there is no automatic blocking mechanism if API access happens too often in a given interval? As soon as you do, I will remove the rate limiting of the OAMI.
Yes, as long as you limit yourself to OA materials. If you get blocked, let me know.
So, for the paper, we should use this, right?
Referenced in this comment above: https://github.com/erlehmann/open-access-media-importer/issues/102#issuecomment-24862881
Yes, that is the right graphic.
Rate limiting removed as of 67abe95dc18657ae6dbcd6df89bdadba3bc2ebd8.
It would be good to have a more representative version of https://commons.wikimedia.org/wiki/File:MIME_types_of_a_random_sample_of_supplementary_materials_from_the_Open_Access_subset_in_PubMed_Central_as_of_October_23,_2012.png .
Or perhaps several versions - one for all the DOI prefixes in the whitelist, and one for everything indexed in PMC over a given period, e.g. the six months from March to August 2013 (if we can do a longer period or even the whole OA subset, even better)
Due to the problems around correctly identifying MS Office filetypes, I would say we ignore those cases where something is identified as MS Word but has another MS Office file ending (e.g ppt).
Should be done in conjunction with license stats, as per https://github.com/erlehmann/open-access-media-importer/issues/101 .
./oa-get download-metadata pmc_pmcid && \ ./oa-cache find-media pmc_pmcid && \ ./oa-get update-mimetypes pmc_pmcid && \
Pinging https://github.com/erlehmann/open-access-media-importer/issues/97 .