wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

MIME type stats figure for JATS-Con paper #102

Closed Daniel-Mietchen closed 11 years ago

Daniel-Mietchen commented 11 years ago

It would be good to have a more representative version of https://commons.wikimedia.org/wiki/File:MIME_types_of_a_random_sample_of_supplementary_materials_from_the_Open_Access_subset_in_PubMed_Central_as_of_October_23,_2012.png .

Or perhaps several versions - one for all the DOI prefixes in the whitelist, and one for everything indexed in PMC over a given period, e.g. the six months from March to August 2013 (if we can do a longer period or even the whole OA subset, even better)

Due to the problems around correctly identifying MS Office filetypes, I would say we ignore those cases where something is identified as MS Word but has another MS Office file ending (e.g ppt).

Should be done in conjunction with license stats, as per https://github.com/erlehmann/open-access-media-importer/issues/101 .

./oa-get download-metadata pmc_pmcid && \ ./oa-cache find-media pmc_pmcid && \ ./oa-get update-mimetypes pmc_pmcid && \

Pinging https://github.com/erlehmann/open-access-media-importer/issues/97 .

Klortho commented 11 years ago

I think "media type" is the correct term, rather than "mime type". I'm changing it everywhere in the paper.

erlehmann commented 11 years ago

The success of any attempt is limited by our ability to compare this data. Our attempts is rate-limited to one request per every 3 seconds right now. This means that for 1000 supplementary materials (like in the linked figure), the lower bound for the time needed is 3000 seconds (50 minutes) – it is likely that the actual time is significantly higher, as this is not including the response time. For 10000 supplementary materials the lower bound is 30000 seconds (over 8 hours), for 100000 supplementary materials the lower bound is 300000 seconds (nearly 3.5 days).

erlehmann commented 11 years ago

I am currently trying to find out how many materials are there in the time range from 2013-03-01 until (but not including) 2013-09-01.

erlehmann commented 11 years ago

Publications for a single day:

$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-02 | wc -w
989

Publications for a single week:

$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-08 | wc -w
5649

Publications for a single month:

$ ./oa-pmc-ids --from 2013-03-01 --until 2013-03-31 | wc -w
139692
erlehmann commented 11 years ago

Daniel, do you know how many supplementary materials are there usually per paper? Depending on the answer, I would propose to make a new figure with either a day's or a week's data.

Daniel-Mietchen commented 11 years ago

I would estimate that there is about one supplementary file per recent paper on average. I'd go for a week, which should take just a few hours to process.

http://wiki.pro-ibiosphere.eu/wiki/User:Daniel_Mietchen http://okfn.org http://wikimedia.org

On Tue, Sep 17, 2013 at 7:17 PM, Nils Dagsson Moskopp < notifications@github.com> wrote:

Daniel, do you know how many supplementary materials are there usually per paper? Depending on the answer, I would propose to make a new figure with either a day's or a week's data.

— Reply to this email directly or view it on GitHubhttps://github.com/erlehmann/open-access-media-importer/issues/102#issuecomment-24605848 .

erlehmann commented 11 years ago

Creating PMC IDs for first week of may:

% ./oa-pmc-ids --from 2013-03-01 --until 2013-03-08 --verbose > pmc-ids-from-2013-03-01-until-2013-03-08

erlehmann commented 11 years ago

Confirming number of PMC IDs:

% wc -w <pmc-ids-from-2013-03-01-until-2013-03-08
5649

erlehmann commented 11 years ago

PMC IDs for first week of may, for reference: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-ids-from-2013-03-01-until-2013-03-08

erlehmann commented 11 years ago

Creating database on host files.mi.ur.de (user erlehmann).

$ nohup sh -c 'cat pmc-ids-from-2013-03-01-until-2013-03-08 | ./oa-get download-metadata pmc_pmcid 2>oa-get-download-metadata.log'
Daniel-Mietchen commented 11 years ago

What does the 2 do there as a third argument to oa-get?

erlehmann commented 11 years ago

“2>” redirects the standard error stream (stderr) to the log file, as the file descriptor of stderr is “2”. http://en.wikibooks.org/wiki/Bourne_Shell_Scripting/Files_and_streams#Redirecting_standard_error_.28and_other_streams.29

Daniel-Mietchen commented 11 years ago

Thanks for the pointer. Should have known that one, and probably did some years ago...

erlehmann commented 11 years ago

Finding supplementary materials on host files.mi.ur.de (user erlehmann).

$ nohup sh -c './oa-cache find-media pmc_pmcid 2> oa-cache-find-media.log'

Caveat: Since Python 2.x uses ASCII by default if it cannot determine the encoding, the command will crash, as stderr going to a file uses ASCII. It only works after overriding the encoding used for stdin, stdout and stderr:

$ export PYTHONIOENCODING=utf-8

Source: http://docs.python.org/2/using/cmdline.html#envvar-PYTHONIOENCODING

erlehmann commented 11 years ago

Checking media types on host files.mi.ur.de (user erlehmann). I commented out the following lines of oa-get so no wrong license assessment on our part could distort the results:

free_materials = [
    material for material in materials \
        if material.article.license_url in config.free_license_urls
    ]
materials = free_materials  # Checking MIME types of non-free            
                                   # supplementary materials costs time.  

I then overrode the encoding used for stdin, stdout and stderr and started the process.

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'
erlehmann commented 11 years ago

The process probably will not take too long:

379 of 6306   6% |##                                           | ETA:  02:23:25

The log shows quite some hits:

DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s013.m4v, source claimed text/plain but is video/mp4.
DOI 10.1021/ja3120955, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585461/bin/ja3120955_si_001.pdf, source claimed / but is application/pdf.

I fully agree with the proposal to ignore MS Office files:

DOI 10.1371/journal.pcbi.1002920, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3581797/bin/pcbi.1002920.s001.xls, source claimed application/vnd.ms-excel but is application/msword.
erlehmann commented 11 years ago

As of 57843219c150c83bcac817d8c1206212b11c45d0, “oa-cache stats” treats MIME type correctness of documents detected to be MS Office documents (“application/msword”) as unknown.

erlehmann commented 11 years ago

Dumping statistics:

% ./oa-cache stats pmc_pmcid > oa-stats

Output:

Counting supplementary materials … 6306 supplementary materials found.
100% |#########################################################################|
erlehmann commented 11 years ago
% ./plot-helper < oa-stats
generates two files relevant to this bug: 1. A CSV file table with mismatches tallied up (top: XML; left: actual media type): http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/mediatypes-misreported.csv 2. A plot: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/mediatypes-misreported-by-publisher.png “unknown” files are all files detected as application/msword. It actually surprised me that there are so many of them.
Daniel-Mietchen commented 11 years ago

The numbers for MS Office documents do not surprise me - there are indeed loads of them.

The figure is OK for the paper, but I think we need a larger sample (6 months or so) in order to enrich the talk.

As for the CSV (I uploaded a copy to GDocs), it contains a few cases that are interesting for OAMI and for which it would be good to have the PMCIDs :

There are also about 20 files claimed to be some audio or video format while identified as another audio or video format. Does the converter detect that or does it need the info from the media type check?

It might be worth visualizing this as a heat map, with separate color schemes for audio, video, and the rest.

Since the majority of the mismatches seem to be cases labeled as text/plain and identified as application/xml, is there an independent way to check whether our assignment is correct?

Klortho commented 11 years ago

Nils, why do you think your requests are rate-limited to one every 3 seconds? I have inquired about this a few times at PMC, and no one seems to know anything about this.

erlehmann commented 11 years ago

Stderr log file from the media type check: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/oa-get-update-mimetypes.log

Many files labeled as text/plain and identified as application/xml seem to be Chemical Markup Language (CML) files: http://en.wikipedia.org/wiki/Chemical_Markup_Language

You can check that those actually are XML files by looking at the beginning of the files, they start with:

<'?xml version="1.0" encoding="UTF-8"?> 
erlehmann commented 11 years ago

Regarding the video/x-flv files:

DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v001.flv, source claimed application/postscript but is video/x-flv.
DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v002.flv, source claimed application/postscript but is video/x-flv.
DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v003.flv, source claimed application/postscript but is video/x-flv.
DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v004.flv, source claimed application/postscript but is video/x-flv.
DOI 10.4103/0970-0358.105988, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3580371/bin/IJPS-45-581-v005.flv, source claimed application/postscript but is video/x-flv.
erlehmann commented 11 years ago

Regarding files reported as text/plain actually being video/mp4:

DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s013.m4v, source claimed text/plain but is video/mp4.
DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s014.m4v, source claimed text/plain but is video/mp4.
DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s015.m4v, source claimed text/plain but is video/mp4.
DOI 10.1371/journal.pcbi.1002915, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3585391/bin/pcbi.1002915.s016.m4v, source claimed text/plain but is video/mp4.
DOI 10.1371/journal.pgen.1003342, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591300/bin/pgen.1003342.s007.m4v, source claimed text/plain but is video/mp4.
erlehmann commented 11 years ago

Regarding files reported as text/plain actually being application/ogg:

DOI 10.1371/journal.pcbi.1002908, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3572992/bin/pcbi.1002908.s001.ogg, source claimed text/plain but is application/ogg.
DOI 10.1371/journal.pcbi.1002908, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3572992/bin/pcbi.1002908.s002.ogg, source claimed text/plain but is application/ogg.
erlehmann commented 11 years ago

Regarding the file reported as audio/x-realaudio which actually was application/x-rar:

DOI 10.1371/journal.pcbi.1002933, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591266/bin/pcbi.1002933.s003.rar, source claimed audio/x-realaudio but is application/x-rar.
erlehmann commented 11 years ago

Klortho, http://www.ncbi.nlm.nih.gov/pmc/tools/oai/ says:

   If you are using a script that makes more than 100 requests of any kind,     
   please run it outside of the PMC system's peak hours. Do not make more       
   than one request every 3 seconds, even at off-peak times. Peak hours are     
   Monday to Friday, 5:00 AM to 9:00 PM, U.S. Eastern time.                  
Klortho commented 11 years ago

I had forgotten about that. I don't think it's enforced anywhere. As long as you are not "abusing" our system, I don't think you have to worry. I'll check on getting the wording on that changed to be less onerous.

Daniel-Mietchen commented 11 years ago

Nils, can you please provide a sentence or two on why the media type detection does not work for MS Office files?

erlehmann commented 11 years ago

Daniel, most files type can be inferred by a few bytes, colloquially called “magic numbers”. For example, if a file starts with finds “<!DOCTYPE html”, you can infer that it is is a HTML file. http://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files http://en.wikipedia.org/wiki/File_format#Magic_number http://en.wikipedia.org/wiki/List_of_file_signatures

Microsoft Office documents all begin with “D0 CF 11 E0” and are structured in a very complicated way. To determine what kind of file you have you have to parse the whole file and reverse engineer the MS Office format. http://social.msdn.microsoft.com/Forums/en-US/343d09e3-5fdf-4b4a-9fa6-8ccb37a35930/developing-a-tool-to-recognise-ms-office-file-types-doc-xls-mdb-ppt-#d06878c3-951d-4ba5-8aae-ce8411a00f62

For efficiency reasons, the Open Access Media Importer only looks at the first 12 bytes of a file to find out what format a file has. For audio and video formats, this is enough. More accurate MS Office document detection would necessitate a full download of all files just to infer the media type – and be of questionable value, since we are not interested in MS Office files anyway.

Klortho commented 11 years ago

Nils, we're changing that text to read, "please make sure you don't do concurrent requests"

erlehmann commented 11 years ago

Klortho, I think this is great news. Can you assure me that there is no automatic blocking mechanism if API access happens too often in a given interval? As soon as you do, I will remove the rate limiting of the OAMI.

Klortho commented 11 years ago

Yes, as long as you limit yourself to OA materials. If you get blocked, let me know.

So, for the paper, we should use this, right?

Referenced in this comment above: https://github.com/erlehmann/open-access-media-importer/issues/102#issuecomment-24862881

erlehmann commented 11 years ago

Yes, that is the right graphic.

erlehmann commented 11 years ago

Rate limiting removed as of 67abe95dc18657ae6dbcd6df89bdadba3bc2ebd8.