wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

YouTube exporter #82

Open Daniel-Mietchen opened 11 years ago

Daniel-Mietchen commented 11 years ago

This issue serves to bundle work on a pipeline to export videos from PMC to YouTube.

wrought commented 11 years ago

seems youtube accepts .ogg format and automatically converts URLs to have anchor tags in the description, so possible to easily link back to WMC: http://www.youtube.com/watch?v=JP4hd_PVFSE

wrought commented 11 years ago

So, working out some specs for this might be:

  1. Connect to Google/Youtube w/ Oauth2
  2. Upload video and write meta data through Youtube API v3 https://developers.google.com/youtube/v3/
  3. Benchmark and throttle if necessary to avoid hitting API limits, which might look something like "200 video uploads, 7000 write operations, and 200,000 read operations that each retrieve three resource parts" per day, totalling approximately 5,000,000 "units" over the api.
  4. Potentially communicate with youtube / google about lifting throttle if it becomes an impediment.
  5. Create manual procedure to update an updated, failed, or otherwise incorrect upload.

This seems to me that there are two important components if this is to be integrated with the current service:

  1. Upload to Youtube after video is deposited in Commons, so the proper URL can be used to link back, provide DOI and direct-download link as well.
  2. Need to be able to potentially update either the Youtube or Commons media independent from one another. Need to investigate how the application is currently used in case of error, update, etc.
Daniel-Mietchen commented 11 years ago

This looks good to me so far.

Some background on what I have in mind with this YouTube exporter:

  1. outreach to the YouTube community (and potentially that of other video sharing sites) about (a) Wikimedia Commons, (b) research and (c) Open Access;
  2. checking the OAMI workflows and readying them for routine operation (keeping track of articles that have suitable materials, and of what has been uploaded when and where) and possibly further plugins;
  3. for those videos that failed to convert through Gstreamer, there is a good chance that YouTube does have a way to ingest them, and we could then import the WebM from there into Commons;
  4. outreach to scholarly authors and editors about the benefits (and pitfalls, if any) of reuse-friendly licenses through more comprehensive inclusion of reuse in altmetrics;
  5. testing whether YouTube's "related" material can be of any use in improving categorization of the videos on Commons;
  6. testing the technical and community aspects of sharing media from Commons with other sites and their respective communities (e.g. Flickr, sound archives);
  7. testing the legal ground of such multi-layer reuse in a commercial context (e.g. https://twitter.com/EvoMRI/status/350790898092752896 ).
wrought commented 11 years ago

After reading through more of the tool it seems like it would make the most sense to add:

  1. A few fields to model.py, including especially an uploaded_to_youtube bit.
  2. an action to oa_put to upload_media_to_youtube akin to upload_media
  3. The requisite helper functions for YouTube akin to mediawiki.py
  4. Possibly a second plot to view stats on successful uploads to youtube

// Matt

----- Reply message ----- From: "Daniel Mietchen" notifications@github.com To: "erlehmann/open-access-media-importer" open-access-media-importer@noreply.github.com Cc: "Matt Senate" mattsenate@gmail.com Subject: [open-access-media-importer] YouTube exporter (#82) Date: Sat, Jun 29, 2013 03:52 This looks good to me so far.

Some background on what I have in mind with this YouTube exporter:

outreach to the YouTube community (and potentially that of other video sharing sites) about (a) Wikimedia Commons, (b) research and (c) Open Access; checking the OAMI workflows and readying them for routine operation (keeping track of articles that have suitable materials, and of what has been uploaded when and where) and possibly further plugins; for those videos that failed to convert through Gstreamer, there is a good chance that YouTube does have a way to ingest them, and we could then import the WebM from there into Commons; outreach to scholarly authors and editors about the benefits (and pitfalls, if any) of reuse-friendly licenses through more comprehensive inclusion of reuse in altmetrics; testing whether YouTube's "related" material can be of any use in improving categorization of the videos on Commons; testing the technical and community aspects of sharing media from Commons with other sites and their respective communities (e.g. Flickr, sound archives); testing the legal ground of such multi-layer reuse in a commercial context (e.g. https://twitter.com/EvoMRI/status/350790898092752896 ).

— Reply to this email directly or view it on GitHub.

Daniel-Mietchen commented 11 years ago

Yes, this makes good sense for a start, but what I have in mind is more complicated.

  1. I see no easy way to tell precisely which articles on PMC have (1) already been checked by the two OAMI crawlers (cf. https://github.com/erlehmann/open-access-media-importer/issues/85 and https://github.com/erlehmann/open-access-media-importer/issues/83 ) (2) caused problems with conversion or upload (cf. https://github.com/erlehmann/open-access-media-importer/issues/22 and https://github.com/erlehmann/open-access-media-importer/issues/21 and https://github.com/erlehmann/open-access-media-importer/issues?labels=GStreamer&page=1&state=open ). Perhaps it's worth checking and fixing the workflows here first before we expand to YouTube.
  2. YouTube does not accept sound-only files, nor videos below or above certain sizes.
  3. The uploading under CC licenses option is only available for trusted users, but we do not even have an account yet (I would prefer to have a separate one, perhaps named "WikiProject Open Access" or so).
  4. I would like the YouTube entry to link back to the file and paper on (1) the journal's website (2) Wikimedia Commons (if both conversion and upload worked) (3) PMC and to the original license (which is not always CC BY 3.0, nor even CC BY (cf. https://twitter.com/invisiblecomma/status/345101287580385280 )
wrought commented 11 years ago

From my point of view, while there are some bugs and edge cases, there are no technical "blockers" for developing the feature to deposit these videos on youtube as well as a mediawiki instance. While there are some issues with identifying materials and whether they have been uploaded, I think that is an existing bug that requires its own development. I would rather not go too far down that path now, but I would be interested in helping later on. For the time being, I think it is reasonable and within scope technically to extend to youtube as I mentioned above.

As for audio, this should be caught in error handling of some fashion, so no worries. Same for video sizing, this should be logged as errors and there should be a queue of backlogged uploads.

Daniel-Mietchen commented 11 years ago

Agreed.

There are a number of design decisions we made early on that make life difficult with the OAMI now that we are handling thousands of files from hundreds of journals. Some of these decisions would have been different if we had known about the problems in the XML (see also http://chrismaloney.org/notes/OAMI%20JatsCon%20Submission,%202013 - accepted by now).

So I think the point for this weekend is to get a demo for a PMC-to-YouTube workflow going (possibly via Commons), with a few files (say, on the order of 100), then finetune that workflow over the coming weeks, with the goal of having the channel in full operation by OA week.

wrought commented 11 years ago

Some work put in so far on this branch: https://github.com/wrought/open-access-media-importer/tree/youtube

Will see about throttling, this is currently being done with a single sleep() function, can do the same ;)

Thought about metadata--all will be posted in the Youtube description, which automatically converts links. Are there any cases where supplemental files are given their own DOI? We don't seem to be accessing that information currently if so. If it's a fringe case, not worth it anyhow.

Daniel-Mietchen commented 11 years ago

The PLOS ONE YouTube channel has 1.2M views from ca. 100 files http://www.youtube.com/user/channelplosone/videos?view=0&sort=p&flow=grid . 1.1M of these are of one video: http://www.youtube.com/watch?v=g1y7ASI3ZkQ

The Pensoft YouTube channel has 500k views from ca. 20 videos: http://www.youtube.com/user/PensoftPublishers/videos?flow=grid&view=0&sort=p

BMC: 100k views on ca. 200 videos, mostly about open access http://www.youtube.com/user/BioMedCentral/videos?sort=p&view=0&flow=grid

Daniel-Mietchen commented 11 years ago

There is currently no tool to expose how often a video or audio file embedded in a Wikipedia article has actually been played.

Daniel-Mietchen commented 11 years ago

Some other channels related to science http://www.youtube.com/user/edyong209/videos?sort=p&view=0&flow=grid http://www.youtube.com/user/RoyalSociety/videos?sort=p&view=0&flow=grid http://www.youtube.com/user/zfaulkes/videos?sort=p&view=0&flow=grid http://www.youtube.com/user/SmithsonianScience?sort=p&view=0&flow=grid http://www.youtube.com/user/CERNTV?sort=p&view=0&flow=grid http://www.youtube.com/user/NASAexplorer?sort=p&view=0&flow=grid

PLOS videos: http://www.youtube.com/results?search_sort=video_view_count&search_query=plos&search_type=videos

Daniel-Mietchen commented 11 years ago

A snapshot of the files uploaded by the bot to Commons that get most views via Wikipedia: http://www.webcitation.org/6HvxubbRD , calculated via http://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Uploaded+with+Open+Access+Media+Importer&use_globalusage=1&ns0=1&show_details=1

wrought commented 11 years ago

The tests should show up here http://youtube.com/wikiprojectoatest and the live will be http://youtube.com/wikiprojectoa