Open snipem opened 9 years ago
Here is one: http://www.mmafighting.com/2014/2/2/5370376/ufc-169-post-fight-show
This page contains both youtube and ooyala videos, while youtube-dl detects the youtube video first, so the ooyala video is not downloaded at all.
In the file: youtube-dl/youtube_dl/extractor/generic.py I removed some of the return
in the method: _real_extract. It was then able to extract more videos from different video services. But then I ran into this error:
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'https://tifrib.com/said-rageah/']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.04.26
[debug] Git HEAD: e8bfe2a
[debug] Python version 2.7.12 - Linux-4.4.0-72-generic-x86_64-with-Ubuntu-16.04-xenial
[debug] exe versions: ffmpeg 2.8.11-0ubuntu0.16.04.1, ffprobe 2.8.11-0ubuntu0.16.04.1
[debug] Proxy map: {}
--- self._real_extract
--- Called _real_extract for embeded URLs
--- https://tifrib.com/said-rageah/
[generic] said-rageah: Requesting header
WARNING: Falling back on generic information extractor.
[generic] said-rageah: Downloading webpage
[generic] said-rageah: Extracting information
--- Look for embedded YouTube player
--- Found embedded Youtube video
[u'https://videopress.com/embed/4BajuZCH', u'https://videopress.com/embed/X1is4uyi', u'https://videopress.com/embed/aJlE15aE', u'https://videopress.com/embed/SV3AWSeV']
ERROR: Unsupported URL: https://tifrib.com/said-rageah/
Traceback (most recent call last):
File "youtube_dl/extractor/generic.py", line 1916, in _real_extract
doc = compat_etree_fromstring(webpage.encode('utf-8'))
File "youtube_dl/compat.py", line 2526, in compat_etree_fromstring
doc = _XML(text, parser=etree.XMLParser(target=_TreeBuilder(element_factory=_element_factory)))
File "youtube_dl/compat.py", line 2515, in _XML
parser.feed(text)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1653, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
raise err
ParseError: not well-formed (invalid token): line 42, column 344
Traceback (most recent call last):
File "youtube_dl/YoutubeDL.py", line 760, in extract_info
ie_result = ie.extract(url)
File "youtube_dl/extractor/common.py", line 430, in extract
ie_result = self._real_extract(url)
File "youtube_dl/extractor/generic.py", line 2786, in _real_extract
raise UnsupportedError(url)
I think it's because I removed too many return
and Youtube-dl default to an extractor and that one did not recognize anything... So I don't think it will be hard for me to find a solution to this.
I am posting this here because I would like your feedbacks on the strategy I have chosen to resolve this issue.
@yan12125 I tried your URL (http://www.mmafighting.com/2014/2/2/5370376/ufc-169-post-fight-show). I was only able to download one of the videos (the one from Ooyala). I am not sure why yet.
@yan12125 I just looked at the log on my terminal... It seems it found the Youtube video but it's not downloading it for some reason.
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'http://www.mmafighting.com/2014/2/2/5370376/ufc-169-post-fight-show']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.04.26
[debug] Git HEAD: e8bfe2a
[debug] Python version 2.7.12 - Linux-4.4.0-72-generic-x86_64-with-Ubuntu-16.04-xenial
[debug] exe versions: ffmpeg 2.8.11-0ubuntu0.16.04.1, ffprobe 2.8.11-0ubuntu0.16.04.1
[debug] Proxy map: {}
--- self._real_extract
--- Called _real_extract for embeded URLs
--- http://www.mmafighting.com/2014/2/2/5370376/ufc-169-post-fight-show
[generic] ufc-169-post-fight-show: Requesting header
WARNING: Falling back on generic information extractor.
[generic] ufc-169-post-fight-show: Downloading webpage
[generic] ufc-169-post-fight-show: Extracting information
--- Look for embedded YouTube player
--- Found embedded Youtube video
[]
--- self._real_extract
[Ooyala] 5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j: Downloading JSON metadata
[Ooyala] 5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j: Downloading JSON metadata
[Ooyala] 5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j: Downloading m3u8 information
[debug] Invoking downloader on u'http://player.ooyala.com/player/all/5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j_4000.m3u8'
[download] UFC 169 post-fight show-5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j.mp4 has already been downloaded
[download] 100% of 386.05MiB
[debug] ffmpeg command line: ffprobe -show_streams 'file:UFC 169 post-fight show-5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j.mp4'
[ffmpeg] Fixing malformated aac bitstream in "UFC 169 post-fight show-5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j.mp4"
[debug] ffmpeg command line: ffmpeg -y -i 'file:UFC 169 post-fight show-5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j.mp4' -c copy -f mp4 -bsf:a aac_adtstoasc 'file:UFC 169 post-fight show-5mdXVoazrZPFMEwA751Q-TJ5NH0KAz2j.temp.mp4'
Removing return
s is not enough. Need a generic approach to combine different URLs from different extractors in generic.py
"combine different URLs from different extractors in generic.py" - How? I am willing to do it but I am unsure of what you mean.
For example, pages Brightcove videos yield an playlist:
return {
'_type': 'playlist',
'title': video_title,
'id': video_id,
'entries': entries,
}
And Wistia videos give a transparent URL:
return {
'_type': 'url_transparent',
'url': embed_url,
'ie_key': 'Wistia',
'uploader': video_uploader,
}
The overall result can be a playlist of them: (I'm not sure whether this approach can handle all possible cases or not)
return {
'_type': 'playlist',
'entries': [{
'_type': 'playlist',
'title': video_title,
'id': video_id,
'entries': entries,
}, {
'_type': 'url_transparent',
'url': embed_url,
'ie_key': 'Wistia',
'uploader': video_uploader,
}]
}
Let me try it out. It might not be perfect but over time we can correct the code.
Treat a url as a playlist if more than one video url is found. This should be a thing for every url that is handled with the generic video extractor.