Open powerfrontier opened 2 years ago
From the UK, the streaming formats are all geo-blocked (403) and fail quickly, giving just
...
[info] Available formats for 6643979:
format code extension resolution note
Alta mp4 unknown
HQ mp4 unknown
HD_READY mp4 unknown
HD_FULL mp4 unknown (best)
However the problem URL from #30148 does give similar behaviour. On investigation the MPD data (https://rtve-hlsvod.secure.footprint.net/resources/TE_GL45/mp4/7/4/1643791217247.mp4/video.mpd?idasset=5092259) here is 1GB so it could easily take minutes to download, or hours on a slow connection. I expect that is what you are seeing. Why is it 1GB? Because it's actually an entire MP4 file and not a manifest at all. The extractor looks at the .mpd
in the final part of the video URL path and decides that it must be a DASH manifest. The site sends Content-type: text/xml
so that's no use.
As the MP4 data will fail XML parsing (see your log) there's no point spending time on it. We can make a small hack to detect an incorrectly typed URL by checking the beginning of the returned data:
--- old/youtube_dl/extractor/rtve.py
+++ new/youtube_dl/extractor/rtve.py
@@ -9,6 +9,7 @@ import sys
from .common import InfoExtractor
from ..compat import (
compat_b64decode,
+ compat_kwargs,
compat_struct_unpack,
)
from ..utils import (
@@ -118,6 +119,20 @@ class RTVEALaCartaIE(InfoExtractor):
yield quality.decode(), url
encrypted_data.read(4) # CRC
+ def _webpage_read_content(self, urlh, url_or_request, video_id, *args, **kwargs):
+
+ content_test = getattr(self, '__content_test', None)
+ if callable(content_test):
+ content = urlh.read(512)
+ if not content_test(content):
+ raise ExtractorError('Unexpected content', cause=content_test, video_id=video_id)
+
+ prefix = kwargs.pop('prefix')
+ kwargs['prefix'] = content if prefix is None else prefix + content
+ kwargs = compat_kwargs(kwargs)
+
+ return super(RTVEALaCartaIE, self)._webpage_read_content(urlh, url_or_request, video_id, *args, **kwargs)
+
def _extract_png_formats(self, video_id):
png = self._download_webpage(
'http://www.rtve.es/ztnr/movil/thumbnail/%s/videos/%s.png' % (self._manager, video_id),
@@ -131,8 +146,20 @@ class RTVEALaCartaIE(InfoExtractor):
video_url, video_id, 'mp4', 'm3u8_native',
m3u8_id='hls', fatal=False))
elif ext == 'mpd':
- formats.extend(self._extract_mpd_formats(
- video_url, video_id, 'dash', fatal=False))
+ try:
+ setattr(self, '__content_test', lambda x: b'ftypiso' not in x[:20])
+ formats.extend(self._extract_mpd_formats(
+ video_url, video_id, 'dash', fatal=False))
+ except ExtractorError as e:
+ if e.cause is not getattr(self, '__content_test', None):
+ raise
+ formats.append({
+ 'format_id': quality,
+ 'quality': q(quality),
+ 'url': video_url,
+ })
+ finally:
+ setattr(self, '__content_test', None)
else:
formats.append({
'format_id': quality,
I'd be very happy to receive offers to redesign the downloading methods of InfoExtractor, rather than this ad hoc hack.
Then (although it's not clear from the transcript there were no hangs in this):
$ python -m youtube_dl -v -F 'https://www.rtve.es/alacarta/videos/la-caza/caza-monteperdido-capitulo-1-deshielo/5092259/'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.rtve.es/alacarta/videos/la-caza/caza-monteperdido-capitulo-1-deshielo/5092259/']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: a03b9775d
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[rtve.es:alacarta] Fetching manager info
[rtve.es:alacarta] 5092259: Downloading JSON metadata
[rtve.es:alacarta] 5092259: Downloading url information
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[info] Available formats for 5092259:
format code extension resolution note
hls-1211 mp4 640x360 1211k , avc1.100.41@1013k, 25.0fps, mp4a.40.2@128k
hls-2113 mp4 1024x576 2113k , avc1.100.41@1864k, 25.0fps, mp4a.40.2@128k
hls-3132 mp4 1280x720 3132k , avc1.100.41@2761k, 25.0fps, mp4a.40.2@192k
hls-4990 mp4 1920x1080 4990k , avc1.100.41@4514k, 25.0fps, mp4a.40.2@192k
Alta-0 mpd unknown
Alta-1 mp4 unknown
HQ-0 mpd unknown
HQ-1 mp4 unknown
HD_READY-0 mpd unknown
HD_READY-1 mp4 unknown
HD_FULL-0 mpd unknown
HD_FULL-1 mp4 unknown (best)
$
Thanks @dirkf for the quick response and the patch. I'm a noob and I don't know how to implement that patch on my side but I will do some research on it when a have more spare time tomorrow. There is some possibility to get that patch with a pull request on the master branch?
Applying a patch can be easy-ish or tricky depending on what type of yt-dl installation you have.
For a PR I'd like to have a proper code solution. The hack was limited to what could be done in just the extractor. There should be a hook for this sort of thing.
This seems a plausible approach:
--- old/youtube_dl/extractor/common.py
+++ new/youtube_dl/extractor/common.py
@@ -654,6 +654,16 @@
self._downloader.report_warning(errmsg)
return False
+ def _webpage_check_content(self, urlh, url_or_request, video_id, note=None, errnote=None, fatal=True, encoding=None):
+ """
+ Check the content is as expected: if not, raise ExtractorError, or, if not fatal, may return False
+
+ Return the content read from urlh if as expected
+ """
+
+ # by default, expect any content
+ return None
+
def _download_webpage_handle(self, url_or_request, video_id, note=None, errnote=None, fatal=True, encoding=None, data=None, headers={}, query={}, expected_status=None):
"""
Return a tuple (page content as string, URL handle).
@@ -668,7 +678,13 @@
if urlh is False:
assert not fatal
return False
- content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding=encoding)
+
+ first_bytes = self._webpage_check_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding)
+ if first_bytes is False:
+ assert not fatal
+ return False
+
+ content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding=encoding, prefix=first_bytes)
return (content, urlh)
@staticmethod
Now we can just do this in the extractor:
--- old/youtube_dl/extractor/rtve.py
+++ new/youtube_dl/extractor/rtve.py
@@ -118,6 +118,16 @@
yield quality.decode(), url
encrypted_data.read(4) # CRC
+ def _webpage_check_content(self, urlh, url_or_request, video_id, **kwargs):
+
+ content = urlh.read(20)
+ content_type = urlh.headers.get('Content-Type', '')
+
+ if '/xml' in content_type and b'ftypiso' in content:
+ # ignore fatal to distinguish this case
+ raise ExtractorError('Unexpected content for ' + content_type, cause=content, video_id=video_id)
+ return content
+
def _extract_png_formats(self, video_id):
png = self._download_webpage(
'http://www.rtve.es/ztnr/movil/thumbnail/%s/videos/%s.png' % (self._manager, video_id),
@@ -130,15 +140,20 @@
formats.extend(self._extract_m3u8_formats(
video_url, video_id, 'mp4', 'm3u8_native',
m3u8_id='hls', fatal=False))
- elif ext == 'mpd':
- formats.extend(self._extract_mpd_formats(
- video_url, video_id, 'dash', fatal=False))
- else:
- formats.append({
- 'format_id': quality,
- 'quality': q(quality),
- 'url': video_url,
- })
+ continue
+ if ext == 'mpd':
+ try:
+ formats.extend(self._extract_mpd_formats(
+ video_url, video_id, 'dash', fatal=False))
+ continue
+ except ExtractorError as e:
+ if not e.msg.startswith('Unexpected content '):
+ raise
+ formats.append({
+ 'format_id': quality,
+ 'quality': q(quality),
+ 'url': video_url,
+ })
self._sort_formats(formats)
return formats
Checklist
Verbose log
Description
It gets stuck on "Downloading MPD manifest" line like 2-3 minutes various times as you can see in the log and finally the video gets downloaded. I don't know if those waiting times are the expected behavior or it is not. If it's the second case here is the issue, not a broken site one but a malfunction at least. I also tried the last version (youtube-dl 2022.07.12.810) from the more update builds from https://github.com/ytdl-patched/youtube-dl with the same result.