rtve.es:infantil and multiple waiting times

powerfrontier commented 2 years ago

Checklist

[x] I'm reporting a broken site support
[x] I've verified that I'm running youtube-dl version 2021.12.17
[x] I've checked that all provided URLs are alive and playable in a browser
[x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[x] I've searched the bugtracker for similar issues including closed ones

Verbose log

$ youtube-dl -v https://www.rtve.es/infantil/serie/pinocho/video/dos-pinochos/6643979/
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.rtve.es/infantil/serie/pinocho/video/dos-pinochos/6643979/']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.5 (CPython) - Linux-5.15.53-1-MANJARO-x86_64-with-glibc2.35
[debug] exe versions: ffmpeg 5.0.1, ffprobe 5.0.1, rtmpdump 2.4
[debug] Proxy map: {}
[rtve.es:infantil] Fetching manager info
[rtve.es:infantil] 6643979: Downloading JSON metadata
[rtve.es:infantil] 6643979: Downloading url information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading MPD manifest
[rtve.es:infantil] 6643979: Downloading MPD manifest
WARNING: [rtve.es:infantil] 6643979: Failed to parse XML not well-formed (invalid token): line 1, column 0
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading MPD manifest
[rtve.es:infantil] 6643979: Downloading MPD manifest
WARNING: [rtve.es:infantil] 6643979: Failed to parse XML not well-formed (invalid token): line 1, column 0
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading MPD manifest
[rtve.es:infantil] 6643979: Downloading MPD manifest
WARNING: [rtve.es:infantil] 6643979: Failed to parse XML not well-formed (invalid token): line 1, column 0
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading m3u8 information
[rtve.es:infantil] 6643979: Downloading MPD manifest
[rtve.es:infantil] 6643979: Downloading MPD manifest
WARNING: [rtve.es:infantil] 6643979: Failed to parse XML not well-formed (invalid token): line 1, column 0
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://lote5-vod-hls-geoblockurl.akamaized.net/resources/TE_GLEAD/mp4/7/6/1657130979867.mp4/video.mpd?idasset=6643979'
[dashsegments] Total fragments: 696
[download] Destination: Los dos Pinochos-6643979.fdash-video=3959000.mp4
[download] 100% of 327.95MiB in 01:12
[debug] Invoking downloader on 'https://lote5-vod-hls-geoblockurl.akamaized.net/resources/TE_GLEAD/mp4/7/6/1657130979867.mp4/video.mpd?idasset=6643979'
[dashsegments] Total fragments: 349
[download] Destination: Los dos Pinochos-6643979.fdash-audio=193202-1.m4a
[download] 100% of 16.19MiB in 00:19
[ffmpeg] Merging formats into "Los dos Pinochos-6643979.mp4"
[debug] ffmpeg command line: ffmpeg -y -loglevel repeat+info -i 'file:Los dos Pinochos-6643979.fdash-video=3959000.mp4' -i 'file:Los dos Pinochos-6643979.fdash-audio=193202-1.m4a' -c copy -map 0:v:0 -map 1:a:0 'file:Los dos Pinochos-6643979.temp.mp4'
Deleting original file Los dos Pinochos-6643979.fdash-video=3959000.mp4 (pass -k to keep)
Deleting original file Los dos Pinochos-6643979.fdash-audio=193202-1.m4a (pass -k to keep)

Description

It gets stuck on "Downloading MPD manifest" line like 2-3 minutes various times as you can see in the log and finally the video gets downloaded. I don't know if those waiting times are the expected behavior or it is not. If it's the second case here is the issue, not a broken site one but a malfunction at least. I also tried the last version (youtube-dl 2022.07.12.810) from the more update builds from https://github.com/ytdl-patched/youtube-dl with the same result.

dirkf commented 2 years ago

From the UK, the streaming formats are all geo-blocked (403) and fail quickly, giving just

...
[info] Available formats for 6643979:
format code  extension  resolution note
Alta         mp4        unknown    
HQ           mp4        unknown    
HD_READY     mp4        unknown    
HD_FULL      mp4        unknown    (best)

However the problem URL from #30148 does give similar behaviour. On investigation the MPD data (https://rtve-hlsvod.secure.footprint.net/resources/TE_GL45/mp4/7/4/1643791217247.mp4/video.mpd?idasset=5092259) here is 1GB so it could easily take minutes to download, or hours on a slow connection. I expect that is what you are seeing. Why is it 1GB? Because it's actually an entire MP4 file and not a manifest at all. The extractor looks at the .mpd in the final part of the video URL path and decides that it must be a DASH manifest. The site sends Content-type: text/xml so that's no use.

As the MP4 data will fail XML parsing (see your log) there's no point spending time on it. We can make a small hack to detect an incorrectly typed URL by checking the beginning of the returned data:

--- old/youtube_dl/extractor/rtve.py
+++ new/youtube_dl/extractor/rtve.py
@@ -9,6 +9,7 @@ import sys
 from .common import InfoExtractor
 from ..compat import (
     compat_b64decode,
+    compat_kwargs,
     compat_struct_unpack,
 )
 from ..utils import (
@@ -118,6 +119,20 @@ class RTVEALaCartaIE(InfoExtractor):
                 yield quality.decode(), url
             encrypted_data.read(4)  # CRC

+    def _webpage_read_content(self, urlh, url_or_request, video_id, *args, **kwargs):
+
+        content_test = getattr(self, '__content_test', None)
+        if callable(content_test):
+            content = urlh.read(512)
+            if not content_test(content):
+                raise ExtractorError('Unexpected content', cause=content_test, video_id=video_id)
+
+            prefix = kwargs.pop('prefix')
+            kwargs['prefix'] = content if prefix is None else prefix + content
+            kwargs = compat_kwargs(kwargs)
+
+        return super(RTVEALaCartaIE, self)._webpage_read_content(urlh, url_or_request, video_id, *args, **kwargs)
+
     def _extract_png_formats(self, video_id):
         png = self._download_webpage(
             'http://www.rtve.es/ztnr/movil/thumbnail/%s/videos/%s.png' % (self._manager, video_id),
@@ -131,8 +146,20 @@ class RTVEALaCartaIE(InfoExtractor):
                     video_url, video_id, 'mp4', 'm3u8_native',
                     m3u8_id='hls', fatal=False))
             elif ext == 'mpd':
-                formats.extend(self._extract_mpd_formats(
-                    video_url, video_id, 'dash', fatal=False))
+                try:
+                    setattr(self, '__content_test', lambda x: b'ftypiso' not in x[:20])
+                    formats.extend(self._extract_mpd_formats(
+                        video_url, video_id, 'dash', fatal=False))
+                except ExtractorError as e:
+                    if e.cause is not getattr(self, '__content_test', None):
+                        raise
+                    formats.append({
+                        'format_id': quality,
+                        'quality': q(quality),
+                        'url': video_url,
+                    })
+                finally:
+                    setattr(self, '__content_test', None)
             else:
                 formats.append({
                     'format_id': quality,

I'd be very happy to receive offers to redesign the downloading methods of InfoExtractor, rather than this ad hoc hack.

Then (although it's not clear from the transcript there were no hangs in this):

$ python -m youtube_dl -v -F 'https://www.rtve.es/alacarta/videos/la-caza/caza-monteperdido-capitulo-1-deshielo/5092259/' 
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.rtve.es/alacarta/videos/la-caza/caza-monteperdido-capitulo-1-deshielo/5092259/']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: a03b9775d
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[rtve.es:alacarta] Fetching manager info
[rtve.es:alacarta] 5092259: Downloading JSON metadata
[rtve.es:alacarta] 5092259: Downloading url information
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[rtve.es:alacarta] 5092259: Downloading m3u8 information
[rtve.es:alacarta] 5092259: Downloading MPD manifest
[info] Available formats for 5092259:
format code  extension  resolution note
hls-1211     mp4        640x360    1211k , avc1.100.41@1013k, 25.0fps, mp4a.40.2@128k
hls-2113     mp4        1024x576   2113k , avc1.100.41@1864k, 25.0fps, mp4a.40.2@128k
hls-3132     mp4        1280x720   3132k , avc1.100.41@2761k, 25.0fps, mp4a.40.2@192k
hls-4990     mp4        1920x1080  4990k , avc1.100.41@4514k, 25.0fps, mp4a.40.2@192k
Alta-0       mpd        unknown    
Alta-1       mp4        unknown    
HQ-0         mpd        unknown    
HQ-1         mp4        unknown    
HD_READY-0   mpd        unknown    
HD_READY-1   mp4        unknown    
HD_FULL-0    mpd        unknown    
HD_FULL-1    mp4        unknown    (best)
$

powerfrontier commented 2 years ago

Thanks @dirkf for the quick response and the patch. I'm a noob and I don't know how to implement that patch on my side but I will do some research on it when a have more spare time tomorrow. There is some possibility to get that patch with a pull request on the master branch?

dirkf commented 2 years ago

Applying a patch can be easy-ish or tricky depending on what type of yt-dl installation you have.

For a PR I'd like to have a proper code solution. The hack was limited to what could be done in just the extractor. There should be a hook for this sort of thing.

This seems a plausible approach:

--- old/youtube_dl/extractor/common.py
+++ new/youtube_dl/extractor/common.py
@@ -654,6 +654,16 @@
                 self._downloader.report_warning(errmsg)
                 return False

+    def _webpage_check_content(self, urlh, url_or_request, video_id, note=None, errnote=None, fatal=True, encoding=None):
+        """
+        Check the content is as expected: if not, raise ExtractorError, or, if not fatal, may return False
+
+        Return the content read from urlh if as expected
+        """
+
+        # by default, expect any content
+        return None
+
     def _download_webpage_handle(self, url_or_request, video_id, note=None, errnote=None, fatal=True, encoding=None, data=None, headers={}, query={}, expected_status=None):
         """
         Return a tuple (page content as string, URL handle).
@@ -668,7 +678,13 @@
         if urlh is False:
             assert not fatal
             return False
-        content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding=encoding)
+
+        first_bytes = self._webpage_check_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding)
+        if first_bytes is False:
+            assert not fatal
+            return False
+
+        content = self._webpage_read_content(urlh, url_or_request, video_id, note, errnote, fatal, encoding=encoding, prefix=first_bytes)
         return (content, urlh)

     @staticmethod

Now we can just do this in the extractor:

--- old/youtube_dl/extractor/rtve.py
+++ new/youtube_dl/extractor/rtve.py
@@ -118,6 +118,16 @@
                 yield quality.decode(), url
             encrypted_data.read(4)  # CRC

+    def _webpage_check_content(self, urlh, url_or_request, video_id, **kwargs):
+
+        content = urlh.read(20)
+        content_type = urlh.headers.get('Content-Type', '')
+
+        if '/xml' in content_type and b'ftypiso' in content:
+            # ignore fatal to distinguish this case
+            raise ExtractorError('Unexpected content for ' + content_type, cause=content, video_id=video_id)
+        return content
+
     def _extract_png_formats(self, video_id):
         png = self._download_webpage(
             'http://www.rtve.es/ztnr/movil/thumbnail/%s/videos/%s.png' % (self._manager, video_id),
@@ -130,15 +140,20 @@
                 formats.extend(self._extract_m3u8_formats(
                     video_url, video_id, 'mp4', 'm3u8_native',
                     m3u8_id='hls', fatal=False))
-            elif ext == 'mpd':
-                formats.extend(self._extract_mpd_formats(
-                    video_url, video_id, 'dash', fatal=False))
-            else:
-                formats.append({
-                    'format_id': quality,
-                    'quality': q(quality),
-                    'url': video_url,
-                })
+                continue
+            if ext == 'mpd':
+                try:
+                    formats.extend(self._extract_mpd_formats(
+                        video_url, video_id, 'dash', fatal=False))
+                    continue
+                except ExtractorError as e:
+                    if not e.msg.startswith('Unexpected content '):
+                        raise
+            formats.append({
+                'format_id': quality,
+                'quality': q(quality),
+                'url': video_url,
+            })
         self._sort_formats(formats)
         return formats

ytdl-org / youtube-dl