ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.6k stars 10.05k forks source link

srgssr new audio url #31474

Open orangerkater opened 1 year ago

orangerkater commented 1 year ago

Checklist

Verbose log

$ youtube-dl -v "https://www.srf.ch/audio/maloney/frohe-weihnachten?id=12304744"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.srf.ch/audio/maloney/frohe-weihnachten?id=12304744']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.8.10 (CPython) - Linux-5.15.0-56-generic-x86_64-with-glibc2.29
[debug] exe versions: ffmpeg 4.2.7, ffprobe 4.2.7
[debug] Proxy map: {}
[generic] frohe-weihnachten?id=12304744: Requesting header
WARNING: Falling back on generic information extractor.
[generic] frohe-weihnachten?id=12304744: Downloading webpage
[generic] frohe-weihnachten?id=12304744: Extracting information
ERROR: Unsupported URL: https://www.srf.ch/audio/maloney/frohe-weihnachten?id=12304744
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/local/lib/python3.8/dist-packages/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/lib/python3.8/dist-packages/youtube_dl/extractor/generic.py", line 3489, in _real_extract
    raise UnsupportedError(url)
youtube_dl.utils.UnsupportedError: Unsupported URL: https://www.srf.ch/audio/maloney/frohe-weihnachten?id=12304744

Description

SRG has changed their url schematics. The url uses /audio/ instead of /play/radio/ and the ids are much shorter now. New example url: https://www.srf.ch/audio/maloney/frohe-weihnachten?id=12304744

Old uris seem to work, but calling it in a browser you will get redirected. Example: http://www.rtr.ch/play/radio/actualitad/audio/saira-tujetsch-tuttina-cuntinuar-cun-sedrun-muster-turissem?id=63cb0778-27f8-49af-9284-8c7a8c6d15fc becomes https://www.rtr.ch/audio/actualitad/saira-tujetsch-tuttina-cuntinuar-cun-sedrun-muster-turissem?partId=10728785

Funny enough, in this example SRG uses partId instead of id.

orangerkater commented 1 year ago

Maybe @goggle could help out?

dirkf commented 1 year ago

The UUID from the old format is found in the data-assetid attribute of the player <div> which has class js-media:

data-assetid="urn:srf:audio:2458a159-dcc5-44c5-9730-b5043f2d3f95"

So,

--- old/youtube_dl/extractor/srgssr.py
+++ new/youtube_dl/extractor/srgssr.py
@@ -5,6 +5,7 @@

 from .common import InfoExtractor
 from ..utils import (
+    extract_attributes,
     ExtractorError,
     float_or_none,
     int_or_none,
@@ -161,12 +162,18 @@
     _VALID_URL = r'''(?x)
                     https?://
                         (?:(?:www|play)\.)?
-                        (?P<bu>srf|rts|rsi|rtr|swissinfo)\.ch/play/(?:tv|radio)/
-                        (?:
-                            [^/]+/(?P<type>video|audio)/[^?]+|
-                            popup(?P<type_2>video|audio)player
-                        )
-                        \?.*?\b(?:id=|urn=urn:[^:]+:video:)(?P<id>[0-9a-f\-]{36}|\d+)
+                        (?P<bu>srf|rts|rsi|rtr|swissinfo)\.ch/
+                            (?:
+                                play/(?:tv|radio)/
+                                (?:
+                                    [^/]+/(?P<type>video|audio)/[^?]+|
+                                    popup(?P<type_2>video|audio)player
+                                )|
+                                (?:
+                                    (?P<type_3>video|audio)(?:/[^/]+)+/?
+                                )
+                            )
+                        \?.*?\b(?:(?:partId|id)=|urn=urn:[^:]+:video:)(?P<id>[0-9a-f\-]{36}|\d+)
                     '''

     _TESTS = [{
@@ -247,6 +254,12 @@
     def _real_extract(self, url):
         mobj = re.match(self._VALID_URL, url)
         bu = mobj.group('bu')
-        media_type = mobj.group('type') or mobj.group('type_2')
+        media_type = mobj.group('type') or mobj.group('type_2') or mobj.group('type_3')
         media_id = mobj.group('id')
+        if mobj.group('type_3') and len(media_id) < 36:
+            webpage = self._download_webpage(url, media_id)
+            player = self._search_regex(r'''(<div\b[^>]+\bclass\s*=\s*('|")js-media\2[^>]*>)''', webpage, 'Media URN') or ''
+            player = extract_attributes(player)
+            urn = player.get('data-assetid') or ''
+            media_id = urn.rsplit(':', 1)[-1]
         return self.url_result('srgssr:%s:%s:%s' % (bu[:3], media_type, media_id), 'SRGSSR')
orangerkater commented 1 year ago

This works like a charm. What an amazing response time! Thank you so much!

orangerkater commented 1 year ago

How will this patch find its way into upstream? Should I create a pull request?

dirkf commented 1 year ago

Please do.