yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
91.81k stars 7.14k forks source link

Fix radio-canada.ca support (and maybe OHdio?) #6678

Open anarcat opened 1 year ago

anarcat commented 1 year ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Canada

Example URLs

Provide a description that is worded well enough to be understood

radio-canada used to provide actual RSS feeds for their shows, but switched to using a rather idiotic OHdio thing that is both an app and a website. Confusingly, other things are still available outside of that site.

The above URLs give an example of videos that are available outside of OHdio, but presumably some work could be done to support OHdio-only sites as well. I have also provided an example of an audio track that I've actually been able to download myself, after much head banging and hair-splitting. That is, of course, after a friend told me "oh you should just write a yt-dlp extractor" and I found out (much too late) about this documentation and the plugin stuff.

In any case, what i did is this rather nasty shell script:

#!/bin/sh

# ridiculous scraper for radio-canada audio shows
#
# many limitations:
# - hardcodes URL
# - probably brittle
# - untested on other shows, let alone videos and especially not OHdio
# - doesn't add proper tags to the audio file (artist, album, etc)
# - should be a yt-dlp extractor instead, see https://github.com/yt-dlp/yt-dlp/issues/6678

set -e

#set -x

BaseURL='https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios'

curl -s "$BaseURL" | sed -n '/medianet-content/{s/.*href="/https:\/\/ici.radio-canada.ca/;s/".*//;p}' | while read EpUrl; do

    #EpUrl='https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/episodes/456610/neo-agence-cheffe-sauver-monde-ennemie/7228/audios'
    # look for a <script .*type=consolevideo+json>, it's our magic JSON blob
    #IdMedia=$(curl -s "$EpUrl" | grep -Po '"IdMedia":"\d+"' | grep -Po '\d+')
    JSON_BLOB=$(curl -s "$EpUrl" | sed -n '/<script[^>]*application[^>]*>/,/<\/script>/p' | sed -e '/<\/\?script/d')

    IdMedia=$(
        jq -r .Media.IdMedia <<EOF
$JSON_BLOB
EOF
           )
    Title=$(
        jq -r .Title <<EOF
$JSON_BLOB
EOF
         )
    m3u_url=$(curl -s 'https://services.radio-canada.ca/media/validation/v2/?appCode=medianet&connectionType=hd&deviceType=ipad&idMedia='"$IdMedia"'&multibitrate=true&output=json&tech=hls' | jq -r .url)
    yt-dlp -o "$Title".mp4 "$m3u_url"
done

Basically the trick is that you can enumerate the audio tracks from the "playlist" URL above, then from there you need to find and load a small JSON blob that gives you the unique ID for a HLS stream that yt-dlp is then quite happy to slurp down. It's missing author/title/genre kind of metadata in the mp4 and doesn't pull the covers, but that would probably be easier to implement in yt-dlp than going any further ahead in that horrible script.

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

anarcat@angela:tmp$ yt-dlp -vU 'https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios'
[debug] Command-line config: ['-vU', 'https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (debian*)
[debug] Python 3.11.2 (CPython x86_64 64bit) - Linux-6.1.0-6-amd64-x86_64-with-glibc2.36 (OpenSSL 3.0.8 7 Feb 2023, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.2-3 (setts), ffprobe 5.1.2-3, rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.11.0, brotli-1.0.9, certifi-2022.09.24, mutagen-1.46.0, pyxattr-0.8.0, secretstorage-3.3.3, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
yt-dlp is up to date (stable@2023.03.04)
[generic] Extracting URL: https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios
[generic] audios: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] audios: Extracting information
[debug] Looking for embeds
ERROR: Unsupported URL: https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/yt_dlp/YoutubeDL.py", line 1518, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/yt_dlp/YoutubeDL.py", line 1594, in __extract_info
    ie_result = ie.extract(url)
                ^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/yt_dlp/extractor/common.py", line 694, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/yt_dlp/extractor/generic.py", line 2510, in _real_extract
    raise UnsupportedError(url)
yt_dlp.utils.UnsupportedError: Unsupported URL: https://ici.radio-canada.ca/jeunesse/scolaire/emissions/5615/lagent-jean/contenu/audios
bashonly commented 1 day ago

see also example URL in #11710

bl4ckb0ne commented 1 day ago

11710 seems a bit different, I tried to run the script but it fails early one, there's no medianet-content in the URL content.

dirkf commented 1 day ago

It's there but in the hydration JSON (which has nothing useful for l'agent Jean) and here:

...<div data-mediainfo='{"appCode":"medianet","mediaId":"8802204"}'>...
dirkf commented 7 hours ago

The OHdio pages are basically similar to the previous ones, but the hydration JSON assigned to _rcState_ has to be analyzed; actual A/V pages have a Medianet ID stashed as mediaId. There isn't an obvious way to distinguish playlists: it seems that matching xxx/nnn in the url where xxx is one of livres-audio, balados, episodes or emissions and the path finishes with the hyphenated program slug, like enquete-de-crime-une-histoire-presque-vraie, finds the obvious ones; then the playlist item URLS (etc) can be found in an items list in the hydration JSON.