ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.96k stars 10.01k forks source link

DRTV fails downloading video JSON #32581

Open jakobmn opened 1 year ago

jakobmn commented 1 year ago

Checklist

Verbose log

PS C:\temp\dr\youtube-dl> python.exe .\youtubedl__main__.py --verbose https://www.dr.dk/drtv/se/hammerslag-foerstegangskoebet407460 [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['--verbose', 'https://www.dr.dk/drtv/se/hammerslag-foerstegangskoebet_407460'] [debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252 [debug] youtube-dl version 2021.12.17 [debug] Git HEAD: 00ef748cc [debug] Python 3.11.6 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 - OpenSSL 3.0.11 19 Sep 2023 [debug] exe versions: ffmpeg 6.0-essentials_build-www.gyan.dev, ffprobe 6.0-essentialsbuild-www.gyan.dev [debug] Proxy map: {} [drtv] hammerslag-foerstegangskoebet407460: Downloading webpage https://www.dr.dk/drtv/se/hammerslag-foerstegangskoebet_407460 [drtv] 00952316100: Downloading video JSON https://www.dr.dk/mu-online/api/1.4/programcard?expanded=true&productionnumber=00952316100 ERROR: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. File "C:\temp\dr\youtube-dl\youtube_dl\extractor\common.py", line 666, in _request_webpage return self._downloader.urlopen(url_or_request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\temp\dr\youtube-dl\youtube_dl\YoutubeDL.py", line 2461, in urlopen return self._opener.open(req, timeout=self._socket_timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64qbz5n2kfra8p0\Lib\urllib\request.py", line 525, in open response = meth(req, response) ^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64qbz5n2kfra8p0\Lib\urllib\request.py", line 634, in http_response response = self.parent.error( ^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64qbz5n2kfra8p0\Lib\urllib\request.py", line 563, in error return self._call_chain(*args) ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64qbz5n2kfra8p0\Lib\urllib\request.py", line 496, in _call_chain result = func(*args) ^^^^^^^^^^^ File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64__qbz5n2kfra8p0\Lib\urllib\request.py", line 643, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp)

PS C:\temp\dr\youtube-dl>

Description

Tested both with 2021.12.17 and latest master branch. Worked last week - it would seem the https://www.dr.dk/mu-online/api/1.4/programcard has been deprecated? I can't find any reference to it when using the web page. All links I have tried on dr.dk fails with the same error now.

dirkf commented 1 year ago

The page has a "hydration" JSON block assigned to window.__data containing programme metadata and preview media links. To get the actual media links, the site makes this request (Copy as cURL from Mozilla webtools):

curl 'https://production.dr-massive.com/api/account/items/407460/videos?delivery=stream&device=web_browser&ff=idp%2Cldp%2Crpt&lang=da&resolution=HD-1080&sub=Anonymous' -H 'User-Agent: Mozilla/5.0 (X11; Linux i686; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.17.1' -H 'Accept: application/json' -H 'Accept-Language: en-GB,en;q=0.5' --compressed -H 'X-Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzUxMiJ9.eyJzaWQiOiJiNmYxYjFjNS1mNTM0LTRhZmYtOWYwZS1hODUxOTdiN2JmMmQiLCJzcHIiOiJOb25lIiwiYXVkIjoiaHR0cDovL2lzbC5kci1tYXNzaXZlLmNvbS9JU0wvQXBpL1YxL0RhdGFzZXJ2aWNlIiwic3ViIjoiQ2F0YWxvZyIsImV4cCI6MTY5NjU4NTY3OSwidXNlckFjY291bnRJZCI6ImNhNTg5YzFmZGU4MTRlOWE4MDIyMjhlZGZkNzI0MDlkIiwidXNlclByb2ZpbGVJZCI6ImNhNTg5YzFmZGU4MTRlOWE4MDIyMjhlZGZkNzI0MDlkIiwiZW1haWwiOiJjYTU4OWMxZmRlODE0ZTlhODAyMjI4ZWRmZDcyNDA5ZEBleGFtcGxlLmNvbSIsImRldmljZSI6IndlYl9icm93c2VyIiwidmFsaWRVbnRpbCI6MTY5OTEzNDQ3OSwiaWF0IjoxNjk2NTQyNDc5LCJpc09wdGVkT3V0Ijp0cnVlLCJpc0NvdW50cnlWZXJpZmllZCI6ZmFsc2UsImdlb0xvY2F0aW9uIjoiYWJyb2FkIiwiaXNEZXZpY2VBYnJvYWQiOnRydWUsImlzRmFsbGJhY2tUb2tlbiI6ZmFsc2UsInN1YnNjcmlwdGlvbiI6IkFub255bW91cyIsImNvbnNlbnQiOltdLCJzZXNzaW9uU3RhdGUiOm51bGx9.RxOGorDcjT69xFoG2cDgjiLSgLJZD0v83ItuVnvlWAWTJM_wFcc5sxH1EDLjLcmcfdy2aWOov0vlh-ePBi5WPV8vanB0aMdC23_NqrujdSKLWSIpNuKCyQ5Gx1e3Vz5edaWpDIWAsGBU3dLjo7ItWfL8rTIPskQ4mrIVfpPNzLK-R6OnSneMl7bqItdKwm4wuMYypqnp8YhI4kbzxiFAP3Y2mmCvnHeN9QiCXkYYyPp_Vy40mfAsGRCRA7se8wDuQlvarE4om863VCpaS5mi-mcK_qisele8eeZ8jWy4g6A1hNEbErQXn7xMoQCptlSc8fOZdItOML3W0rpiJG5Hfg' -H 'origin: https://www.dr.dk' -H 'Referer: https://www.dr.dk/drtv/se/hammerslag_-foerstegangskoebet_407460' -H 'DNT: 1' -H 'Connection: keep-alive'

To reproduce this we have to work out how to get the bearer token (required), among other things.

viktor-enzell commented 1 year ago

@dirkf svtplay-dl has support for DRTV and they seem to be able to extract the bearer token. Perhaps a similar solution can be used here?

https://github.com/spaam/svtplay-dl/blob/master/lib/svtplay_dl/service/dr.py

PeterBehrend commented 1 year ago

-hi, Since I am not an expert on JSON I can only comment in general terms. Sorry for that. I can confirm that ytp-dl is no longer working on www.dr.dk/drtv I have been using an older version (2023.03.04) until yesterday. Newer versions did not work. Today this old version also failed (and any later version).

I tried the product svtplay-dl today. It can download older contents from dr..dk/drtv but not the latest. For instance: Success with the "Beck" Series. Failure with "Das Boot" (very latest contents).

Again. I apologize for these general observations. Best I can do. I love yt-dlp and anyone who makes it work with dr.dk/drtv Been using it for years. /regards

dirkf commented 1 year ago

Actual example URLs would help, so that we can determine if the old page structures still apply.

Thanks @viktor-enzell for the hint. I guess this is the relevant fragment:

        deviceid = uuid.uuid1()
        res = self.http.request(
            "post",
            "https://isl.dr-massive.com/api/authorization/anonymous-sso?device=web_browser&ff=idp%2Cldp&lang=da",
            json={"deviceId": str(deviceid), "scopes": ["Catalog"], "optout": True},
        )
        token = res.json()[0]["value"]

As svtplay is MIT-licensed, we can't copy this code directly but it should be possible to embody the procedure shown.

PeterBehrend commented 1 year ago

Right. 2 links that definitely does not work with yt-dlp:

https://www.dr.dk/drtv/se/das-boot_-en-ny-chance_393842 https://www.dr.dk/drtv/se/beck_-paa-tynd-is_412058

dirkf commented 1 year ago

Pages that work OK (if any) are also of interest!

Also bear in mind that this tracker is for yt-dl rather than yt-dlp, although in this case both used the same code with the 404 API host. The yt-dlp extractor knows about series/season pages (it uses the new API host for those). When the current issue is fixed (here or there) we can unify the extractor versions. There is an open PR hereI have a fix branch that will need to be updated.

thomasdn commented 1 year ago

If anyone needs to be able to route traffic through Denmark in order to test this, please get in touch.

lyngklip commented 11 months ago

Using the hint provided by @viktor-enzell I have made local modifications to the extractor drtv.py such that my yt-dlp can now access the links provided by @PeterBehrend. All parts of the original drtv.py that make use of the DR api "mu-online" seem to be rendered useless by recent changes at DR. My modifications incur potentially undesirable dependencies on the modules requests and uuid. I use _extract_m3u8_formats_and_subtitles to provide formats and subs. I'ts little more than a proof-of-concept. I'm not sure if it's worthy of a pull-request as is.

dirkf commented 11 months ago

Thanks. If you're willing to release your code under Unlicense and to put it up in a GH repo, or even easier, a Gist, I can probably adapt your tactics to yt-dl conventions: certain things would be hard, like having to use HTTP/2, say.

This version of uuid is fine to use. requests has to be worked around.

dirkf commented 11 months ago

The requests stuff could be s/t like this. Note that yt-dl code shouldn't have f'{strings}' and uses Ellipsis for ..., though these are fine in yt-dlp. Both projects quote strings 'singly'.

-        token = requests.post("https://isl.dr-massive.com/api/authorization/anonymous-sso?device=web_browser&ff=idp%2Cldp&lang=da", json={"deviceId": str(deviceId), "scopes": ["Catalog"], "optout": True}).json()[0]["value"] # , headers={"Content-Type": "application/json"}
+        token = self._download_json('https://isl.dr-massive.com/api/authorization/anonymous-sso?device=web_browser&ff=idp%2Cldp&lang=da', data=json.dumps({'deviceId': compat_str(deviceId), 'scopes': ['Catalog'], 'optout': True}).encode('utf-8))[0]['value']
-        data = requests.get(f"https://production.dr-massive.com/api/account/items/{itemId}/videos?delivery=stream&device=web_browser&ff=idp%2Cldp%2Crpt&lang=da&resolution=HD-1080&sub=Anonymous", headers={"authorization": f"Bearer {token}"}).json()
+        data = self._download_json('https://production.dr-massive.com/api/account/items/{0}/videos?delivery=stream&device=web_browser&ff=idp%2Cldp%2Crpt&lang=da&resolution=HD-1080&sub=Anonymous'.format(itemId), headers={'authorization': 'Bearer {0}'.format(token)})
almx commented 11 months ago

With my changes to drtv.py, yt-dlp insisted on downloading a "SpokenSubtitles" version of "Das Boot" season 2 episode 8. That's because I just used format[0]. For that reason I've updated the gist to iterate over formats more like drtv.py did before. I suppose you would figure that out anyway. The gist obviously only contains the parts of the full drtv.py file that are changed, i.e. the _real_extract() method of the DRTVIE class. Other classes and tests and what not should stay.

On a further note I have to use --downloader ffmpeg with "Das Boot" due to an unrelated yt-dlp issue.

Regarding subtitles, it would be great if you can ensure it grabs all subtitles with the --all-subs tag. Here's an example of a video that has two subtitles even though it's technically a Danish show. It begins with an Italian woman talking a bit, the rest is Danish.

https://www.dr.dk/drtv/se/spise-med-price_-pasta-selv_397445

It has these two subs:

https://drod22r.akamaized.net/all/clear/none/ec/64c74410ef918632984110ec/00212301010/subtitles/Foreign-21269858-227d5e6d-8d67-45ca-b672-c174e7ae86d4.vtt https://drod22r.akamaized.net/all/clear/none/ec/64c74410ef918632984110ec/00212301010/subtitles/Foreign_HardOfHearing-21269858-227d5e6d-8d67-45ca-b672-c174e7ae86d4.vtt

The first one only displays foreign language, the second one has all subtitles combined. They have these JSON tags:

1st: 'language': 'ForeignLanguageSubtitles' 2nd: 'language': 'CombinedLanguageSubtitles'

I wish could contribute more but I'm bad at Python. But this is some of the data I've dug up while debugging.

jkiddo commented 11 months ago

@lyngklip would you be willing to share more of your findings?

almx commented 11 months ago

For info I am close to finalizing my changes to yt-dlp in issue https://github.com/yt-dlp/yt-dlp/issues/8298

Just need to commit the changes and have it reviewed

almx commented 11 months ago

DRTV has now been fixed and thoroughly tested in the yt-dlp project - https://github.com/yt-dlp/yt-dlp/issues/8298

dirkf commented 11 months ago

Just a few comments:

This last issue is tricky and probably affects many sites. The metadata sent by DRTV has a broadcast time in ISO 8601 format with no explicit time-zone. The programme page may include a default TZ in ld+json or hydration JSON (like Europe/Copenhagen) or we could assume one (but does DRTV broadcast in Greenland?). But Python (<3.9) and the yt-dl utility routines don't give us an easy way to apply a TZ specified by name, even if we transform it into a TZ string like CET-1CEST,M3.5.0/2,M10.5.0/3 which contains all the information needed to determine the UTC offset for a date-time string with no explicit time-zone (apart from the ambiguous hour when CEST is rolled back).

almx commented 11 months ago

@dirkf Are you replying to my changes to the yt-dlp project ? Just making sure.

I only spent a short time looking at radio and live streams (what is "news"?), but gave up. I'm sure it's possible, but I felt downloading the regular shows is more important. Maybe radio/live streams can be a new issue # in itself if anyone is up for it.

I hven't dug much into the timestamps, but is that so important?

dirkf commented 11 months ago

Timestamps that vary according to location cause tests to break depending on where the tests are run, which can't be controlled when GitHub runs the tests in VMs, or if devs elsewhere try to run them. Ofc we can just change the test value to int but then the test doesn't show that the extraction is getting the expected value (because it may not be). However I've run up a little routine that I hope will mostly solve the problem: "mostly" because code to process dates and times is bound to be wrong somehow; also because there's no unambiguous answer to the question "what is the UTC offset of this date time string from CET/CEST (or any TZ with DST) with no explicit offset?" when the time is within the falling-back hour going from 03:00 CEST to 02:00 CET.

The old extractor had some test URLs with radio and nyheder ("news", isn't it? PR #10536) The answer to the live pages seems to be in the hydration JSON (json_data) where there are thumbnails and manifest URLs.

I've made a yt-dl version of the extractor including the time munging where I'll also try to update the live extraction (may need DK testers). It'll be in a PR soon, but needs some enhancements to be pulled from yt-dlp core. Many thanks for resolving the new site structure and the playlist handling.

Simonfront commented 10 months ago

The newest version of yt-dlp still had this error, so I got so far as updating to the nightly build. Instead I got certificate errors when accessing dr.dk URLs.

It's possible to bypass this certificate error by adding the --no-check-certificates option when downloading from dr.dk (although probably generally unadvisable).

KarmusDK commented 1 week ago

404 Not Found is back with the latest stable versions.

C:\>yt-dlp --embed-subs --all-subs https://www.dr.dk/drtv/se/we-are-lady-parts_-spil-noget_460662
[drtv] Downloading anonymous token
[drtv] Extracting URL: https://www.dr.dk/drtv/se/we-are-lady-parts_-spil-noget_460662
[drtv] we-are-lady-parts_-spil-noget_460662: Downloading webpage
[drtv] 00022120210: Downloading stream data
ERROR: [drtv] we-are-lady-parts_-spil-noget_460662: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: Not Found>)
dirkf commented 1 week ago

Useful to know, but for ayt-dlp maintainer to action that, please post there. A new issue referencing yt-dlp/yt-dlp#8298 would probably be best. However I see that a pending PR was just merged that should fix the issue and would be present in the next nightly, if not today's. The equivalent change here is giving 403 for the media as expected since I'm not testing from DK.