Open github-com-bqm opened 2 years ago
Yes, and that's how [NZZ]
appears in the log.
Obviously the extractor needs to be rewritten. The site seems to be serving React-ish pages where the additional page content (hydration) is in a JS expression like this
<script> window.__NZZ__ = (function (arg1, ...,. argN) { /*assignments */; return {/* possibly useful data but including variable references*/ };}(binding1, ..., bindingN)); </script>
Here N is a large number ~ 100s. You might expect that the function body would generate some template and the arguments would supply the values to populate that template for a specific page, but there seems to be no logical distinction between data set in the function body and in the arguments.
So there isn't a JSON constant that can be decoded directly. The jsinterp
module isn't able to execute the expression, so we'll have to hack the JS about to get the targets.
Also, the videos are being served via cdn.jwplatform.com, so a possibility is to pass just the IDs to JWPlatformIE
, thus avoiding extracting metadata from the page.
Before the actual video starts, a commercial is shown, which can be downloaded with the src url in the video tag (e.g. this here: https://crcdn09.adnxs-simple.com/creative/p/3927/2021/12/4/30280269/c720773f-fdd5-43dd-ac03-d2ec2d49a40e_1280_720_1700k.mp4)
After the commercial finishes, the video url is replaced with this:
blob:https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19
Trying to use this directly with youtube-dl, I first get a message that youtube-dl doesn't support blob urls. if I try without blob:, then I get this error:
youtube-dl --verbose "https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--restrict-filenames', u'--verbose', u'https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 2.7.16 (CPython) - Darwin-20.6.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] 568533c4-674c-4521-897e-059cd0561c19: Requesting header
WARNING: Could not send HEAD request to https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19: HTTP Error 404: Not Found
[generic] 568533c4-674c-4521-897e-059cd0561c19: Downloading webpage
ERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 634, in _request_webpage
return self._downloader.urlopen(url_or_request)
File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 2288, in urlopen
return self._opener.open(req, timeout=self._socket_timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 435, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 473, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
In the problem page, the video of interest can be downloaded as jwplatform:ndKNFKmf
$ youtube-dl -F 'jwplatform:ndKNFKmf'
[JWPlatform] ndKNFKmf: Downloading JSON metadata
[JWPlatform] ndKNFKmf: Downloading m3u8 information
[info] Available formats for ndKNFKmf:
format code extension resolution note
0 m4a audio only
120 mp4 audio only 120k , mp4a.40.2
310 mp4 320x180 310k , avc1.77.30, mp4a.40.2
440 mp4 480x270 440k , avc1.77.30, mp4a.40.2
570 mp4 720x406 570k , avc1.77.30, mp4a.40.2
1280 mp4 1280x720 1280k , avc1.77.30, mp4a.40.2
3180 mp4 1920x1080 3180k , avc1.77.30, mp4a.40.2
7 mp4 320x180 304602k
8 mp4 480x270 438633k
9 mp4 720x406 565446k
10 mp4 1280x720 1274784k
11 mp4 1920x1080 3171488k (best)
$
Plainly there's a x1000 bug in the last 5 formats.
With a new test version of the extractor:
$ python -m youtube_dl -v -F 'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[NZZ] 1639635: Downloading webpage
[JWPlatform] ndKNFKmf: Downloading JSON metadata
[JWPlatform] 1639635: Downloading m3u8 information
[info] Available formats for 1639635:
format code extension resolution note
0 m4a audio only
hls-120 mp4 audio only 120k , mp4a.40.2
180p mp4 320x180 304k , 63.00MiB
hls-310 mp4 320x180 310k , avc1.77.30, mp4a.40.2
270p mp4 480x270 438k , 90.72MiB
hls-440 mp4 480x270 440k , avc1.77.30, mp4a.40.2
406p mp4 720x406 565k , 116.95MiB
hls-570 mp4 720x406 570k , avc1.77.30, mp4a.40.2
720p mp4 1280x720 1274k , 263.66MiB
hls-1280 mp4 1280x720 1280k , avc1.77.30, mp4a.40.2
1080p mp4 1920x1080 3171k , 655.95MiB
hls-3180 mp4 1920x1080 3180k , avc1.77.30, mp4a.40.2 (best)
$
But this touches a whole lot of files so it needs a PR rather than just a patch, and also to be sure that the changes to other files aren't breaking anything else.
Thanks a lot, I can live with the workaround for the moment: Searching for "mediaid" (including double quotes) in the source code of the page containing the video. The following string is the needed to download the video from jwplatform as described earlier.
Out of interest: How did you extract the string?
If JS is disabled, as when yt-dl gets the page, the mediaid
isn't populated. There are two tells:
application/ld+json
, pretty-printed and redacted below, has most of the info for a single video; in particular the JWP id can be extracted from the data-hid
attribute of the script
tag: '@context': 'http://schema.org',
'@type': 'VideoObject',
'@id': 'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635#ndKNFKmf',
'description': 'Ein NZZ Format über chronische Schmerzen, teure Hoffnungen und ein Hirn im Vollgas-Modus. ',
'contentUrl': 'https://cdn.jwplayer.com/videos/ndKNFKmf-aSRX9V0s.mp4',
'width': 1920,
'height': 1080,
'inLanguage': 'de-CH',
'name': 'NZZ Format | Migräne: Folterkammer im Kopf',
'thumbnailUrl': [
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=320',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=480',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=640',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=720',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=1280',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=1920',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.mp4?width=320',
'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.mp4?width=640'
],
'uploadDate': '2021-08-03T07:15:00.000Z',
'duration': 'PT0H28M55S'
}</script>
return {...}
bit of the hydration expression), pretty-printed and redacted below, contains a nested content
object containing a playlist
with a link
element that includes the JWP id, as well as a sources
element that lists other formats:{
content: {
title: dc,
description: dd,
kind: 'Single Item',
playlist: [
{
title: dc,
mediaid: b$,
link: 'https://cdn.jwplayer.com/previews/ndKNFKmf',
image: de,
images: [
{
src: 'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=320',
width: ca,
type: i
},
...
],
duration: 1735,
pubdate: 1627974900,
description: dd,
tags: 'NZZ Video,Wissenschaft,forschung,medizin,gesundheit,nzz format,video,krankheit,neurowissenschaft,Medikamente,Volkskrankheit,Hirn,Migräne,Kopfschmerzen,Migräniker,Chronische Schmerzen,Botox,Lebenswille,NZZ Format in voller Länge',
sources: [
{
file: 'https://cdn.jwplayer.com/manifests/ndKNFKmf.m3u8',
type: 'application/vnd.apple.mpegurl'
},
{
file: 'https://cdn.jwplayer.com/videos/ndKNFKmf-aSRX9V0s.mp4',
type: ac,
height: 1080,
width: di,
label: '1080p',
bitrate: 3171488,
filesize: 687816570,
framerate: aG
},
...
],
tracks: [{
file: 'https://cdn.jwplayer.com/strips/ndKNFKmf-120.vtt',
kind: 'thumbnails'
}],
variations: {},
playerId: dj
}
],
feed_instance_id: '4b1078bc-1396-484e-8fa7-aa6ddf72e32c',
playerId: dj,
intl: {...}
}
}
Note, eg mediaid
has the value b$
, which is one of a vast number of parameters passed to the enclosing function
(its value "ndKNFKmf"
was at line 319 column 91069 of the page in the list of parameter values when I fetched it). When you let the JS hydrate the page, the value of mediaid
gets populated, as you found.
But in either case, the formats and metadata in the JSON can be ignored because they can be extracted from the JWPlatform API once the id has been extracted, as in the posted log.
Of course, that works only as long as NZZ uses JWP for its media. The provider id "jw"
is another of the numerous parameters used to populate the playlist
object but I didn't bother to try checking it.
Checklist
Verbose log
Description
No error is displayed, but nevertheless no video is downloaded. I am guessing that this should work however, since NZZ is listed as one of the supported extractors:
youtube-dl --list-extractors | grep -i nzz
NZZ