NZZ doesn't work anymore

github-com-bqm commented 2 years ago

Checklist

[x] I'm reporting a broken site support
[x] I've verified that I'm running youtube-dl version 2021.12.17
[x] I've checked that all provided URLs are alive and playable in a browser
[x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[x] I've searched the bugtracker for similar issues including closed ones

Verbose log

youtube-dl --verbose "https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--restrict-filenames', u'--verbose', u'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 2.7.16 (CPython) - Darwin-20.6.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[NZZ] 1639635: Downloading webpage
[download] Downloading playlist: 1639635
[NZZ] playlist 1639635: Collected 0 video ids (downloading 0 of them)
[download] Finished downloading playlist: 1639635

Description

No error is displayed, but nevertheless no video is downloaded. I am guessing that this should work however, since NZZ is listed as one of the supported extractors:

youtube-dl --list-extractors | grep -i nzz
NZZ

dirkf commented 2 years ago

Yes, and that's how [NZZ] appears in the log.

Obviously the extractor needs to be rewritten. The site seems to be serving React-ish pages where the additional page content (hydration) is in a JS expression like this

<script> window.__NZZ__ = (function (arg1, ...,. argN) { /*assignments */; return {/* possibly useful data but including variable references*/ };}(binding1, ..., bindingN)); </script>

Here N is a large number ~ 100s. You might expect that the function body would generate some template and the arguments would supply the values to populate that template for a specific page, but there seems to be no logical distinction between data set in the function body and in the arguments.

So there isn't a JSON constant that can be decoded directly. The jsinterp module isn't able to execute the expression, so we'll have to hack the JS about to get the targets.

Also, the videos are being served via cdn.jwplatform.com, so a possibility is to pass just the IDs to JWPlatformIE, thus avoiding extracting metadata from the page.

github-com-bqm commented 2 years ago

Before the actual video starts, a commercial is shown, which can be downloaded with the src url in the video tag (e.g. this here: https://crcdn09.adnxs-simple.com/creative/p/3927/2021/12/4/30280269/c720773f-fdd5-43dd-ac03-d2ec2d49a40e_1280_720_1700k.mp4)

After the commercial finishes, the video url is replaced with this:

blob:https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19

Trying to use this directly with youtube-dl, I first get a message that youtube-dl doesn't support blob urls. if I try without blob:, then I get this error:

youtube-dl --verbose "https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19" 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--restrict-filenames', u'--verbose', u'https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 2.7.16 (CPython) - Darwin-20.6.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] 568533c4-674c-4521-897e-059cd0561c19: Requesting header
WARNING: Could not send HEAD request to https://www.nzz.ch/568533c4-674c-4521-897e-059cd0561c19: HTTP Error 404: Not Found
[generic] 568533c4-674c-4521-897e-059cd0561c19: Downloading webpage
ERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 634, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 2288, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

dirkf commented 2 years ago

In the problem page, the video of interest can be downloaded as jwplatform:ndKNFKmf

$ youtube-dl -F 'jwplatform:ndKNFKmf'
[JWPlatform] ndKNFKmf: Downloading JSON metadata
[JWPlatform] ndKNFKmf: Downloading m3u8 information
[info] Available formats for ndKNFKmf:
format code  extension  resolution note
0            m4a        audio only 
120          mp4        audio only  120k , mp4a.40.2
310          mp4        320x180     310k , avc1.77.30, mp4a.40.2
440          mp4        480x270     440k , avc1.77.30, mp4a.40.2
570          mp4        720x406     570k , avc1.77.30, mp4a.40.2
1280         mp4        1280x720   1280k , avc1.77.30, mp4a.40.2
3180         mp4        1920x1080  3180k , avc1.77.30, mp4a.40.2
7            mp4        320x180    304602k 
8            mp4        480x270    438633k 
9            mp4        720x406    565446k 
10           mp4        1280x720   1274784k 
11           mp4        1920x1080  3171488k  (best)
$

Plainly there's a x1000 bug in the last 5 formats.

With a new test version of the extractor:

$ python -m youtube_dl -v -F 'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635?jwsource=cl']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[NZZ] 1639635: Downloading webpage
[JWPlatform] ndKNFKmf: Downloading JSON metadata
[JWPlatform] 1639635: Downloading m3u8 information
[info] Available formats for 1639635:
format code  extension  resolution note
0            m4a        audio only 
hls-120      mp4        audio only  120k , mp4a.40.2
180p         mp4        320x180     304k , 63.00MiB
hls-310      mp4        320x180     310k , avc1.77.30, mp4a.40.2
270p         mp4        480x270     438k , 90.72MiB
hls-440      mp4        480x270     440k , avc1.77.30, mp4a.40.2
406p         mp4        720x406     565k , 116.95MiB
hls-570      mp4        720x406     570k , avc1.77.30, mp4a.40.2
720p         mp4        1280x720   1274k , 263.66MiB
hls-1280     mp4        1280x720   1280k , avc1.77.30, mp4a.40.2
1080p        mp4        1920x1080  3171k , 655.95MiB
hls-3180     mp4        1920x1080  3180k , avc1.77.30, mp4a.40.2 (best)
$

But this touches a whole lot of files so it needs a PR rather than just a patch, and also to be sure that the changes to other files aren't breaking anything else.

github-com-bqm commented 2 years ago

Thanks a lot, I can live with the workaround for the moment: Searching for "mediaid" (including double quotes) in the source code of the page containing the video. The following string is the needed to download the video from jwplatform as described earlier.

Out of interest: How did you extract the string?

dirkf commented 2 years ago

If JS is disabled, as when yt-dl gets the page, the mediaid isn't populated. There are two tells:

this JSON object of type application/ld+json, pretty-printed and redacted below, has most of the info for a single video; in particular the JWP id can be extracted from the data-hid attribute of the script tag:

  '@context': 'http://schema.org',
  '@type': 'VideoObject',
  '@id': 'https://www.nzz.ch/wissenschaft/migraene-chronische-kopfschmerzen-bis-zum-verlust-vom-lebenswille-ld.1639635#ndKNFKmf',
  'description': 'Ein NZZ Format über chronische Schmerzen, teure Hoffnungen und ein Hirn im Vollgas-Modus. ',
  'contentUrl': 'https://cdn.jwplayer.com/videos/ndKNFKmf-aSRX9V0s.mp4',
  'width': 1920,
  'height': 1080,
  'inLanguage': 'de-CH',
  'name': 'NZZ Format | Migräne: Folterkammer im Kopf',
  'thumbnailUrl': [
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=320',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=480',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=640',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=720',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=1280',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=1920',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.mp4?width=320',
    'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.mp4?width=640'
  ],
  'uploadDate': '2021-08-03T07:15:00.000Z',
  'duration': 'PT0H28M55S'
}</script>

some JS (in the return {...} bit of the hydration expression), pretty-printed and redacted below, contains a nested content object containing a playlist with a link element that includes the JWP id, as well as a sources element that lists other formats:

{
  content: {
    title: dc,
    description: dd,
    kind: 'Single Item',
    playlist: [
      {
        title: dc,
        mediaid: b$,
        link: 'https://cdn.jwplayer.com/previews/ndKNFKmf',
        image: de,
        images: [
          {
            src: 'https://cdn.jwplayer.com/v2/media/ndKNFKmf/poster.jpg?width=320',
            width: ca,
            type: i
          },
         ...
        ],
        duration: 1735,
        pubdate: 1627974900,
        description: dd,
        tags: 'NZZ Video,Wissenschaft,forschung,medizin,gesundheit,nzz format,video,krankheit,neurowissenschaft,Medikamente,Volkskrankheit,Hirn,Migräne,Kopfschmerzen,Migräniker,Chronische Schmerzen,Botox,Lebenswille,NZZ Format in voller Länge',
        sources: [
          {
            file: 'https://cdn.jwplayer.com/manifests/ndKNFKmf.m3u8',
            type: 'application/vnd.apple.mpegurl'
          },
          {
            file: 'https://cdn.jwplayer.com/videos/ndKNFKmf-aSRX9V0s.mp4',
            type: ac,
            height: 1080,
            width: di,
            label: '1080p',
            bitrate: 3171488,
            filesize: 687816570,
            framerate: aG
          },
          ...
        ],
        tracks: [{
            file: 'https://cdn.jwplayer.com/strips/ndKNFKmf-120.vtt',
            kind: 'thumbnails'
        }],
        variations: {},
        playerId: dj
      }
    ],
    feed_instance_id: '4b1078bc-1396-484e-8fa7-aa6ddf72e32c',
    playerId: dj,
    intl: {...}
  }
}

Note, eg mediaid has the value b$, which is one of a vast number of parameters passed to the enclosing function (its value "ndKNFKmf" was at line 319 column 91069 of the page in the list of parameter values when I fetched it). When you let the JS hydrate the page, the value of mediaid gets populated, as you found.

But in either case, the formats and metadata in the JSON can be ignored because they can be extracted from the JWPlatform API once the id has been extracted, as in the posted log.

Of course, that works only as long as NZZ uses JWP for its media. The provider id "jw" is another of the numerous parameters used to populate the playlist object but I didn't bother to try checking it.

ytdl-org / youtube-dl