ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.34k stars 9.95k forks source link

Unable to download JSON metadata on Raywenderlich.com #24027

Open Stunner opened 4 years ago

Stunner commented 4 years ago

Checklist

Verbose log

$ youtube-dl --verbose https://www.raywenderlich.com/4743-beginning-rxswift
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--verbose', u'https://www.raywenderlich.com/4743-beginning-rxswift']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2020.01.24
[debug] Python version 2.7.16 (CPython) - Darwin-19.2.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 4.2.2, ffprobe 4.2.2, rtmpdump 2.4
[debug] Proxy map: {}
[RayWenderlichCourse] 4743-beginning-rxswift: Downloading webpage
[download] Downloading playlist: Beginning RxSwift
[RayWenderlichCourse] playlist Beginning RxSwift: Collected 39 video ids (downloading 39 of them)
[download] Downloading video 1 of 39
[RayWenderlich] 4743-beginning-rxswift/1: Downloading webpage
[RayWenderlich] 4743-beginning-rxswift/1: Downloading JSON metadata
[vimeo] 266135871: Downloading webpage
[vimeo] 266135871: Extracting information
[vimeo] 266135871: Downloading JSON metadata
WARNING: Unable to download JSON metadata: HTTP Error 404: Not Found
[vimeo] 266135871: Downloading akfire_interconnect_quic m3u8 information
[vimeo] 266135871: Downloading fastly_skyfire m3u8 information
[vimeo] 266135871: Downloading akfire_interconnect_quic MPD information
[vimeo] 266135871: Downloading akfire_interconnect_quic MPD information
[vimeo] 266135871: Downloading fastly_skyfire MPD information
[vimeo] 266135871: Downloading fastly_skyfire MPD information
[debug] Default format spec: bestvideo+bestaudio/best
[download] Introduction-266135871.mp4 has already been downloaded and merged
[download] Downloading video 2 of 39
[RayWenderlich] 4743-beginning-rxswift/2: Downloading webpage
[RayWenderlich] 4743-beginning-rxswift/2: Downloading JSON metadata
[vimeo] 266136175: Downloading webpage
[vimeo] 266136175: Extracting information
[vimeo] 266136175: Downloading JSON metadata
WARNING: Unable to download JSON metadata: HTTP Error 404: Not Found
[vimeo] 266136175: Downloading akfire_interconnect_quic m3u8 information
[vimeo] 266136175: Downloading fastly_skyfire m3u8 information
[vimeo] 266136175: Downloading akfire_interconnect_quic MPD information
[vimeo] 266136175: Downloading akfire_interconnect_quic MPD information
[vimeo] 266136175: Downloading fastly_skyfire MPD information
[vimeo] 266136175: Downloading fastly_skyfire MPD information
[debug] Default format spec: bestvideo+bestaudio/best
[download] Hello RxSwift-266136175.mp4 has already been downloaded and merged
[download] Downloading video 3 of 39
[RayWenderlich] 4743-beginning-rxswift/3: Downloading webpage
[RayWenderlich] 4743-beginning-rxswift/3: Downloading JSON metadata
ERROR: Unable to download JSON metadata: HTTP Error 403: Forbidden (caused by HTTPError()); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 627, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 2237, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

$

Description

This does require account credentials but I am unable to provide them as the account is a shared account and I am not the primary owner.

anhdle14 commented 4 years ago

After debugging I have found several issues with the current raywenderlich.py

  1. Currently, RWL, short for RaywenderLich, is using cookies sessions with the state to get USER_TOKEN on webpage.
    <script>
//<![CDATA[

      window.CAROLUS_ENV = {
        KERCHING_BASE_URL: "https://store.raywenderlich.com/",
        BETAMAX_BASE_URL: "https://videos.raywenderlich.com/api/v1",
        GUARDPOST_BASE_URL: "https://accounts.raywenderlich.com/v2",
        CONTENT_PERMISSIONS_REQUIRED_COOKIE_DOMAIN: ".raywenderlich.com",
        USER_TOKEN: "*"
      };
//]]>
</script>
  1. The 403 JSON error is coming from L106, L116. The correct way to get the JSON is:
GET /api/v1/videos/3712.json
Accept: application/json, text/javascript, */*; q=0.01
Authorization: Token $USER_TOKEN
Origin: https://www.raywenderlich.com
Referer: https://www.raywenderlich.com/
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.2 Safari/605.1.15
X-Requested-With: XMLHttpRequest
X-CSRF-Token: *

You can bypass by parse the USER_TOKEN with a parameter (--video-password) and get that value into raywenderlich.py

  1. The current implementation is using the thumbnailUrl in HTML's meta tags to get the lessonId. Apparently, there is no other way to get that 3712 except getting from the thumbnail. And there are videos that don't have a thumbnail.
<meta property="og:image" content="https://files.betamax.raywenderlich.com/attachments/videos/3712/f0a9b08b-3919-4b5a-aad7-40676ce0fa1f.png">