ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.24k stars 10.03k forks source link

Receiving 403 Forbidden error when video is accessible in browser #26919

Open kristophercrawford opened 4 years ago

kristophercrawford commented 4 years ago

Checklist

Question

I am attempting to download a video from the website documentarymania.com and am not able to download some videos. I can access these videos in a browser normally. I have tried changing the user agent used by youtube-dl and setup an EC2 instance and tried from there as well to rule out my IP address being filtered. Verbose output is below:

(doc_download) root@lxc10:~/doc_download# youtube-dl --verbose --print-traffic --dump-page https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '--print-traffic', '--dump-page', 'https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.09.20
[debug] Python version 3.7.3 (CPython) - Linux-5.4.65-1-pve-x86_64-with-debian-10.6
[debug] exe versions: none
[debug] Proxy map: {}
[generic] player: Requesting header
send: b'HEAD /player.php?title=The+Body+vs+Coronavirus HTTP/1.1\r\nHost: www.documentarymania.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.20 Safari/537.36\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-us,en;q=0.5\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Sun, 18 Oct 2020 00:51:19 GMT
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Vary: Accept-Encoding
header: X-Powered-By: PHP/5.6.40
header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
header: Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Pragma: no-cache
header: Set-Cookie: PHPSESSID=8or0c0us88neoj1fr22tprltk5; path=/
header: Strict-Transport-Security: max-age=31536000
header: Content-Encoding: gzip
WARNING: Falling back on generic information extractor.
[generic] player: Downloading webpage
send: b'GET /player.php?title=The+Body+vs+Coronavirus HTTP/1.1\r\nHost: www.documentarymania.com\r\nCookie: PHPSESSID=8or0c0us88neoj1fr22tprltk5\r\nAccept-Encoding: *\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.20 Safari/537.36\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Sun, 18 Oct 2020 00:51:19 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: X-Powered-By: PHP/5.6.40
header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
header: Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Pragma: no-cache
header: Strict-Transport-Security: max-age=31536000
[generic] Dumping request to https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus

[generic] player: Extracting information
[debug] Default format spec: best/bestvideo+bestaudio
[debug] Invoking downloader on 'https://www.documentarymania.com/Videos/'
send: b'GET /Videos/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.documentarymania.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.20 Safari/537.36\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: en-us,en;q=0.5\r\nCookie: PHPSESSID=8or0c0us88neoj1fr22tprltk5\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Server: nginx
header: Date: Sun, 18 Oct 2020 00:51:20 GMT
header: Content-Type: text/html; charset=iso-8859-1
header: Content-Length: 209
header: Connection: close
header: Vary: Accept-Encoding
header: Strict-Transport-Security: max-age=31536000
ERROR: unable to download video data: HTTP Error 403: Forbidden
Traceback (most recent call last):
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/YoutubeDL.py", line 1926, in process_info
    success = dl(filename, info_dict)
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/YoutubeDL.py", line 1865, in dl
    return fd.download(name, info)
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/downloader/common.py", line 366, in download
    return self.real_download(filename, info_dict)
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/downloader/http.py", line 348, in real_download
    establish_connection()
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/downloader/http.py", line 114, in establish_connection
    raise err
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/downloader/http.py", line 110, in establish_connection
    ctx.data = self.ydl.urlopen(request)
  File "/root/doc_download/lib/python3.7/site-packages/youtube_dl/YoutubeDL.py", line 2238, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

(doc_download) root@lxc10:~/doc_download#
october262 commented 4 years ago

first play this video - https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus and skip to almost to the end of the video and let it stop. 2nd open the network tab (F12) and select network and then media. you should see mp4 file.

copy the first URL, open a new tab - paste in the URL and hit enter, the video should play right click on the video and select save video as, the video should say videoHD.php, just add .mp4 like this videoHD.php.mp4 and you should have the video downloaded.

this work best using the firefox web browser and the brave web browser haven't tested on other browsers yet.

pez-public commented 4 years ago

The generic info extractor is obtaining the incorrect url. You can see in the verbose output where it says: [debug] Invoking downloader on 'https://www.documentarymania.com/Videos/'

It looks like it's happening here.

I believe this line is pulling the following content from the webpage: <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "VideoObject", "name": "The Body vs Coronavirus", "description": "How can we cope with the tricky coronavirus now rampant worldwide? As the pandemic tightens its grip on the world, there are important unanswered questions about this novel virus: Why does this infection spread so rapidly from people with no symptoms? Why do some people become critical while others don't? Will a definitive treatment be found? The underlying key to these questions lie in our immune system. Immune cells are microscopic warriors, combating viruses and another pathogens. <br> Through the high-tech 'eyes' of next-generation microscopes and computer-generated imagery, we will see how our immune defense corps combat against microbes and what mechanism is expected to help develop treatment. ", "thumbnailUrl": "https://www.documentarymania.com/iconos/The.Body.Vs.Coronavirus.jpg", "uploadDate": "2020-10-08 09:53:44Z", "duration": "PT51M30S", "contentUrl": "https://www.documentarymania.com/Videos/", "embedUrl": "https://www.documentarymania.com/player.php?title=The Body vs Coronavirus", "interactionCount": "5429" } </script>

You can see that "contentUrl" is "https://www.documentarymania.com/Videos/", and it looks like this value is used here, before being merged here.

Tested using the following: import youtube_dl from youtube_dl.extractor import generic url = "https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus" gie = generic.GenericIE() gie.set_downloader(youtube_dl.YoutubeDL()) this = gie._real_extract(url) print(this)

I'm not authority on schemas, but per https://schema.org/VideoObject (the @context value that's referenced in the webpage), @type VideoObject is like a subclass of @type MediaObject, and the "contentUrl" of a MediaObject is supposed to be "Actual bytes of the media object, for example the image file or video file," so I would wager that this website is not following standard protocol.