ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.55k stars 10.05k forks source link

[Tumblr] some links returning Unable to download webpage: HTTP Error 403: Forbidden #29585

Open someziggyman opened 3 years ago

someziggyman commented 3 years ago

Checklist

Verbose log

youtube-dl -v -F https://everythingfox.tumblr.com/post/656964996113301504/i-am-fierce-via
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://everythingfox.tumblr.com/post/656964996113301504/i-am-fierce-via']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.06.06
[debug] Git HEAD: 7d37d0970
[debug] Python version 3.9.6 (CPython) - macOS-11.4-arm64-arm-64bit
[debug] exe versions: none
[debug] Proxy map: {}
[Tumblr] 656964996113301504: Downloading webpage
[Tumblr] 656964996113301504: Downloading iframe page
ERROR: Unable to download webpage: HTTP Error 403: Forbidden (caused by <HTTPError 403: 'Forbidden'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/opt/homebrew/Cellar/youtube-dl/2021.6.6/libexec/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 634, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/opt/homebrew/Cellar/youtube-dl/2021.6.6/libexec/lib/python3.9/site-packages/youtube_dl/YoutubeDL.py", line 2288, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/opt/homebrew/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/opt/homebrew/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/opt/homebrew/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*args)
  File "/opt/homebrew/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/opt/homebrew/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)

Description

Test link: https://everythingfox.tumblr.com/post/656964996113301504/i-am-fierce-via same link but a bit different format: https://everythingfox.tumblr.com/post/656964996113301504/embed

However these links work, even though the structure seems to be the save (subdomain, post, ID, video name): https://dumbasscats.tumblr.com/post/638777506589229056/a-true-captain-goes-down-with-his-ship-via-reddit https://cuteanimalshare.tumblr.com/post/656841552268869632/who-doesnt-like-ginger-cats

dirkf commented 3 years ago

The failing URL https://everythingfox.tumblr.com/post/656964996113301504/i-am-fierce-via needs the Referer header to be added when fetching the iframe URL (the value being the URL of the original page).

Also, the page has 10 video iframes, but the extractor only finds the first (top) one. The extractor should default to selecting the first video unless a playlist is requested, but, because --yes-playlist isn't distinguishable from failing to say --no-playlist, there is no way for yt-dl to do that.

dirkf commented 2 years ago

Most extractors follow the browser's access paths, so that we know the extracted item corresponds to the resource indicated by the extracted URL.

When using an API that isn't directly invoked in the browser access path, we need to understand what metadata is available, in case the webpage needs to be searched for missing fields, and to what extent the API is supported/documented.

In this case, just pulling the yt-dlp fixes looks like a simple solution and would avoid duplicate code.

dirkf commented 2 years ago

If the site/app has a function like that I'd count it as a documented API. But such deep link URLs can be handled by adding an extractor, or extending an existing URL pattern. The default approach I described follows since yt-dl pre-dates the smartphone app era.

Of course, yt-dl has its own custom links, such using just the YT ID, or ytsearchall:..., or kaltura:partner:id.