ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.61k stars 10.05k forks source link

Facebook: unable to access the actual title of the video #14156

Open jollino opened 7 years ago

jollino commented 7 years ago

Please follow the guide below


Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2017.09.02. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

Before submitting an issue make sure you have:

What is the purpose of your issue?


The following sections concretize particular purposed issues, you can erase any section (the contents between triple ---) not applicable to your issue


If the purpose of this issue is a bug report, site support request or you are not completely sure provide the full verbose output as follows:

Add the -v flag to your command line you run youtube-dl with (youtube-dl -v <your command line>), copy the whole output and insert it here. It should look similar to one below (replace it with your log inserted between triple ```):

[debug] System config: []
[debug] User config: ['--retries', '99']
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://www.facebook.com/cclarinascita/videos/vl.1382797855382682/415442948635545/']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.09.02
[debug] Python version 3.6.2 - Darwin-16.7.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 3.3.3, ffprobe 3.3.3, rtmpdump 2.4
[debug] Proxy map: {}
[facebook] 415442948635545: Downloading webpage
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://video-mxp1-1.xx.fbcdn.net/v/t42.1790-2/11163850_415443371968836_1361671976_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InNkIn0%3D&oh=030b09815fc1303f203f9541d321ff85&oe=59B40912'
[download] Promo della prima stagione di Camera Café  #CameraCafé #LaRinascita-415442948635545.mp4 has already been downloaded
[download] 100% of 808.89KiB
<end of log>

Description of your issue, suggested solution and other information

It looks like youtube-dl is unable to get the title of a video, and always uses the description as a title. It's actually somewhat rare for videos to have an actual title, but it would be useful for youtube-dl to use it when it is available.

For reference I'm using https://www.facebook.com/cclarinascita/videos/vl.1382797855382682/415442948635545/ but it seems to happen consistently.

The problem, I believe, is that the actual title is only shown if the video if opened from a video list page, not if the url is accessed directly, so it's just hard to spot. Most videos, moreover, don't even have one; the first part of the description is effectively used as the title.

Compare for instance for opening the aforementioned video url versus opening it from https://www.facebook.com/cclarinascita/videos/ (1st video of the 3rd playlist from the top, named "Camera Café - I stagione - Ep. 0-49").

At least on my end, the direct url just opens a simple video page showing no title, whereas opening it from the list shows a different interface with a bigger video on the left, and a right sidebar with the title in black, the description, and more videos from the same playlist ("Up next").

I ran the --write-info-json option, and it really looks like the title is not found at all:

{
   "id":"415442948635545",
   "title":"Promo della prima stagione di Camera Caf\u00e9  #CameraCaf\u00e9 #LaRinascita",
   "formats":[
      {
         "format_id":"progressive_sd_src",
         "url":"https://video-mxp1-1.xx.fbcdn.net/v/t42.1790-2/11163850_415443371968836_1361671976_n.mp4?efg=eyJybHIiOjM5OSwicmxhIjo1MTIsInZlbmNvZGVfdGFnIjoic2QifQ%3D%3D&rl=399&vabr=222&oh=030b09815fc1303f203f9541d321ff85&oe=59B40912",
         "preference":-10,
         "ext":"mp4",
         "format":"progressive_sd_src - unknown",
         "protocol":"https",
         "http_headers":{
            "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)",
            "Accept-Charset":"ISO-8859-1,utf-8;q=0.7,*;q=0.7",
            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Encoding":"gzip, deflate",
            "Accept-Language":"en-us,en;q=0.5"
         }
      },
      {
         "format_id":"progressive_sd_src_no_ratelimit",
         "url":"https://video-mxp1-1.xx.fbcdn.net/v/t42.1790-2/11163850_415443371968836_1361671976_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InNkIn0%3D&oh=030b09815fc1303f203f9541d321ff85&oe=59B40912",
         "preference":-10,
         "ext":"mp4",
         "format":"progressive_sd_src_no_ratelimit - unknown",
         "protocol":"https",
         "http_headers":{
            "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)",
            "Accept-Charset":"ISO-8859-1,utf-8;q=0.7,*;q=0.7",
            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Encoding":"gzip, deflate",
            "Accept-Language":"en-us,en;q=0.5"
         }
      }
   ],
   "uploader":"Camera Caf\u00e9 - La Rinascita",
   "timestamp":1429881130,
   "extractor":"facebook",
   "webpage_url":"https://www.facebook.com/cclarinascita/videos/vl.1382797855382682/415442948635545/?type=1",
   "webpage_url_basename":"415442948635545",
   "extractor_key":"Facebook",
   "playlist":null,
   "playlist_index":null,
   "display_id":"415442948635545",
   "upload_date":"20150424",
   "format_id":"progressive_sd_src_no_ratelimit",
   "url":"https://video-mxp1-1.xx.fbcdn.net/v/t42.1790-2/11163850_415443371968836_1361671976_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InNkIn0%3D&oh=030b09815fc1303f203f9541d321ff85&oe=59B40912",
   "preference":-10,
   "ext":"mp4",
   "format":"progressive_sd_src_no_ratelimit - unknown",
   "protocol":"https",
   "http_headers":{
      "User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)",
      "Accept-Charset":"ISO-8859-1,utf-8;q=0.7,*;q=0.7",
      "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "Accept-Encoding":"gzip, deflate",
      "Accept-Language":"en-us,en;q=0.5"
   },
   "fulltitle":"Promo della prima stagione di Camera Caf\u00e9  #CameraCaf\u00e9 #LaRinascita",
   "_filename":"Promo della prima stagione di Camera Caf\u00e9  #CameraCaf\u00e9 #LaRinascita-415442948635545.mp4"
}

I dug a bit into the html of https://www.facebook.com/cclarinascita/videos/vl.1382797855382682/415442948635545 (which, again, does not show the title if it's accessed directly) and discovered that the title is actually injected into the page, but it's only found inside two commented links to the video itself, in two different parts of the page:

<a class="_2za_" href="https://www.facebook.com/cclarinascita/videos"><span class="_50f7">Camera Café - I stagione - Ep. 0 - Sketch Promo</span></a>

and

<a data-onclick="[[&quot;TahoeController&quot;,&quot;openFromVideoLinkHelper&quot;,&#123;&quot;__elem&quot;:1&#125;,&quot;unknown&quot;]]" class="async_saving _400z _2-40 _5pcq" href="/cclarinascita/videos/415442948635545/" aria-label="Camera Caf&#xe9; - I stagione - Ep. 0 - Sketch Promo" data-video-channel-id="387437888106301:415442948635545" data-channel-caller="channel_view_from_unknown" ajaxify="#" rel="async" target="">

I'm not sure how reliable this would be given that it's commented out (would a parser even be able to access it, by default?), but I suppose it's the only silver lining here.

Also, it appears that changing the ?type= parameter in the query string has absolutely no effect; the same is for removing the playlist part of the url (the intermediate /vl.1382797855382682/ in this case).

barsnick commented 5 years ago

This has been annoying me for quite some time as well. The Facebook users (i.e. content providers) give their videos actual titles as captions, and youtube-dl extracts the text "below".

Another random example: https://www.facebook.com/markberubemusic/videos/vb.350276931674911/315046592680474/?type=2&theater

The caption says:

This week in Switzerland...

13.12.18 - Basel - Kaserne / 15.12.18 - Bern - Dachstock * opening for Sophie Hunger

youtube-dl extracts:

13.12.18 - Basel - Kaserne / 15.12.18 - Bern - Dachstock * opening for Sophie...

While I actually interpret the title as such:

This week in Switzerland...

I can give tons of other examples.

I found this "proper" title in Facebook's current HTML code within this construct: <title id="pageTitle">... | Facebook</title>

I'll post a pull request adding this regex to youtube-dl's title extraction. Its result is titles much more according to my expectations. It does change almost every title (including all the tests). It also changes group videos' titles from "[group name] has 481 members" to "[group name] Public Group". I don't see this side effect as all too bad though, either.

If you're impatient before I manage to post the pull request: Here's the regex:

r'(?s)<title id="pageTitle"[^>]*>([^<]*)(?: \| Facebook)</title>'

or the code:

        if not video_title:
            video_title = self._html_search_regex(
                r'(?s)<title id="pageTitle"[^>]*>([^<]*)(?: \| Facebook)</title>',
                webpage, 'title', default=None)