cannot correctly resolve `bilibili.com` video URLs contained in a festival / bilibili.com 的包含在 festival 中的视频链接不能被正确解析

szdytom commented 1 year ago

Checklist

[x] I'm reporting a broken site support
[x] I've verified that I'm running youtube-dl version 2021.12.17
[x] I've checked that all provided URLs are alive and playable in a browser
[x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[x] I've searched the bugtracker for similar issues including closed ones

Verbose log

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['https://www.bilibili.com/video/BV1dZ4y1Y7bt', '-v']
[debug] Encodings: locale cp936, fs mbcs, out cp936, pref cp936
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041
[debug] exe versions: none
[debug] Proxy map: {}
[BiliBili] 1dZ4y1Y7bt: Downloading webpage
[BiliBili] 1dZ4y1Y7bt: Downloading video info page
ERROR: Unable to extract title; please report this issue on https://yt-dl.org/bug . Make sure you are using
 the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and
 include its complete output.
Traceback (most recent call last):
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\Youtube
DL.py", line 815, in wrapper
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\Youtube
DL.py", line 836, in __extract_info
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extract
or\common.py", line 534, in extract
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extract
or\bilibili.py", line 213, in _real_extract
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extract
or\common.py", line 1021, in _html_search_regex
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extract
or\common.py", line 1012, in _search_regex
youtube_dl.utils.RegexNotFoundError: Unable to extract title; please report this issue on https://yt-dl.org
/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-
dl with the --verbose flag and include its complete output.

Description

cannot correctly resolve bilibili.com video URLs which is contained in a festival. for example,

https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt

while a normal video(not contained in a festival) URL should look like

https://www.bilibili.com/video/BVxxxxxxxx

but using https://www.bilibili.com/video/BV1dZ4y1Y7bt still does not work for it auto redirects back to the festival URL.

bilibili.com 的包含在 festival 中的视频链接不能被正确解析。

dirkf commented 1 year ago

The _VALID_URL can be updated to match URLs like https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt. Is this the only such format (ie .../festival/slug?bvid=...) or should other top-level path components and/or more path components be matched?

The error occurs because the title extraction fails. In the problem page there is this <title>洛天依十周年官方演唱会</title>. If that should be the fallback title, that's fine, but I'm not familiar with the content. Then

$ python3.9 -m youtube_dl -v -F 'https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: a5464aca1
[debug] Python version 3.9.16 (CPython) - Linux-4.4.0-210-generic-i686-with-glibc2.23
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[BiliBili] 1dZ4y1Y7bt: Downloading webpage
[BiliBili] 1dZ4y1Y7bt: Downloading video info page
WARNING: unable to extract description; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
WARNING: unable to extract og:image; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
[info] Available formats for 1dZ4y1Y7bt:
format code  extension  resolution note
0            flv        unknown    3.53GiB
$

li6in9muyou commented 1 year ago

URL format like .../festival/<slug>?bvid=<bvid>) is used on rare occasions.
What's in the tag should not be the fallback title, that is the title of the "festival". The requested video is one of many videos published in this "festival"</li> </ol> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/dirkf"><img src="https://avatars.githubusercontent.com/u/1222880?v=4" />dirkf</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>What should be the title of the test video <a href="https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt">https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt</a>?</p> <p>If there isn't an obvious candidate, the title could be <code>f'{festival_title}: {video_id}'</code> or similar.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/li6in9muyou"><img src="https://avatars.githubusercontent.com/u/77159535?v=4" />li6in9muyou</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>The element can be located with <code>.video-toobar_title</code> whoes innerText is <code>【洛天依原创曲】光与影的对白【2022官方生贺曲】</code>. This is very different from other video pages.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/dirkf"><img src="https://avatars.githubusercontent.com/u/1222880?v=4" />dirkf</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>That's fine. There are other fields not being extracted but I don't think they should cause warnings. Obviously, suggestions for alternative sources in the page are welcome.</p> <pre><code class="language-console">$ python3.9 -m youtube_dl --get-title 'https://www.bilibili.com/festival/lty10th?bvid=BV1dZ4y1Y7bt' WARNING: unable to extract description; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. WARNING: unable to extract og:image; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. 【洛天依原创曲】光与影的对白【2022官方生贺曲】 $</code></pre> <p>Are the <code>【】</code> part of the title or should they be stripped?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/szdytom"><img src="https://avatars.githubusercontent.com/u/33175397?v=4" />szdytom</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>no it shouldn't, the <code>【】</code> is a part of the title. </p> <p>P.S. video description can be read by <code>document.querySelector('.video-desc').innerHTML</code></p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

ytdl-org / youtube-dl