yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
86.8k stars 6.77k forks source link

[Twitch] Unable to download chat replay (--sub-langs rechat) #5747

Open JakubSkowron opened 1 year ago

JakubSkowron commented 1 year ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Luxembourg/Europe

Provide a description that is worded well enough to be understood

Cannot download live chat replay (comments) from any recorded live stream on Twitch.

Example video:

$ yt-dlp -vU --skip-download --write-subs --sub-langs rechat https://www.twitch.tv/videos/1670416229

We get:

yt_dlp.utils.DownloadError: Unable to download video subtitles for 'rechat': HTTP Error 410: Gone

It is probably a problem with client_id not being allowed by Twitch API any more, because in output we see:

[debug] Invoking http downloader on "https://api.twitch.tv/v5/videos/1670416229/comments?client_id=kimne78kx3ncx6brgo4mv6wki5h1ko"

When we go to the URL with a normal web browser, we get Message:

This api.twitch.tv page can’t be found
It may have been moved or deleted.
HTTP ERROR 410

I am using rechat, because it is on the list of available subtitles:

$ yt-dlp -vU --skip-download --list-subs https://www.twitch.tv/videos/1670416229
[debug] Command-line config: ['-vU', '--skip-download', '--list-subs', 'https://www.twitch.tv/videos/1670416229']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.11.11 [8b64402] (linux_exe)
[debug] Python 3.10.6 (CPython x86_64 64bit) - Linux-5.15.0-56-generic-x86_64-with-glibc2.35 (OpenSSL 3.0.7 1 Nov 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2, rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.15.0, brotli-1.0.9, certifi-2022.09.24, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1723 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.11.11, Current version: 2022.11.11
yt-dlp is up to date (2022.11.11)
[debug] [twitch:vod] Extracting URL: https://www.twitch.tv/videos/1670416229
[twitch:vod] 1670416229: Downloading stream metadata GraphQL
[twitch:vod] 1670416229: Downloading video access token GraphQL
[twitch:vod] 1670416229: Downloading m3u8 information
[twitch:vod] 1670416229: Downloading storyboard metadata JSON
WARNING: [twitch:vod] Unable to download JSON metadata: HTTP Error 403: Forbidden
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[info] Available subtitles for v1670416229:
Language Formats
rechat   json

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

[debug] Command-line config: ['-vU', '--skip-download', '--write-subs', '--sub-langs', 'rechat', 'https://www.twitch.tv/videos/1670416229']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.11.11 [8b64402] (linux_exe)
[debug] Python 3.10.6 (CPython x86_64 64bit) - Linux-5.15.0-56-generic-x86_64-with-glibc2.35 (OpenSSL 3.0.7 1 Nov 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2, rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.15.0, brotli-1.0.9, certifi-2022.09.24, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1723 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.11.11, Current version: 2022.11.11
yt-dlp is up to date (2022.11.11)
[debug] [twitch:vod] Extracting URL: https://www.twitch.tv/videos/1670416229
[twitch:vod] 1670416229: Downloading stream metadata GraphQL
[twitch:vod] 1670416229: Downloading video access token GraphQL
[twitch:vod] 1670416229: Downloading m3u8 information
[twitch:vod] 1670416229: Downloading storyboard metadata JSON
WARNING: [twitch:vod] Unable to download JSON metadata: HTTP Error 403: Forbidden
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[info] v1670416229: Downloading subtitles: rechat
[debug] Default format spec: bestvideo*+bestaudio/best
[info] v1670416229: Downloading 1 format(s): 1080p60
[info] Writing video subtitles to: GlobiHorror - Une soirée d'horreur pour se faire peur avant Noël ! #Horreur #Peur #Halloween [v1670416229].rechat.json
[debug] Invoking http downloader on "https://api.twitch.tv/v5/videos/1670416229/comments?client_id=kimne78kx3ncx6brgo4mv6wki5h1ko"
ERROR: Unable to download video subtitles for 'rechat': HTTP Error 410: Gone
Traceback (most recent call last):
  File "yt_dlp/YoutubeDL.py", line 3950, in _write_subtitles
  File "yt_dlp/YoutubeDL.py", line 2924, in dl
  File "yt_dlp/downloader/common.py", line 446, in download
  File "yt_dlp/downloader/http.py", line 371, in real_download
  File "yt_dlp/downloader/http.py", line 129, in establish_connection
  File "yt_dlp/YoutubeDL.py", line 3692, in urlopen
  File "urllib/request.py", line 525, in open
  File "urllib/request.py", line 634, in http_response
  File "urllib/request.py", line 563, in error
  File "urllib/request.py", line 496, in _call_chain
  File "urllib/request.py", line 643, in http_error_default
urllib.error.HTTPError: HTTP Error 410: Gone

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "yt_dlp/YoutubeDL.py", line 1485, in wrapper
  File "yt_dlp/YoutubeDL.py", line 1582, in __extract_info
  File "yt_dlp/YoutubeDL.py", line 1641, in process_ie_result
  File "yt_dlp/YoutubeDL.py", line 2737, in process_video_result
  File "yt_dlp/YoutubeDL.py", line 2983, in process_info
  File "yt_dlp/YoutubeDL.py", line 3958, in _write_subtitles
yt_dlp.utils.DownloadError: Unable to download video subtitles for 'rechat': HTTP Error 410: Gone
bashonly commented 1 year ago

It looks like twitch uses graphql to fetch rechat now.

For devs: here's an example POST request body (pretty-printed):

[
  {
    "operationName": "VideoCommentsByOffsetOrCursor",
    "variables": {
      "videoID": "1670416229",
      "contentOffsetSeconds": 0
    },
    "extensions": {
      "persistedQuery": {
        "version": 1,
        "sha256Hash": "b70a3591ff0f4e0313d126c6a1502d79a1c02baebb288227c582044aa76adf6a"
      }
    }
  }
]
pukkandan commented 1 year ago

hm.. that means our http downloader won't be able to handle it

mpeter50 commented 1 year ago

@JakubSkowron I must ask this: are you sure the way you used to download live chat always downloaded the full chat history?

JakubSkowron commented 1 year ago

@mpeter50 I downloaded mostly ones where there were no much things in chat, short streams, but for one stream I've seen that messages only up to 360 seconds were downloaded ("content_offset_seconds").

mpeter50 commented 1 year ago

Yes, this is because the built-in way of downloading chat messages always only downloaded the first page of the chat history. I think there should be an ID-like thing at the end of the file, at least when the chat history would continue on further pages.

1551 was able to download all* chat messages until recently, but I'll try to fix it shortly. No guarantees, though.

When I fix it, one can use it regularly after cloning my repository and checking out the twitchvod-livechat branch.

*AFAIK moderated (moderator chatbot deletions included) messages, and messages recently sent by those who have been temporarily banned (even if for only a few seconds/minutes) is not available, as these messages are all deleted.

mpeter50 commented 1 year ago

@JakubSkowron I've uploaded the fix for Twitch chat extraction.

For now it wont be merged into the normal version of yt-dlp because for that heavy structural changes would have to be done, but development is tracked in #1551 . If you want to try it out, or use it in the meantime, you can clone my fork, switch to the twitchvod-livechat branch, and run yt-dlp from there. You will also need to use the --sub-langs all,live_chat argument for the yt-dlp command. Keep in mind that (afaik) all VODs and their chat is only available for 2 months starting from them being published. I wouldn't count on a merge in that amount of time, even starting from now.

edvordo commented 1 year ago

For anyone stumbling into this issue, the fork of mpeter50 worked for me, but I had to change line 553 from

chat_history.extend(traverse_obj(comments_obj, ('edges', slice, 'node')))

to

chat_history.extend(traverse_obj(comments_obj, ('edges', ..., 'node')))
mpeter50 commented 1 year ago

Oh, I thought I pushed that change already. But maybe I remember something else that was similar.

Be sure to check that you did not only get the first page (~50 pieces) of comments. A common indicator of it is too small files for long streams of popular streamers. Right now there are problems with anti-bot protection at Twitch, that for some reason only prevents downloading chat pages after the first one.

If you find that you did not get all comments, there is a workaround. Let me know and I'll try to summarize it. When I get back to this, I want to fix it normally.

edvordo commented 1 year ago

Seems like I got all almost 300 message from the VOD I downloaded.

I do plan on dowloading more, so just in case I stumble into it, could you describe the workaround anyway, please?

mpeter50 commented 1 year ago

Capture data

Open a new tab in your web browser, open the develeoper tools and its network tab for it, and load Twitch. The order is important because this way the devtools wont miss traffic from the beginning.

image

Extract identifiers

You will need 2 or 3 things from here:

Device ID: this is stored in the unique_id cookie and is set in the first request (which is the document itself). Although yt-dlp can extract this from your browser if you give it the --cookies-from-browser firefox arguments, (for now) it wont know how it should be used. It would attach it to every request as a cookie, but we need something else.

Last I checked this expires in a year, not counting from loading the page, but instead from when it was generated.

image

Client-Integrity: you can find this either in the token field in the JSON response of the latest https://gql.twitch.tv/integrity request, or in requests which use this token in the Client-Integrity header.

Last I checked this last a day, so you may have to re-obtain this regularly.

When it is obtained When it is used
image image

Use identifiers

Open the yt_dlp/extractor/twitch.py file from the repository, and find the yt_dlp.extractor.twitch.TwitchBaseIE._download_base_gql function.

The headers is a dictionary/map that contains what HTTP headers should yt-dlp use when making certain GQL requests. Insert into it the Device ID you obtained with the X-Device-Id key, and the Client Integrity token with the Client-Integrity key. It should look like this:

    def _download_base_gql(self, video_id, ops, note, fatal=True):
        headers = {
            'Content-Type': 'text/plain;charset=UTF-8',
            'Client-ID': self._CLIENT_ID,
        }
        gql_auth = self._get_cookies('https://gql.twitch.tv').get('auth-token')
        if gql_auth:
            headers['Authorization'] = 'OAuth ' + gql_auth.value
        # else:
            # headers['Authorization'] = 'undefined'

        headers["X-Device-Id"] = "your Device ID goes here"
        headers["Client-Integrity"] = "your Client-Integrity token goes here"

        return self._download_json(
            'https://gql.twitch.tv/gql', video_id, note,
            data=json.dumps(ops).encode(),
            headers=headers, fatal=fatal)

There shouldn't be any more changes to this function. Please don't copy and paste the above snippet, as I don't know if there have been changes to this function since I last pulled, but only take it as an illustration.

When you are done, try running yt-dlp for a VOD where you previously couldn't obtain all messages. Be sure to actually run yt-dlp from this source code you just edited, not the installed version, and also run it at a temporary output directory, so as to not contaiminate your usual one with bad files, maybe even overwriting good ones. If it looks good, you should be ok to run it with the usual output directory too. If you use archive files (--download-archive), don't forget to remove the IDs of the failed VODs from it, so that yt-dlp does not skip them when retrying.

pukkandan commented 1 year ago

We should be able to easily make our extractor read this when user passes --cookies-from-browser. Fixing when cookies are not passed may be harder

mpeter50 commented 1 year ago

A month or so ago I started throwing together utility functions to generate all of the necessary IDs from scratch, but I got tangled up in it and also more urgent things have appeared.

As a half measure I tried to add extractor specific options for the IDs so that the user can provide them without messing with the code, but it lowercased the case-sensitive keys and so it didn't work. Then I think somewhere I have sawn that it is possible to get extractor options with the cases being preserved, but I don't remember. Maybe it was on the matrix/discord yt-dlp coding help channel, but that is a literal information blackhole..

bashonly commented 1 year ago

extractor options with the cases being preserved

_configuration_arg() has a casesense param

donnaken15 commented 1 year ago

any update on this? it's still happening for me

I tried mpeter's method and got nothing

edvordo commented 1 year ago

got nothing from where? I've been using that patch for months now, reliably granted, having it in core would be nice :)

mpeter50 commented 1 year ago

@donnaken15 did you enable downloading the live chat? You need to add live_chat to the list of downloaded subtitle languages, even if you use all, like this: --sub-langs all,live_chat

If you did that already, did you get any error messages? If you paste the full output of a run with the --verbose arg, that could be helpful, but if you do that please upload it as a gist, and link the gist here, instead of pasting the text here directly. That way this issue discussion is much more readable, I think.

If by my method you mean my July 10 comment, thats not enough in itself. You have to use the branch of #1551 (which is the twitchvod-livechat branch of my fork: https://github.com/mpeter50/yt-dlp). With the current state of it, you can set the 3 kinds of IDs with extractor arguments, or if you use a config file, with that too.

If you want to use it on sub-only channels, you also have to make yt-dlp to use the Authorization header, so that it "logs in" with your account. If you ask it to use the cookies from your browser, it will use that automtaically. However, in this case make sure that you obtain the 3 IDs from the same browser tab where your account is logged in. That is, don't use a container tab for that on Firefox. Thats because the 3 IDs are connected to your account, and only work with the Authorization header of your account.

mpeter50 commented 1 year ago

@edvordo maybe you already do, but please always check if all messages were obtained corretly.

I have run into it multiple times that for some reason or another not all messages were downloaded, but the first page only, and sometimes only realized that after the VODs were gone fron twitch. This thing that you have to supply fresh IDs every day makes it very fragile. Recently I wasnt able to work on trying to auto-generate the IDs, but thinking about it, maybe even just aborting the procedure when I receive that error would be a better idea than doing nothing..

donnaken15 commented 1 year ago

@donnaken15 did you enable downloading the live chat? You need to add live_chat to the list of downloaded subtitle languages, even if you use all, like this: --sub-langs all,live_chat

the only "subs" that come up is "rechat" just tried this yt-dlp --sub-langs all,live_chat --write-sub --skip-download https://www.twitch.tv/videos/... --verbose https://gist.github.com/donnaken15/31af64223eccfed160c287eddf13c782

patched code:

    def _download_base_gql(self, video_id, ops, note, fatal=True):
        headers = {
            'Content-Type': 'text/plain;charset=UTF-8',
            'Client-ID': self._CLIENT_ID,
        }
        gql_auth = self._get_cookies('https://gql.twitch.tv').get('auth-token')
        if gql_auth:
            headers['Authorization'] = 'OAuth ' + gql_auth.value
        headers["Device-ID"] = "obtained from browser"
        headers["Client-Integrity"] = "obtained from browser"
        return self._download_json(
            'https://gql.twitch.tv/gql', video_id, note,
            data=json.dumps(ops).encode(),
            headers=headers, fatal=fatal)

realized I accidentally didn't put "X-" in front of Device-ID, but it still fails

mpeter50 commented 1 year ago

Oh, yes, I see. In one of the last commits I have renamed live_chat to rechat, as other yt-dlp code expects that instead. You should actually use rechat now.

But that is not the root of the problem. It is that I moved downloading chat to a function that should run later, but as it turns out, data obtained by that function was not taken into account by the code that is supposed to save the chat and otehr subtitle-like data.

With that in mind it is surprising that it works for edvordo, but maybe they are just using an earlier commit.

I have uploaded a fix to #1551. Could you please check if it works for you now? Soon I'll also upload further changes, so it would be helpful if you would also tell which commit did you try.

edvordo commented 1 year ago

Well, technically, I'm not using your fork. I've cloned the main yt-dlp repo, applied the changes as they were two months ago to relevant files and been using from source, semi-regularly pulling changes from main (no conflicts yet).

Also, yes, I've just check all 46 downloaded chat histories, all of them have all messages from given streams from past 2 months.

donnaken15 commented 1 year ago

I have uploaded a fix to #1551. Could you please check if it works for you now? Soon I'll also upload further changes, so it would be helpful if you would also tell which commit did you try.

I tried the branch you linked in the PR and it worked, without my identity cookies

pato-pan commented 1 month ago

Could a feature request be made as a temporary workaround? simply still download the video even though --sub-langs rechat or --embed-subs is specified