Open JakubSkowron opened 1 year ago
It looks like twitch uses graphql to fetch rechat now.
For devs: here's an example POST request body (pretty-printed):
[
{
"operationName": "VideoCommentsByOffsetOrCursor",
"variables": {
"videoID": "1670416229",
"contentOffsetSeconds": 0
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "b70a3591ff0f4e0313d126c6a1502d79a1c02baebb288227c582044aa76adf6a"
}
}
}
]
hm.. that means our http downloader won't be able to handle it
@JakubSkowron I must ask this: are you sure the way you used to download live chat always downloaded the full chat history?
@mpeter50 I downloaded mostly ones where there were no much things in chat, short streams, but for one stream I've seen that messages only up to 360 seconds were downloaded ("content_offset_seconds").
Yes, this is because the built-in way of downloading chat messages always only downloaded the first page of the chat history. I think there should be an ID-like thing at the end of the file, at least when the chat history would continue on further pages.
When I fix it, one can use it regularly after cloning my repository and checking out the twitchvod-livechat
branch.
*AFAIK moderated (moderator chatbot deletions included) messages, and messages recently sent by those who have been temporarily banned (even if for only a few seconds/minutes) is not available, as these messages are all deleted.
@JakubSkowron I've uploaded the fix for Twitch chat extraction.
For now it wont be merged into the normal version of yt-dlp because for that heavy structural changes would have to be done, but development is tracked in #1551 .
If you want to try it out, or use it in the meantime, you can clone my fork, switch to the twitchvod-livechat branch, and run yt-dlp from there. You will also need to use the --sub-langs all,live_chat
argument for the yt-dlp command.
Keep in mind that (afaik) all VODs and their chat is only available for 2 months starting from them being published. I wouldn't count on a merge in that amount of time, even starting from now.
For anyone stumbling into this issue, the fork of mpeter50 worked for me, but I had to change line 553 from
chat_history.extend(traverse_obj(comments_obj, ('edges', slice, 'node')))
to
chat_history.extend(traverse_obj(comments_obj, ('edges', ..., 'node')))
Oh, I thought I pushed that change already. But maybe I remember something else that was similar.
Be sure to check that you did not only get the first page (~50 pieces) of comments. A common indicator of it is too small files for long streams of popular streamers. Right now there are problems with anti-bot protection at Twitch, that for some reason only prevents downloading chat pages after the first one.
If you find that you did not get all comments, there is a workaround. Let me know and I'll try to summarize it. When I get back to this, I want to fix it normally.
Seems like I got all almost 300 message from the VOD I downloaded.
I do plan on dowloading more, so just in case I stumble into it, could you describe the workaround anyway, please?
Open a new tab in your web browser, open the develeoper tools and its network tab for it, and load Twitch. The order is important because this way the devtools wont miss traffic from the beginning.
You will need 2 or 3 things from here:
Device ID: this is stored in the unique_id
cookie and is set in the first request (which is the document itself).
Although yt-dlp can extract this from your browser if you give it the --cookies-from-browser firefox
arguments, (for now) it wont know how it should be used. It would attach it to every request as a cookie, but we need something else.
Last I checked this expires in a year, not counting from loading the page, but instead from when it was generated.
Client-Integrity: you can find this either in the token field in the JSON response of the latest https://gql.twitch.tv/integrity
request, or in requests which use this token in the Client-Integrity
header.
Last I checked this last a day, so you may have to re-obtain this regularly.
When it is obtained | When it is used |
---|---|
Open the yt_dlp/extractor/twitch.py
file from the repository, and find the yt_dlp.extractor.twitch.TwitchBaseIE._download_base_gql
function.
The headers
is a dictionary/map that contains what HTTP headers should yt-dlp use when making certain GQL requests.
Insert into it the Device ID you obtained with the X-Device-Id
key, and the Client Integrity token with the Client-Integrity
key.
It should look like this:
def _download_base_gql(self, video_id, ops, note, fatal=True):
headers = {
'Content-Type': 'text/plain;charset=UTF-8',
'Client-ID': self._CLIENT_ID,
}
gql_auth = self._get_cookies('https://gql.twitch.tv').get('auth-token')
if gql_auth:
headers['Authorization'] = 'OAuth ' + gql_auth.value
# else:
# headers['Authorization'] = 'undefined'
headers["X-Device-Id"] = "your Device ID goes here"
headers["Client-Integrity"] = "your Client-Integrity token goes here"
return self._download_json(
'https://gql.twitch.tv/gql', video_id, note,
data=json.dumps(ops).encode(),
headers=headers, fatal=fatal)
There shouldn't be any more changes to this function. Please don't copy and paste the above snippet, as I don't know if there have been changes to this function since I last pulled, but only take it as an illustration.
When you are done, try running yt-dlp for a VOD where you previously couldn't obtain all messages.
Be sure to actually run yt-dlp from this source code you just edited, not the installed version, and also run it at a temporary output directory, so as to not contaiminate your usual one with bad files, maybe even overwriting good ones.
If it looks good, you should be ok to run it with the usual output directory too. If you use archive files (--download-archive
), don't forget to remove the IDs of the failed VODs from it, so that yt-dlp does not skip them when retrying.
We should be able to easily make our extractor read this when user passes --cookies-from-browser
. Fixing when cookies are not passed may be harder
A month or so ago I started throwing together utility functions to generate all of the necessary IDs from scratch, but I got tangled up in it and also more urgent things have appeared.
As a half measure I tried to add extractor specific options for the IDs so that the user can provide them without messing with the code, but it lowercased the case-sensitive keys and so it didn't work. Then I think somewhere I have sawn that it is possible to get extractor options with the cases being preserved, but I don't remember. Maybe it was on the matrix/discord yt-dlp coding help channel, but that is a literal information blackhole..
extractor options with the cases being preserved
_configuration_arg()
has a casesense
param
any update on this? it's still happening for me
I tried mpeter's method and got nothing
got nothing from where? I've been using that patch for months now, reliably granted, having it in core would be nice :)
@donnaken15 did you enable downloading the live chat? You need to add live_chat
to the list of downloaded subtitle languages, even if you use all
, like this: --sub-langs all,live_chat
If you did that already, did you get any error messages?
If you paste the full output of a run with the --verbose
arg, that could be helpful, but if you do that please upload it as a gist, and link the gist here, instead of pasting the text here directly. That way this issue discussion is much more readable, I think.
If by my method you mean my July 10 comment, thats not enough in itself. You have to use the branch of #1551 (which is the twitchvod-livechat branch of my fork: https://github.com/mpeter50/yt-dlp). With the current state of it, you can set the 3 kinds of IDs with extractor arguments, or if you use a config file, with that too.
If you want to use it on sub-only channels, you also have to make yt-dlp to use the Authorization
header, so that it "logs in" with your account. If you ask it to use the cookies from your browser, it will use that automtaically.
However, in this case make sure that you obtain the 3 IDs from the same browser tab where your account is logged in. That is, don't use a container tab for that on Firefox. Thats because the 3 IDs are connected to your account, and only work with the Authorization header of your account.
@edvordo maybe you already do, but please always check if all messages were obtained corretly.
I have run into it multiple times that for some reason or another not all messages were downloaded, but the first page only, and sometimes only realized that after the VODs were gone fron twitch. This thing that you have to supply fresh IDs every day makes it very fragile. Recently I wasnt able to work on trying to auto-generate the IDs, but thinking about it, maybe even just aborting the procedure when I receive that error would be a better idea than doing nothing..
@donnaken15 did you enable downloading the live chat? You need to add
live_chat
to the list of downloaded subtitle languages, even if you useall
, like this:--sub-langs all,live_chat
the only "subs" that come up is "rechat"
just tried this
yt-dlp --sub-langs all,live_chat --write-sub --skip-download https://www.twitch.tv/videos/... --verbose
https://gist.github.com/donnaken15/31af64223eccfed160c287eddf13c782
patched code:
def _download_base_gql(self, video_id, ops, note, fatal=True):
headers = {
'Content-Type': 'text/plain;charset=UTF-8',
'Client-ID': self._CLIENT_ID,
}
gql_auth = self._get_cookies('https://gql.twitch.tv').get('auth-token')
if gql_auth:
headers['Authorization'] = 'OAuth ' + gql_auth.value
headers["Device-ID"] = "obtained from browser"
headers["Client-Integrity"] = "obtained from browser"
return self._download_json(
'https://gql.twitch.tv/gql', video_id, note,
data=json.dumps(ops).encode(),
headers=headers, fatal=fatal)
realized I accidentally didn't put "X-" in front of Device-ID, but it still fails
Oh, yes, I see. In one of the last commits I have renamed live_chat to rechat, as other yt-dlp code expects that instead. You should actually use rechat now.
But that is not the root of the problem. It is that I moved downloading chat to a function that should run later, but as it turns out, data obtained by that function was not taken into account by the code that is supposed to save the chat and otehr subtitle-like data.
With that in mind it is surprising that it works for edvordo, but maybe they are just using an earlier commit.
I have uploaded a fix to #1551. Could you please check if it works for you now? Soon I'll also upload further changes, so it would be helpful if you would also tell which commit did you try.
Well, technically, I'm not using your fork. I've cloned the main yt-dlp repo, applied the changes as they were two months ago to relevant files and been using from source, semi-regularly pulling changes from main (no conflicts yet).
Also, yes, I've just check all 46 downloaded chat histories, all of them have all messages from given streams from past 2 months.
I have uploaded a fix to #1551. Could you please check if it works for you now? Soon I'll also upload further changes, so it would be helpful if you would also tell which commit did you try.
I tried the branch you linked in the PR and it worked, without my identity cookies
Could a feature request be made as a temporary workaround? simply still download the video even though --sub-langs rechat or --embed-subs is specified
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
Checklist
Region
Luxembourg/Europe
Provide a description that is worded well enough to be understood
Cannot download live chat replay (comments) from any recorded live stream on Twitch.
Example video:
$ yt-dlp -vU --skip-download --write-subs --sub-langs rechat https://www.twitch.tv/videos/1670416229
We get:
yt_dlp.utils.DownloadError: Unable to download video subtitles for 'rechat': HTTP Error 410: Gone
It is probably a problem with client_id not being allowed by Twitch API any more, because in output we see:
[debug] Invoking http downloader on "https://api.twitch.tv/v5/videos/1670416229/comments?client_id=kimne78kx3ncx6brgo4mv6wki5h1ko"
When we go to the URL with a normal web browser, we get Message:
I am using rechat, because it is on the list of available subtitles:
Provide verbose output that clearly demonstrates the problem
yt-dlp -vU <your command line>
)[debug] Command-line config
) and insert it belowComplete Verbose Output