If a chat message in a VOD contains a non-ASCII character (any 2-bytes UTF-8 symbol for example) then emotes[].name field of message JSON from the library parsed wrongly.
(I've patcher the library with temporarily debugging by prints to see the raw GQL content for the message mapper (chat_downloader.sites.twitch.TwitchChatDownloader._parse_message_info()))
Twitch GQL uses byte positioning as the beginning and the end of an emote code inside the chat text, so for non-ASCII characters the byte form of Python string should be used as the source of applying locations.
Basic information
Describe the bug
If a chat message in a VOD contains a non-ASCII character (any 2-bytes UTF-8 symbol for example) then
emotes[].name
field ofmessage
JSON from the library parsed wrongly.Command/Code used
chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'
-v
):(I've patcher the library with temporarily debugging by
print
s to see the raw GQL content for themessage
mapper (chat_downloader.sites.twitch.TwitchChatDownloader._parse_message_info()
))Actual content of
test.jsonl
(prettified)Expected content of
test.jsonl
(prettified)name
field of the emote should be filled:Additional context/information
Twitch GQL uses byte positioning as the beginning and the end of an emote code inside the chat text, so for non-ASCII characters the byte form of Python string should be used as the source of applying
locations
.The fix is straightforward:
instead of
https://github.com/xenova/chat-downloader/blob/94ed3fe9dd2af8f193ea5b25adc7509a8cbb0e63/chat_downloader/sites/twitch.py#L258