ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.21k stars 9.93k forks source link

YouTube chat replay support #25874

Open Xalaxis opened 4 years ago

Xalaxis commented 4 years ago

Checklist

Description

YouTube now has "chat replay" for recorded livestreams in the same style as Twitch, which youtube-dl already supports extraction of as a "subtitle". It would be beneficial for youtube-dl to also support extraction as a subtitle for YouTube, as like on Twitch, chat on YouTube can form a very important part of the livestream in question. There is no existing support for this in youtube-dl, or similar option that I can see.

There is a Python library at https://github.com/taizan-hokuto/pytchat which may be useful for the implementation of this. Amongst other formats, it supports output as JSON, which could simply be passed back as the output for a new "subtitle" - the same style as the Twitch chat replay.

Use case example: The archiving of a YouTube channel, including all metadata. At the moment the chat replay would not be saved, meaning there is no context for content in the video which may refer to it.

Xalaxis commented 4 years ago

@dstftw I read https://github.com/ytdl-org/youtube-dl#is-the-description-of-the-issue-itself-sufficient before writing the issue, and I believe the description meets those requirements. Please can you let me know what you would like me to amend?

EDIT: I have now made a few amendments, which might be what you are looking for.

dstftw commented 4 years ago

Provide concrete examples with concrete URLs. There are no telepathists here.

Xalaxis commented 4 years ago

Example: The YouTube video https://www.youtube.com/watch?v=h4M5iFLKWqU has a chat replay associated with it.

Using youtube-dl --write-sub https://www.youtube.com/watch?v=h4M5iFLKWqU I would like the subtitles to be written to \<name of output>.chatreplay.json, with a structure that includes all of the attributes available. A list of those that pytchat has extracted and therefore should be possible for youtube-dl to use is available here, including:

Xalaxis commented 4 years ago

Data appears to be provided in JSON format from the https://www.youtube.com/live_chat_replay/get_live_chat_replay endpoint.

JomSpoons commented 4 years ago

I would also really like the ability to download chat replays. Whenever I do YouTube streams I tend not to put any sort of chat on-screen because it takes up too much room, so I'd like to be able to download the replays in some form. Whether it be a simple text file or some sort of subtitle track like Xalaxis mentioned, I just want a way to preserve the chat replays to my streams.

siikamiika commented 4 years ago

Also interested in this, made a simple POC script in python that iterates all regular messages in a video and prints them to stdout. You can test it with

./script.py "<video_id>" > output # many rows of JSON objects
#!/usr/bin/env python3

import requests
import re
import json
import sys

session = requests.session()

def requests_get(url):
    return session.get(
        url,
        headers={
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0',
            'Accept-Encoding': 'gzip, deflate',
        },
    )

def debug(message, details=None):
    print(message, details, file=sys.stderr)

def parse_yt_initial_data(data):
    raw_json = re.search(b'window\["ytInitialData"\]\s*=\s*(.*);', data).group(1)
    return json.loads(raw_json)

def get_continuation_id_initial(video_id):
    response = requests_get('https://www.youtube.com/watch?v={}'.format(video_id))
    data = parse_yt_initial_data(response.content)
    return data['contents']['twoColumnWatchNextResults']['conversationBar']['liveChatRenderer']['continuations'][0]['reloadContinuationData']['continuation']

def get_continuation_data_initial(continuation_id):
    response = requests_get('https://www.youtube.com/live_chat_replay?continuation={}'.format(continuation_id))
    return parse_yt_initial_data(response.content)

def get_continuation_data_next(continuation_id, offset):
    response = requests_get(
        'https://www.youtube.com/live_chat_replay/get_live_chat_replay'
        + '?continuation={}'.format(continuation_id)
        + '&playerOffsetMs={}'.format(offset)
        + '&hidden=false'
        + '&pbj=1'
    )
    return response.json()['response']

def iter_actions(video_id):
    continuation_id = get_continuation_id_initial(video_id)
    first = True
    offset = None
    while continuation_id is not None:
        data = get_continuation_data_initial(continuation_id) if first else get_continuation_data_next(continuation_id, int(offset) - 5000)
        first = False
        continuation_id = None

        live_chat_continuation = data['continuationContents']['liveChatContinuation']
        offset = None
        if 'actions' not in live_chat_continuation:
            # TODO either out of comments or no comments right now
            debug('Actions not found, exiting', live_chat_continuation)
            continue
        for action in live_chat_continuation['actions']:
            if 'replayChatItemAction' in action:
                replay_chat_item_action = action['replayChatItemAction']
                offset = replay_chat_item_action['videoOffsetTimeMsec']
                for sub_action in replay_chat_item_action['actions']:
                    if 'addChatItemAction' in sub_action:
                        add_chat = sub_action['addChatItemAction']['item']
                        if 'liveChatTextMessageRenderer' in add_chat:
                            # {
                            #     'message': {'runs': [
                            #         {'text': '???'},
                            #         {'emoji': {'emojiId': '???', 'shortcuts': [':???:'], 'searchTerms': ['???'], 'image': {'thumbnails': [{'url': 'https://???.ggpht.com/???', 'width': 24, 'height': 24}, {'url': 'https://???.ggpht.com/???', 'width': 48, 'height': 48}], 'accessibility': {'accessibilityData': {'label': ':???:'}}}, 'isCustomEmoji': True}},
                            #         {'text': '???'}
                            #     ]},
                            #     'authorName': {'simpleText': '????'},
                            #     'authorPhoto': {'thumbnails': [{'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 32, 'height': 32}, {'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 64, 'height': 64}]},
                            #     'contextMenuEndpoint': {???},
                            #     'id': '???',
                            #     'timestampUsec': '1595943102558354',
                            #     'authorBadges': [{'liveChatAuthorBadgeRenderer': {'customThumbnail': {'thumbnails': [{'url': 'https://???.ggpht.com/???'}, {'url': 'https://???.ggpht.com/???'}]}, 'tooltip': '???', 'accessibility': {'accessibilityData': {'label': '???'}}}}],
                            #     'authorExternalChannelId': '???',
                            #     'contextMenuAccessibility': {???},
                            #     'timestampText': {'simpleText': '28.42'}
                            # }
                            yield {'liveChatTextMessageRenderer': add_chat['liveChatTextMessageRenderer']}
                        elif 'liveChatPaidMessageRenderer' in add_chat:
                            # {
                            #     'id': '???',
                            #     'timestampUsec': '1595941482934178',
                            #     'authorName': {'simpleText': '???'},
                            #     'authorPhoto': {'thumbnails': [{'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 32, 'height': 32}, {'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 64, 'height': 64}]},
                            #     'purchaseAmountText': {'simpleText': '200\xa0¥'},
                            #     'message': {'runs': [
                            #         {'text': '???'},
                            #         {'emoji': {'emojiId': '???', 'shortcuts': [':???:'], 'searchTerms': ['???'], 'image': {'thumbnails': [{'url': 'https://???.ggpht.com/???', 'width': 24, 'height': 24}, {'url': 'https://???.ggpht.com/???', 'width': 48, 'height': 48}], 'accessibility': {'accessibilityData': {'label': ':???:'}}}, 'isCustomEmoji': True}},
                            #         {'text': '???'}
                            #     ]},
                            #     'headerBackgroundColor': 4278237396,
                            #     'headerTextColor': 4278190080,
                            #     'bodyBackgroundColor': 4278248959,
                            #     'bodyTextColor': 4278190080,
                            #     'authorExternalChannelId': '???',
                            #     'authorNameTextColor': 3003121664,
                            #     'contextMenuEndpoint': {???},
                            #     'timestampColor': 2147483648,
                            #     'contextMenuAccessibility': {???},
                            #     'timestampText': {'simpleText': '1.58'}
                            # }
                            yield {'liveChatPaidMessageRenderer': add_chat['liveChatPaidMessageRenderer']}
                        elif 'liveChatMembershipItemRenderer' in add_chat:
                            # {
                            #     'id': '???',
                            #     'timestampUsec': '1595941068503043',
                            #     'timestampText': {'simpleText': '-4:50'},
                            #     'authorExternalChannelId': '???',
                            #     'headerSubtext': {'runs': [{'text': '???'}]},
                            #     'authorName': {'simpleText': '????'},
                            #     'authorPhoto': {'thumbnails': [{'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 32, 'height': 32}, {'url': 'https://???.ggpht.com/???/photo.jpg', 'width': 64, 'height': 64}]},
                            #     'authorBadges': [{'liveChatAuthorBadgeRenderer': {'customThumbnail': {'thumbnails': [{'url': 'https://???.ggpht.com/???'}, {'url': 'https://???.ggpht.com/???'}]}, 'tooltip': '???', 'accessibility': {'accessibilityData': {'label': '???'}}}}],
                            #     'contextMenuEndpoint': {???},
                            #     'contextMenuAccessibility': {???}
                            # }
                            yield {'liveChatMembershipItemRenderer': add_chat['liveChatMembershipItemRenderer']}
                        # irrelevant
                        elif 'liveChatViewerEngagementMessageRenderer' in add_chat:
                            pass
                        elif 'liveChatPlaceholderItemRenderer' in add_chat:
                            pass
                        else:
                            debug('Unrecognized action item', add_chat)
                    # tickers out of scope for now
                    elif 'addLiveChatTickerItemAction' in sub_action:
                        pass
                    else:
                        debug('Unrecognized sub_action', sub_action)
            else:
                debug('Unrecognized action', action)

        continuation_id = live_chat_continuation['continuations'][0]['liveChatReplayContinuationData']['continuation']

for action in iter_actions(sys.argv[1]):
    print(json.dumps(action, ensure_ascii=False))

edit: updated code to handle superchat and membership messages

JomSpoons commented 4 years ago

Also interested in this, made a simple POC script in python that iterates all regular messages in a video and prints them to stdout. You can test it with

Thank you so much for this, it works and it's a huge help to me. I really hope we can have something similar to this implemented into youtube-dl soon

siikamiika commented 4 years ago

There's a PR open if anyone wants to test, and I made a converter that generates niconico-style rolling chat in the ASS/SSA subtitle format to be used offline: https://github.com/siikamiika/scripts/tree/master/danmaku