ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.74k stars 10.07k forks source link

Subtitle does not work for Linkedin. #21879

Open zizoumgs opened 5 years ago

zizoumgs commented 5 years ago

Checklist

Verbose log

[debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['-u', 'PRIVATE', '-p', 'PRIVATE', '-U', 'https://www.linkedin.com/learning/electronics-foundations-basic-circuits', '--no-check-certificate', '--ffmpeg-location', 'C:\Users\xxxx\Downloads\ffmpeg-20171022-72c3d9a-win64-static\ffmpeg-20171022-72c3d9a-win64-static\bin', '--write-sub', '--list-subs', '-r', '500k', '-v'] [debug] Encodings: locale cp1256, fs mbcs, out cp720, pref cp1256 [debug] youtube-dl version 2019.07.16 [debug] Python version 3.4.4 (CPython) - Windows-10-10.0.17134 [debug] exe versions: ffmpeg N-88042-g72c3d9ae45, ffprobe N-88042-g72c3d9ae45 [debug] Proxy map: {} youtube-dl is up-to-date (2019.07.16)

Description

I can not download subtitle from Linkedin . it says that "has no subtitles" . In the reality , the video is having subtitle. This happened in all Linkedin course. As additional information . Lynda becomes Linkedin right now . I can download subtitle from Lynda a couple month ago. I have an active subscription and I can view and download the videos. but not the subtitle.

sekmo commented 5 years ago

any news from this side?

ICEknigh7 commented 3 years ago

Can confirm that this still doesn't work (and that thumbnails can't be downloaded, either).

codekoriko commented 3 years ago

Can confirm that this still doesn't work (and that thumbnails can't be downloaded, either).

Yep just re-checked it. They now use their own captioning engine using JS. The captions can be found in the html code, under the tag <code>. What I end up doing is parsing their captions with beautiful soup and making making my .SRT file from it.

nellepn commented 3 years ago

@psychonaute If you have time, please clarify how to get .SRT Thank you.

codekoriko commented 3 years ago

@nellepn from a code I wrote a while back (~6 months ago) but is still working at the time of this writting. you'll need to get the auth cookie, the same you use for youtube-dl to run (netscape format) Alternatively that's the default settings of this chrome extension: EditThisCookie

import requests
from bs4 import BeautifulSoup
from http.cookiejar import MozillaCookieJar
import datetime
import json
import re

def get_srt_file(vid_url, sub_filename, cookie_file):
    jar = MozillaCookieJar(cookie_file)
    jar.load()
    page = requests.get(vid_url, cookies=jar)
    soup = BeautifulSoup(page.content, 'html.parser')
    res = soup.find_all("code", text=re.compile(r'transcriptStartAt'))
    if res:
        data = json.loads(res[0].text)
    else:
        return

    transcript = [t for t in data['included'] if t.get('lines')]
    transcript_ord = sorted(transcript[0]['lines'], key=lambda k: k['transcriptStartAt'])

    with open(sub_filename, 'w', encoding='utf-8') as srt_file:
        for i, entry in enumerate(transcript_ord):
            timing = [ str(datetime.timedelta(milliseconds=entry['transcriptStartAt'])) ]
            # no "end time" in their transcript, determining one from next "start time"
            if i+1 < len(transcript_ord):
                end_time = str(datetime.timedelta(milliseconds=transcript_ord[i+1]['transcriptStartAt']))
            else:
                # if last one, end at +5s
                end_time = str(datetime.timedelta(milliseconds=transcript_ord[i]['transcriptStartAt']+5000))
            timing.append(end_time)
            # trucate micro-seconds
            for j, t in enumerate(timing):
                try:
                    if len(t.split('.')[1]) > 3:
                        timing[j] = t[:-3]
                except IndexError:
                    timing[j] = t + ".000"

            srt_file.write(f"{i+1}\n")
            srt_file.write(f"{timing[0]} --> {timing[1]}\n")
            srt_file.write(f"{entry['caption'].strip()}\n\n")

get_srt_file(
    "https://www.linkedin.com/learning/course-path/path-to-video",
    "mysub.srt",
    "C:\path\to\my\cookie.txt",
)

If all this still sounds too obscure you can first following this course: https://www.linkedin.com/learning/web-scraping-with-python 🚀

nellepn commented 3 years ago

@psychonaute Will do my best :) Thanks for pointing in the right direction. Cheers!

nellepn commented 3 years ago

@psychonaute So... THIS IS IT. For now, I will miss lynda.com, where I had the whole course with all srt files on my hard drive with just one command. The current situation with linkedin is that I get all the videos in offline but srt files is missing. In fact, here I will share what I use. I left the "--all-subs" parameter but currently nothing is obtained with it. It used to work on lynda.com, which can still be checked today with free lessons there for those who do not have an account or for entire courses for those who have an account.

(I use GNU Linux) youtube-dl -o "/home/john/course/%(chapter)s/%(autonumber)02d. %(title)s.mp4" -f "progressive-540p" --cookies "cookies.txt" --all-subs https://www.linkedin.com/learning/web-scraping-with-python/

To see what formats is available: youtube-dl -F --cookies "cookies.txt" https://www.linkedin.com/learning/web-scraping-with-python/how-to-learn-to-stop-worrying-and-love-the-bot/

For cookies.txt I use Firefox extension cookies.txt

zakna commented 3 years ago

Hello all im having the same problem :

[linkedin:learning:course] Downloading JSON metadata [download] Downloading playlist: Advanced Selenium: Support Classes [linkedin:learning:course] playlist Advanced Selenium: Support Classes: Collected 20 video ids (downloading 20 of them) [download] Downloading video 1 of 20 [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 360p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 540p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 720p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading m3u8 information 2241373 has no subtitles