Open zizoumgs opened 5 years ago
any news from this side?
Can confirm that this still doesn't work (and that thumbnails can't be downloaded, either).
Can confirm that this still doesn't work (and that thumbnails can't be downloaded, either).
Yep just re-checked it. They now use their own captioning engine using JS. The captions can be found in the html code, under the tag <code>
. What I end up doing is parsing their captions with beautiful soup and making making my .SRT file from it.
@psychonaute If you have time, please clarify how to get .SRT Thank you.
@nellepn from a code I wrote a while back (~6 months ago) but is still working at the time of this writting. you'll need to get the auth cookie, the same you use for youtube-dl to run (netscape format) Alternatively that's the default settings of this chrome extension: EditThisCookie
import requests
from bs4 import BeautifulSoup
from http.cookiejar import MozillaCookieJar
import datetime
import json
import re
def get_srt_file(vid_url, sub_filename, cookie_file):
jar = MozillaCookieJar(cookie_file)
jar.load()
page = requests.get(vid_url, cookies=jar)
soup = BeautifulSoup(page.content, 'html.parser')
res = soup.find_all("code", text=re.compile(r'transcriptStartAt'))
if res:
data = json.loads(res[0].text)
else:
return
transcript = [t for t in data['included'] if t.get('lines')]
transcript_ord = sorted(transcript[0]['lines'], key=lambda k: k['transcriptStartAt'])
with open(sub_filename, 'w', encoding='utf-8') as srt_file:
for i, entry in enumerate(transcript_ord):
timing = [ str(datetime.timedelta(milliseconds=entry['transcriptStartAt'])) ]
# no "end time" in their transcript, determining one from next "start time"
if i+1 < len(transcript_ord):
end_time = str(datetime.timedelta(milliseconds=transcript_ord[i+1]['transcriptStartAt']))
else:
# if last one, end at +5s
end_time = str(datetime.timedelta(milliseconds=transcript_ord[i]['transcriptStartAt']+5000))
timing.append(end_time)
# trucate micro-seconds
for j, t in enumerate(timing):
try:
if len(t.split('.')[1]) > 3:
timing[j] = t[:-3]
except IndexError:
timing[j] = t + ".000"
srt_file.write(f"{i+1}\n")
srt_file.write(f"{timing[0]} --> {timing[1]}\n")
srt_file.write(f"{entry['caption'].strip()}\n\n")
get_srt_file(
"https://www.linkedin.com/learning/course-path/path-to-video",
"mysub.srt",
"C:\path\to\my\cookie.txt",
)
If all this still sounds too obscure you can first following this course: https://www.linkedin.com/learning/web-scraping-with-python 🚀
@psychonaute Will do my best :) Thanks for pointing in the right direction. Cheers!
@psychonaute So... THIS IS IT. For now, I will miss lynda.com, where I had the whole course with all srt files on my hard drive with just one command. The current situation with linkedin is that I get all the videos in offline but srt files is missing. In fact, here I will share what I use. I left the "--all-subs" parameter but currently nothing is obtained with it. It used to work on lynda.com, which can still be checked today with free lessons there for those who do not have an account or for entire courses for those who have an account.
(I use GNU Linux)
youtube-dl -o "/home/john/course/%(chapter)s/%(autonumber)02d. %(title)s.mp4" -f "progressive-540p" --cookies "cookies.txt" --all-subs https://www.linkedin.com/learning/web-scraping-with-python/
To see what formats is available:
youtube-dl -F --cookies "cookies.txt" https://www.linkedin.com/learning/web-scraping-with-python/how-to-learn-to-stop-worrying-and-love-the-bot/
For cookies.txt I use Firefox extension cookies.txt
Hello all im having the same problem :
[linkedin:learning:course] Downloading JSON metadata [download] Downloading playlist: Advanced Selenium: Support Classes [linkedin:learning:course] playlist Advanced Selenium: Support Classes: Collected 20 video ids (downloading 20 of them) [download] Downloading video 1 of 20 [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 360p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 540p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading 720p JSON metadata [linkedin:learning] the-best-kept-secret-in-webdriver: Downloading m3u8 information 2241373 has no subtitles
Checklist
Verbose log
[debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['-u', 'PRIVATE', '-p', 'PRIVATE', '-U', 'https://www.linkedin.com/learning/electronics-foundations-basic-circuits', '--no-check-certificate', '--ffmpeg-location', 'C:\Users\xxxx\Downloads\ffmpeg-20171022-72c3d9a-win64-static\ffmpeg-20171022-72c3d9a-win64-static\bin', '--write-sub', '--list-subs', '-r', '500k', '-v'] [debug] Encodings: locale cp1256, fs mbcs, out cp720, pref cp1256 [debug] youtube-dl version 2019.07.16 [debug] Python version 3.4.4 (CPython) - Windows-10-10.0.17134 [debug] exe versions: ffmpeg N-88042-g72c3d9ae45, ffprobe N-88042-g72c3d9ae45 [debug] Proxy map: {} youtube-dl is up-to-date (2019.07.16)
Description
I can not download subtitle from Linkedin . it says that "has no subtitles" . In the reality , the video is having subtitle. This happened in all Linkedin course. As additional information . Lynda becomes Linkedin right now . I can download subtitle from Lynda a couple month ago. I have an active subscription and I can view and download the videos. but not the subtitle.