ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.25k stars 10.03k forks source link

[LinkedIn] ERROR: An extractor error has occurred. #31270

Open tkbi opened 2 years ago

tkbi commented 2 years ago

Checklist

Verbose log

youtube site : = https://www.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur?autoSkip=true&autoplay=true&resume=false&u=xxxxxxxx
ERROR: An extractor error has occurred. (caused by KeyError(u'JSESSIONID',)); please report this issue on https://yt-dl.org/bug . 
Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. 
Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

WRITE DESCRIPTION HERE

dirkf commented 2 years ago

Be sure to call youtube-dl with the --verbose flag and include its complete output.

Do it?

Vangelis66 commented 2 years ago

... The URL in OP should work without the query parameters and without providing login credentials; from what I gathered, it must be a free sample:

https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur

(... it plays fine inside my browser 😄 ); -F-ing this in youtube-dl yields:

youtube-dl -v -F "https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur" => 

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur']
[debug] Encodings: locale cp1253, fs mbcs, out cp737, pref cp1253
[debug] youtube-dl version 2022.10.02.810
[debug] Python version 3.4.4 (CPython) - Windows-Vista-6.0.6003-SP2
[debug] exe versions: ffmpeg 5.0, ffprobe 5.0, phantomjs 2.1.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] was-ist-softwarearchitektur: Requesting header
WARNING: Falling back on generic information extractor.
[generic] was-ist-softwarearchitektur: Downloading webpage
[generic] was-ist-softwarearchitektur: Extracting information
[info] Available formats for was-ist-softwarearchitektur:
format code  extension      resolution note
0            unknown_video  unknown

Sadly, what is actually being downloaded is no media file, but a HTML one, the page's source code:

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur']
[debug] Encodings: locale cp1253, fs mbcs, out cp737, pref cp1253
[debug] youtube-dl version 2022.10.02.810
[debug] Python version 3.4.4 (CPython) - Windows-Vista-6.0.6003-SP2
[debug] exe versions: ffmpeg 5.0, ffprobe 5.0, phantomjs 2.1.1, rtmpdump 2.4
[debug] Proxy map: {}
[generic] was-ist-softwarearchitektur: Requesting header
WARNING: Falling back on generic information extractor.
[generic] was-ist-softwarearchitektur: Downloading webpage
[generic] was-ist-softwarearchitektur: Extracting information
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur'
[download] Destination: Was ist Softwarearchitektur - Einf?hrung in die Softwarearchitektur 1 - Grundlagen, Begriffe und ausgew?hlte Tools-was-ist-softwarearchitektur.unknown_video
[download] 100% of 118.05KiB in 00:01

Inside the Page Source,

  <div class="share-native-video">
    <video class="share-native-video__node video-js" data-sources=
    "[{&quot;src&quot;:&quot;https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665342000&amp;v=beta&amp;t=dwvCYQxhSyDwwUmJAmm62GV9oPTuK-mQZ3DczI2u7_8#.mp4&quot;}]"
    data-poster-url=
    "https://media-exp1.licdn.com/dms/image/C4E0DAQEIBfhUmf3mlw/learning-public-crop_675_1200/0/1567118425117?e=1665342000&amp;v=beta&amp;t=XO_op2UrLN64_KQbrSj9lOk7ZACr6p7GblyOuweD6mg"
    data-captions-url=
    "https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIMn_ZSVB7fCQAAAYOaDcda-TEnQXD-2_tND91gB_iJ-8yCizcDJTGyq02N63-Q_ApFqSlVY2hn7z3zvzkmPud1C9eKTbQKpe-rU3AB6tiBK57Z4gjR05o7TXuEhZS9vF6zj7WA5667DYvm-J0qPwCyffsBnYOOQuZaQPFlrkGDGKHVtcx1mp451qGQMZaRvPuCLOCjHgkk05cpKbMFOeN7u2Fc7roDLg-R3sNIIPuH-onbSpuRVJ1mAXyeq53bNRcch5sWUvBrfUHXXhtegbt8ae-1usMuwy8NwJCAfExXAlM4r-yp_JIDE1IbfihQUckTrrWxRDrXMF1WDnIh6QdfiSqWD6X6EbgTiOQQSSCjP-qDft_jY2X4_epMn-3u4HcDYnpD87pO_v2-UlRFaufikXU1Q3QkdOY5A6oaKP5gYichUMAiXIR7USfwnVXXfqaRqdEL0SqMjX3jCO2RvtlEMx1jmLVzVHhdqB8qdNXoL0Qz09sFSwrR46M1"
    data-digitalmedia-asset-urn="urn:li:lyndaVideo:(urn:li:lyndaCourse:761019,2815083)"
    data-tracking-id="DJvYbPLNTWK7posByDamzw=="></video>
  </div>

contains direct links to the media file itself,

https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665342000&v=beta&t=dwvCYQxhSyDwwUmJAmm62GV9oPTuK-mQZ3DczI2u7_8#.mp4

(tokenised with a defined lifespan) and to WebVTT subs:

https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIMn_ZSVB7fCQAAAYOaDcda-TEnQXD-2_tND91gB_iJ-8yCizcDJTGyq02N63-Q_ApFqSlVY2hn7z3zvzkmPud1C9eKTbQKpe-rU3AB6tiBK57Z4gjR05o7TXuEhZS9vF6zj7WA5667DYvm-J0qPwCyffsBnYOOQuZaQPFlrkGDGKHVtcx1mp451qGQMZaRvPuCLOCjHgkk05cpKbMFOeN7u2Fc7roDLg-R3sNIIPuH-onbSpuRVJ1mAXyeq53bNRcch5sWUvBrfUHXXhtegbt8ae-1usMuwy8NwJCAfExXAlM4r-yp_JIDE1IbfihQUckTrrWxRDrXMF1WDnIh6QdfiSqWD6X6EbgTiOQQSSCjP-qDft_jY2X4_epMn-3u4HcDYnpD87pO_v2-UlRFaufikXU1Q3QkdOY5A6oaKP5gYichUMAiXIR7USfwnVXXfqaRqdEL0SqMjX3jCO2RvtlEMx1jmLVzVHhdqB8qdNXoL0Qz09sFSwrR46M1

I suppose getting this to work with "paid for content" will require additional steps... FWIW, yt-dlp fails right away with:

ERROR: Unsupported URL: https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur

(w/ and w/o -ies generic,default 😉 )

dirkf commented 2 years ago

There is a LinkedIn extractor, but only for www.linkedin.com, and it doesn't understand this page structure, which has good ld+json block apart from its invalid contentURL.

After giving it a good talking-to:

$ python -m youtube_dl -j 'https://de.linkedin.com/learning/einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools/was-ist-softwarearchitektur' | jq '.'
{
  "display_id": "was-ist-softwarearchitektur",
  "extractor": "linkedin:learning",
  "protocol": "https",
  "description": "Grundlagenwissen für Softwarearchitekten",
  "upload_date": "20190605",
  "timestamp": 1559692800,
  "formats": [
    {
      "protocol": "https",
      "format": "vbr-540 - 540p",
      "url": "https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665367200&v=beta&t=acTi1Goh8ZA7q9v2RAqERacJAikDZVcNmDaNfz8NpGs#.mp4",
      "http_headers": ...,
      "height": 540,
      "ext": "mp4",
      "format_id": "vbr-540"
    }
  ],
  "episode_id": "was-ist-softwarearchitektur",
  "series_id": "einfuhrung-in-die-softwarearchitektur-1-grundlagen-begriffe-und-ausgewahlte-tools",
  "_filename": "Was ist Softwarearchitektur - Einführung in die Softwarearchitektur 1 - Grundlagen, Begriffe und ausgewählte Tools-C4D0DAQGPfzHW4lxbTw.mp4",
  "uploader": "Hendrik Lösch",
  "duration": 312,
  "format_id": "vbr-540",
  "height": 540,
  "http_headers": ...,
  "id": "C4D0DAQGPfzHW4lxbTw",
  "subtitles": {
    "de": [
      {
        "url": "https://www.linkedin.com/ambry/?x-li-ambry-ep=AQIg1xl8j15WUAAAAYObefkbEndlIBeOIr8KICAJLJfFJxEGcNImqIc5sryMAQFVT5UIdIYmp4sS2d7uzpT_2Pn6XApik8l-7zhiIwKm9rKaiGCi-XfRdpKnA9e_vfZNd4012ocdN-6wLmPE7sSeF_AJIw8QoGja6KR-cNgrdyYMRjbQPrQJymTtoL4BP7z_JY4eM8IQMvrjlMoDDRGRlRr7Rq4lbXwP1iPGBfZb7KbDKu3ft1hbWHBuz3tMgYwtYnmAzvOJT6LIxhvNdZQTPc4g92B6OF3pz7xlaxfCVS8mGCKhs7ZNdkDw6juQDDIEedxwUrq4-xCTrTI3sNbfa8lIiTISHgfBuxsinODnef1mPHBFDxiJk16p2fw9SA3iUg1pqavE8Uj_JIP4DWHxtN43WK7R_onPH8KAgnvNUr5PVZBi3nzaqJbFSC9uSRDO2VDNkEcCqD15wA8oVJaAbsw4T87XteGZL8x-_klxt37qvzY_vlnfEZDgx033",
        "ext": "vtt"
      }
    ]
  },
  "view_count": 5709,
  "playlist": null,
  "thumbnails": [
    {
      "url": "https://media-exp1.licdn.com/dms/image/C4E0DAQEIBfhUmf3mlw/learning-public-crop_675_1200/0/1567118425117?e=1665367200&v=beta&t=VXMXYnLrAzWBVEQj73zJXVZc5_KcpSMaAmrNi87NL_Y",
      "id": "0"
    }
  ],
  "title": "Was ist Softwarearchitektur? - Einführung in die Softwarearchitektur 1: Grundlagen, Begriffe und ausgewählte Tools",
  "url": "https://dms.licdn.com/playlist/C4D0DAQGPfzHW4lxbTw/learning-original-video-vbr-540/0/1598782239071?e=1665367200&v=beta&t=acTi1Goh8ZA7q9v2RAqERacJAikDZVcNmDaNfz8NpGs#.mp4",
  "extractor_key": "LinkedInLearning",
  "format": "vbr-540 - 540p",
  ...
}
$
dirkf commented 2 years ago

This is the patch I used:

--- old/youtube_dl/extractor/linkedin.py
+++ new/youtube_dl/extractor/linkedin.py
@@ -5,9 +5,14 @@ import re

 from .common import InfoExtractor
 from ..utils import (
+    extract_attributes,
     ExtractorError,
     float_or_none,
     int_or_none,
+    ISO639Utils,
+    merge_dicts,
+    try_get,
+    url_or_none,
     urlencode_postdata,
     urljoin,
 )
@@ -17,7 +22,7 @@ class LinkedInLearningBaseIE(InfoExtractor):
     _NETRC_MACHINE = 'linkedin'
     _LOGIN_URL = 'https://www.linkedin.com/uas/login?trk=learning'

-    def _call_api(self, course_slug, fields, video_slug=None, resolution=None):
+    def _call_api(self, host, course_slug, fields, video_slug=None, resolution=None):
         query = {
             'courseSlug': course_slug,
             'fields': fields,
@@ -27,10 +32,10 @@ class LinkedInLearningBaseIE(InfoExtractor):
         if video_slug:
             query.update({
                 'videoSlug': video_slug,
-                'resolution': '_%s' % resolution,
+                'resolution': '_%s' % (resolution, ),
             })
-            sub = ' %dp' % resolution
-        api_url = 'https://www.linkedin.com/learning-api/detailedCourses'
+            sub = ' %dp' % (resolution, )
+        api_url = 'https://%s.linkedin.com/learning-api/detailedCourses' % (host, )
         return self._download_json(
             api_url, video_slug, 'Downloading%s JSON metadata' % sub, headers={
                 'Csrf-Token': self._get_cookies(api_url)['JSESSIONID'].value,
@@ -73,7 +78,7 @@ class LinkedInLearningBaseIE(InfoExtractor):

 class LinkedInLearningIE(LinkedInLearningBaseIE):
     IE_NAME = 'linkedin:learning'
-    _VALID_URL = r'https?://(?:www\.)?linkedin\.com/learning/(?P<course_slug>[^/]+)/(?P<id>[^/?#]+)'
+    _VALID_URL = r'https?://(?P<host>www|[a-z]{2})?(?(host)\.)linkedin\.com/learning/(?P<course_slug>[^/]+)/(?P<id>[^/?#]+)'
     _TEST = {
         'url': 'https://www.linkedin.com/learning/programming-foundations-fundamentals/welcome?autoplay=true',
         'md5': 'a1d74422ff0d5e66a792deb996693167',
@@ -86,15 +91,53 @@ class LinkedInLearningIE(LinkedInLearningBaseIE):
         },
     }

+    def _extract_free(self, url, host, course_slug, video_slug):
+        webpage = self._download_webpage(url, video_slug)
+        info = self._search_json_ld(webpage, video_slug, expected_type='VideoObject', default={})
+        title = info['title']
+        info.pop('url', None)
+        native_video = self._search_regex(
+            r'''(<video\b[^>]+\bclass\s*=\s*(["'])(?:(?:(?!\2).)+?\s)?share-native-video__node video-js\2[^>]*>)''',
+            webpage, 'native video', default='')
+        native_video = extract_attributes(native_video)
+        sources = self._parse_json(native_video.get('data-sources', '[]'), video_slug)
+        formats = []
+        video_id = video_slug
+        for src in sources:
+            src = url_or_none(try_get(src, lambda x: x['src']))
+            if not src:
+                continue
+            format_id = self._search_regex(r'-([avt]br-\d+)/', src, 'format id', default=None)
+            video_id = self._search_regex(r'/playlist/([^/]+)/', src, 'video id', default=video_id)
+            ext = self._search_regex(r'#\.(\w+)$', src, 'ext', default=None)
+            formats.append({
+                'url': src,
+                'format_id': format_id,
+                'ext': ext,
+                'height': int_or_none(format_id.split('-')[-1]),
+            })
+        self._sort_formats(formats)
+        sttl_url = native_video.get('data-captions-url') if ISO639Utils.short2long(host) else None
+        return merge_dicts({
+            'id': video_id,
+            'display_id': video_slug,
+            'episode_id': video_slug,
+            'series_id': course_slug,
+            'formats': formats,
+            'subtitles': sttl_url and {host: [{'url': sttl_url, 'ext': 'vtt', }]},
+            }, info)
+
     def _real_extract(self, url):
-        course_slug, video_slug = re.match(self._VALID_URL, url).groups()
+        host, course_slug, video_slug = re.match(self._VALID_URL, url).groups()

         video_data = None
         formats = []
         for width, height in ((640, 360), (960, 540), (1280, 720)):
-            video_data = self._call_api(
-                course_slug, 'selectedVideo', video_slug, height)['selectedVideo']
-
+            try:
+                video_data = self._call_api(
+                    host or 'www', course_slug, 'selectedVideo', video_slug, height)['selectedVideo']
+            except (ExtractorError, KeyError):
+                return self._extract_free(url, host, course_slug, video_slug)
             video_url_data = video_data.get('url') or {}
             progressive_url = video_url_data.get('progressiveUrl')
             if progressive_url:
@@ -155,7 +198,7 @@ class LinkedInLearningCourseIE(LinkedInLearningBaseIE):

     def _real_extract(self, url):
         course_slug = self._match_id(url)
-        course_data = self._call_api(course_slug, 'chapters,description,title')
+        course_data = self._call_api('www', course_slug, 'chapters,description,title')

         entries = []
         for chapter_number, chapter in enumerate(course_data.get('chapters', []), 1):