ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
129.78k stars 9.79k forks source link

[Udemy] Unable to extract course id #30719

Open pashakiz opened 2 years ago

pashakiz commented 2 years ago

Checklist

Verbose log

youtube-dl --cookies c:\udemy_cookies.txt -o 'E:/Udemy/%(playlist)s/%(chapter_number)s - %(chapter)s/%(title)s.%(ext)s' https://www.udemy.com/typescript-full/ --verbose
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--cookies', 'c:\\udemy_cookies.txt', '-o', 'E:/Udemy/%(playlist)s/%(chapter_number)s - %(chapter)s/%(title)s.%(ext)s', 'https://www.udemy.com/typescript-full/', '--verbose']
[debug] Encodings: locale cp1251, fs utf-8, out utf-8, pref cp1251
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.9.6 (CPython) - Windows-10-10.0.19044-SP0
[debug] exe versions: none
[debug] Proxy map: {}
[udemy:course] typescript-full: Downloading webpage
ERROR: Unable to extract course id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "c:\python39\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "c:\python39\lib\site-packages\youtube_dl\YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "c:\python39\lib\site-packages\youtube_dl\extractor\common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "c:\python39\lib\site-packages\youtube_dl\extractor\udemy.py", line 442, in _real_extract
    course_id, title = self._extract_course_info(webpage, course_path)
  File "c:\python39\lib\site-packages\youtube_dl\extractor\udemy.py", line 78, in _extract_course_info
    course_id = course.get('id') or self._search_regex(
  File "c:\python39\lib\site-packages\youtube_dl\extractor\common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract course id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

I am trying to download a paid course from udemy.com And when I try use login/pass - I getting this: ERROR: Unable to download webpage: HTTP Error 403: Forbidden

So I tried use --cookies option and getting another error (above). I'm export udemy_cookies.txt form browser by this chrome extention: https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid

dirkf commented 2 years ago

Plainly neither of the targets sought for course id is in the web page:

 r'data-course-id=["\'](\d+)'
 r'"courseId"\s*:\s*(\d+)'

If you can get the plain web page HTML using curl or wget with your cookies, we might be able to see what's wrong.

pashakiz commented 2 years ago

I'm not good at CLI curl... I try this on my Windows 10: curl -b udemy_cookies.txt https://www.udemy.com/typescript-full/ And get: Cannot send content body with given predicate type

pashakiz commented 2 years ago

But I saw in browser and found this (data-clp-course-id)

<body id="udemy" class="
    ud-app-loader ud-component--course-landing-page-udlite
  udemy " data-clp-course-id="4412496" data-module-id="course-landing-page/udlite" data-module-args="...">

and this (courseId)

<div class="clp-component-render"><div class="clp-component-render"><div class="ud-component--course-landing-page-udlite--purchase-body-container" data-component-props="{&quot;componentProps&quot;:{&quot;purchaseSection&quot;:{&quot;is_course_paid&quot;:true,&quot;has_subscription_offerings&quot;:false,&quot;subscription&quot;:null,&quot;style_full_lifetime_access&quot;:&quot;full-lifetime-access&quot;,&quot;style_money_back_guarantee&quot;:&quot;money-back-guarantee&quot;},&quot;purchaseInfo&quot;:{&quot;isValidStudent&quot;:false,&quot;purchaseDate&quot;:null},&quot;moneyBackGuarantee&quot;:{&quot;is_enabled&quot;:true},&quot;addToCart&quot;:{&quot;buyables&quot;:[{&quot;buyable_object_type&quot;:&quot;course&quot;,&quot;id&quot;:4412496,&quot;buyableContext&quot;:{&quot;contentLocaleId&quot;:null}}],&quot;onAddRedirectUrl&quot;:&quot;/cart/added/course/4412496/&quot;,&quot;addedButtonBsStyle&quot;:&quot;primary&quot;,&quot;is_enabled&quot;:true}},&quot;courseId&quot;:[4412496],&quot;courseObject&quot;:{&quot;id&quot;:4412496,&quot;is_private&quot;:false}}"><div data-unique-id="450" style="display:none"></div><div><div class="purchase-section-container-skeleton--price--3Xcfk purchase-section-container-skeleton--skeleton--1UsRE skeleton--skeleton--1jc5m"><div class="text-skeleton--text-skeleton--7BlWc skeleton--skeleton--1jc5m"><p><span class="text-skeleton--line--3Pla- block--block--1b0nE"></span><span class="text-skeleton--line--3Pla- block--block--1b0nE"></span></p><div class="skeleton--shine--2nD_V"></div></div><div class="skeleton--shine--2nD_V"></div></div><div class="purchase-section-container-skeleton--cta--jnShg purchase-section-container-skeleton--skeleton--1UsRE skeleton--skeleton--1jc5m"><span class="block--block--1b0nE"></span><div class="skeleton--shine--2nD_V"></div></div><div class="purchase-section-container-skeleton--money-back--3lqS1 purchase-section-container-skeleton--skeleton--1UsRE skeleton--skeleton--1jc5m"><span class="block--block--1b0nE"></span><div class="skeleton--shine--2nD_V"></div></div></div></div></div></div>
</div>

Will it help?

dirkf commented 2 years ago

curl -b udemy_cookies.txt https://www.udemy.com/typescript-full/

Try curl -c udemy_cookies.txt "https://www.udemy.com/typescript-full/".

But your observation may be enough. This patch would find the course ID, though obviously there may be other changes if the course ID is being sent differently:

--- old/youtube-dl/youtube_dl/extractor/udemy.py
+++ new/youtube-dl/youtube_dl/extractor/udemy.py
@@ -77,8 +77,8 @@
             video_id, fatal=False) or {}
         course_id = course.get('id') or self._search_regex(
             [
-                r'data-course-id=["\'](\d+)',
-                r'&quot;courseId&quot;\s*:\s*(\d+)'
+                r'data-(?:clp-)?course-id\s*=\s*["\'](\d+)',
+                r'&quot;courseId&quot;\s*:\s*\[?(\d+)'
             ], webpage, 'course id')
         return course_id, course.get('title')
pashakiz commented 2 years ago

Great! It works now with this patch! Thank you!

harslannet commented 2 years ago

curl -b udemy_cookies.txt https://www.udemy.com/typescript-full/

Try curl -c udemy_cookies.txt "https://www.udemy.com/typescript-full/".

But your observation may be enough. This patch would find the course ID, though obviously there may be other changes if the course ID is being sent differently:

--- old/youtube-dl/youtube_dl/extractor/udemy.py
+++ new/youtube-dl/youtube_dl/extractor/udemy.py
@@ -77,8 +77,8 @@
             video_id, fatal=False) or {}
         course_id = course.get('id') or self._search_regex(
             [
-                r'data-course-id=["\'](\d+)',
-                r'&quot;courseId&quot;\s*:\s*(\d+)'
+                r'data-(?:clp-)?course-id\s*=\s*["\'](\d+)',
+                r'&quot;courseId&quot;\s*:\s*\[?(\d+)'
             ], webpage, 'course id')
         return course_id, course.get('title')

what is the way to apply this patch?

pashakiz commented 2 years ago

I found this file here:

c:\python39\lib\site-packages\youtube_dl\extractor\udemy.py

See above - this path is in the verbose log. So you can find this path at your machine from your verbose log.

And fixed them manually following the advice above.

i.e replace this two lines:

r'data-course-id=["\'](\d+)',
r'&quot;courseId&quot;\s*:\s*(\d+)'

with these:

r'data-(?:clp-)?course-id\s*=\s*["\'](\d+)',
r'&quot;courseId&quot;\s*:\s*\[?(\d+)'
EightChickens commented 2 years ago

Hello, I couldn't find this udemy.py. The verbose says c:\Users\dst\ ... but this "dst" does not exist. I searched my entire C: drive (with "dir udemy.py /S") and found none. I opened the cookie txt and couldn't find anything with "data-" ... how do I get proper cookie?

dirkf commented 2 years ago

You probably have the Windows self-extracting executable, which is not so easy to patch. Install Python and use pip to install yt-dl: the extractor source file will then be accessible.

EightChickens commented 2 years ago

Even if I got the other version, how could it extract the course id if I couldn't even find "data-" in the cookie.txt? I used the browser extension mentioned in the original post. Brave browser.

harslannet commented 2 years ago

thank you @pashakiz and @dirkf . I was able to get it working as a result of your answers.

dirkf commented 2 years ago

Even if I got the other version, how could it extract the course id if I couldn't even find "data-" in the cookie.txt?

You need the two changes, apparently:

EightChickens commented 2 years ago

Hi dirkf, I couldn't find "course" anywhere in the cookie either. The txt file has 46 rows and some of them do have access_token and ud_last_auth_information, but no "course" or "data-". There is "muxData" however. I'm using the extension mentioned in the original post.

dirkf commented 2 years ago

It's in the web page, nothing to do with the cookies, which are necessary to be able to fetch the page at all. You don't need to care what's in the cookie file, except that it was extracted from a currently logged-in browser session and hasn't subsequently expired.

light42 commented 2 years ago

Don't forget to clean udemy cookies before exporting it(to avoid previous/business session tampering current cookies) And also if download process suddenly halted sometimes VPN could help.

Well, you might caught another problems down the road, and this method won't work on business account, but if you're lucky it's possible to download all the videos.

dirkf commented 2 years ago

Also see why a dedicated tool for Udemy was abandoned.