ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.62k stars 9.97k forks source link

Support for https://n.news.naver.com/mnews/article(+ and willing to contribute to this) #31349

Open jinhere opened 1 year ago

jinhere commented 1 year ago

Checklist

Example URLs

Description

C:\Users\jinhere>youtube-dl -v -F https://n.news.naver.com/mnews/article/052/0001813167?sid=291
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://n.news.naver.com/mnews/article/052/0001813167?sid=291']
[debug] Encodings: locale cp949, fs utf-8, out utf-8, pref cp949
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.4 (CPython) - Windows-10-10.0.19044-SP0
[debug] exe versions: none
[debug] Proxy map: {}
[generic] 0001813167?sid=291: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 0001813167?sid=291: Downloading webpage
[generic] 0001813167?sid=291: Extracting information
ERROR: Unsupported URL: https://n.news.naver.com/mnews/article/052/0001813167?sid=291
Traceback (most recent call last):
  File "C:\Users\jinhere\AppData\Local\Programs\Python\Python310\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\jinhere\AppData\Local\Programs\Python\Python310\lib\site-packages\youtube_dl\YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "C:\Users\jinhere\AppData\Local\Programs\Python\Python310\lib\site-packages\youtube_dl\extractor\common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "C:\Users\jinhere\AppData\Local\Programs\Python\Python310\lib\site-packages\youtube_dl\extractor\generic.py", line 3489, in _real_extract
    raise UnsupportedError(url)
youtube_dl.utils.UnsupportedError: Unsupported URL: https://n.news.naver.com/mnews/article/052/0001813167?sid=291

Hello, the website i want support is 'naver news' which has a video and news letters at a single page. I ran the command and it says it is unsupported url. So i thought it could be solved by adding new extractor, but I've found there is an extractor called naver.py that extracts video from 'naver tv'(the video player looked similar). So my question is: do i have to make a new extractor or add codes in naver.py? I might be asking obvious question but this is my first time reading&contributing to big project so i want to ask you before starting.

dirkf commented 1 year ago

Thanks for your offer. I'd definitely add it to the same module. Depending on how similar the page structure and URL patterns are, it may be possible to modify the NaverIE class or otherwise a new extractor class, perhaps derived from NaverIE or NaverBaseIE.

The existing extractor handles pages whose URLs contain a fragment like /v/{numeric_video_id}. Then it uses "secret knowledge" to look that ID up using an API URL that returns JSON metadata for the ID.

The first problem page has this interesting chunk:

        <div class="_VOD_PLAYER_WRAP"
             data-video-index="0"
             data-video-id="0021E0A5FD56A21AF775A26D7F39FD5EFBC1"
             data-inkey="V1284101883929cd8aa8a81e760f310062bc742c134593abb80254509b668f9fc60b381e760f310062bc7"
             data-cover-image-url="https://mimgnews.pstatic.net/image/052/2022/11/13/cover_cover_202211132227471923_t_20221113223902322.jpg"
             data-cover-image-thumbnail-url="https://mimgnews.pstatic.net/image/052/2022/11/13/cover_cover_202211132227471923_t_20221113223902322.jpg?type&#x3D;w647"
             data-nvp-playable="true">
        </div>

This data-video-id doesn't seem to be the sort of ID at all. You'll probably have to track the web traffic using your browser development tools to see how the data-video-id gets transformed into the media URL.