Closed Synetech closed 1 week ago
bump... just came across another example:
URL: https://www.cnn.com/2018/05/21/entertainment/jada-pinkett-hair-loss/index.html
yt-dlp.py -v "https://www.cnn.com/2018/05/21/entertainment/jada-pinkett-hair-loss/index.html" [debug] Command-line config: ['-v', 'https://www.cnn.com/2018/05/21/entertainment/jada-pinkett-hair-loss/index.html'] [debug] Encodings: locale cp1252, fs utf-8, out utf-8, err utf-8, pref cp1252 [debug] yt-dlp version 2022.03.08.1 [c0c2c57d3] (zip) [debug] Python version 3.6.6 (CPython 64bit) - Windows-10-10.0.14393-SP0 [debug] exe versions: ffmpeg 4.3.2-2021-02-02-full_build-www.gyan.dev, ffprobe 4.3.2-2021-02-02-full_build-www.gyan.dev, rtmpdump 2.4 [debug] Optional libraries: sqlite [debug] Proxy map: {} [debug] [CNNArticle] Extracting URL: https://www.cnn.com/2018/05/21/entertainment/jada-pinkett-hair-loss/index.html [CNNArticle] index.html: Downloading webpage ERROR: [CNNArticle] Unable to extract cnn url; please report this issue on https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using yt-dlp -U File "C:\Transmogrifier\yt-dlp.py\yt_dlp\extractor\common.py", line 617, in extract ie_result = self._real_extract(url) File "C:\Transmogrifier\yt-dlp.py\yt_dlp\extractor\cnn.py", line 145, in _real_extract cnn_url = self._html_search_regex(r"video:\s*'([^']+)'", webpage, 'cnn url') File "C:\Transmogrifier\yt-dlp.py\yt_dlp\extractor\common.py", line 1201, in _html_search_regex res = self._search_regex(pattern, string, name, default, fatal, flags, group) File "C:\Transmogrifier\yt-dlp.py\yt_dlp\extractor\common.py", line 1192, in _search_regex raise RegexNotFoundError('Unable to extract %s' % _name)
Thanks Ringo
Hello, i'm have the extractor that can download OP url however it didn't download the url in test and this url. I always get error for this kind url (in the browser)
In dev tools, it said CORS Missing Allow Origin
for this kind of url manifest. Using media url in json give 403 Error. I appreciated if someone can confirm this url or http://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/ works
Here's my branch if anyone want to test: https://github.com/HobbyistDev/yt-dlp/tree/cnn-article-fix
and this url. I always get error for this kind url (in the browser)
It plays in browser for me
The webpage JS makes an API call to: https://fave.api.cnn.io/v1/video?id=world/2022/02/04/parrot-steals-gopro-new-zealand-lon-orig-na.cnn&customer=cnn&edition=domestic&env=prod
which contains the metadata and format info. The ID (world/2022/02/04/parrot-steals-gopro-new-zealand-lon-orig-na.cnn
) can be found in the JSON LD block.
Same M.O. for the webpage from #6167, API call to: https://fave.api.cnn.io/v1/video?id=health/2021/06/18/how-to-improve-your-breathing-lbb-orig.cnn&customer=cnn&edition=domestic&env=prod with the ID found in JSON LD
The CNN extractors seem to be generally broken (yt-dl is essentially the same, but lacks CNNIndonesiaIE
).
TurnerBaseIE
but it produces manifest URLs that mostly give 400 Bad Request with the resulting media links giving 403. Presumably the JSON API above should be used instead.TurnerBaseIE._extract_cvp_info()
to the results from the JSON API also throws up 400 errors, resulting from _extract_f4m_formats()
and _extract_akamai_formats()
. Are these methods now obsolete, with the demise of Flash?_extract_timestamp()
method was disabled, presumably because the base class method failed to to scale the epoch value from ms to s.TurnerBaseIE
(and so AdobePassIE
) but perhaps this helps NA cable TV subscribers in some way.Taking most of those into account gives something like this (tested in yt-dl but making the canonical changes):
--- old/yt_dlp/extractor/cnn.py
+++ new/yt_dlp/extractor/cnn.py
@@ -1,6 +1,22 @@
+import re
+
from .common import InfoExtractor
from .turner import TurnerBaseIE
-from ..utils import merge_dicts, try_call, url_basename
+from ..utils import (
+ determine_ext,
+ extract_attributes,
+ ExtractorError,
+ float_or_none,
+ HEADRequest,
+ int_or_none,
+ parse_duration,
+ strip_or_none,
+ traverse_obj,
+ update_url_query,
+ url_basename,
+ url_or_none,
+)
+from ..compat import compat_urlparse
class CNNIE(TurnerBaseIE):
@@ -18,7 +34,7 @@
'duration': 135,
'upload_date': '20130609',
},
- 'expected_warnings': ['Failed to download m3u8 information'],
+ 'skip': 'Redirect to home: content expired?',
}, {
'url': 'http://edition.cnn.com/video/?/video/us/2013/08/21/sot-student-gives-epic-speech.georgia-institute-of-technology&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+rss%2Fcnn_topstories+%28RSS%3A+Top+Stories%29',
'md5': 'b5cc60c60a3477d185af8f19a2a26f4e',
@@ -29,18 +45,18 @@
'description': "A Georgia Tech student welcomes the incoming freshmen with an epic speech backed by music from \"2001: A Space Odyssey.\"",
'upload_date': '20130821',
},
- 'expected_warnings': ['Failed to download m3u8 information'],
+ 'skip': 'Redirect to home: content expired?',
}, {
'url': 'http://www.cnn.com/video/data/2.0/video/living/2014/12/22/growing-america-nashville-salemtown-board-episode-1.hln.html',
- 'md5': 'f14d02ebd264df951feb2400e2c25a1b',
+ 'md5': '5aa0b7170c275d9772e1cfa5c29a6f5f',
'info_dict': {
'id': 'living/2014/12/22/growing-america-nashville-salemtown-board-episode-1.hln',
'ext': 'mp4',
'title': 'Nashville Ep. 1: Hand crafted skateboards',
'description': 'md5:e7223a503315c9f150acac52e76de086',
+ 'timestamp': 1419271243,
'upload_date': '20141222',
},
- 'expected_warnings': ['Failed to download m3u8 information'],
}, {
'url': 'http://money.cnn.com/video/news/2016/08/19/netflix-stunning-stats.cnnmoney/index.html',
'md5': '52a515dc1b0f001cd82e4ceda32be9d1',
@@ -55,6 +71,7 @@
# m3u8 download
'skip_download': True,
},
+ 'skip': 'Redirect to home: content expired?',
}, {
'url': 'http://cnn.com/video/?/video/politics/2015/03/27/pkg-arizona-senator-church-attendance-mandatory.ktvk',
'only_matching': True,
@@ -70,6 +87,7 @@
# http://edition.cnn.com/.element/apps/cvp/3.0/cfg/spider/cnn/expansion/config.xml
'edition': {
'data_src': 'http://edition.cnn.com/video/data/3.0/video/%s/index.xml',
+ 'json_src': 'https://fave.api.cnn.io/v1/video?id=%s&customer=cnn&edition=domestic&env=prod',
'media_src': 'http://pmd.cdn.turner.com/cnn/big',
},
# http://money.cnn.com/.element/apps/cvp2/cfg/config.xml
@@ -79,17 +97,169 @@
},
}
- def _extract_timestamp(self, video_data):
- # TODO: fix timestamp extraction
- return None
+ def _extract_f4m_formats(self, *args, **kwargs):
+ return []
+
+ def _extract_akamai_formats(self, *args, **kwargs):
+ return []
+
+ def _extract_cvp_info(self, url, video_id, path_data={}, ap_data={}, fatal=False):
+
+ video_data = self._download_json(url, video_id)
+ urls = set()
+ formats = []
+ thumbnails = []
+ subtitles = {}
+ VIDEO_NAME_RE = r'(?P<width>[0-9]+)x(?P<height>[0-9]+)(?:_(?P<bitrate>[0-9]+))?'
+ # Possible formats locations: files/fileUri, groupFiles/.../files/fileUri
+ # and maybe others
+ for video_url in traverse_obj(
+ video_data,
+ ((('groupFiles', Ellipsis), None), 'files', Ellipsis, 'fileUri'),
+ expected_type=lambda x: x if '/' in x else None):
+ video_id = video_data.get('id') or video_id
+ if video_url.startswith('/mp4:protected/'):
+ continue
+ # TODO Correct extraction for these files
+ # protected_path_data = path_data.get('protected')
+ # if not protected_path_data or not rtmp_src:
+ # continue
+ # protected_path = self._search_regex(
+ # r'/mp4:(.+)\.[a-z0-9]', video_url, 'secure path')
+ # auth = self._download_webpage(
+ # protected_path_data['tokenizer_src'], query={
+ # 'path': protected_path,
+ # 'videoId': content_id,
+ # 'aifp': aifp,
+ # })
+ # token = xpath_text(auth, 'token')
+ # if not token:
+ # continue
+ # video_url = rtmp_src + video_url + '?' + token
+ elif video_url.startswith('/secure/'):
+ secure_path_data = path_data.get('secure')
+ if not secure_path_data:
+ continue
+ video_url = self._add_akamai_spe_token(
+ secure_path_data['tokenizer_src'],
+ secure_path_data['media_src'] + video_url,
+ video_id, ap_data)
+ v_url = url_or_none(video_url)
+ if (not v_url) and video_url.startswith('/'):
+ video_url = traverse_obj(
+ path_data, ('default', 'media_src'),
+ expected_type=lambda x: url_or_none('%s%s' % (x, video_url)))
+ if (not video_url) or video_url in urls:
+ continue
+ urls.add(video_url)
+ ext = determine_ext(video_url)
+ if ext in ('scc', 'srt', 'vtt'):
+ subtitles.setdefault('en', []).append({
+ 'ext': ext,
+ 'url': video_url,
+ })
+ elif ext == 'png':
+ thumbnails.append({
+ # 'id': format_id,
+ 'url': video_url,
+ })
+ elif ext == 'smil':
+ formats.extend(self._extract_smil_formats(
+ video_url, video_id, fatal=False))
+ elif re.match(r'https?://[^/]+\.akamaihd\.net/[iz]/', video_url):
+ formats.extend(self._extract_akamai_formats(
+ video_url, video_id, {
+ 'hds': path_data.get('f4m', {}).get('host'),
+ # nba.cdn.turner.com, ht.cdn.turner.com, ht2.cdn.turner.com
+ # ht3.cdn.turner.com, i.cdn.turner.com, s.cdn.turner.com
+ # ssl.cdn.turner.com
+ 'http': traverse_obj(
+ path_data, ('default', 'media_src'),
+ expected_type=lambda x: compat_urlparse.urlparse(x).host),
+ }))
+ elif ext == 'm3u8':
+ m3u8_formats = self._extract_m3u8_formats(
+ video_url, video_id, 'mp4',
+ m3u8_id='hls', entry_protocol='m3u8_native',
+ fatal=False)
+ if '/secure/' in video_url and '?hdnea=' in video_url:
+ for f in m3u8_formats:
+ f['_seekable'] = False
+ formats.extend(m3u8_formats)
+ elif ext == 'f4m':
+ formats.extend(self._extract_f4m_formats(
+ update_url_query(video_url, {'hdcore': '3.7.0'}),
+ video_id, f4m_id='hds', fatal=False))
+ else:
+ f = {
+ 'url': video_url,
+ 'ext': ext,
+ }
+ mobj = re.search(VIDEO_NAME_RE, video_url)
+ if mobj:
+ f.update({
+ 'format_id': mobj.group(),
+ 'width': int(mobj.group('width')),
+ 'height': int(mobj.group('height')),
+ 'tbr': int_or_none(mobj.group('bitrate')),
+ })
+ formats.append(f)
+
+ sttl_fmts = {
+ 'scc': 'scc',
+ 'webvtt': 'vtt',
+ 'smptett': 'tt',
+ }
+ for source in traverse_obj(
+ video_data,
+ ('closedCaptions', 'types',
+ lambda _, v: v['format'] in sttl_fmts)):
+ track = traverse_obj(source, 'track', expected_type=dict)
+ track_url = url_or_none(track.get('url'))
+ if not track_url or track_url.endswith('/big'):
+ continue
+ lang = traverse_obj(track, 'lang', 'label') or 'en'
+ subtitles.setdefault(lang, []).append({
+ 'url': track_url,
+ 'ext': sttl_fmts.get(source.get('format'))
+ })
+
+ thumbnails.extend(
+ traverse_obj(video_data, ('images', Ellipsis, {
+ 'id': 'name',
+ 'url': 'uri',
+ 'width': 'imageWidth',
+ 'height': 'imageHeight',
+ })))
+
+ return {
+ 'id': video_id,
+ 'title': video_data['headline'],
+ 'formats': formats,
+ 'subtitles': subtitles,
+ 'thumbnails': thumbnails,
+ 'description': strip_or_none(video_data.get('description')),
+ 'duration': parse_duration(video_data.get('length')) or int_or_none(video_data.get('trt')),
+ 'timestamp': traverse_obj(video_data, ('dateCreated', 'uts'), expected_type=lambda x: float_or_none(x, 1000)),
+ }
def _real_extract(self, url):
- sub_domain, path, page_title = self._match_valid_url(url).groups()
+ sub_domain, path, page_title = re.match(self._VALID_URL, url).groups()
if sub_domain not in ('money', 'edition'):
sub_domain = 'edition'
+
+ urlh = self._request_webpage(
+ HEADRequest(url), page_title, fatal=False, expected_status=404)
+ if not urlh:
+ raise ExtractorError('URL inaccessible')
+ elif urlh.getcode() == 404:
+ raise ExtractorError('URL not found (404)')
+ elif compat_urlparse.urlparse(urlh.geturl()).path in ('/business/' if sub_domain == 'money' else '/videos/'):
+ raise ExtractorError('Redirect to home page: content expired?')
+
config = self._CONFIG[sub_domain]
return self._extract_cvp_info(
- config['data_src'] % path, page_title, {
+ config['json_src'] % path, page_title, {
'default': {
'media_src': config['media_src'],
},
@@ -103,12 +273,13 @@
_VALID_URL = r'https?://[^\.]+\.blogs\.cnn\.com/.+'
_TEST = {
'url': 'http://reliablesources.blogs.cnn.com/2014/02/09/criminalizing-journalism/',
- 'md5': '3e56f97b0b6ffb4b79f4ea0749551084',
+ 'md5': '8738943bc67afb23f02d8a116a13370f',
'info_dict': {
'id': 'bestoftv/2014/02/09/criminalizing-journalism.cnn',
'ext': 'mp4',
'title': 'Criminalizing journalism?',
'description': 'Glenn Greenwald responds to comments made this week on Capitol Hill that journalists could be criminal accessories.',
+ 'timestamp': 1391965438,
'upload_date': '20140209',
},
'expected_warnings': ['Failed to download m3u8 information'],
@@ -125,19 +296,28 @@
_VALID_URL = r'https?://(?:(?:edition|www)\.)?cnn\.com/(?!videos?/)'
_TEST = {
'url': 'http://www.cnn.com/2014/12/21/politics/obama-north-koreas-hack-not-war-but-cyber-vandalism/',
- 'md5': '689034c2a3d9c6dc4aa72d65a81efd01',
+ 'md5': 'ad618091beda9eb5afc80bb62c8cdc3a',
'info_dict': {
'id': 'bestoftv/2014/12/21/ip-north-korea-obama.cnn',
'ext': 'mp4',
'title': 'Obama: Cyberattack not an act of war',
'description': 'md5:0a802a40d2376f60e6b04c8d5bcebc4b',
+ 'timestamp': 1419171098,
'upload_date': '20141221',
},
- 'expected_warnings': ['Failed to download m3u8 information'],
+ 'params': {
+ 'user_agent': 'Mozilla/5.0',
+ },
'add_ie': ['CNN'],
}
def _real_extract(self, url):
- webpage = self._download_webpage(url, url_basename(url))
- cnn_url = self._html_search_regex(r"video:\s*'([^']+)'", webpage, 'cnn url')
- return self.url_result('http://cnn.com/video/?/video/' + cnn_url, CNNIE.ie_key())
+ video_id = re.sub(r'/index\.html$', '', compat_urlparse.urlparse(url).path)
+ webpage = self._download_webpage(url, video_id)
+ cnn_url = self._html_search_regex(r"video:\s*'([^']+)'", webpage, 'cnn url', default=None)
+ if not cnn_url:
+ video_resource = self._search_regex(
+ r'''(<div\s+(?:[\w-]+\s*=\s*"[^"]*"\s+)*?data-featured-video\s*=\s*('|")true\2(?:\s+[\w-]+\s*=\s*"[^"]*")*\s*>)''',
+ webpage, 'featured video resource')
+ cnn_url = extract_attributes(video_resource)['data-video-id']
+ return self.url_result('http://cnn.com/videos/' + cnn_url, CNNIE.ie_key())
another example URL and log if needed:
How do we get the supported sites list updated? It's been over 1 year.
In my case it falls back on generic extractor and ends "unsupported site".
Am I doing something wrong as to why it isn't finding the CNN extractor?
I was using --ap-mso and cookies in my config
https://www.cnn.com/videos/title-980446
@RutD0g that is unrelated to this issue (CNN Article videos). Open a new issue. I don't think that type of CNN URL has ever been supported
How do we get the supported sites list updated? It's been over 1 year.
You wait for volunteer developers to decide to work on site support in their own unpaid free time
CNN is a funny one but you have to find an API call it makes for the manifest and the address will be in there. If you search fave.api in the network tab you'll find it
@RutD0g : How do we get the supported sites list updated? It's been over 1 year.
@bashonly : You wait for volunteer developers to decide to work on site support in their own unpaid free time
Maybe the suggestion was to remove CNN from the supported sites list until fixed ?
Patch seems to work fine BUT with many warnings. Some adjustments are needed to how requests are made (if I understand correctly). Here's a test link: https://edition.cnn.com/2024/03/27/opinions/gaza-israel-resigning-state-department-sheline/index.html
Here's a log:
[CNNArticle] Extracting URL: https://edition.cnn.com/2024/03/27/opinions/gaza-israel-resigning-state-department-sheline/index.html
[CNNArticle] /2024/03/27/opinions/gaza-israel-resigning-state-department-sheline: Downloading webpage
[CNN] Extracting URL: http://cnn.com/videos/tv/2024/03/28/amanpour-sheline-gaza-protest.cnn
[CNN] amanpour-sheline-gaza-protest: Downloading webpage
ERROR: Passing a urllib.request.Request to _create_request() is deprecated. Use yt_dlp.networking.common.Request instead.; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
File "
ERROR: Response.getcode() is deprecated, use Response.status; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U ERROR: Response.geturl() is deprecated, use Response.url; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U [CNN] amanpour-sheline-gaza-protest: Downloading JSON metadata [debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, size, br, asr, proto, vext, aext, hasaud, source, id [info] Available formats for tv/2024/03/28/amanpour-sheline-gaza-protest.cnn: ID EXT RESOLUTION │ FILESIZE TBR PROTO │ VCODEC ACODEC ──────────────────────────────────────────────────────────────────────── 1920x1080_8000 mp4 1920x1080 │ ≈672.34MiB 8000k https │ unknown unknown
Checklist
Region
No response
Description
Can't download videos from CNN anymore. e.g.: https://www.cnn.com/travel/article/parrot-steals-gopro-scli-intl
Verbose log