yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
86.06k stars 6.7k forks source link

Broken plugin: extractor/senategov.py - New URL Structure #4140

Closed frisch1 closed 2 years ago

frisch1 commented 2 years ago

Checklist

Region

United States

Description

The U.S. Senate changed the structure of its URLs for Akamai, including the code lookup. The ISVP delivery is still there, but the code lookup and structure is now different. The error log is below. This is confirmed changed on all Senate.gov websites as of the first week in June.

The new structure still calls the ISVP url from the iframe, for example, using https://www.epw.senate.gov/public/index.cfm/2022/6/toxic-substances-control-act-amendments-implementation

The ISVP file is: https://www.senate.gov/isvp?type=live&comm=epw&filename=epw062222&auto_play=false

You still use filename, however the comm lookup (epw) is now different, per the mapping on that page (https://www.senate.gov/isvp?type=live&comm=epw&filename=epw062222&auto_play=false). See below these nows for the new mapping, which now has a label and uses a new, seven-digit ID.

From the comm (epw), the filename (epw062222) and the new lookup, the new HLS stream is constructed as follows: https://www-senate-gov-media-srs.akamaized.net/hls/live/2036783/epw/epw062222/master.m3u8

or

https://www-senate-gov-media-srs.akamaized.net/hls/live/SEVEN_DIGIT_ID/COMM/FILENAME/master.m3u8

The new mapping table is below. As with previous implementations, there appears to be a backup/fallback structure: https://www-senate-gov-msl3archive.akamaized.net/environment/epw062222/master.m3u8 ...but I have been unable to confirm as the backup doesn't appear to deliver a stream. But the primary works.

That structure is showing up on older URLs e.g from 2021 https://www.epw.senate.gov/public/index.cfm/hearings?ID=DE00984F-4EC0-4CD9-BA77-A781E4F94DA1 calls the following HLS: https://www-senate-gov-msl3archive.akamaized.net/environment/epw031721_1/master.m3u8.

They have not finished updating all Senate committee pages. But it appears they are doing it (the above from 2021 def was not in that structure in 2021, per our database). However, that said, dk if it makes sense to keep old method around as a fallback just in case. It def is not on any new videos but dk the process for updating all the old pages or how long it'll take.

However, based on the URL, guessing it applies to Akamai's msl3... MSL3.x

["ag", "76440", "https://ag-f.akamaihd.net", "2036803", "agriculture"], ["aging", "76442", "https://aging-f.akamaihd.net", "2036801", "aging"], ["approps", "76441", "https://approps-f.akamaihd.net", "2036802", "appropriations"], ["armed", "76445", "https://armed-f.akamaihd.net", "2036800", "armedservices"], ["banking", "76446", "https://banking-f.akamaihd.net", "2036799", "banking"], ["budget", "76447", "https://budget-f.akamaihd.net", "2036798", "budget"], ["cecc", "76486", "https://srs-f.akamaihd.net", "2036782", "srs_cecc"], ["commerce", "80177", "https://commerce1-f.akamaihd.net", "2036779", "commerce"], ["csce", "75229", "https://srs-f.akamaihd.net", "2036777", "srs_srs"], ["dpc", "76590", "https://dpc-f.akamaihd.net", "dpc"], ["energy", "76448", "https://energy-f.akamaihd.net", "2036797", "energy"], ["epw", "76478", "https://epw-f.akamaihd.net", "2036783", "environment"], ["ethics", "76449", "https://ethics-f.akamaihd.net", "2036796", "ethics"], ["finance", "76450", "https://finance-f.akamaihd.net", "2036795", "finance_finance"], ["foreign", "76451", "https://foreign-f.akamaihd.net", "2036794", "foreignrelations"], ["govtaff", "76453", "https://govtaff-f.akamaihd.net", "2036792", "hsgac"], ["help", "76452", "https://help-f.akamaihd.net", "2036793", "help"], ["indian", "76455", "https://indian-f.akamaihd.net", "2036791", "indianaffairs"], ["intel", "76456", "https://intel-f.akamaihd.net", "2036790", "intelligence"], ["intlnarc", "76457", "https://intlnarc-f.akamaihd.net", "internationalnarcoticscaucus"], ["jccic", "85180", "https://jccic-f.akamaihd.net", "2036778", "jccic"], ["jec", "76458", "https://jec-f.akamaihd.net", "2036789", "jointeconomic"], ["judiciary", "76459", "https://judiciary-f.akamaihd.net", "2036788"," judiciary"], ["rpc", "76591", "https://rpc-f.akamaihd.net", "rpc"], ["rules", "76460", "https://rules-f.akamaihd.net", "2036787", "rules"], ["saa", "76489", "https://srs-f.akamaihd.net", "2036780", "srs_saa"], ["smbiz", "76461", "https://smbiz-f.akamaihd.net", "2036786", "smallbusiness"], ["srs", "75229", "https://srs-f.akamaihd.net", "2031966", "srs_srs"], ["uscc", "76487", "https://srs-f.akamaihd.net", "2036781", "srs_uscc"], ["vetaff", "76462", "https://vetaff-f.akamaihd.net", "2036785", "veteransaffairs"], ["arch", "", "https://ussenate-f.akamaihd.net/"], ["uscp", "", "", "2043686", ""], ["cio", "", "", "2043686", ""]

Verbose log

$ yt-dlp -vU https://www.epw.senate.gov/public/index.cfm/2022/6/toxic-substances-control-act-amendments-implementation
[debug] Command-line config: ['-vU', 'https://www.epw.senate.gov/public/index.cfm/2022/6/toxic-substances-control-act-amendments-implementation']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.06.22.1 [a86e01e]
[debug] Python version 3.8.0 (CPython 64bit) - Linux-5.4.0-1080-aws-x86_64-with-glibc2.27
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg 4.3.2-0york0, ffprobe 4.3.2-0york0, phantomjs ., rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2018.01.18, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.3
[debug] Proxy map: {}
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.06.22.1, Current version: 2022.06.22.1
yt-dlp is up to date (2022.06.22.1)
[debug] [generic] Extracting URL: https://www.epw.senate.gov/public/index.cfm/2022/6/toxic-substances-control-act-amendments-implementation
[generic] toxic-substances-control-act-amendments-implementation: Requesting header
WARNING: [generic] Falling back on generic information extractor.
[generic] toxic-substances-control-act-amendments-implementation: Downloading webpage
[generic] toxic-substances-control-act-amendments-implementation: Extracting information
[debug] Looking for video embeds
[debug] [SenateISVP] Extracting URL: https://www.senate.gov/isvp?type=live&comm=epw&filename=epw062222&auto_play=false
[SenateISVP] epw062222: Downloading webpage
[SenateISVP] epw062222: Downloading f4m manifest
ERROR: [SenateISVP] Unable to download f4m manifest: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 647, in extract
    ie_result = self._real_extract(url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/senategov.py", line 131, in _real_extract
    for entry in self._extract_f4m_formats(f4m_url, video_id, f4m_id='f4m'):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 1984, in _extract_f4m_formats
    res = self._download_xml_handle(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 965, in download_handle
    res = self._download_webpage_handle(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 833, in _download_webpage_handle
    urlh = self._request_webpage(url_or_request, video_id, note, errnote, fatal, data=data, headers=headers, query=query, expected_status=expected_status)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 790, in _request_webpage
    raise ExtractorError(errmsg, cause=err)

  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/extractor/common.py", line 772, in _request_webpage
    return self._downloader.urlopen(self._create_request(url_or_request, data, headers, query))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/yt_dlp/YoutubeDL.py", line 3594, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
pukkandan commented 2 years ago

The given URL doesn't exist anymore. Do you have another example?