yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
76.75k stars 6.03k forks source link

yt-dlp fails to parse MPD manifest: KeyError('sourceURL') #8269

Open flashdagger opened 9 months ago

flashdagger commented 9 months ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Provide a description that is worded well enough to be understood

Disclaimer: I checked all the boxes to advance in the process.

Dear developers and maintainers,

I have no idea, if the MPD file conforms to the standard. Downloading it with ffmpeg also fails, but maybe due to missing Header attributes. Please decide for yourself, if the MPD parsing needs to be changed or maybe you can tell me, if this particular format is too anomalous.

Best regards Marcel

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

[debug] Command-line config: ['https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd', '--no-config', '-v']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8 (No ANSI), error utf-8 (No ANSI), screen utf-8 (No ANSI)
[debug] yt-dlp version stable@2023.09.24 [088add956] (pip)
[debug] Python 3.10.6 (CPython x86_64 64bit) - Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 (OpenSSL 1.1.1f  31 Mar 2020, glibc 2.31)
[debug] exe versions: ffmpeg 4.2.7, ffprobe 4.2.7
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.07.22, mutagen-1.47.0, sqlite3-3.31.1, websockets-11.0.3
[debug] Proxy map: {}
[debug] Extractor Plugins: Auf1IE, Auf1RadioIE, BrighteonIE, BrighteonRadioIE, BrighteonTvIE, PmWissenIE, PmWissenSearchIE, ServusSearchIE, ServusTVIE
[debug] Plugin directories: ['python3.10/site-packages/yt_dlp_plugins']
[debug] Loaded 1895 extractors
[generic] Extracting URL: https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Extracting information
ERROR: An extractor error has occurred. (caused by KeyError('sourceURL')); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "python3.10/site-packages/yt_dlp/extractor/common.py", line 715, in extract
    ie_result = self._real_extract(url)
  File "python3.10/site-packages/yt_dlp/extractor/generic.py", line 2535, in _real_extract
    info_dict['formats'], info_dict['subtitles'] = self._parse_mpd_formats_and_subtitles(
  File "python3.10/site-packages/yt_dlp/extractor/common.py", line 2734, in _parse_mpd_formats_and_subtitles
    representation_ms_info = extract_multisegment_info(representation, adaption_set_ms_info)
  File "python3.10/site-packages/yt_dlp/extractor/common.py", line 2618, in extract_multisegment_info
    extract_Initialization(segment_list)
  File "python3.10/site-packages/yt_dlp/extractor/common.py", line 2613, in extract_Initialization
    ms_info['initialization_url'] = initialization.attrib['sourceURL']
KeyError: 'sourceURL'
emarsden commented 9 months ago

This MPD is using an Initialization element that does not include a sourceURL attribute. It only includes a range attribute that refers to a higher-level BaseURL. yt-dlp is assuming that sourceURL is always present.

BTW, dash-mpd-cli downloads this content fine.

bashonly commented 9 months ago

Same issue as the underlying issue of #5288, though that site has apparently changed and it may not be useful for continuing to track the MPD sourceURL problem, so keeping this open

dirkf commented 9 months ago

In https://github.com/ytdl-org/youtube-dl/issues/32595#issuecomment-1761209532, I back-ported yt-dlp's _parse_mpd_formats_and subtitles() and modified it to address this issue.

The old code instantiated a BaseURL at the representation level by merging BaseURLs up the XML hierarchy and finally adding default URL components from the mpd_base_url, but didn't use any default for media URL attributes.

My approach was to pull out the BaseURL processing so that as the hierarchy is descended whatever BaseURL has been constructed so far can be passed, if it isn't a partial path, with key base_url in the parent info, and then used as a default for any missing media URLs.

There may be better ways. This sort of DASH format may even be invalid. But this is what happens with OP's link:

$ python -m youtube_dl -v -F 'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 66ab0814c
[debug] Python 2.7.18 (CPython i686 32bit) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial - OpenSSL 1.1.1w  11 Sep 2023 - glibc 2.15
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Requesting header
WARNING: Falling back on generic information extractor.
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Downloading webpage
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Extracting information
[info] Available formats for b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a:
format code  extension  resolution note
1            m4a        audio only [eng] DASH audio    0k , m4a_dash container, mp4a.40.2 (44100Hz)
0            mp4        480x270    [eng] DASH video  300k , mp4_dash container, avc1.640015, video only
2            mp4        960x540    [eng] DASH video  600k , mp4_dash container, avc1.64001f, video only (best)
$ 
crowetic commented 8 months ago

has this been added to the code? if I build from source are your changes included?

In ytdl-org/youtube-dl#32595 (comment), I back-ported yt-dlp's _parse_mpd_formats_and subtitles() and modified it to address this issue.

The old code instantiated a BaseURL at the representation level by merging BaseURLs up the XML hierarchy and finally adding default URL components from the mpd_base_url, but didn't use any default for media URL attributes.

My approach was to pull out the BaseURL processing so that as the hierarchy is descended whatever BaseURL has been constructed so far can be passed, if it isn't a partial path, with key base_url in the parent info, and then used as a default for any missing media URLs.

There may be better ways. This sort of DASH format may even be invalid. But this is what happens with OP's link:

$ python -m youtube_dl -v -F 'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 66ab0814c
[debug] Python 2.7.18 (CPython i686 32bit) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial - OpenSSL 1.1.1w  11 Sep 2023 - glibc 2.15
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Requesting header
WARNING: Falling back on generic information extractor.
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Downloading webpage
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Extracting information
[info] Available formats for b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a:
format code  extension  resolution note
1            m4a        audio only [eng] DASH audio    0k , m4a_dash container, mp4a.40.2 (44100Hz)
0            mp4        480x270    [eng] DASH video  300k , mp4_dash container, avc1.640015, video only
2            mp4        960x540    [eng] DASH video  600k , mp4_dash container, avc1.64001f, video only (best)
$ 
dirkf commented 8 months ago

Not even in a PR at yt-dl yet, let alone here.

crowetic commented 8 months ago

damn, okay... please let me know if it does become a thing...

(or if you happen to have a linux version I can test with your changes?)

thank you!

miscellaneous01 commented 6 months ago

li'l noobish here very 1st - is #8959 closed? You said submit a ticket; I did. It was revoked? If so, should I close account and stop sending these?

bashonly commented 6 months ago

@miscellaneous01 it was a duplicate of this issue. There is no need for 2 reports to track 1 bug

miscellaneous01 commented 6 months ago

I couldn't find a way to search for tickets. github doesn't supply a yt-dlp range search, right?

bashonly commented 6 months ago

@miscellaneous01 github has a search function but it's not very good. Don't worry about it, happens all the time

dirkf commented 5 months ago

In the duplicate issues #8655, #8959, #9012, the problem URLs from brighteon.com appear to generate a 2-item playlist, where the first item has A-V and matching video-only formats, plus an audio-only format, and the second is just mp3. Is that expected?

flashdagger commented 5 months ago

The brighteon plugin usually presents something like this:

ID         EXT RESOLUTION FPS │   FILESIZE   TBR PROTO │ VCODEC        VBR ACODEC      ABR ASR MORE INFO
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
dash-audio m4a audio only     │ ~ 39.77MiB   96k https │ audio only        mp4a.40.5   96k 48k DASH audio, m4a_dash
audio      mp3 audio only     │ ≈ 79.55MiB  192k https │ audio only        mp4a.40.2  192k 48k
dash-270p  mp4 480x270        │ ~145.01MiB  350k https │ avc1.4d401f  350k video only          DASH video, mp4_dash
hls-270p   mp4 480x270     15 │ ~103.01MiB  249k m3u8  │ avc1.4d401f       mp4a.40.5
dash-540p  mp4 960x540        │ ~621.46MiB 1500k https │ avc1.640028 1500k video only          DASH video, mp4_dash
hls-540p   mp4 960x540     30 │ ~279.05MiB  674k m3u8  │ avc1.4d401f       mp4a.40.5

Where the Dash streams (1 audio + 2 or 3 video) comes from the MPD manifest. HLS streams are from m3u8 and the mp3 audio is a separate file.

fireattack commented 5 months ago

I don't think that plugin did anything special about this issue, it just skips problematic MPDs.

As for these these dash formats it does return, maybe it just extracted them from another non-problematic MPDs?

Edit: wait, that's your plugin! Then I have no idea what you meant.

dirkf commented 5 months ago

Then I suppose that the extraction as a 2-item playlist is an artefact of the upstream generic extractor.

flashdagger commented 5 months ago

I don't think that plugin did anything special about this issue, it just skips problematic MPDs.

As for these these dash formats it does return, maybe it just extracted them from another non-problematic MPDs?

Edit: wait, that's your plugin! Then I have no idea what you meant.

I just described which formats should be expected when the MPD is finally parsed. My plugin does not try to solve the issue, as MPD-parsing is a yt-dlp core functionality.

@dirkf: If you use the generic extractor then you also get all formats. Just that the mp3 is an additional playlist item, but that's how the extractor works, I suppose...

dirkf commented 5 months ago

See https://github.com/ytdl-org/youtube-dl/pull/32710:

$ python -m youtube_dl -vF 'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd'[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-vF', u'https://video.brighteon.com/file/BTBucket-Prod/dash/b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a.mpd']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 630da9eb7
[debug] Python 2.7.18 (CPython i686 32bit) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial - OpenSSL 1.1.1w  11 Sep 2023 - glibc 2.15
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Requesting header
WARNING: Falling back on generic information extractor.
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Downloading webpage
[generic] b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a: Extracting information
[info] Available formats for b00477ec-6e1b-4ab8-a42b-43b6cdf18c0a:
format code  extension  resolution note
1            m4a        audio only [eng] DASH audio    0k , m4a_dash container, mp4a.40.2 (44100Hz)
0            mp4        480x270    [eng] DASH video  300k , mp4_dash container, avc1.640015, video only
2            mp4        960x540    [eng] DASH video  600k , mp4_dash container, avc1.64001f, video only (best)
$
introspectionism commented 3 weeks ago
[username@host downloads]$ yt-dlp https://www.brighteon.com/9f2be9d4-1600-4002-a836-f3605746d3cc
[generic] Extracting URL: https://www.brighteon.com/9f2be9d4-1600-4002-a836-f3605746d3cc
[generic] 9f2be9d4-1600-4002-a836-f3605746d3cc: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] 9f2be9d4-1600-4002-a836-f3605746d3cc: Extracting information
[html5] 9f2be9d4-1600-4002-a836-f3605746d3cc: Downloading m3u8 information
[html5] 9f2be9d4-1600-4002-a836-f3605746d3cc: Downloading MPD manifest
ERROR: An extractor error has occurred. (caused by KeyError('sourceURL')); please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
[username@host downloads]$ yt-dlp -U
Latest version: stable@2024.05.27 from yt-dlp/yt-dlp
yt-dlp is up to date (stable@2024.05.27 from yt-dlp/yt-dlp)