ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.19k stars 10.02k forks source link

C-SPAN videos start early and end early #13030

Closed johnhawkinson closed 3 years ago

johnhawkinson commented 7 years ago

Downloading https://www.c-span.org/video/?427577-1/sally-yates-james-clapper-testify-russian-interference-2016-election&live after-the-fact (not live), I get 21 segments, but the first segment starts too soon (not a big problem) and the final segment ends too soon (a real issue).

Per the clock in the upper-right corner of the video, Russian Interference in 2016 Election part 21-476924_21.mp4 ends at 5:24 pm ET. Reviewing it in the in-browser flash player, the video ends at 5:43 pm ET, so approx 19-20 minutes later.

pb3:Downloads jhawk$ ythls --write-pages -v 'https://www.c-span.org/video/?427577-1/sally-yates-james-clapper-testify-russian-interference-2016-election&live'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--no-part', u'--hls-use-mpegts', u'--write-pages', u'-v', u'https://www.c-span.org/video/?427577-1/sally-yates-james-clapper-testify-russian-interference-2016-election&live']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.05.07
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[CSpan] 427577: Downloading webpage
[CSpan] Saving request to 427577_https_-_www.c-span.org_video_427577-1_sally-yates-james-clapper-testify-russian-interference-2016-election_live.dump
[CSpan] 476924: Downloading JSON metadata
[CSpan] Saving request to 476924_http_-_www.c-span.org_assets_player_ajax-player.phpos=android_html5=program_id=476924.dump
[CSpan] 476924: Downloading XML
[CSpan] Saving request to 476924_http_-_www.c-span.org_common_services_flashXml.phpprogramid=476924.dump
[download] Downloading playlist: Russian Interference in 2016 Election
[CSpan] playlist Russian Interference in 2016 Election: Collected 21 video ids (downloading 21 of them)
[download] Downloading video 1 of 21
[debug] Invoking downloader on u'https://media.c-spanvideo.org/dynamic/2017/05/08/20170508135928003_hd/20170508135928003_hd.MP4-M20.mp4?Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9tZWRpYS5jLXNwYW52aWRlby5vcmcvZHluYW1pYy8yMDE3LzA1LzA4LzIwMTcwNTA4MTM1OTI4MDAzX2hkLzIwMTcwNTA4MTM1OTI4MDAzX2hkLk1QNC1NMjAubXA0IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNDk0MjkzNjI5fSwiSVBBZGRyZXNzIjp7IkFXUzpTb3VyY2VJcCI6IjQ1LjQ3LjI3LjIyMyJ9fX1dfQ__&Signature=QABCrvxeB1BSD5UfrkViaS2OvXELVIYZn0Fo~9EBX6pEXoc~sm8X9o4ltecVcDqUBYp8jXivoPNLXrt1kAEwnVsK5t7-so8dXyzSjx385hZde0QUuixFIUaTdH8aWt3Ps24BUshjxPBPyWMP9Uevv7DD7QuwV53qViGHY1t8wYI_&Key-Pair-Id=APKAIHKVWBEAXX562G7Q'
[download] Destination: Russian Interference in 2016 Election part 1-476924_1.mp4
[download] 100% of 283.81MiB in 01:52
...
[download] Downloading video 21 of 21
[debug] Invoking downloader on u'https://media.c-spanvideo.org/dynamic/5minute/20170508171958003_hd/20170508171958003_hd.MP4-M20.mp4?Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9tZWRpYS5jLXNwYW52aWRlby5vcmcvZHluYW1pYy81bWludXRlLzIwMTcwNTA4MTcxOTU4MDAzX2hkLzIwMTcwNTA4MTcxOTU4MDAzX2hkLk1QNC1NMjAubXA0IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNDk0MjkzNjI5fSwiSVBBZGRyZXNzIjp7IkFXUzpTb3VyY2VJcCI6IjQ1LjQ3LjI3LjIyMyJ9fX1dfQ__&Signature=b1GOgVz5t~tw~l-0pbcsCw1AGHeOGjQDtDSs3w5cUIoABWj4YnqBUWAGZYYD840vqVOR5Qca2Av4mdCyY9VOqKU79o5j7SYGPlHky63ofRQ22o8uziYz5De6Z0lPc5BIsIwaKWpSvy89e7D8GbOVOH0Qe5kmL5BsgfLuGMrWxYE_&Key-Pair-Id=APKAIHKVWBEAXX562G7Q'
[download] Destination: Russian Interference in 2016 Election part 21-476924_21.mp4
[download] 100% of 85.23MiB in 00:21
[download] Finished downloading playlist: Russian Interference in 2016 Election

In the time since I ran that (and then watched the video at realtime speed), it seems to have been repacked into 7 pieces instead of 21. It also has lost the LIVE tag and the timestamp in the upper-right corner, so it's tougher to tell at what point it cuts off. But it definitely does not reach the end. (update: see below)

Note that this is the same content that raised #13028 about livestream downloading, but the issues are not really related.

johnhawkinson commented 7 years ago

Update:

it seems to have been repacked into 7 pieces instead of 21. It also has lost the LIVE tag and the timestamp in the upper-right corner, so it's tougher to tell at what point it cuts off. But it definitely does not reach the end.

I was mislead. It does indeed now reach the end of the hearing (5:43 pm ET at 44:06 in the 1 hour video file) , after which point it switches to commentary and replays edited excerpts from the hearing in the remaining 15 minutes.

johnhawkinson commented 7 years ago

There's also something troubling about the way cspan.py produces playlists instead of single videos, as well as its implementation. That is, it does something unexpected, but it also does that unexpected thing incorrectly


The HTML player for C-SPAN pages presents a single continuous video. Youtube-dl should therefore present the same number of videos to the user, but instead it presents a series of broken-up videos of various lengths in a playlist. Also, sometimes these videos overlap in time (e.g. 5-minute length videos that overlap by 5 seconds, or 1-hour videos that overlap by 5 minutes), meaning they are not directly amenable to concatenation, and they do not produce a seamless playback experience.

It's also a bit challenging to figure out what to download if you want a limited time-window of a long set of videos. I guess youtube-dl -j can be used along with careful parsing of the JSON output to figure out how each video is and to try to figure out the timing (or maybe -J), but that's not straightforward.

The videos in prior comments above were recent and they seemed to be changing.


An older example that seems less likely to change is https://www.c-span.org/video/?424989-1/deputy-associate-attorneys-general-testify-confirmation-hearing from March 2017.

There youtube-dl claims there is a 6-part playlist, but the first entry appears to be a 3hr34min long clip of the entirety of the hearing. The next is a 1-hour long clip that overlaps with the start of the hearing.

So presumably something has gone wrong in cspan.py.

I'm not sure how to succinctly summarize this, and also the JSON output seems to disagree with what I actually get with respect to durations. That is:

pb3:Downloads jhawk$ youtube-dl -J https://www.c-span.org/video/?424989-1/deputy-associate-attorneys-general-testify-confirmation-hearing > 13.js
pb3:Downloads jhawk$ < 13.js2 jq -c '.entries[] | [ .id,.duration]' 
["472626_1",12868]
["472626_2",736]
["472626_3",3600]
["472626_4",3600]
["472626_5",3600]
["472626_6",1337]

Which seems to suggest the first entry should be 12868 (3hr34min+some seconds) which seems wright. But then the next is 736 when really it should be 3600.

But here's what I actually get:

pb3:Downloads jhawk$ for i in Deputy\ Attorney\ General\ and\ Associate\ Attorney\ General\ Nominations\ part\ *; do echo $i; ffprobe "$i" 2>&1 | grep Duration; done
Deputy Attorney General and Associate Attorney General Nominations part 1-472626_1.mp4
  Duration: 03:34:27.54, start: 0.000000, bitrate: 625 kb/s
Deputy Attorney General and Associate Attorney General Nominations part 2-472626_2.mp4
  Duration: 01:00:08.58, start: 0.000000, bitrate: 776 kb/s
Deputy Attorney General and Associate Attorney General Nominations part 3-472626_3.mp4
  Duration: 01:00:08.51, start: 0.000000, bitrate: 674 kb/s
Deputy Attorney General and Associate Attorney General Nominations part 4-472626_4.mp4
  Duration: 01:00:08.48, start: 0.000000, bitrate: 636 kb/s
Deputy Attorney General and Associate Attorney General Nominations part 5-472626_5.mp4
  Duration: 01:00:08.74, start: 0.000000, bitrate: 635 kb/s
Deputy Attorney General and Associate Attorney General Nominations part 6-472626_6.mp4
  Duration: 01:00:08.54, start: 0.000000, bitrate: 738 kb/s

not sure what to make of that.


Oh, because it's not voluminous like the actual output, maybe -F is instructive:

pb3:Downloads jhawk$ youtube-dl -vF https://www.c-span.org/video/?424989-1/deputy-associate-attorneys-general-testify-confirmation-hearing
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-vF', u'https://www.c-span.org/video/?424989-1/deputy-associate-attorneys-general-testify-confirmation-hearing']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.05.07
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[CSpan] 424989: Downloading webpage
[CSpan] 472626: Downloading JSON metadata
[CSpan] 472626: Downloading XML
[download] Downloading playlist: Deputy Attorney General and Associate Attorney General Nominations
[CSpan] playlist Deputy Attorney General and Associate Attorney General Nominations: Collected 6 video ids (downloading 6 of them)
[download] Downloading video 1 of 6
[info] Available formats for 472626_1:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Downloading video 2 of 6
[info] Available formats for 472626_2:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Downloading video 3 of 6
[info] Available formats for 472626_3:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Downloading video 4 of 6
[info] Available formats for 472626_4:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Downloading video 5 of 6
[info] Available formats for 472626_5:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Downloading video 6 of 6
[info] Available formats for 472626_6:
format code  extension  resolution note
224-234p     mp4        234p        224k 
628-360p     mp4        360p        628k 
4000-576p    mp4        576p       4000k  (best)
[download] Finished downloading playlist: Deputy Attorney General and Associate Attorney General Nominations
pb3:Downloads jhawk$ 
remitamine commented 7 years ago

it's a known issue #10662.

johnhawkinson commented 7 years ago

I read 10662 to be about starting early and ending late, giving you a superset of the problem video.

But here we describe starting late ending early (missing content), as well as major duplication of content (indeed, wholesale duplication of 3+ hours of content), as well as minor duplication (seconds on clips that last minutes), so it's a larger a problem.

Hrmm.

remitamine commented 7 years ago

But here we describe starting late (missing content)

i'm not sure about, youtube-dl downloads all the parts served by c-span.

as well as major duplication of content (indeed, wholesale duplication of 3+ hours of content), as well as minor duplication (seconds on clips that last minutes)

this is the same problem in refrenced issue, the real content is part from what is downloaded(the overlapping happend because the video is split into multi-parts), and i did explain in https://github.com/rg3/youtube-dl/issues/10662#issuecomment-250704716 why it's happening and how to fix it, however the part that needs to be enabled depend on #8851.

johnhawkinson commented 7 years ago

i'm not sure about, youtube-dl downloads all the parts served by c-span.

Sorry, this is confusing and I just made it worse. I should have said ending early (not starting late); I've corrected the above comment. The ending early was what I originally described here:

Downloading https://www.c-span.org/video/?427577-1/sally-yates-james-clapper-testify-russian-interference-2016-election&live after-the-fact (not live), I get 21 segments, but the first segment starts too soon (not a big problem) and the final segment ends too soon (a real issue).

But unfortunately CSPAN has re-encoded this segment so there is no longer a viable test case. (It seems a bit ridiculous that there is a Brightcove, then a 21-segment encoding, then a 7-segment encoding, and maybe then a whole-program encoding. But it is what it is.)

I'm pretty sure that the problem of ending early was not visible in the HTML player, although what I saw may have been skewed by the time it took me to watch the content, so I may have seen early endign during the 21-segment encoding but not during the 7-segment. Bah, what a mess.

bonacker commented 7 years ago

During the airing of a live show and in the hour or two after a C-Span video has aired for the first time, YT-dl downloads the show broken up into 5 min. segments. Shortly after that, the number of segments decreases to 2-4 for a one hour program. Eventually, perhaps 6-12 hours later, the whole video is contained in one or two segments.

It has been like this for a long time and is not a problem. I have successfully joined a dozen or more 5 minute segments using a tiny little free app, mp4joiner, but usually I wait for C-Span to do the joining. One show I usually watch, the Sunday Q & A program, usually downloads as two one hour segments. One of them may begin a couple of minutes before the Q & A show, while the other may include some of the show that follows the Q & A.

The point is that it is easy to view an entire program even if a file contains more than just the show you want to view or less. You have to be a little flexible. I am grateful that C-Span and YT-DL make it possible to download all C-Span shows, albeit in a somewhat unorthodox way. In past years, It was not always possible to download C-Span shows.

johnhawkinson commented 7 years ago

I'm not really clear what @bonacker's intent is, but:

You have to be a little flexible.

Well, no. We should note all the limitations and then file pull requests to implement changes to correct them. Do not accept mediocrity, but strive to make the software as good as we can!

Separately:

bonacker commented 7 years ago

My points were more philosophical than technical, but here is one semi-technical point I made:

I pointed out that based on my extensive downloading of C-Span programs via YT-DL, I know that those five minute files are a short-lived phenomenon that only exist for an hour or so after a show first airs. If one has a little patience to wait a couple of hours or, more commonly, if one is downloading an archived show from a day, a week or three years ago, one will never encounter one of those five minute files.

"Do not accept mediocrity, but strive to make the software as good as we can!"

I am a huge fan of YT-DL and am enormously grateful for all the coders who maintain functioning extractors, but YT-Dl is a kluge, and taking that fact into account, IMO, there is nothing mediocre about the user experience of downloading C-Span programs via YT-DL. I get to see all the C-Span videos that I want to see in their entirety and spend an insignificant amount of extra time in the process due to those tiny glitches that so disturb the above poster.

I'm not really clear what @bonacker's intent is”

While it is technically a bug that YT-DL does not exactly 100% mimic the behavior of the player in which C-Span wants its viewers to stream its videos, in my view there are better ways to spend one's time, including developing extractors for new sites or fixing broken ones than expending time and effort to fix an extractor that is reliably delivering 99% of its mission. Coding perfection is a nice goal but less important than overall user experience, especially in a kluge program that surprises by working at all, given the lack of “cooperation” of its “partners,” i.e. YT, C-Span etc. I realize others, like the poster to whom I am replying, disagree.