RTS video channels have subtitles but ytdl does not support them

ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites

http://ytdl-org.github.io/youtube-dl/

The Unlicense

130.46k stars 9.85k forks source link

RTS video channels have subtitles but ytdl does not support them #21438

Open boulderob opened 5 years ago

boulderob commented 5 years ago

Checklist

[x ] I'm reporting a site feature request
[x ] I've verified that I'm running youtube-dl version 2019.06.08
[x ] I've searched the bugtracker for similar site feature requests including closed ones

Description

the rts.ch website has numerous video show channels and most of them have support for subtitles. for example if you play either of these two videos (from two different program channels on rts) in the browser you can see that you can enable and disable subtitles on the fly and that they are not hard coded / embedded in the video itself.

https://www.rts.ch/play/tv/passe-moi-les-jumelles/video/teddy-des-papillons-dans-les-yeux--bernard-sauveur-de-greniers?id=9524098

https://www.rts.ch/play/tv/temps-present/video/50-ans-les-romands-dans-loeil-de-temps-present-45-il-etait-une-fois-les-migrants-italiens?id=10421935

the problem is that when i try to use youtube-dl to get the subtitle files it can't find them. is there a way to update the code to locate and retrieve the subtitle files b/c they are indeed there?

i'm using the latest version

 youtube-dl --version
2019.06.08

here's the output from the command to list the subs for one of the videos above:

youtube-dl --list-subs https://www.rts.ch/play/tv/passe-moi-les-jumelles/video/teddy-des-papillons-dans-les-yeux--bernard-sauveur-de-greniers?id=9524098
[SRGSSR] 9524098: Downloading JSON metadata
[SRGSSR] 9524098: Downloading HTTP-HLS-HD token
[SRGSSR] 9524098: Downloading m3u8 information
[SRGSSR] 9524098: Downloading HTTP-HLS-SD token
[SRGSSR] 9524098: Downloading m3u8 information
[SRGSSR] 9524098: Downloading HTTP-HDS-HD token
[SRGSSR] 9524098: Downloading f4m manifest
WARNING: Unable to download f4m manifest: HTTP Error 404: Not Found
[SRGSSR] 9524098: Downloading HTTP-HDS-SD token
[SRGSSR] 9524098: Downloading f4m manifest
WARNING: Unable to download f4m manifest: HTTP Error 404: Not Found
9524098 has no subtitles

subtitle support has been lacking for rts.ch for at least 2 years. i'm just finally submitting a report

goggle commented 5 years ago

There has been effort to improve the SRGSSR extractor (e.g. #14725 and #18956), but the pull requests got ignored... Not much I can do...

boulderob commented 5 years ago

hey @goggle, the only reason i can think of why they wouldn't pull your fix is due to the fact you may have not maintained the backwards compatibility with the original extractor per your own notes in #14725?

is your #14725 fix still currently working for rts subtitles? i'm thinking of just cloning the repo locally and pulling your fix to run a separate copy for rts subtitles.

it will also allow me to learn more about the extractors so i can fix some issues with subtitles present on other sites that ytdl doesn't see but are there on playback. not being familiar with streaming video, it appears the key is to

1) find any subtitle / video "metadata" via actual page embeds or generated via js / REST api and returned during the load of site video files. (ie get the metadata) 2) then you have to just "extract" the subtitle info from the data structures... 3) in a meaningful format so that ytdl can reuse it to actually download the subtitles

ie. ytdl can be used to both display subs (--list-subs) as in step 2 above and actually do the download (--write-srts, etc) as in step 3 meaning you have to identify the underlying subtitle file so that you can later download it.. that is to say they are two different atomic ops within ytdl.

does that sound correct? for me the difficult part is finding the subtitle metadata file / delivery mechanism since i'm not a streaming expert. once that part is discovered (step one above), extracting the data should be pretty straightforward (i think?). is it fair to say that many many sites use their own proprietary methods to provide this subtitle video metadata or is it fairly consistent?

i'm trying to figure out the best way to approach this going forward b/c there are other sites which ytdl fails to find the subs and i'd just assume fix them myself vs waiting for someone else to do it and or if someone else fixes them, wait for them to be pulled back into main

thanks for any assistance you can provide

goggle commented 5 years ago

Hey @boulderob

Feel free to use the code from #14725. I haven't tested it recently, but I guess it should still work. Extracting subtitles is a fairly easy task:

Find the source of the subtitles. In case of SRG SSR, the subtitle files are listed in these .jsons, together with the other metadata.
Tell youtube-dl about these subtitle files.

That's it. The rest does youtube-dl for you. I recommend you to have a look how other information extractors handle subtitle extraction.

To improve the SRG SSR extractor, IMO the following steps are needed:

Upgrade to the integrationlayer 2.0 API. This has been done in #14725.
Make sure that this bug (#14717) gets fixed (has been done in #14725 in as well).
Add subtitle support.
Don't delete the current code, but use it as a fallback instead.

boulderob commented 5 years ago

hey @goggle thx for the quick reply.

what i mean is how do you specifically go about finding the json file names used in the regex that define the subtitles? unless i missed stg, i didn't see them embedded in the rts web pages and i couldn't find them by inspecting the network output in the firefox dev tools last time i tried to view the videos to troubleshoot this on my own. you already identified them and supplied the regexs, so it's easy to just point to them and say "they're the json files". but identifying what those urls are in the first place is the hard part. once you know what they are everything becomes easier :)

obviously the original developer who created the original extractor for RTS wasn't smart enough to find those subtitle json files either which is why they didn't include them in the extractor. you were. so my question is how do you go about determining where the subtitle info (json or other metadata file) is in the first place? that way i can leverage this work to other sites that lack the same functionality when i need to. this part about how you discover what the subtitle metadata file actually is is the missing link / mystery.

also, once you know what the json url is, you obviously just pass it to the stock ytdl _download_json method which then apparently does some heavy lifting. but searching for _download_json in the code base gives me 20 pages of search results mostly all extractor calls so i can't see what it actually does ;) where is _download_json defined? my guess is it's fairly generic and just returns a data structure that's up to the extractor to inspect what's there and figure out which info is the subtitle info yes?

thx

boulderob commented 5 years ago

cont'd : for example on a supported german tv site i'm running into the same problem. streamed videos have subs but ytdl reports none.

if you go to the actual extractor for ard (the german site):

https://github.com/ytdl-org/youtube-dl/blob/master/youtube_dl/extractor/ard.py

it tries to look for subtitles based on a var _subtitleUrl which to the best of my knowledge never gets set anywhere in the code? ;) so it's not doing anything and i'm not sure it would work even if it was set. so how do you determine what the subtitle metadata file is to even process in the first place?

here's an example video link

# youtube-dl --list-subs https://www.daserste.de/unterhaltung/krimi/tatort/videos/borowski-und-das-dunkle-netz-102.html
[ARD] borowski-und-das-dunkle-netz: Downloading XML
102 has no subtitles

goggle commented 5 years ago

Regarding the youtube-dl helper functions, that you should use whenever possible:

The helper methods that are related to the information extractor itself can be found (and are somewhat documented if you're lucky) in youtube_dl/extractor/common.py.
More general helper functions can be found in youtube_dl/utils.py.

To figure out how things exactly work, you need to try to get some understanding of the SRG SSR API. You are on the right track when you use the Firefox or Chrome developer tools in the browser!

Let's have a look at this video: https://www.rts.ch/play/tv/couleurs-dete/video/couleurs-dete?id=10550033 Note that the most important thing in that URL is the video id id=10550033. Now by using the dev tools, you see that the browser opens the file https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:rts:video:10550033.json?onlyChapters=true&vector=portalplay. So the only variable component in this URL is the video id. You should be able to replace this video id by any other valid video id and get the desired metadata information. When you have a closer look in this loaded json file, you see that there are the subtitles included: chapterList->0->subtitleList->0 The same scheme should also be true for other videos. But beware that it might be possible to have such metadata files without a subtitleList entry! In this case your code should still work and not break the video extraction.

boulderob commented 5 years ago

ah.. super thx. i can see the json file. more importantly knowing this, i don't necessarily even need ytdl to access the vtt file but can view the vtt file directly in the browser or manually download it once i know the path. awesome!

part of the problem was that i was actually looking for the information after i clicked the video to actually stream thinking that the subtitle info wouldn't be available in the browser dev network console until i did so. in fact, the link to the metadata file is available on the initial page load.

ok so on the initial page load, tons of resources are downloaded. unless of course the metadata file is embedded and easily available on the initial page (which today probably is seldom the case), should i then just assume it's getting loaded via json / xhr, so that i can decrease the amount of resources i have to check in the dev console (grep on json / xhr requests) on different sites to find the subtitle metadata file or are a multitude of schemes possible?

i guess what i'm asking is how much digging you normally have to do to find the metadata info in the dev console and whether it's trial and error clicking on things or you can narrow the search down quicker via some means.. perhaps even a grep search on the resources in teh dev console

update: since the video id aka vid is important, i was definitely able to narrow down the dev console resources by grepping on the vid which dramatically decreases what i have to inspect in the dev console. i guess after that it's going to be trial and error clicking on those links though until you pull up the metadata you're looking for correct?

one final js question for you. since the json metadata is already retrieved via xhr on initial page load, the json data must already exist and be loaded into an internal js data structure. is there any way to inspect that data structure via the dev console (ie inspect the live data on the current page in browser memory) vs just reloading the json file in a separate browser tab to see what's in it? if you have a link showing how to best do this that's fine (preferably firefox vs chome dev tools).

regardless. thx. you helped me have a much better understanding of how to go about this. i'm going to see if i can decipher the ard stream now

goggle commented 5 years ago

It really depends on the site. SRG SSR has a great API, where all the needed information can be extracted from only one JSON file. This is not always the case. You are right, it's mostly investigating how the things works by using the browser's development tools + trial and error... For SRG SSR videos, you really only need to have the video id. With that you can download the metadata json and have all the information you need!

one final js question for you. since the json metadata is already retrieved via xhr on initial page load, the json data must already exist and be loaded into an internal js data structure. is there any way to inspect that data structure via the dev console (ie inspect the live data on the current page in browser memory) vs just reloading the json file in a separate browser tab to see what's in it? if you have a link showing how to best do this that's fine (preferably firefox vs chome dev tools).

Yes, you can look at this json directly in the dev tools. Click on the Network tab (you may need to reload the site), select the JSON, this opens further information about that JSON. Now you can click on the Response tab. This shows you the content of that JSON. It looks like this:

boulderob commented 5 years ago

ok thx for the tip on inspecting via the response tab. it's almost the same thing as simply retrieving the json file itself in a tab. i need to get much better with exploring the live js stack as it exists in the browser memory / dev console so i can understand fully what the js is doing in real time. i've searched in the past but didn't find much good info on the web.

fyi, i was indeed able to find the vtt file for the ard stream as well! it does indeed look like this will be highly variable from one site to another and is likely to always require a little clicking and exploring as you suggest but my guess is you get a feel for it as this only took me a few minutes to find what i was looking for. i think they may have already tried to support this in ard but maybe the underlying url regex changed and so it isn't picking it up anymore.

based upon your relative lack of success pushing your rts changes back to public master, i'm not sure trying to fork and push is the way i want to go. i might just clone public master and pull to keep it up to date. then just make my own changes locally and merge with master. if i want to pull your forked / downstream rts fix on a onetime basis into this local repo, what's the best way to do that?

thx as usual for the help

goggle commented 5 years ago

I'm not a git expert, but you can try the following: Clone the official youtube-dl repository:

git clone "https://github.com/ytdl-org/youtube-dl.git"

Change inside that repository and add my fork as a remote:

cd youtube-dl/
git remote add goggle https://github.com/goggle/youtube-dl.git

Pull the srf branch from my repo:

git pull goggle srf

Note that this will lead to conflicts (since the SRG SSR information extractor got some updates meanwhile). You need to resolve these conflicts manually by editing youtube_dl/extractor/srgssr.py.

Alternatively, you can directly use my forked repository:

git clone "https://github.com/goggle/youtube-dl.git"
cd youtube-dl/
git checkout srf

Note that by doing it this way you have a very old youtube-dl version... This should both work for Linux and Mac, I have no idea about Windows...

boulderob commented 5 years ago

i was pretty much planning on something similar to the first method with a merge and then i'll just continue with the upstream repo pulls to keep it up to date. if i just pull yours i'm going to be out of date with the upstream.

last thing for you is whether you use / recommend a python debugger in standalone mode or combined with an ide / editor for any of this

goggle commented 5 years ago

I've never really used a Python debugger or sophisticated IDE to work with Python... I just use an editor (vim or VS Code). To test if things work when developing on youtube-dl, I simply try it out:

python -m youtube_dl -F URL

I usually use the caveman's method (adding some print statements to the code) to see if things work like expected...

But I can recommend using flake8 for code formatting and pylint to check the code quality.

boulderob commented 5 years ago

ok great. @goggle thx for the assistance today. i really appreciate it. i can finally start to work with certain subtitles that i haven't been able to for quite some time. cheers!

boulderob commented 5 years ago

@goggle i've been pretty successful getting the subtitles with your help but i now have an instance where ytdl is is actually not finding the RTS video itself for download on a page. i've narrowed down the link to just the video itself that was embedded in a parent page but ytdl still can't recognize the link. i'm trying to find a way to get the final streamed link and then just manually download it outside of ytdl with a linux utility of some kind perhaps ffmpeg or stg else. would you have any insights to this particular link here as to how to best go about doing that:

https://player.rts.ch/p/rts/embed?urn=urn:rts:video:10468814

as i said, ytdl doesn't recognize the url so it can't process it for video dl so i'm wondering how to do it manually from the cmd line. ie nothing is see in the firefox dev console is leading me to anything (useful video links) that i can download using curl, etc or perhaps i'm using the wrong utility for a stream.

thx

goggle commented 5 years ago

The subtitles of this video seem to be hard-coded into the video, so there is no way to extract the subtitles from it. To download that video, you can still use youtube-dl by using the urn:

youtube-dl srgssr:rts:video:10468814

boulderob commented 5 years ago

i know the subs are hard-coded. i don't want to extract them i just want to download the video itself but ytdl wasn't doing that based on the links i was providing it. what you recommended does work but you did not supply an actual web url to ytdl.

for instance, this was the original rts site url i had

https://pages.rts.ch/docs/10446655-harry-dean-stanton---partly-fiction.html

ytdl doesn't find any formats with that link so it can't download it. so i narrowed it down to:

https://player.rts.ch/p/rts/embed?urn=urn:rts:video:10468814

thinking ytdl might be able to handle this but it couldn't do that either.

basically you just took the vid id and plugged it into a known format ie srgssr:rts:video:<vid id> which i guess means i can do that for any rts vid id going forward with ytdl but it's not an url :) i presume if i look at the ytdl code i will see what the url is transformed into from the string you give it. so going forward i can always do what you did for any rts video once i know the idea.

BUT.. for grins and other sites potentially not handled by ytdl.. which js / xhr is actually creating the full video url in the firefox dev console for my example above? the full url has to exist there for the ytdl code to find the actual video for download.

AND... ytdl is just a wrapper for something else (probably ffmpeg???) to handle the download and reassembly of vid stream packets .. so if i know how to obtain the actual vid download url from the step above, can't i just skip ytdl all together and run that command directly to get the video file? if so what is that command?

i guess where i'm at is that sometimes a one off download is going to make more sense then modifying or adding a new feature to the ytdl code for 1 or 2 videos. it's kind of like knowing how to fish directly when i want vs always having to go to the ytdl fish market to get my video if that makes sense.

thx

boulderob commented 5 years ago

ok i can see that it definitely is using ffmpeg and i can even see the link being used for this file when i run ytdl from teh command line :) so they key is finding in the firefox dev console the vid url / file (for a given resolution) for any streamed video and just plugging it into the right incantation of ffmpeg on the command line

goggle commented 5 years ago

The URLs that you mention are simply not supported by the SRG SSR extractor in youtube-dl at the moment. They can be added easily by editing the _VALID_URL regex in srgssr.py.

I cannot answer your question about other sites. It's really everywhere different, that's also the reason why so many information extractors exist in youtube-dl.

Yes, youtube-dl uses ffmpeg. It is mostly used for post-processing, e.g. it assembles the video chunks after being downloaded by youtube-dl.

boulderob commented 5 years ago

hey @goggle, i think we're repeating the same thing here. obviously someone else figured out the rts "master url" in the past for downloading vids and all anyone needs to do now is supply new regex's to srgssr.py which extracts the vid id to plug into that master download path. all i'm saying is that even if someone else has already figured out the master url for me and plugged it into ytdl to leverage off of, if i can't figure out how to get that rts master path on my own using browser dev tools, i'm missing the real skill set i need to fully work with streaming video and deconstruct the js in the browser for any site. since i already know what that final master path is supposed to be here b/c it's already defined for me, it's probably easier to start and master the detective work with rts first, since i have the end pieces and know what i'm looking for now, before i move on to other sites. i hope that makes sense.

goggle commented 5 years ago

@boulderob Did you have any success in what you were trying to achieve?

boulderob commented 4 years ago

hey @goggle, it's been awhile and i'm sorry to say i didn't get around to doing this in terms of making any necessary changes for a pull request or anything. i have been able to write some basic scripts that allow me to pull subs when i need them which is quicker than trying to figure out the requirements for a full ytdl fix.

two things though

1) based upon the short message i saw in your original pull request from the project owner, i think the reason your original pull request got denied is because you didn't merge your branch with upstream master. if you performed this merge, i think the project maintainer would be able to accept your PR.

2) i have a new question that i've been wracking my brain on related to .ch video subs. it's at srf though. check out this link here:

https://www.srf.ch/play/tv/donna-leon/video/donna-leon---tierische-profite?id=6e5879c7-7292-46c0-b070-020c3a07cef1

i can find the vtt subs for this video. it's via 2 degrees of separation though because they are segmented and stream from an m3gu file! based on research it looks the subtitles for videos on this site all behave the same way.

with some ffmpeg foo, i can remux the segments from the m3gu file into a single vtt file that looks legit. the problem is that unlike every other downloaded videos where the mp4 file and the subtitle vtt file are separate but vlc is able to recognize the vtt file if it's named the same, vlc is recognizing my muxed vtt file but it never displays any actual vtt files in vlc mac! all my other download mp4s work great with separate vtt files though so it's only these muxed vtt files that don't work!

i've spent a lot of time on this and am getting nowhere. i am not finding any other *muxed vtt / subtitle files in the codebase of ytdl. they appear to be mostly single media files that you just download and everything works. note that if i use ffmpeg to burn the vtt muxed file into the mp4 the subtitles work**! however, i don't want to do that and if i get around to making this work and want to check in a fix, i have to have a solution that works natively with ytdl.

my guess is it has to do with the muxed format. if you're busy no worries. i just thought you might have some insigts on this. thx

jbrea commented 4 years ago

@boulderob Can you share your basic script to download subs? If possible I would like to download the subs from the daily news, like https://www.rts.ch/play/tv/popupvideoplayer?id=11287161

edcol commented 3 years ago

ah.. super thx. i can see the json file. more importantly knowing this, i don't necessarily even need ytdl to access the vtt file but can view the vtt file directly in the browser or manually download it once i know the path. awesome!

part of the problem was that i was actually looking for the information after i clicked the video to actually stream thinking that the subtitle info wouldn't be available in the browser dev network console until i did so. in fact, the link to the metadata file is available on the initial page load.

ok so on the initial page load, tons of resources are downloaded. unless of course the metadata file is embedded and easily available on the initial page (which today probably is seldom the case), should i then just assume it's getting loaded via json / xhr, so that i can decrease the amount of resources i have to check in the dev console (grep on json / xhr requests) on different sites to find the subtitle metadata file or are a multitude of schemes possible?

i guess what i'm asking is how much digging you normally have to do to find the metadata info in the dev console and whether it's trial and error clicking on things or you can narrow the search down quicker via some means.. perhaps even a grep search on the resources in teh dev console

update: since the video id aka vid is important, i was definitely able to narrow down the dev console resources by grepping on the vid which dramatically decreases what i have to inspect in the dev console. i guess after that it's going to be trial and error clicking on those links though until you pull up the metadata you're looking for correct?

one final js question for you. since the json metadata is already retrieved via xhr on initial page load, the json data must already exist and be loaded into an internal js data structure. is there any way to inspect that data structure via the dev console (ie inspect the live data on the current page in browser memory) vs just reloading the json file in a separate browser tab to see what's in it? if you have a link showing how to best do this that's fine (preferably firefox vs chome dev tools).

regardless. thx. you helped me have a much better understanding of how to go about this. i'm going to see if i can decipher the ard stream now

@boulderob Can you share your basic script to download subs? If possible I would like to download the subs from the daily news, like https://www.rts.ch/play/tv/popupvideoplayer?id=11287161

@boulderob: I have the same ask, could you please share the method of downloading the subs from RTS.CH, Example URL: https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:rts:video:11939634.json?onlyChapters=true&vector=portalplay

thanks in advance EC