Open kristophercrawford opened 4 years ago
first play this video - https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus and skip to almost to the end of the video and let it stop. 2nd open the network tab (F12) and select network and then media. you should see mp4 file.
copy the first URL, open a new tab - paste in the URL and hit enter, the video should play right click on the video and select save video as, the video should say videoHD.php, just add .mp4 like this videoHD.php.mp4 and you should have the video downloaded.
this work best using the firefox web browser and the brave web browser haven't tested on other browsers yet.
The generic info extractor is obtaining the incorrect url. You can see in the verbose output where it says: [debug] Invoking downloader on 'https://www.documentarymania.com/Videos/'
It looks like it's happening here.
I believe this line is pulling the following content from the webpage:
<script type="application/ld+json"> { "@context": "http://schema.org", "@type": "VideoObject", "name": "The Body vs Coronavirus", "description": "How can we cope with the tricky coronavirus now rampant worldwide? As the pandemic tightens its grip on the world, there are important unanswered questions about this novel virus: Why does this infection spread so rapidly from people with no symptoms? Why do some people become critical while others don't? Will a definitive treatment be found? The underlying key to these questions lie in our immune system. Immune cells are microscopic warriors, combating viruses and another pathogens. <br> Through the high-tech 'eyes' of next-generation microscopes and computer-generated imagery, we will see how our immune defense corps combat against microbes and what mechanism is expected to help develop treatment. ", "thumbnailUrl": "https://www.documentarymania.com/iconos/The.Body.Vs.Coronavirus.jpg", "uploadDate": "2020-10-08 09:53:44Z", "duration": "PT51M30S", "contentUrl": "https://www.documentarymania.com/Videos/", "embedUrl": "https://www.documentarymania.com/player.php?title=The Body vs Coronavirus", "interactionCount": "5429" } </script>
You can see that "contentUrl" is "https://www.documentarymania.com/Videos/", and it looks like this value is used here, before being merged here.
Tested using the following:
import youtube_dl
from youtube_dl.extractor import generic
url = "https://www.documentarymania.com/player.php?title=The+Body+vs+Coronavirus"
gie = generic.GenericIE()
gie.set_downloader(youtube_dl.YoutubeDL())
this = gie._real_extract(url)
print(this)
I'm not authority on schemas, but per https://schema.org/VideoObject (the @context
value that's referenced in the webpage), @type
VideoObject is like a subclass of @type
MediaObject, and the "contentUrl" of a MediaObject is supposed to be "Actual bytes of the media object, for example the image file or video file," so I would wager that this website is not following standard protocol.
Checklist
Question
I am attempting to download a video from the website documentarymania.com and am not able to download some videos. I can access these videos in a browser normally. I have tried changing the user agent used by youtube-dl and setup an EC2 instance and tried from there as well to rule out my IP address being filtered. Verbose output is below: