xaicron / p5-www-youtube-download

YouTube video download interface.
http://blog.livedoor.jp/xaicron/
Other
38 stars 28 forks source link

403 forbidden with ver0.58 #36

Open supersquirrel opened 9 years ago

supersquirrel commented 9 years ago

Good day and thank you for your great work!

I have the same problem as reported in bug 24: https://github.com/xaicron/p5-www-youtube-download/issues/24

When trying to download some videos, for example kDMsVApzv9w i get a !! kDMsVApzv9w download failed: 403 Forbidden at E:\test.pl line 9.

In Firefox the video plays fine.

WWW::YouTube::Download Installed: 0.58

perl --version: This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x86-multi-thread-64int

test.pl: use strict; use warnings; use WWW::YouTube::Download; my $video_id='kDMsVApzv9w'; my $client = WWW::YouTube::Download->new; $client->download($video_id);

(first time reporting a bug, hope it's ok this way...)

oalders commented 9 years ago

@supersquirrel thanks for this. Which version of WWW::YouTube::Download do you have installed?

supersquirrel commented 9 years ago

I use version 0.58 (the latest), installed with (on the command line) cpan WWW::YouTube::Download.

oalders commented 9 years ago

Not sure what the issue is, but here's a script that will provide a fair amount of debugging output:

use strict;
use warnings;

use LWP::ConsoleLogger::Easy qw( debug_ua );
use WWW::Mechanize;
use WWW::YouTube::Download;

my $ua = WWW::Mechanize->new;
debug_ua( $ua );

my $video_id = 'kDMsVApzv9w';
my $client   = WWW::YouTube::Download->new( ua => $ua );
$client->download( $video_id );
supersquirrel commented 9 years ago

I run your script, here is the result (quite huge): https://gist.github.com/supersquirrel/da7b6bc7079fa4873d9b I removed my (almost-static) IP-Address, replaced with "REMOVED_IP".

edit 15.07.15: sorry, i just managed to delete the gist while trying something. :-/ i restored it here: https://gist.github.com/supersquirrel/79237dd370ca69f3f837

supersquirrel commented 9 years ago

Sorry to pressure you, any news here? Sadly the issue is still there...

oalders commented 9 years ago

Not feeling any pressure. :) I just got maint on this module a while back to apply some existing patches and release them, so I don't actually know the internals all that well. If someone has a fix for this, I'm happy to apply and release it.

supersquirrel commented 9 years ago

Thank you for your answer. I just wanted to try to do some debugging (with my very limited perl knowledge), but i have a new problem: the logging-thing doesn't longer work. Can you confirm this issue or it is my PC that is broken?

log.pl:

use strict;
use warnings;
use LWP::ConsoleLogger::Easy qw( debug_ua );
use WWW::Mechanize;
use WWW::YouTube::Download;
my $ua = WWW::Mechanize->new;
debug_ua( $ua );
my $video_id='D36JUfE1oYk'; #KITTEN MEETS HEDGEHOG 
my $client   = WWW::YouTube::Download->new( ua => $ua );
$client->download( $video_id );

command line: perl log.pl>log_err 2>&1

expected result: the script succeeds, the video is downloaded and a logfile is created.

real result: the script fails, no video is downloaded and in the logfile there is a error-message: "failed to extract JSON data at log.pl line 10."

What is going on? This script worked a few weeks ago! Can you confirm this issue?

EDIT: If i make a script without the logging stuff

use strict;
use warnings;
use WWW::YouTube::Download;
my $video_id='D36JUfE1oYk'; #KITTEN MEETS HEDGEHOG 
my $client = WWW::YouTube::Download->new;
$client->download($video_id);

it works fine! It's only when the logging/debugging-stuff is enabled that there is a error.

oalders commented 9 years ago

Looks like the ConsoleLogger is messing with the response somehow. Good catch. That is something I can look into. I'll update the ticket once I've had a chance to see what's going on.

supersquirrel commented 9 years ago

I did some debugging. Because the logger-thing doesn't work i added some Data::Dumper() to LWP::UserAgent.pm and WWW::YouTube::Download.pm to see what is going on. Ugly but it works... It seems that the problem is due to some missing parameters. At some point Firefox with a opened Youtube-Video-Page (where the video plays fine) and the downloadscript (that fails to download the video) will both make a GET to something like https://r5---sn-n4g-cvqe.googlevideo.com/videoplayback (exact name can differ). Looking at the parameters there are some that are missing from the call made by Perl. These are below, with the value for the video i used ("Gary Moore / Midnight Blues" https://www.youtube.com/watch?v=kDMsVApzv9w ):

c=web clen=4790736 cpn=("random" stuff like SDz_cHjyYAgB50qZ or dPgcnBFFDBaYWuaU, changes at each reload) cver=as3 gir=yes keepalive=yes range=0-241663

I suppose that if we add these params to the downloadscript it should work again (maybe not all params are mandatory). I tried this by hacking Download.pm (very very ugly but my Perl is too limited to make it "nice and clean"), but i always got 403 forbidden. I think the problem is the cpn-value that is different at each request (but maybe i'm wrong and the problem is elsewhere!).

I downloaded the Youtube-page with wget (so no JS will be executed) and looked at it: clen, gir and keepalive seems to be somewhere in the code, so with some regex it should be possible to get them. range can be found to but with another value, to be investigated. c and cver are not there but seems constant. The problem is the cpn-value (and maybe the range-value, it seems constant between reloads but it's probably not the same thing for each video. I didn't look at this). In the raw HTML (downloaded by wget) there are some "[CPN]" that seems like placeholder for the actual value. The value must be computed by Firefox from some other value, either bei JS or by Flash. If i switch to HTML5-Mode (so no more Flash) the cpn-parameter is still in the request made by Firefox. I looked in the scripts, "cpn" can be found several time in html5player-new.js. Sadly this file is almost 1 meg of obfuscated JS, so no way to understand. In addition my JS-knowledge is almost zero, so i'm stuck here.

TODO: 1) find the way the cpn-value is computed 2) same thing for range-value 3) modifiy Download.pm so all these 7 parameters are added to the request, test, it should work now. 4) (voluntary) check if all 7 parameters are needed, maybe some can be removed.

I did what i was able to do but as i said now i'm stuck. Some help would ne nice...

A lot of text in poor english, i hope you can understand, if not just ask for further explanations.

oalders commented 9 years ago

@supersquirrel thanks for doing the research on this. I'm sorry that I'm not able to help more right now. Have you thought about just scraping those values from a page rather than figuring out how they are computed? Something like https://metacpan.org/pod/Web::Scraper makes it pretty easy to extract text out of HTML.

supersquirrel commented 9 years ago

Have you thought about just scraping those values from a page rather than figuring out how they are computed?

Sadly it seems that it's not as easy as this.

5 of the 7 parameters can be found in the code of the YT-page, so with some regex-magic (or a specialised Perl-module) we should be able to get them.

The cpn-value (and maybe the range-value, didn't look at this one) is NOT in the page-code, at least not in plain text. (There is A LOT of code, it's difficult to get trough all this stuff.)

(By the way, if you click "show source" in Firefox it will reload the page so the cpn-value changes again. Use Firebug --> HTML-Tab --> --> Edit.)

If you download the videopage with wget (so no JS executed, you really get the "raw" thing) and you look at the ytplayer.config-thing you will see several times something like "\u0026cpn=[CPN]", so i suppose cpn is computed by JS on the client side (Browser) only. What I don't understand is why i can't find something like cpn= with Firebug in the processed code (=the code of the page modified by all the JS running in the browser), maybe [CPN] is replaced by real value, then the entire thing is stored somewhere in JS and deleted from the source code? I fear we have to reverse engineer the JS to extract the algorithm to be able to compute the needed value in Perl. :-(

(By the way, i think these \u<4 digit number> are new in the code, maybe thats why the logger-module is broken?)

I spend quite a few hours trying to debug/analyze/understand/... the html5player.js but it's really way to much (50.000 lines once passed through a beautifier, >2500 functions) and all variable and function names are obfuscated. Why this JS-file and not another one? Because i suppose the algorithm is there somewhere, in the code you can find the word "cpn" several times. There is also the (probable) meaning of this abbreviation: clientPlaybackNonce.

There is also a "upn" (userPlaybackNonce???), it's NOT the same thing (and iirc the upn is directly in the code).

I tried to hook(?) on the individual JS-functions to get a log when which function is called with which arguments, because this way it should be possible to find the functions (only a few i suppose) that creates this nasty cpn-value. With a lot of try and error i was able to get some very short logs but nothing useful (details on request, i just played around almost without JS-knowledge).

edit: Is there a way to use HTTP instead of HTTPS on Youtube? Would be really useful.

oalders commented 9 years ago

Without having looked at it, I wonder if cpn is a red herring. For example, this project is quite popular: https://github.com/rg3/youtube-dl I've grepped it for cpn and nothing useful appears. It might be worth checking into how other projects have solved this.

I haven't looked into the ConsoleLogger issues yet -- thanks for the tip. I have seen this same problem with Pithub as well, so I should really compare the two cases.

As far as youtube over HTTP goes, I'm not aware of that as a possibility. Looks like HTTP is just a 301 to HTTPS:

lafs-MacBook-Pro:~ olaf$ curl -I http://www.youtube.com
HTTP/1.1 301 Moved Permanently
Date: Tue, 08 Sep 2015 01:11:57 GMT
Server: gwiseguy/2.0
X-XSS-Protection: 1; mode=block; report=https://www.google.com/appserve/security-bugs/log/youtube
Location: https://www.youtube.com/
Content-Length: 0
P3P: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=en for more info."
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache
Expires: Tue, 27 Apr 1971 19:44:06 EST
X-Content-Type-Options: nosniff
Set-Cookie: VISITOR_INFO1_LIVE=N4dkSArj284; path=/; domain=.youtube.com; expires=Sun, 08-May-2016 13:04:57 GMT; httponly
Set-Cookie: YSC=pI_a5BmQkgo; path=/; domain=.youtube.com; httponly
supersquirrel commented 9 years ago

For example, this project is quite popular: https://github.com/rg3/youtube-dl I've grepped it for cpn and nothing useful appears. It might be worth checking into how other projects have solved this.

Looks like they use the swf-file, there is a swfinterp.py that is loaded by extractor/youtube.py. I don't know Python, i may look at this later. edit: I actually already looked at the swf-file, it is possible to extract a lot of (apparently not or not strongly obfuscated) source code from it. With some website i managed to get 39.000 lines or 1,68MB.

I have found some very interesting Javascript related to cpn, but i'm stuck because my tools are not working: The problem is the HTTPS. For testing i need to inject code into the Youtube-page before Firefox can run the JS. For this I use a little proxy programm called Proxomitron. Even if it's quite old it works great for HTTP. It can do the same thing for HTTPS (man-in-the-middle so they will be invalid certificates), but it seems that they is no way to make Firefox accept the self-signed certificate. This is really annoying. According to a bug report it may work with older versions of FF, i will try this and leave another message here if i have new informations about the actual YT/script-problem.

supersquirrel commented 9 years ago

I was finally able to set up Fiddler2 so that works at least for Youtube...

Regarding cpn your thought that it's a red herring might be true. Basically it's 16 calls to window.crypto.getRandomValues that are then translated/encoded to a string. No other values or parameters are used. WTF? If you want the source code look at html5player.js and search for "window.crypto.getRandomValues". In my case the function is called Gv() but this may vary.

I modified the downloadscript so that the GET videoplayback has identical parameters compared to what i see in Firebug (for a Flashvideo): still 403.

The difference is that in Firefox they are some other calls (GET, maybe POST) before the video is requested, maybe that's the problem? Or some cookie that is missing because of the missing calls? Or anything other? I don't know. I might continue to try to make this work but it's becoming quite depressing, a lot of work still without results...

Serkan-devel commented 8 years ago

I have the same problem with the latest version 0.58