yatish27 / linkedin-scraper

Scrapes the public profile of the linkedin page
MIT License
553 stars 221 forks source link

Does linked scraper follow links? #49

Closed bvanbreukelen closed 8 years ago

bvanbreukelen commented 9 years ago

I'm using linked scraper in a web configuration. I ran after 4 'scrapes' into 999 (blocked by linkedIn). I made a workaround using WGET first, storing the profile and serve this on my own server. After this I http://localhost/profile.html with linkedin-scraper. This works like a charm... however after a number of 'scrapes' I run into a 999 blocked again?? I'm not a ruby programmer and have a hard time following the code. My question is simple, is linkedin-scraper following links in the profile, and therefore still accessing linkedin.com? And if so.. does it really need to? I noticed that opening the stored profile on my server as a webpage loads the linkedin page again (there is a redirect in the file).

Hope you can shed some light on this.

yatish27 commented 9 years ago

Hi, Yes to get the company details it follows the link on the profile which will hit 'linkedin.com'

On Fri, Jul 17, 2015 at 11:30 AM, bvanbreukelen notifications@github.com wrote:

I'm using linked scraper in a web configuration. I ran after 4 'scrapes' into 999 (blocked by linkedIn). I made a workaround using WGET first, storing the profile and serve this on my own server. After this I http://localhost/profile.html with linkedin-scraper. This works like a charm... however after a number of 'scrapes' I run into a 999 blocked again?? I'm not a ruby programmer and have a hard time following the code. My question is simple, is linkedin-scraper following links in the profile, and therefore still accessing linkedin.com? And if so.. does it really need to? I noticed that opening the stored profile on my server as a webpage loads the linkedin page again (there is a redirect in the file).

Hope you can shed some light on this.

— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49.

Yatish Mehta

bvanbreukelen commented 9 years ago

Ah. I;m just wondering what information I would miss. When I 'look' at a profile to me it seems like everything is there. Using wget it is easy to show yourself as a 'normal' browser visiting linkedin. I could wget something but not scrape using the scraper. So for me the 'workaround' seemed to work. Would it be possible to build/implement an option (-nofollow) and parse the more limited data? Just wondering if that would be still useable?

yatish27 commented 9 years ago

You would miss the company detail information

On Fri, Jul 17, 2015 at 12:16 PM, bvanbreukelen notifications@github.com wrote:

Ah. I;m just wondering what information I would miss. When I 'look' at a profile to me it seems like everything is there. Using wget it is easy to show yourself as a 'normal' browser visiting linkedin. I could wget something but not scrape using the scraper. So for me the 'workaround' seemed to work. Would it be possible to build/implement an option (-nofollow) and parse the more limited data? Just wondering if that would be still useable?

— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-122379724 .

Yatish Mehta

bvanbreukelen commented 9 years ago

Ok. I could maybe work around that. Could you point me to where i might change this in the code to disable that additional detail. just to try this out? I might be able to work this out myself but it would be nice to have a general idea where to find this. By the way, I like the scraper!! Would also love to be able to 'scrape' my contacts/network information :)

yatish27 commented 9 years ago

When calling the fuction to get the data don't call company related methods on the profile object

On Fri, Jul 17, 2015 at 1:25 PM, bvanbreukelen notifications@github.com wrote:

Ok. I could maybe work around that. Could you point me to where i might change this in the code to disable that additional detail. just to try this out? I might be able to work this out myself but it would be nice to have a general idea where to find this. By the way, I like the scraper!! Would also love to be able to 'scrape' my contacts/network information :)

— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-122403169 .

Yatish Mehta

bvanbreukelen commented 9 years ago

Thank you for the quick response I will give this a try. this will need me to learn ruby a bit better I guess as I'm now using the binary version from command line.

petabreads commented 9 years ago

Hi @bvanbreukelen, would you mind sharing your solution using WGET? I was also working through the 999 issue last night... not being a dev I ended up setting a VPN from my linode back to a cisco firewall at home. Works pretty well, but not my preferred solution. Also tired tor and torsocks/torify, but 70% of those addresses are also blocked.

Thanks and good luck

bvanbreukelen commented 9 years ago

Hi @petabreads,

Sure.. I use the bin version of the scraper from command line btw. I wrote a php script (which is called from an ajax call from my website) that first does a wget and next the scrape. I setup a folder on my server that is web accessible so the scraper can connect to that instead of the linkedin site. I put that part of the php here. Be aware, to minimise download etc, there are like a ton of checks in my script. In the end it works, however the scraper still does access linkedin and I get blocked every once in a while, which seems only for a few hours. I use wget because you can set browser profile (and fake a browser), you can even be a googlebot :)

(php snippets, just copy paste this in an editor to see the code a bit better)

/* WGET THIS to our server * _/ // echo shellexec("wget -P $fileBase -U Googlebot/2.1 --no-check-certificate $scrapeUrl 2>&1"); /\ this is one option, impersonate as a Googlebot.. I use the second line; be Mozilla Firefox **/

echo shell_exec("wget -P $fileBase --header='Accept: text/html' -U 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0' --no-check-certificate " . $scrapeUrl . " 2>&1");

/\ What is the Name of the wget file and make it a .html file **/ shell_exec('mv ' . $fileBase . $split . ' ' . $nameOfFile);

/\ $fileBase is the folder (absolute) where the linkedin wget downloaded file is located

$scrapeProfile = shell_exec("/usr/bin/ruby " . $fileBase . "linkedin-scraper " . $serverBase . $fileNameOnly . " 2>&1"); print "
" . $scrapeProfile . "
"; /\ this prints the JSON to the screen it's my debug **/

$jsonProfile = json_decode($scrapeProfile); /\ this makes the JSON a PHP array / print_r($jsonProfile); / this prints the Array as Debug**/

Hope this code is still readable.

petabreads commented 9 years ago

@bvanbreukelen Thanks for the info! still trying to understand it :D (like i said, not a developer).

Just had an idea... would it be possible with a browser to login to Linkedin, navigate to a profile, download the page source (this is the same as what wget returns?), then use that in the process you outlined above? The idea being to scrape a profile while actually authenticated to Linkedin, thus getting the private details. Many of the profiles I'm looking at have restricted public profiles.

Maybe that's impossible?

bvanbreukelen commented 9 years ago

Hi.

Indeed the page source is what wget returns. Your other idea sounds great. I'm not the developer of the scraper and I think the scraper now only does public page profiles. Would be nice to develop a version that could do the private ones.

Cheers Bas

Op 20 jul. 2015 om 22:02 heeft petabreads notifications@github.com<mailto:notifications@github.com> het volgende geschreven:

@bvanbreukelenhttps://github.com/bvanbreukelen Thanks for the info! still trying to understand it :D (like i said, not a developer).

Just had an idea... would it be possible with a browser to login to Linkedin, navigate to a profile, download the page source (this is the same as what wget returns?), then use that in the process you outlined above? The idea being to scrape a profile while actually authenticated to Linkedin, thus getting the private details. Many of the profiles I'm looking at have restricted public profiles.

Maybe that's impossible?

— Reply to this email directly or view it on GitHubhttps://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-123012026.

petabreads commented 9 years ago

@bvanbreukelen I might give it a try this evening.... Thanks again

yatish27 commented 9 years ago

I could work on private/authenticated login profile scraping, but there is high chance linkedin will ban the account soon