Closed bvanbreukelen closed 8 years ago
Hi, Yes to get the company details it follows the link on the profile which will hit 'linkedin.com'
On Fri, Jul 17, 2015 at 11:30 AM, bvanbreukelen notifications@github.com wrote:
I'm using linked scraper in a web configuration. I ran after 4 'scrapes' into 999 (blocked by linkedIn). I made a workaround using WGET first, storing the profile and serve this on my own server. After this I http://localhost/profile.html with linkedin-scraper. This works like a charm... however after a number of 'scrapes' I run into a 999 blocked again?? I'm not a ruby programmer and have a hard time following the code. My question is simple, is linkedin-scraper following links in the profile, and therefore still accessing linkedin.com? And if so.. does it really need to? I noticed that opening the stored profile on my server as a webpage loads the linkedin page again (there is a redirect in the file).
Hope you can shed some light on this.
— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49.
Yatish Mehta
Ah. I;m just wondering what information I would miss. When I 'look' at a profile to me it seems like everything is there. Using wget it is easy to show yourself as a 'normal' browser visiting linkedin. I could wget something but not scrape using the scraper. So for me the 'workaround' seemed to work. Would it be possible to build/implement an option (-nofollow) and parse the more limited data? Just wondering if that would be still useable?
You would miss the company detail information
On Fri, Jul 17, 2015 at 12:16 PM, bvanbreukelen notifications@github.com wrote:
Ah. I;m just wondering what information I would miss. When I 'look' at a profile to me it seems like everything is there. Using wget it is easy to show yourself as a 'normal' browser visiting linkedin. I could wget something but not scrape using the scraper. So for me the 'workaround' seemed to work. Would it be possible to build/implement an option (-nofollow) and parse the more limited data? Just wondering if that would be still useable?
— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-122379724 .
Yatish Mehta
Ok. I could maybe work around that. Could you point me to where i might change this in the code to disable that additional detail. just to try this out? I might be able to work this out myself but it would be nice to have a general idea where to find this. By the way, I like the scraper!! Would also love to be able to 'scrape' my contacts/network information :)
When calling the fuction to get the data don't call company related methods on the profile object
On Fri, Jul 17, 2015 at 1:25 PM, bvanbreukelen notifications@github.com wrote:
Ok. I could maybe work around that. Could you point me to where i might change this in the code to disable that additional detail. just to try this out? I might be able to work this out myself but it would be nice to have a general idea where to find this. By the way, I like the scraper!! Would also love to be able to 'scrape' my contacts/network information :)
— Reply to this email directly or view it on GitHub https://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-122403169 .
Yatish Mehta
Thank you for the quick response I will give this a try. this will need me to learn ruby a bit better I guess as I'm now using the binary version from command line.
Hi @bvanbreukelen, would you mind sharing your solution using WGET? I was also working through the 999 issue last night... not being a dev I ended up setting a VPN from my linode back to a cisco firewall at home. Works pretty well, but not my preferred solution. Also tired tor and torsocks/torify, but 70% of those addresses are also blocked.
Thanks and good luck
Hi @petabreads,
Sure.. I use the bin version of the scraper from command line btw. I wrote a php script (which is called from an ajax call from my website) that first does a wget and next the scrape. I setup a folder on my server that is web accessible so the scraper can connect to that instead of the linkedin site. I put that part of the php here. Be aware, to minimise download etc, there are like a ton of checks in my script. In the end it works, however the scraper still does access linkedin and I get blocked every once in a while, which seems only for a few hours. I use wget because you can set browser profile (and fake a browser), you can even be a googlebot :)
(php snippets, just copy paste this in an editor to see the code a bit better)
/* WGET THIS to our server * _/ // echo shellexec("wget -P $fileBase -U Googlebot/2.1 --no-check-certificate $scrapeUrl 2>&1"); /\ this is one option, impersonate as a Googlebot.. I use the second line; be Mozilla Firefox **/
echo shell_exec("wget -P $fileBase --header='Accept: text/html' -U 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0' --no-check-certificate " . $scrapeUrl . " 2>&1");
/\ What is the Name of the wget file and make it a .html file **/ shell_exec('mv ' . $fileBase . $split . ' ' . $nameOfFile);
/\ $fileBase is the folder (absolute) where the linkedin wget downloaded file is located
$scrapeProfile = shell_exec("/usr/bin/ruby " . $fileBase . "linkedin-scraper " . $serverBase . $fileNameOnly . " 2>&1");
print "
" . $scrapeProfile . "
"; /\ this prints the JSON to the screen it's my debug **/
$jsonProfile = json_decode($scrapeProfile); /\ this makes the JSON a PHP array / print_r($jsonProfile); / this prints the Array as Debug**/
Hope this code is still readable.
@bvanbreukelen Thanks for the info! still trying to understand it :D (like i said, not a developer).
Just had an idea... would it be possible with a browser to login to Linkedin, navigate to a profile, download the page source (this is the same as what wget returns?), then use that in the process you outlined above? The idea being to scrape a profile while actually authenticated to Linkedin, thus getting the private details. Many of the profiles I'm looking at have restricted public profiles.
Maybe that's impossible?
Hi.
Indeed the page source is what wget returns. Your other idea sounds great. I'm not the developer of the scraper and I think the scraper now only does public page profiles. Would be nice to develop a version that could do the private ones.
Cheers Bas
Op 20 jul. 2015 om 22:02 heeft petabreads notifications@github.com<mailto:notifications@github.com> het volgende geschreven:
@bvanbreukelenhttps://github.com/bvanbreukelen Thanks for the info! still trying to understand it :D (like i said, not a developer).
Just had an idea... would it be possible with a browser to login to Linkedin, navigate to a profile, download the page source (this is the same as what wget returns?), then use that in the process you outlined above? The idea being to scrape a profile while actually authenticated to Linkedin, thus getting the private details. Many of the profiles I'm looking at have restricted public profiles.
Maybe that's impossible?
— Reply to this email directly or view it on GitHubhttps://github.com/yatish27/linkedin-scraper/issues/49#issuecomment-123012026.
@bvanbreukelen I might give it a try this evening.... Thanks again
I could work on private/authenticated login profile scraping, but there is high chance linkedin will ban the account soon
I'm using linked scraper in a web configuration. I ran after 4 'scrapes' into 999 (blocked by linkedIn). I made a workaround using WGET first, storing the profile and serve this on my own server. After this I http://localhost/profile.html with linkedin-scraper. This works like a charm... however after a number of 'scrapes' I run into a 999 blocked again?? I'm not a ruby programmer and have a hard time following the code. My question is simple, is linkedin-scraper following links in the profile, and therefore still accessing linkedin.com? And if so.. does it really need to? I noticed that opening the stored profile on my server as a webpage loads the linkedin page again (there is a redirect in the file).
Hope you can shed some light on this.