Closed nhasemann closed 7 years ago
Which page did it crawl to return those 5 links ? What links did it find ?
On Mon, Nov 7, 2016 at 12:30 PM, nhasemann notifications@github.com wrote:
Hi everyone,
I edited the BasicCrawler.java file and the number of outgoing links is not correct. For the most of the URLs it says that it's found only 5 URLs.
I changed nothing on
Set
links = htmlParseData.getOutgoingUrls(); logger.debug("Number of outgoing links: {}", links.size());
these both lines.
Do you have an idea where the problem is?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/169, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW52T9O61bQQo6L5KYlUhq1qrT_r3ks5q7v3igaJpZM4KrDjF .
For this page
it founds following links:
13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://fls-eu.amazon.de/1/oc-csi/1/OP/requestId=XQDDSKEG7EZRWMPGVEKA&js=0 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://images-na.ssl-images-amazon.com/captcha/lqbiackd/Captcha_etxmbdetag.jpg 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: http://www.amazon.de/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=3312401 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: http://www.amazon.de/gp/help/customer/display.html/ref=footer_cou/275-2496043-9483305?ie=UTF8&nodeId=505048
Today I started the program again and made following observation:
After 12 minutes the program founds only 5 links for every url. Does amazon.com banned me from the server?
If I had more time, I would have checked out the latest version and tried it myself.
As it is, I hope someone here will be able to check that page. And I would ask you in the meantime to check if you made any changes to the parser class where it parses the offline page ? Maybe this page? https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/parser/Parser.java )
On Tue, Nov 8, 2016 at 11:59 AM, nhasemann notifications@github.com wrote:
Today I started the program again and made following observation:
After 12 minutes the program founds only 5 links for every url. Does amazon.com banned me from the server?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/169#issuecomment-259094459, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW9rByROwuLdqK3RF3d07_gSrmBdHks5q8EgVgaJpZM4KrDjF .
Hi,
this crawler respects robots.txt
. So if you take a look at https://amazon.com/robots.txt . we find, that a lot of urls are not allowed to be crawled.
Workaround would be, if you really want to crawl this page, you can adapt this class https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/robotstxt/RobotstxtServer.java to ignore robots.txt
. Then, you should be able to crawl it.
//CC: @yasserg Maybe you can close this issue?
Seems to me that a lot of links are generated or appended to page via Javascript or AJAX. Unfortunately crawler4j is not capable of this yet. Looks for Selenium or CasperJS/PhantomJS. If you google there are way to use from Java (es. with Geb).
Hi everyone,
I edited the BasicCrawler.java file and the number of outgoing links is not correct. For the most of the URLs it says that it's found only 5 URLs.
I changed nothing on
Set<WebURL> links = htmlParseData.getOutgoingUrls();
logger.debug("Number of outgoing links: {}", links.size());
these both lines.
Do you have an idea where the problem is?