yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.53k stars 1.93k forks source link

Missing URLs #169

Closed nhasemann closed 7 years ago

nhasemann commented 7 years ago

Hi everyone,

I edited the BasicCrawler.java file and the number of outgoing links is not correct. For the most of the URLs it says that it's found only 5 URLs.

I changed nothing on

Set<WebURL> links = htmlParseData.getOutgoingUrls();

logger.debug("Number of outgoing links: {}", links.size());

these both lines.

Do you have an idea where the problem is?

Chaiavi commented 7 years ago

Which page did it crawl to return those 5 links ? What links did it find ?

On Mon, Nov 7, 2016 at 12:30 PM, nhasemann notifications@github.com wrote:

Hi everyone,

I edited the BasicCrawler.java file and the number of outgoing links is not correct. For the most of the URLs it says that it's found only 5 URLs.

I changed nothing on

Set links = htmlParseData.getOutgoingUrls();

logger.debug("Number of outgoing links: {}", links.size());

these both lines.

Do you have an idea where the problem is?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/169, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW52T9O61bQQo6L5KYlUhq1qrT_r3ks5q7v3igaJpZM4KrDjF .

nhasemann commented 7 years ago

For this page

https://www.amazon.de/Es-Roman-Stephen-King/dp/345343577X/ref=sr_1_2?s=books&ie=UTF8&qid=1478511545&sr=1-2

it founds following links:

13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://fls-eu.amazon.de/1/oc-csi/1/OP/requestId=XQDDSKEG7EZRWMPGVEKA&js=0 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: https://images-na.ssl-images-amazon.com/captcha/lqbiackd/Captcha_etxmbdetag.jpg 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: http://www.amazon.de/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=3312401 13:31:27.152 [Crawler 1] DEBUG edu.uci.ics.crawler4j.crawler.WebCrawler - Element of links: http://www.amazon.de/gp/help/customer/display.html/ref=footer_cou/275-2496043-9483305?ie=UTF8&nodeId=505048

nhasemann commented 7 years ago

Today I started the program again and made following observation:

After 12 minutes the program founds only 5 links for every url. Does amazon.com banned me from the server?

Chaiavi commented 7 years ago
  1. I am sure Amazon has good - "Anti Crawling" tactics so their servers won't be flooded - In order to overcome this we have a built in "politeness" mechanism - set it to do a crawl only once per 60 seconds or something to see if it makes things better.
  2. After the crawler retrieves a page, then it parses it OFFLINE, so it should find all links in that page, so it doesn't make any sense that it found only 5 links in that page.

If I had more time, I would have checked out the latest version and tried it myself.

As it is, I hope someone here will be able to check that page. And I would ask you in the meantime to check if you made any changes to the parser class where it parses the offline page ? Maybe this page? https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/parser/Parser.java )

On Tue, Nov 8, 2016 at 11:59 AM, nhasemann notifications@github.com wrote:

Today I started the program again and made following observation:

After 12 minutes the program founds only 5 links for every url. Does amazon.com banned me from the server?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/169#issuecomment-259094459, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW9rByROwuLdqK3RF3d07_gSrmBdHks5q8EgVgaJpZM4KrDjF .

rzo1 commented 7 years ago

Hi,

this crawler respects robots.txt. So if you take a look at https://amazon.com/robots.txt . we find, that a lot of urls are not allowed to be crawled.

Workaround would be, if you really want to crawl this page, you can adapt this class https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/robotstxt/RobotstxtServer.java to ignore robots.txt. Then, you should be able to crawl it.

//CC: @yasserg Maybe you can close this issue?

s17t commented 7 years ago

Seems to me that a lot of links are generated or appended to page via Javascript or AJAX. Unfortunately crawler4j is not capable of this yet. Looks for Selenium or CasperJS/PhantomJS. If you google there are way to use from Java (es. with Geb).