xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

WebCrawler.shouldVisit doesn't gets relative URLs #170

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. use an example of http://vimeo.com/search?q=lectures as seed url

What is the expected output? What do you see instead?
Links to pages that are of type <a href="/7160598" > are not followed. These 
too should be followed. 

What version of the product are you using?
3.3

Please provide any additional information below.
Relative links should ideally get resolved wrt base url, links of type /terms, 
/help are followed correctly from the same page. 

Original issue reported on code.google.com by sanjay.d...@gmail.com on 19 Aug 2012 at 6:23

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:25

GoogleCodeExporter commented 9 years ago
Good question.

But first, it seems that I get a 403 ??

Original comment by avrah...@gmail.com on 26 Aug 2014 at 9:44

GoogleCodeExporter commented 9 years ago
Ok, seems like Vimeo blocks all user agents which are not recognized.

In order to crawl it, you should just identify as a legit userAgent (like 
firefox etc)

Anyway, it does get those links...

This one is already fixed.

Original comment by avrah...@gmail.com on 26 Aug 2014 at 9:58