WebCrawler.shouldVisit doesn't gets relative URLs

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. use an example of http://vimeo.com/search?q=lectures as seed url

What is the expected output? What do you see instead?
Links to pages that are of type <a href="/7160598" > are not followed. These 
too should be followed. 

What version of the product are you using?
3.3

Please provide any additional information below.
Relative links should ideally get resolved wrt base url, links of type /terms, 
/help are followed correctly from the same page.

Original issue reported on code.google.com by sanjay.d...@gmail.com on 19 Aug 2012 at 6:23

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:25

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Good question.

But first, it seems that I get a 403 ??

Original comment by avrah...@gmail.com on 26 Aug 2014 at 9:44

GoogleCodeExporter commented 9 years ago

Ok, seems like Vimeo blocks all user agents which are not recognized.

In order to crawl it, you should just identify as a legit userAgent (like 
firefox etc)

Anyway, it does get those links...

This one is already fixed.

Original comment by avrah...@gmail.com on 26 Aug 2014 at 9:58

Changed state: Invalid

xrma / crawler4j

WebCrawler.shouldVisit doesn't gets relative URLs #170