xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

some relative path is ignored #179

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Seed http://www.mindfirelabs.com/forum/viewforum.php?f=28
2. I restricted the deplth of crawling to 1.
3. Debug to see what urls are considered as getOutgoingUrls.

What is the expected output? What do you see instead?
Only these urls are in the getOutgoingUrls list.
http://www.mindfirelabs.com/forum/feed.php, 
http://www.mindfirelabs.com/forum/feed.php?mode=forums, 
http://www.mindfirelabs.com/forum/feed.php?mode=topics, 
http://www.mindfirelabs.com/forum/style.php?id=4&lang=en_us, 
http://www.mindfirelabs.com/forum/index.php, 
http://www.mindfirelabs.com/forum/styles/art_deluxe/imageset/site_logo.png, 
http://www.mindfirelabs.com/forum/faq.php, 
http://www.mindfirelabs.com/forum/index.php, 
http://www.mindfirelabs.com/forum/index.php, http://www.phpbb.com/, 
http://www.artodia.com/]

Missing urls are
/viewforum.php?f=5 and all varieties of these. These are the actual topics I am 
interested in.

What version of the product are you using?
3.3

Please provide any additional information below.
When I debug the startElement of HtmlContentHandler the above urls doesn't hit. 
If you do a view source relative urls in the same format ./<pagename> are 
considered for others but missing for the mentioned one. 

Original issue reported on code.google.com by amit.mal...@gmail.com on 16 Nov 2012 at 1:20

GoogleCodeExporter commented 9 years ago
The issue is due to the way the Robot.txt was set in the site.

Original comment by amit.mal...@gmail.com on 20 Nov 2012 at 9:47

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:29