xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawler does not follow Url like http://example.com/../../some.html #101

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
16:05:12,039 INFO  ~ Failed: HTTP/1.1 403 Forbidden, while fetching 
http://www.example.com/../../photos.html

Looks like the crawler does not evaluate paths correctly and try to access 
directory "..", Causing the Server to either give 404 or 403.

The Crawler should access the path http://www.example.com/photos.html in the 
above example.

if you paste the url in a browser, the browser will interpret it convert it 
into http://www.example.com/photos.html. There are lot of websites with suck 
malformed urls.

Original issue reported on code.google.com by tahs...@trademango.com on 5 Jan 2012 at 12:10

GoogleCodeExporter commented 9 years ago
Thanks for reporting. Will publish the fix soon.

-Yasser

Original comment by ganjisaffar@gmail.com on 5 Jan 2012 at 1:45

GoogleCodeExporter commented 9 years ago
The fix is committed to the source repository: 
http://code.google.com/p/crawler4j/source/detail?r=dbc9d3cb0d1efde4431f68b8417ee
2ed5d551a43

It will be included in the next release.

-Yasser

Original comment by ganjisaffar@gmail.com on 6 Jan 2012 at 5:53

GoogleCodeExporter commented 9 years ago
Some issues occured with URLCanonicalizer:
1. URLCanonicalizer added to my url equal sign "=", such as 
http://somedomain.com/uploads/1/0/2/5/10259653/6199347.jpg?1325154037= but my 
url was http://somedomain.com/uploads/1/0/2/5/10259653/6199347.jpg?1325154037.
2. When redirection happens, PageFetcher in fetchHeader method makes all urls 
lowercase to redirected url.

Original comment by mansur.u...@gmail.com on 8 Jan 2012 at 4:43

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
this is not fixed. 

Original comment by alexnosp...@gmail.com on 2 Feb 2013 at 11:05

GoogleCodeExporter commented 9 years ago
Do a simple System.out.println(page.getWebURL().getURL()); to see what url 
crawler4j is visiting. so bad.

Original comment by alexnosp...@gmail.com on 2 Feb 2013 at 11:06

GoogleCodeExporter commented 9 years ago
java.lang.StringIndexOutOfBoundsException: String index out of range: -8 when 
you do CrawlNow.  seems crawler4j is completely unusable for me now.

Original comment by alexnosp...@gmail.com on 3 Feb 2013 at 1:02

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 183d98a269db.

Original comment by ganjisaffar@gmail.com on 3 Mar 2013 at 7:08