Open GoogleCodeExporter opened 9 years ago
Thank you George.
Can you give me a specific URL you were trying to crawl and although you
expected links to be crawled they were ignored?
Original comment by avrah...@gmail.com
on 19 Aug 2014 at 7:57
Yes, of course. For example:
https://formulier.denhaag.nl/Tripleforms/formulierenoverzicht/DefaultEnvironment
.aspx
Must return ~336 pages crawled. But now are ignored.
Note: it is a https page, I have used the fix showed in
https://code.google.com/p/crawler4j/issues/detail?id=174
Original comment by jorgehor...@gmail.com
on 19 Aug 2014 at 8:05
I will check it, but as I have not yet approved and / or inserted the code in
issue 174, we might have different results.
Anyway, this is a good use case and I will check it out.
This issue might take longer as I have several other issues on my plate which
are planned to be fixed before this issue.
Original comment by avrah...@gmail.com
on 19 Aug 2014 at 8:07
Do not worry, for now as workaround I can store these pages myself.
Original comment by jorgehor...@gmail.com
on 19 Aug 2014 at 8:13
After some digging I have concluded the following:
1. This site needs a fix to issue 174
2. After implementing the fix suggested at issue 174, still only a few pages
get crawled, as the crawler thinks he already crawled those pages!
This is a weird bug.
It seems that the pageFetcher's httpClient fetches the wrong statusCode (302
instead of 200).
HttpResponse response = httpClient.execute(get);
All of that code needs serious upgrading, as many methods there are deprecated.
I have opened a new issue for that: Issue 302
I hope solving that issue will help solving this bug
Original comment by avrah...@gmail.com
on 2 Sep 2014 at 12:38
Thanks for the feedback. I will be waiting for this fix.
Original comment by jorgehor...@gmail.com
on 2 Sep 2014 at 1:25
I have fixed issue 174 using a different approach.
But still this problem remains.
I have asked help for issue 302, I hope someone will help us there and maybe it
will fix this issue.
Original comment by avrah...@gmail.com
on 15 Sep 2014 at 2:35
Other issues have been fixed and this one still remains.
A careful debug is in place.
Original comment by avrah...@gmail.com
on 17 Sep 2014 at 2:15
I see this issue is still present in crawler4j-4.1. An example where I get 302
instead of 200 is http://www.kmobgyn.com, but I've found a bunch more.
Original comment by and...@olariu.org
on 30 Mar 2015 at 2:35
Original issue reported on code.google.com by
jorgehor...@gmail.com
on 19 Aug 2014 at 7:54