yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.53k stars 1.93k forks source link

shouldVisit is confused #161

Open seanlei opened 7 years ago

seanlei commented 7 years ago

hello I added some seed urls which is returned by 301 status code.I checked the code in WebCrawler class. it will call shouldVisit method to check 301 redirect url . I think shouldVisit method is only check outGoingUrls,but it also used for check 301 seedUrl which means if request seed url returned 301 redirect and not match shouldVisit method,it will lost all outgoing urls of 301 redirect

soq2000 commented 7 years ago

Hi all, I also have some observation on shouldVisit. I wonder how comes that the number of times that the callback funtion Visit is called is less than the number of times that the shouldVisit function returns true? I already increase the MaxPagesToFetch, but the crawler still stop while not scanning (visiting) all tha pages need visiting (shouldVisit=true). Or I misunderstand something? Many thanks