Closed pacmandv closed 7 years ago
I have the same issue/question. Hope the creator will give some pointers
Hello,
The second parameter of the constructor
$crawler = new \Arachnid\Crawler($url, $linkDepth);
indicates depth number of links that crawler visits,
so if $linkDepth == 3
it will crawl, the front page itself along with children links inside at level 2 and links inside children link - which is level 3 -
similar to this test:
https://github.com/zrashwani/arachnid/blob/master/tests/src/CrawlerTest.php#L295
do you have a case where $linkDepth not working properly so I can investigate?
Hello @zrashwani,
i have checked different depth, and seem like it doesn't go deeper as i need. I have tried the path like
https://www.site.com/products/Category/SubCategory/ProductName.html
the scripts reached only to the https://www.site.com/products/Category/SubCategory
live example http://www.6pm.com/
with depth 3 it comes most deeper but not visited all links.
in my case
depth = 1; count of viosited links = 197 depth = 2; count of viosited links = 212 depth = 3; count of viosited links = 213 depth = 4; count of viosited links = 213 depth = 5; count of viosited links = 213 depth = 6; count of viosited links = 213
Thx.
hello @pacmandv thank you for the info, I will investigate shortly
Hello, I amended the way of crawling to be breadth-first search instead of depth-first, I think the problem will be fixed now, I verified by crawling this url http://toastytech.com/ and got the following results:
--- count links by depth --- "level#1" => 9 "level#2" => 175 "level#3" => 679 "level#4" => 1080 "level#5" => 708
@pacmandv please confirm if that worked in your site
Hello @zrashwani,
i checked it, yes now works well.
Thx.
thanks for confirmation, I will close then
Hi.
Thank for the script.
I can find how to the scan site deeper. I mean there is a front page like https://example.com and on the page there are links to other pages where exist other pages with links. In the code below, crawler visit pages only by the links on the front, but not inside the pages.
Eg on the front exists link to the page https://example.com/links and on this page, there are a few links, the script doesn't visit the link on the page.
it's possible to modify the code above but if exists solution from the box, it's better.
Thx