zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

Crawler the whole site, page inside another page. #21

Closed pacmandv closed 7 years ago

pacmandv commented 7 years ago

Hi.

Thank for the script.

I can find how to the scan site deeper. I mean there is a front page like https://example.com and on the page there are links to other pages where exist other pages with links. In the code below, crawler visit pages only by the links on the front, but not inside the pages.

Eg on the front exists link to the page https://example.com/links and on this page, there are a few links, the script doesn't visit the link on the page.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

set_time_limit(6000);

$linkDepth = 500;
// Initiate crawl    

$crawler = new \Arachnid\Crawler("https://example,com", $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();

it's possible to modify the code above but if exists solution from the box, it's better.

Thx

mkantautas commented 7 years ago

I have the same issue/question. Hope the creator will give some pointers

zrashwani commented 7 years ago

Hello, The second parameter of the constructor $crawler = new \Arachnid\Crawler($url, $linkDepth); indicates depth number of links that crawler visits, so if $linkDepth == 3 it will crawl, the front page itself along with children links inside at level 2 and links inside children link - which is level 3 - similar to this test: https://github.com/zrashwani/arachnid/blob/master/tests/src/CrawlerTest.php#L295

do you have a case where $linkDepth not working properly so I can investigate?

pacmandv commented 7 years ago

Hello @zrashwani,

i have checked different depth, and seem like it doesn't go deeper as i need. I have tried the path like

https://www.site.com/products/Category/SubCategory/ProductName.html

the scripts reached only to the https://www.site.com/products/Category/SubCategory

live example http://www.6pm.com/

with depth 3 it comes most deeper but not visited all links.

in my case

depth = 1; count of viosited links = 197 depth = 2; count of viosited links = 212 depth = 3; count of viosited links = 213 depth = 4; count of viosited links = 213 depth = 5; count of viosited links = 213 depth = 6; count of viosited links = 213

Thx.

zrashwani commented 7 years ago

hello @pacmandv thank you for the info, I will investigate shortly

zrashwani commented 7 years ago

Hello, I amended the way of crawling to be breadth-first search instead of depth-first, I think the problem will be fixed now, I verified by crawling this url http://toastytech.com/ and got the following results:

--- count links by depth --- "level#1" => 9 "level#2" => 175 "level#3" => 679 "level#4" => 1080 "level#5" => 708

@pacmandv please confirm if that worked in your site

pacmandv commented 7 years ago

Hello @zrashwani,

i checked it, yes now works well.

Thx.

zrashwani commented 7 years ago

thanks for confirmation, I will close then