zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

filterLinks not work. #29

Closed dearste closed 6 years ago

dearste commented 7 years ago

Hi, filterLinks not work in this example..

$url = "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t";
$crawler = new Crawler($url, 2); 
$links = $crawler
                ->filterLinks(function($link){

                    return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link); 
                })
                ->traverse()
                ->getLinks();

What Is wrong?

dearste commented 7 years ago

For fix this i have mod this line in traverseSingle method:

if ($filterLinks !== null && $filterLinks($url) === false && isset($this->links[$hash])) { Is it right?

zrashwani commented 7 years ago

This is because the filterLinks() returns false for your base Url, so it is not crawling through link level#1 in the first place, if you modified your condition to be something like that:

return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link) 
    || $link == "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t";

or

 return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link) 
   || (bool) preg_match('/\/eng-gb\/men\/(.*)/',$link);

you will get the required result, same as below:

array:68 [
  0 => "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t"
  1 => "/eng-gb/products/keepall-45-bandouliere-monogram-eclipse-014386"
  2 => "/eng-gb/products/keepall-55-bandouliere-monogram-eclipse-014387"
  3 => "/eng-gb/products/pochette-voyage-mm-monogram-eclipse-014395"
  4 => "/eng-gb/products/christopher-pm-epi-nvprod520155v"
  5 => "/eng-gb/products/christopher-pm-epi-nvprod520156v"
  6 => "/eng-gb/products/danube-pm-epi-nvprod520153v"
  7 => "/eng-gb/products/danube-pm-epi-nvprod520154v"
  8 => "/eng-gb/products/east-side-tote-mm-taurillon-nvprod520157v"
  9 => "/eng-gb/products/east-side-duffle-bag-taurillon-nvprod520158v"
  10 => "/eng-gb/products/danube-pm-taurillon-nvprod520161v"
....
zrashwani commented 6 years ago

closing for now due to no response, feel free to leave a comment if still face problem