zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

Sites like linkedin should be probably excluded #28

Closed mkantautas closed 6 years ago

mkantautas commented 7 years ago

Some social giants should be excluded from the results if they require login to be accessed - in order to only see the really broken links. With the current situation we get some false positives, with sites like linkedin.

zrashwani commented 6 years ago

Hello @neorganic sorry for late reply; I think the best approach is to prevent scrapper from traversing linkedin links using filterLinks method similar as below:


$links = $client->filterLinks(function ($link) {
                             //ignore scrapping link if linkedin
                              return strpos($link,"www.linkedin.com")===false; 
                        })
                        ->traverse()
                        ->getLinks();
zrashwani commented 6 years ago

closing for now, let me know if you still face issue