zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

Absolute links and the actual urls in some cases is being rendered wrongly. #24

Open mkantautas opened 7 years ago

mkantautas commented 7 years ago

For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)

"/evil/../links/index.html" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://toastytech.com/evil/../links/index.html"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://toastytech.com/evil/"
    "depth" => 1
    "status_code" => 200
    "title" => "Nathan's Links"
    "meta_keywords" => ""
    "meta_description" => ""
    "h1_count" => 1
    "h1_contents" => array:1 [ …1]
zrashwani commented 7 years ago

Hello @neorganic thank you for pointing about this issue.

This is now done in this commit, the output for the link you mentioned is now as following:

    [/index.html] => Array
        (
            [original_urls] => Array
                (
                    [../index.html] => ../index.html
                )

            [links_text] => Array
                (
                    [] => 
                    [Back to my Toasty Technology Page] => Back to my Toasty Technology Page
                )

            [absolute_url] => http://toastytech.com/index.html
            [external_link] => 
            [visited] => 1
            [frequency] => 3
            [source_link] => http://toastytech.com/evil/
            [depth] => 1
            [status_code] => 200
            [title] => Nathan's Toasty Technology Page
            [meta_keywords] => 
            [meta_description] => 
            [h1_count] => 0
            [h1_contents] => Array
                (
                )

        )

let me know if there is any other case

mkantautas commented 7 years ago
 "http://meyrovich.com/2016/10/06/vegan-pesto/" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=facebook" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=twitter" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=tumblr" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=pinterest" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=linkedin" => array:8 [▶]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/category/food/"
    "depth" => 2
  ]
  "http://meyrovich.com/2016/10/06/vegan-pesto/?share=google-plus-1" => array:8 [▶]
  "http://meyrovich.com/author/admin/page/2/" => array:8 [▶]
  "https://www.facebook.com/policies/cookies/" => array:8 [▶]
  "https://www.facebook.com/recover/initiate?lwv=100" => array:8 [▶]
  "/reg/" => array:8 [▶]
  "/home" => array:8 [▶]
  "https://twitter.com/signup?context=webintent&follow=meyrovich_" => array:8 [▶]
  "/account/begin_password_reset" => array:8 [▶]
  "//support.twitter.com/groups/31-twitter-basics/topics/104-welcome-to-twitter-support/articles/215585-twitter-101-how-should-i-get-started-using-twitter" => array:8 [▶]
  "https://www.tumblr.com/login?redirect_to=https%3A%2F%2Fwww.tumblr.com%2Fwidgets%2Fshare%2Ftool%3FshareSource%3Dlegacy%26canonicalUrl%3D%26url%3Dhttp%253A%252F%252Fmeyrovich.com%252F2016%252F12%252F24%252Ffarmdrop-christmas-feast%252F%26title%3DFarmdrop%2BChristmas%2BFeast%26_format%3Dhtml%26sequence%3Dpreview" => array:8 [▶]
  "https://www.pinterest.com/_/_/about/cookie-policy/" => array:8 [▶]
  "/_/_/about/terms-service/" => array:8 [▶]
  "/_/_/about/privacy/plain.html" => array:8 [▶]
  "/_/_/about/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/about/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]
  "/_/_/blog/" => array:8 [▶]
  "/_/_/business/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/business/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]
  "/_/_/about/privacy/" => array:8 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://meyrovich.com/_/_/about/privacy/"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
    "depth" => 2
  ]

Hey, yeah so the base url is http://meyrovich.com and mainly I see 2 problems here the links with ///... and that it goes after the ?share links I think these and probably other urls that have ? should be excluded from the list?

zrashwani commented 5 years ago

Hello @neorganic , A major re-writing was done on the library now, the dots issue should be fixed now, as currently PSR 7 - specifically \GuzzleHttp\Psr7\UriResolver::removeDotSegments method - is used to normalize and remove dots from URLs.

Also in order to exclude links with specific patterns from list you can use filterLinks method as mentioned in readme file; I think that way is better as some sites depend on query strings to serve different documents;

Also @neorganic can you check the new version of the library and confirm if issues are fixed in it?

LunarDevelopment commented 5 years ago

Love the library, bravo!

Some interesting results from the latest version of this library, results then code below (I recognise the site being scanned isn't a perfectly marked up site and may be out of scope for the project):

<li itemprop=name><a href=/operational-support-agreement.pdf title="Siteshield" itemprop=url>Siteshield</a></li>

Outputs: Broken: /operational-support-agreement.pdf from https://www.espressoweb.co.uk/

Other instances where it seems the full url has been interpreted okay, but the protocol has not been assumed correctly:

<h2 class=h3><a class="color-grey" href="social-marketing" title="Social Media Marketing from Espresso Web">Engaging Social</a></h2>

Outputs: Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk AND Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk/social-marketing

<?php
require '../vendor/autoload.php';

$time_start = microtime(true);

$url = 'https://www.espressoweb.co.uk/';
$linkDepth = 10;
// Initiate crawl
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();

$collection = new  \Arachnid\LinksCollection($links);

//getting broken links
$brokenLinks = $collection->getBrokenLinks();

foreach($brokenLinks as $i =>  $link) {
    echo "++++++++++++++++++++++++++++++". PHP_EOL ;
    echo "Broken: " . $i . " from " . $link->getParentUrl() . PHP_EOL ;
//    echo json_encode($links) . PHP_EOL ;
}

$time_end = microtime(true);
$execution_time = ($time_end - $time_start) / 60;
$execution_time = number_format((float)$execution_time, 2);

echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
echo "Complete!" . PHP_EOL;
echo 'Total Execution Time: ' . $execution_time . ' Mins' . PHP_EOL;
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;

die;
zrashwani commented 5 years ago

Hello @LunarDevelopment Thank you for reporting this issue; your sample script has made me pay attention to other issues as well there was problem in LinksCollection method getBrokenLinks, it was checking status codes 2xx; however it classified links as broken if the status code is redirect 3xx which is wrong

I have pushed a commit that should handle some cases, and will test it in the next few days to make sure all such cases are fixed

LunarDevelopment commented 5 years ago

Happy to help, looking forwards to your next release!

zrashwani commented 5 years ago

@LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there

LunarDevelopment commented 5 years ago

Thankyou, I'll give it a go.

On Tue, 11 Dec 2018 at 13:13 Zeid Rashwani notifications@github.com wrote:

@LunarDevelopment https://github.com/LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zrashwani/arachnid/issues/24#issuecomment-446197486, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNqM1NAllDy8UoGGTHPp4T4cBIJV7wHks5u3695gaJpZM4OCW1f .