Open mkantautas opened 7 years ago
Hello @neorganic thank you for pointing about this issue.
This is now done in this commit, the output for the link you mentioned is now as following:
[/index.html] => Array
(
[original_urls] => Array
(
[../index.html] => ../index.html
)
[links_text] => Array
(
[] =>
[Back to my Toasty Technology Page] => Back to my Toasty Technology Page
)
[absolute_url] => http://toastytech.com/index.html
[external_link] =>
[visited] => 1
[frequency] => 3
[source_link] => http://toastytech.com/evil/
[depth] => 1
[status_code] => 200
[title] => Nathan's Toasty Technology Page
[meta_keywords] =>
[meta_description] =>
[h1_count] => 0
[h1_contents] => Array
(
)
)
let me know if there is any other case
"http://meyrovich.com/2016/10/06/vegan-pesto/" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=facebook" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=twitter" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=tumblr" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=pinterest" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=linkedin" => array:8 [▶]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit" => array:8 [▼
"original_urls" => array:1 [ …1]
"links_text" => array:1 [ …1]
"absolute_url" => "http://meyrovich.com/2016/10/06/vegan-pesto/?share=reddit"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "http://meyrovich.com/category/food/"
"depth" => 2
]
"http://meyrovich.com/2016/10/06/vegan-pesto/?share=google-plus-1" => array:8 [▶]
"http://meyrovich.com/author/admin/page/2/" => array:8 [▶]
"https://www.facebook.com/policies/cookies/" => array:8 [▶]
"https://www.facebook.com/recover/initiate?lwv=100" => array:8 [▶]
"/reg/" => array:8 [▶]
"/home" => array:8 [▶]
"https://twitter.com/signup?context=webintent&follow=meyrovich_" => array:8 [▶]
"/account/begin_password_reset" => array:8 [▶]
"//support.twitter.com/groups/31-twitter-basics/topics/104-welcome-to-twitter-support/articles/215585-twitter-101-how-should-i-get-started-using-twitter" => array:8 [▶]
"https://www.tumblr.com/login?redirect_to=https%3A%2F%2Fwww.tumblr.com%2Fwidgets%2Fshare%2Ftool%3FshareSource%3Dlegacy%26canonicalUrl%3D%26url%3Dhttp%253A%252F%252Fmeyrovich.com%252F2016%252F12%252F24%252Ffarmdrop-christmas-feast%252F%26title%3DFarmdrop%2BChristmas%2BFeast%26_format%3Dhtml%26sequence%3Dpreview" => array:8 [▶]
"https://www.pinterest.com/_/_/about/cookie-policy/" => array:8 [▶]
"/_/_/about/terms-service/" => array:8 [▶]
"/_/_/about/privacy/plain.html" => array:8 [▶]
"/_/_/about/" => array:8 [▼
"original_urls" => array:1 [ …1]
"links_text" => array:1 [ …1]
"absolute_url" => "http://meyrovich.com/_/_/about/"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
"depth" => 2
]
"/_/_/blog/" => array:8 [▶]
"/_/_/business/" => array:8 [▼
"original_urls" => array:1 [ …1]
"links_text" => array:1 [ …1]
"absolute_url" => "http://meyrovich.com/_/_/business/"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
"depth" => 2
]
"/_/_/about/privacy/" => array:8 [▼
"original_urls" => array:1 [ …1]
"links_text" => array:1 [ …1]
"absolute_url" => "http://meyrovich.com/_/_/about/privacy/"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "http://meyrovich.com/2016/12/24/farmdrop-christmas-feast/?share=pinterest"
"depth" => 2
]
Hey, yeah so the base url is http://meyrovich.com and mainly I see 2 problems here the links with ///... and that it goes after the ?share links I think these and probably other urls that have ? should be excluded from the list?
Hello @neorganic ,
A major re-writing was done on the library now, the dots issue should be fixed now, as currently PSR 7 - specifically \GuzzleHttp\Psr7\UriResolver::removeDotSegments
method - is used to normalize and remove dots from URLs.
Also in order to exclude links with specific patterns from list you can use filterLinks
method as mentioned in readme file; I think that way is better as some sites depend on query strings to serve different documents;
Also @neorganic can you check the new version of the library and confirm if issues are fixed in it?
Love the library, bravo!
Some interesting results from the latest version of this library, results then code below (I recognise the site being scanned isn't a perfectly marked up site and may be out of scope for the project):
<li itemprop=name><a href=/operational-support-agreement.pdf title="Siteshield" itemprop=url>Siteshield</a></li>
Outputs: Broken: /operational-support-agreement.pdf from https://www.espressoweb.co.uk/
Other instances where it seems the full url has been interpreted okay, but the protocol has not been assumed correctly:
<h2 class=h3><a class="color-grey" href="social-marketing" title="Social Media Marketing from Espresso Web">Engaging Social</a></h2>
Outputs: Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk AND Broken: http://www.espressoweb.co.uk/social-marketing from http://www.espressoweb.co.uk/social-marketing
<?php
require '../vendor/autoload.php';
$time_start = microtime(true);
$url = 'https://www.espressoweb.co.uk/';
$linkDepth = 10;
// Initiate crawl
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();
// Get link data
$links = $crawler->getLinks();
$collection = new \Arachnid\LinksCollection($links);
//getting broken links
$brokenLinks = $collection->getBrokenLinks();
foreach($brokenLinks as $i => $link) {
echo "++++++++++++++++++++++++++++++". PHP_EOL ;
echo "Broken: " . $i . " from " . $link->getParentUrl() . PHP_EOL ;
// echo json_encode($links) . PHP_EOL ;
}
$time_end = microtime(true);
$execution_time = ($time_end - $time_start) / 60;
$execution_time = number_format((float)$execution_time, 2);
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
echo "Complete!" . PHP_EOL;
echo 'Total Execution Time: ' . $execution_time . ' Mins' . PHP_EOL;
echo "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" . PHP_EOL;
die;
Hello @LunarDevelopment
Thank you for reporting this issue; your sample script has made me pay attention to other issues as well
there was problem in LinksCollection
method getBrokenLinks
, it was checking status codes 2xx; however it classified links as broken if the status code is redirect 3xx which is wrong
I have pushed a commit that should handle some cases, and will test it in the next few days to make sure all such cases are fixed
Happy to help, looking forwards to your next release!
@LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there
Thankyou, I'll give it a go.
On Tue, 11 Dec 2018 at 13:13 Zeid Rashwani notifications@github.com wrote:
@LunarDevelopment https://github.com/LunarDevelopment the issues you were facing should be fixed in the recent release 2.0.1, please update it to latest release and try there
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zrashwani/arachnid/issues/24#issuecomment-446197486, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNqM1NAllDy8UoGGTHPp4T4cBIJV7wHks5u3695gaJpZM4OCW1f .
For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)