Open mkantautas opened 7 years ago
Hello @neorganic regarding the first point, detecting the source of broken link is very useful feature, I will proceed with implementing it, but I didn't get your point in the second point: "is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?" do you mean the same as issue#21 ?
Hello @neorganic
source_page
is now added to link information, so you can see from which page certain Url was crawled..
ex.
<?php
$links = $crawler->traverse()
->getLinks();
$collection = new LinksCollection($links);
//getting broken links
$brokenLinks = $collection->getBrokenLinks();
the result set will contain results like the following:
"http://www.lukewest.co.uk" => array:3 [
"source_page" => "http://zrashwani.com/simple-web-spider-php-goutte/"
"link" => "http://www.lukewest.co.uk"
"status_code" => "404"
]
For the first part - Great work! For the second part - "error_message" key is not always showed when for e.g. the link is 404.
One more thing that I understand the current package lacks is crawling(just to see if they are not broken) links that are pointing outside the host name website.
A bit off topic, but maybe you could suggest a page with various urls(good/bad) that would be safe for testing your package ?
A bit more info about the second part:
array:9 [▼
"http://www.linkedin.com/company/boxgard" => array:9 [▶]
"https://de.linkedin.com/in/danielbromberg" => array:9 [▶]
"https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:9 [▶]
"https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:9 [▶]
"https://se.linkedin.com/in/alinanmorariu" => array:9 [▶]
"https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:9 [▶]
"http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:9 [▶]
"http://thespringfieldproject.se/investments-ny/" => array:9 [▶]
"/product/valj-forvaring/" => array:2 [▼
"status_code" => 404
"depth" => 3
]
]
Why does the last broken link has only 2 keys in its array ? Where are all the other keys like:
"original_urls" =>
"links_text" =>
"absolute_url" =>
"external_link" =>
"visited" =>
"frequency" =>
"error_code" =>
"error_message" =>
Hello @neorganic can you please send me the base URL to test the issue you mentioned above?
Hello @zrashwani , The base URL is https://boxgard.com
this is done now, the output of the broken page is as following:
"http://www.linkedin.com/company/boxgard" => array:10 [▶]
"https://de.linkedin.com/in/danielbromberg" => array:10 [▶]
"https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:10 [▶]
"https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:10 [▶]
"https://se.linkedin.com/in/alinanmorariu" => array:10 [▶]
"https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:10 [▶]
"http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:10 [▶]
"https://boxgard.com/product/valj-forvaring/" => array:10 [▶]
"http://thespringfieldproject.se/investments-ny/" => array:10 [▶]
"//" => array:11 [▶]
"//blog" => array:11 [▶]
"/product/valj-forvaring/" => array:10 [▼
"original_urls" => array:1 [▶]
"links_text" => array:1 [▶]
"absolute_url" => "https://boxgard.com/product/valj-forvaring/"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "https://boxgard.com/media"
"depth" => 3
"status_code" => 404
"error_message" => 404
]
Perfect, but one new(or maybe i didn't noticed before) issue - false positive:
array:14 [▼
"http://www.linkedin.com/company/boxgard" => array:10 [▶]
"https://de.linkedin.com/in/danielbromberg" => array:10 [▶]
"https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:10 [▶]
"https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:10 [▶]
"https://se.linkedin.com/in/alinanmorariu" => array:10 [▶]
"https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:10 [▶]
"http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:10 [▶]
"https://boxgard.com/product/valj-forvaring/" => array:10 [▶]
"http://thespringfieldproject.se/investments-ny/" => array:10 [▶]
"//" => array:11 [▼
"original_urls" => array:2 [▼
"https://boxgard.com//#steps" => "https://boxgard.com//#steps"
"https://boxgard.com//#pricing" => "https://boxgard.com//#pricing"
]
"links_text" => array:3 [▼
"SÅ FUNKAR DET" => "SÅ FUNKAR DET"
"VAD KOSTAR DET?" => "VAD KOSTAR DET?"
"PRISER" => "PRISER"
]
"absolute_url" => "https://boxgard.com//#pricing"
"external_link" => false
"visited" => false
"frequency" => 4
"source_link" => "https://boxgard.com/product/store/"
"depth" => 3
"status_code" => "404"
"error_code" => 0
"error_message" => "array_replace(): Argument #2 is not an array"
]
"//blog" => array:11 [▼
"original_urls" => array:1 [▶]
"links_text" => array:1 [▶]
"absolute_url" => "https://boxgard.com//blog"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "https://boxgard.com/product/store/"
"depth" => 3
"status_code" => "404"
"error_code" => 0
"error_message" => "cURL error 6: Could not resolve host: blog (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)"
]
"/product/valj-forvaring/" => array:10 [▶]
"http://boxgard.com/)" => array:10 [▶]
"http://erikshjalpen.se/secondhand/vill-du-skanka/" => array:10 [▶]
]
If you would try to access https://boxgard.com//#pricing or https://boxgard.com//blog you would get not 404 error. Of course the url is not completely valid(having the two "//"), but they do work(probably all the modern browser compensates this slight error). I think this false positive case should be fixed!
P.S. Great job so for, your crawler has huge potential. In the near future I will going to contribute to this rep even more with pull requests.
So let's say I am crawling a website http://website.com and it has a broken link http://website.com/dir/subdir/red located in http://website.com/dir/subdir . Is there a way that with all the data there would also be a key "source" => " http://website.com/dir/subdir"
Also, is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?