zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

How to find out from which url the url was crawled? #22

Open mkantautas opened 7 years ago

mkantautas commented 7 years ago

So let's say I am crawling a website http://website.com and it has a broken link http://website.com/dir/subdir/red located in http://website.com/dir/subdir . Is there a way that with all the data there would also be a key "source" => " http://website.com/dir/subdir"

Also, is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?

"original_urls" => 
    "links_text" =>
    "absolute_url" => 
    "external_link" => 
    "visited" => 
    "frequency" => 
    "depth" => 
    "status_code" => 
    "error_code" => 
    "error_message" =>
zrashwani commented 7 years ago

Hello @neorganic regarding the first point, detecting the source of broken link is very useful feature, I will proceed with implementing it, but I didn't get your point in the second point: "is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?" do you mean the same as issue#21 ?

zrashwani commented 7 years ago

Hello @neorganic source_page is now added to link information, so you can see from which page certain Url was crawled.. ex.

<?php
$links = $crawler->traverse()
                 ->getLinks();
$collection = new LinksCollection($links);

//getting broken links
$brokenLinks = $collection->getBrokenLinks();

the result set will contain results like the following:

"http://www.lukewest.co.uk" => array:3 [
      "source_page" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "link" => "http://www.lukewest.co.uk"
      "status_code" => "404"
    ]
mkantautas commented 7 years ago

For the first part - Great work! For the second part - "error_message" key is not always showed when for e.g. the link is 404.

One more thing that I understand the current package lacks is crawling(just to see if they are not broken) links that are pointing outside the host name website.

A bit off topic, but maybe you could suggest a page with various urls(good/bad) that would be safe for testing your package ?

mkantautas commented 7 years ago

A bit more info about the second part:

array:9 [▼
  "http://www.linkedin.com/company/boxgard" => array:9 [▶]
  "https://de.linkedin.com/in/danielbromberg" => array:9 [▶]
  "https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:9 [▶]
  "https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:9 [▶]
  "https://se.linkedin.com/in/alinanmorariu" => array:9 [▶]
  "https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:9 [▶]
  "http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:9 [▶]
  "http://thespringfieldproject.se/investments-ny/" => array:9 [▶]
  "/product/valj-forvaring/" => array:2 [▼
    "status_code" => 404
    "depth" => 3
  ]
]

Why does the last broken link has only 2 keys in its array ? Where are all the other keys like:

"original_urls" => 
    "links_text" =>
    "absolute_url" => 
    "external_link" => 
    "visited" => 
    "frequency" =>  
    "error_code" => 
    "error_message" =>
zrashwani commented 7 years ago

Hello @neorganic can you please send me the base URL to test the issue you mentioned above?

mkantautas commented 7 years ago

Hello @zrashwani , The base URL is https://boxgard.com

zrashwani commented 7 years ago

this is done now, the output of the broken page is as following:

    "http://www.linkedin.com/company/boxgard" => array:10 [▶]
    "https://de.linkedin.com/in/danielbromberg" => array:10 [▶]
    "https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:10 [▶]
    "https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:10 [▶]
    "https://se.linkedin.com/in/alinanmorariu" => array:10 [▶]
    "https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:10 [▶]
    "http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:10 [▶]
    "https://boxgard.com/product/valj-forvaring/" => array:10 [▶]
    "http://thespringfieldproject.se/investments-ny/" => array:10 [▶]
    "//" => array:11 [▶]
    "//blog" => array:11 [▶]
    "/product/valj-forvaring/" => array:10 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "https://boxgard.com/product/valj-forvaring/"
      "external_link" => false
      "visited" => false
      "frequency" => 1
      "source_link" => "https://boxgard.com/media"
      "depth" => 3
      "status_code" => 404
      "error_message" => 404
    ]
mkantautas commented 7 years ago

Perfect, but one new(or maybe i didn't noticed before) issue - false positive:

array:14 [▼
  "http://www.linkedin.com/company/boxgard" => array:10 [▶]
  "https://de.linkedin.com/in/danielbromberg" => array:10 [▶]
  "https://www.linkedin.com/pub/viktor-jondal/b/2b1/517" => array:10 [▶]
  "https://www.linkedin.com/pub/christian-johannesson/49/235/99/en" => array:10 [▶]
  "https://se.linkedin.com/in/alinanmorariu" => array:10 [▶]
  "https://es.linkedin.com/pub/bernat-torras-font/45/81a/437" => array:10 [▶]
  "http://www.bfn.se/fragor/fragor-arkivering.aspx" => array:10 [▶]
  "https://boxgard.com/product/valj-forvaring/" => array:10 [▶]
  "http://thespringfieldproject.se/investments-ny/" => array:10 [▶]
  "//" => array:11 [▼
    "original_urls" => array:2 [▼
      "https://boxgard.com//#steps" => "https://boxgard.com//#steps"
      "https://boxgard.com//#pricing" => "https://boxgard.com//#pricing"
    ]
    "links_text" => array:3 [▼
      "SÅ FUNKAR DET" => "SÅ FUNKAR DET"
      "VAD KOSTAR DET?" => "VAD KOSTAR DET?"
      "PRISER" => "PRISER"
    ]
    "absolute_url" => "https://boxgard.com//#pricing"
    "external_link" => false
    "visited" => false
    "frequency" => 4
    "source_link" => "https://boxgard.com/product/store/"
    "depth" => 3
    "status_code" => "404"
    "error_code" => 0
    "error_message" => "array_replace(): Argument #2 is not an array"
  ]
  "//blog" => array:11 [▼
    "original_urls" => array:1 [▶]
    "links_text" => array:1 [▶]
    "absolute_url" => "https://boxgard.com//blog"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://boxgard.com/product/store/"
    "depth" => 3
    "status_code" => "404"
    "error_code" => 0
    "error_message" => "cURL error 6: Could not resolve host: blog (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)"
  ]
  "/product/valj-forvaring/" => array:10 [▶]
  "http://boxgard.com/)" => array:10 [▶]
  "http://erikshjalpen.se/secondhand/vill-du-skanka/" => array:10 [▶]
]

If you would try to access https://boxgard.com//#pricing or https://boxgard.com//blog you would get not 404 error. Of course the url is not completely valid(having the two "//"), but they do work(probably all the modern browser compensates this slight error). I think this false positive case should be fixed!

P.S. Great job so for, your crawler has huge potential. In the near future I will going to contribute to this rep even more with pull requests.