zrashwani / arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
MIT License
253 stars 60 forks source link

Timeout configuration for Goutte client #11

Closed ollietreend closed 7 years ago

ollietreend commented 8 years ago

It's not currently possible to configure a timeout for the Guzzle client which is used to make HTTP requests when spidering a site. Without a default, Guzzle defaults to 0 timeout – i.e. it'll wait indefinitely until a response is received. (Which arguably isn't a sensible default anyway.)

I'm trying to spider a site which contains a link to a dead server. Requests to the URL never timeout, meaning the spider process gets stuck on this URL and never proceeds.

The timeout is configured when constructing a new Guzzle client, which is currently done in Arachnid\Crawler::getScrapClient():

protected function getScrapClient()
{
    $client = new GoutteClient();
    $client->followRedirects();

    $guzzleClient = new \GuzzleHttp\Client(array(
        'curl' => array(
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_SSL_VERIFYPEER => false,
        ),
    ));
    $client->setClient($guzzleClient);

    return $client;
}

It would be really helpful if a timeout was configured here. To do that, all we need to do is change the configuration array which is passed to the Guzzle client constructor method:

$guzzleClient = new \GuzzleHttp\Client(array(
    'curl' => array(
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
    ),
    'timeout' => 30,
    'connect_timeout' => 30,
));

I think a sensible default would be a 30 second timeout, but it would be great to have that configurable. That could either be an additional parameter in the constructor method, or alternatively an object property which can be changed.

In fact – it might make sense to allow us to add anything to the Guzzle constructor configuration. Perhaps again by means of a class property or constructor parameter whereby we can pass in an array of configuration options. This could be useful when configuring other client options, for example HTTP authentication:

$crawler = new Crawler($url, 3, array(
    'timeout' => 5,
    'connect_timeout' => 5,
    'auth' => array('username', 'password'),
));

Thoughts? I'd be happy to put together a PR for this, provided we can get some agreement on how this should be configured (class constructor, public property, static property, getter/setter, etc.)

zrashwani commented 8 years ago

Hello @ollietreend sorry for being late in reply,

I think your suggestion is really useful, and I prefer that it is implemented either as optional parameter passed to constructor or as setter method, something like setCrawlerOptions(array $options)

please feel free to put a PR for this..

zrashwani commented 7 years ago

Hello @ollietreend the capability to pass options to guzzle client is now added, As you suggested, this can be done by passing options as third parameter to constructor or by using setCrawlerOptions method

<?php
//third parameter is the options used to configure guzzle client
$crawler = new \Arachnid\Crawler('http://github.com',2, 
                         ['auth'=>array('username', 'password')]);

//or using separate method `setCrawlerOptions`
$options = array(
    'curl' => array(
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
    ),
    'timeout' => 30,
    'connect_timeout' => 30,
);

$crawler->setCrawlerOptions($options);
LunarDevelopment commented 5 years ago

Does this config now look like this, as there's no setCrawlerOptions method ?


//or using separate method `setCrawlerOptions`
$options = array(
    'curl' => array(
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
    ),
    'timeout' => 30,
    'connect_timeout' => 30,
);

$crawler = new Crawler($url, $linkDepth, $options);
LunarDevelopment commented 5 years ago

Never mind, I re-read the readme instructions:

  <?php
        //third parameter is the options used to configure guzzle client
        $crawler = new \Arachnid\Crawler('http://github.com',2, 
                                 ['auth'=>array('username', 'password')]);

        //or using separate method `setCrawlerOptions`
        $options = array(
            'curl' => array(
                CURLOPT_SSL_VERIFYHOST => false,
                CURLOPT_SSL_VERIFYPEER => false,
            ),
            'timeout' => 30,
            'connect_timeout' => 30,
        );

        $scrapperClient = \Arachnid\Adapters\CrawlingFactory::create(\Arachnid\Adapters\CrawlingFactory::TYPE_GOUTTE,$options);
        $crawler->setScrapClient($scrapperClient);
zrashwani commented 5 years ago

I have performed a major change on the package in order to allow two types of scrapper; one is the usual Goutte client and the other is headless browser mode - based on Symfony Panther that supports javascript rendering; that's why settings CrawlerOptions is now different according to each adapter