webignition / robots-txt-parser

A parser for robots.txt files
MIT License
8 stars 1 forks source link

How to test if Url is Allow/Disallow #2

Closed LeMoussel closed 7 years ago

LeMoussel commented 8 years ago

For example with this robots.txt content

User-agent: Disallow: deny_all/$

User-agent: Googlebot Disallow: *deny_googlebot/$

How can I test if http://mytestsite.com/deny_all/ & http://mytestsite.com/deny_googlebot/ is Allow/Disallow for all user agent ("*") or Googlebot user agent ("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") ?

webignition commented 7 years ago

Hi @LeMoussel

This library is for parsing the contents of a robots.txt file into a model (an instance of webignition\RobotsTxt\File\File which can then be examined programmatically as required.

The model exists at a lower level of abstraction than the context of your question and as such the model can't directly do what you want. The model has no understanding of the different types of directives (allow, disallow and so on), nor does it understand the values of the directives (*deny_all/$, *deny_googlebot/$).

You can certainly iterate over the set of directives for each given user agent and see if any directives equate to the conditions you're interested in if you're willing to examine the raw directive names and raw directive values.

Here is a (somewhat convoluted) example from a unit test I just created:

public function testFoo()
{
    $source = <<<'EOD'
User-agent: *
Disallow: *deny_all/$

User-agent: Googlebot
Disallow: *deny_googlebot/$
EOD;

    $parser = new \webignition\RobotsTxt\File\Parser();
    $parser->setSource($source);

    $robotsTxtFile = $parser->getFile();

    $areAllUserAgentsDisallowedDenyAllPath = false;

    $directivesForAllAgents = $robotsTxtFile->getDirectivesFor('*')->get();
    foreach ($directivesForAllAgents as $directiveForAllUserAgents) {
        /* @var $directiveForAllUserAgents \webignition\RobotsTxt\Directive\Directive */
        $isDisallowDirective = $directiveForAllUserAgents->getField() === 'disallow';
        $isDenyAllPath = false;

        if ($isDisallowDirective) {
            $isDenyAllPath = (string)$directiveForAllUserAgents->getValue() === '*deny_all/$';
        }

        if ($isDisallowDirective && $isDenyAllPath) {
            $areAllUserAgentsDisallowedDenyAllPath = true;
        }
    }

    $this->assertTrue($areAllUserAgentsDisallowedDenyAllPath);
}
webignition commented 7 years ago

Might have a solution to this soon, reopening ...

webignition commented 7 years ago

Resolved now in robots-txt-file which now supercedes this package.