website-scraper / website-scraper-puppeteer

Plugin for website-scraper which returns html for dynamic websites using puppeteer
MIT License
324 stars 80 forks source link

Cookies are becoming invalid on subsequent requests #115

Open rafaelfndev opened 3 months ago

rafaelfndev commented 3 months ago

Describe the bug I'm getting the cookie through Puppeteer and saving it in a cookies.json file. After that, I load the cookies and send them in string format to the scrapper (please, if there is another way to do this, let me know).

I'm using the scrapper as a recursive. In the first request, the page loads normally and the cookie works, and loads the logged in panel. In the second request, the cookie no longer works, and displays the page saying that I am not logged in.

Note: I'm trying to clone a WordPress site, where the logged in area also uses WooCommerce (I believe it's irrelevant to the scrapper, but it's just an observation).

Expected behavior The cookie should work for all requests, but it only works for the first one.

Configuration

My code

import scrape from 'website-scraper';
import PuppeteerPlugin from 'website-scraper-puppeteer';
import puppeteer from 'puppeteer';
import fs from 'fs';

const pup = {
    headless: false,
    slowMo: 50,
    args: ['--no-sandbox', '--disable-setuid-sandbox', '--start-maximized'],
    defaultViewport: null,
    executablePath: "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
};

function formatCookies(cookies) {
    return cookies.map(cookie => `${cookie.name}=${cookie.value}`).join('; ');
}

(async () => {

    const browser = await puppeteer.launch(pup);
    const page = await browser.newPage();
    await page.goto('https://example.com/login/');

    console.log('Press enter after login');

    process.stdin.resume();
    await new Promise(resolve => process.stdin.once('data', resolve));

    const getCookies = await page.cookies();
    fs.writeFileSync('cookies.json', JSON.stringify(getCookies, null, 2));

    console.log('Cookies saved!');

    await browser.close();

    const cookies = JSON.parse(fs.readFileSync('cookies.json'));

    const cookieString = formatCookies(cookies);

    const options = {
        urls: [
            'https://example.com/admin/',
        ],
        directory: `./site`,
        plugins: [
            new PuppeteerPlugin({
                launchOptions: pup,
            })
        ],
        request: {
            headers: {
                Cookie: cookieString,
            }
        },
        recursive: true,
        urlFilter: function(url) {
            return url.indexOf('https://example.com') === 0;
        },
    };

    await scrape(options);

    console.log('Clone finished!');
})();

Steps to reproduce

  1. Change URL to correct website
  2. Run the script on terminal "node index.js"
  3. Navigate to login website, do login manually, back to terminal and press "enter"
  4. The clone will start with the Cookies and logged in
s0ph1e commented 3 months ago

Hi @rafaelfndev 👋

Thanks for reporting the issue.

As for setting cookies, everything looks correct to me.

The initial cookies (specified in request.headers) should be passed to each request. You can try to see which headers are used for each request if you implement a plugin with beforeRequest action and log the headers there. Please let me know if cookie is not there for the second response, then it sounds like a bug.

It's also possible that the website sets new cookies for each response, which should be passed in the next request. This is not implemented in the module.

rafaelfndev commented 3 months ago

@s0ph1e , thanks for help.

I found the problem (or part of it). When I pass the plugins key with the options to configure Puppeteer, the cookie becomes invalid in the next request. This happens whether I set headless: true or headless: false, or define any options to use Puppeteer. If I don't set the Puppeteer as a plugin, the cookies work.

I also tried setting it to headless: false and force sending the cookie with beforeRequest, like this:

const options = {
    urls: [
        'https://example.com/admin/',
    ],
    directory: `./website`,
    request: {
        headers: {
            Cookie: cookieString,
        }
    },
    recursive: true,
    urlFilter: function(url) {
        return url.indexOf('https://example.com') === 0;
    },
    requestConcurrency: 1,
    plugins: [
        new PuppeteerPlugin({launchOptions: {headless: false}}),
        {
            apply(registerAction) {
                registerAction('beforeRequest', async ({ requestOptions }) => {
                    requestOptions.headers = requestOptions.headers || {};
                    requestOptions.headers['Cookie'] = cookieString;
                    return { requestOptions };
                });
            }
        }
    ]
};

await scrape(options);

But that didn't work either. I believe the problem is in the request with Puppeteer.

My final code that worked is the follow (without Puppeteer):

const options = {
    urls: [
        'https://example.com/admin/',
    ],
    directory: `./website`,
    request: {
        headers: {
            Cookie: cookieString,
        }
    },
    recursive: true,
    urlFilter: function(url) {
        return url.indexOf('https://example.com') === 0;
    },
    requestConcurrency: 1,
};

await scrape(options);

This way, cookies remain valid in subsequent requests.

The problem is certainly with Puppeteer, maybe I need to set cookies for it too.

aivus commented 3 months ago

I've moved the issue to the puppeteer repo.

force sending the cookie with beforeRequest, like this:

Existing code has the same logic and should handle this. Need to debug why it doesn't work

s0ph1e commented 3 months ago

The issue may occur because the website detects requests from Puppeteer (there are many approaches for that, e.g. as described here)

Unfortunately, I don't have time to dig deep into the issue, so please don't expect a fix soon unless someone wants to contribute.

If your page doesn't have js-rendered content and everything works fine without the puppeteer plugin, I recommend avoiding using it.