Open rafaelfndev opened 3 months ago
Hi @rafaelfndev 👋
Thanks for reporting the issue.
As for setting cookies, everything looks correct to me.
The initial cookies (specified in request.headers
) should be passed to each request.
You can try to see which headers are used for each request if you implement a plugin with beforeRequest action and log the headers there.
Please let me know if cookie is not there for the second response, then it sounds like a bug.
It's also possible that the website sets new cookies for each response, which should be passed in the next request. This is not implemented in the module.
@s0ph1e , thanks for help.
I found the problem (or part of it). When I pass the plugins
key with the options to configure Puppeteer, the cookie becomes invalid in the next request. This happens whether I set headless: true
or headless: false
, or define any options to use Puppeteer. If I don't set the Puppeteer as a plugin, the cookies work.
I also tried setting it to headless: false
and force sending the cookie with beforeRequest
, like this:
const options = {
urls: [
'https://example.com/admin/',
],
directory: `./website`,
request: {
headers: {
Cookie: cookieString,
}
},
recursive: true,
urlFilter: function(url) {
return url.indexOf('https://example.com') === 0;
},
requestConcurrency: 1,
plugins: [
new PuppeteerPlugin({launchOptions: {headless: false}}),
{
apply(registerAction) {
registerAction('beforeRequest', async ({ requestOptions }) => {
requestOptions.headers = requestOptions.headers || {};
requestOptions.headers['Cookie'] = cookieString;
return { requestOptions };
});
}
}
]
};
await scrape(options);
But that didn't work either. I believe the problem is in the request with Puppeteer.
My final code that worked is the follow (without Puppeteer):
const options = {
urls: [
'https://example.com/admin/',
],
directory: `./website`,
request: {
headers: {
Cookie: cookieString,
}
},
recursive: true,
urlFilter: function(url) {
return url.indexOf('https://example.com') === 0;
},
requestConcurrency: 1,
};
await scrape(options);
This way, cookies remain valid in subsequent requests.
The problem is certainly with Puppeteer, maybe I need to set cookies for it too.
I've moved the issue to the puppeteer repo.
force sending the cookie with beforeRequest, like this:
Existing code has the same logic and should handle this. Need to debug why it doesn't work
The issue may occur because the website detects requests from Puppeteer (there are many approaches for that, e.g. as described here)
Unfortunately, I don't have time to dig deep into the issue, so please don't expect a fix soon unless someone wants to contribute.
If your page doesn't have js-rendered content and everything works fine without the puppeteer plugin, I recommend avoiding using it.
Describe the bug I'm getting the cookie through Puppeteer and saving it in a cookies.json file. After that, I load the cookies and send them in string format to the scrapper (please, if there is another way to do this, let me know).
I'm using the scrapper as a recursive. In the first request, the page loads normally and the cookie works, and loads the logged in panel. In the second request, the cookie no longer works, and displays the page saying that I am not logged in.
Note: I'm trying to clone a WordPress site, where the logged in area also uses WooCommerce (I believe it's irrelevant to the scrapper, but it's just an observation).
Expected behavior The cookie should work for all requests, but it only works for the first one.
Configuration
My code
Steps to reproduce