Open Minyar2004 opened 6 years ago
The reason might lie in helper.js:
static generateKey(options) {
const json = JSON.stringify(pick(options, PICKED_OPTION_FIELDS), Helper.jsonStableReplacer);
return Helper.hash(json).substring(0, MAX_KEY_LENGTH);
}
Uniqueness is assessed from a hash generated on the result of JSON.stringify()
, but this method doesn't guarantee constant order.
I'm looking for opinions. See https://github.com/substack/json-stable-stringify
Same as #299 @yujiosaka should look into this.
headless 模式下一直报302
I found two reasons:
maxConcurrency
> 1, same page requested in parallel threads.skipRequestedRedirect: true
is anyone consider creating a PR?
Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did. 1) First created a sqlite database. 2) Then in RequestStarted event, insert the current url. 3) In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url 4) In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.
What is the current behavior?
Duplicated urls are not skipped. The same url is crawled twice.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Crawled urls should be skipped even if they come from the
queue
.Please tell us about your environment: