Open tomnielsen opened 6 years ago
@tomnielsen You are 100% right. BaseCache grew strange time to time. I should work on this.
Might I make a secondary suggestion as part of this refactor?
Always have a durable "already processed urls cache" (referred to below as pre-filter-list
).
Consider using LevelDB which chrome uses internally.
Overall process would go something like this:
preRequest()
to see if it should be crawled (and to override the priority).
All rejects are added to the pre-filter-list.
A list of not-filtered-out urls are passed to the next step.Dequeuing out of a shared or network based Priority Queue atomically will be a challenge. But the first operation of this step is to sanity peek the next url and check against the pre-filter-list. After this sanity-check, the approved url is immediately inserted into the pre-filter-list before removing it from the Priority Queue.
Note: You could use this pre-filter-list to get atomic operations in the Priority Queue.
Assume the Priority Queue and pre-filter-list internally use a noSql key/value store.
When a process goes to pull an element out of the Priority Queue, the Priority Queue logic first sanity checks that the url is not already in the pre-filter-list. Then it immediately inserts the url into that pre-filter-list along with a random value used to verify it grabbed the lock.
The Priority Queue then flushes the pre-filter-list and waits some random milliseconds (e.g. if there's a delay for not over aggressive crawling). It then re-reads the pre-filter-list key/value and verifies the random claim number is what it expects. If it matches, then it's safe to assume it got the atomic claim on that url.
As a bonus for using a pre-filter-list is that the Priority Queue doesn't have to worry about duplicate url inserts anymore. Assuming a key/value store, this means the keys can be used to sort the priority instead of detecting duplicates. For example, a key could be priority-number + timestamp + rnd() . (e.g. "00000008_2018-04-04T22:34:33.22_893422") and the value would just be the url. Sorting the queue just then becomes a string sort on the keys. This sort is the default behavior for most nosql DBs.
Let me know if any of this is confusingly worded.
Background LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback.
What is the current behavior? The difference between
get(key)
,set(key, value)
,enqueue(key, value, priority)
,dequeue(key)
is very confusing.key
seems to be more a keyspace?Since the underlying chrome browser had its own cache, the term is overload. E.g. does the
clearCache
setting clear Chrome's cache or the persistent priority queue?What is the expected behavior? I expected the API to be more like standard Priority Queue APIs (similar to what is used for the
PriorityQueue
class used internally in the code), but that's not the BaseCache API.Here's what the current API looks like (for reference).
Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?
How is
get()
/set()
supposed to be used outside the class compared toenqueue()
anddequeue()
?I kind of expected the API to persist a queue to look more like this:
Notice that
enqueue()
takes an array likeenqueue([{value: value, priority: priory}, ...])
since batch queuing might be supported by underlying mechanism and significantly improve performance.Higher level code queues all links found in the page in a tight loop. It can/should be done in batch. From existing code:
each(urls, url => void this._push(extend({}, options, { url }), depth));
This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single
key
. For a shared network queue. This is has really horrible for performance implications.What is the motivation / use case for changing the behavior? Performance and readability.
Please tell us about your environment: