Suggestion: BaseCache api is confusing and not efficient.

yujiosaka / headless-chrome-crawler

Distributed crawler powered by Headless Chrome

MIT License

5.53k stars 406 forks source link

Background LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback.

What is the current behavior? The difference between get(key), set(key, value), enqueue(key, value, priority), dequeue(key) is very confusing. key seems to be more a keyspace?

Since the underlying chrome browser had its own cache, the term is overload. E.g. does the clearCache setting clear Chrome's cache or the persistent priority queue?

What is the expected behavior? I expected the API to be more like standard Priority Queue APIs (similar to what is used for the PriorityQueue class used internally in the code), but that's not the BaseCache API.

Here's what the current API looks like (for reference).

class BaseCache {
    init();
    close();

    clear();

    get(key);
    set(key, value);

    enqueue(key, value, priority)
    dequeue(key);

    size(key);
    remove(key);
}

Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?

How is get() / set() supposed to be used outside the class compared to enqueue() and dequeue()?

I kind of expected the API to persist a queue to look more like this:

class BasePersistentPriorityQueue {
    init(namespace);
    close();

    clear();

    enqueue( [{value, priority}] )
    dequeue();

    peek();
    queue_length();
}

Notice that enqueue() takes an array like enqueue([{value: value, priority: priory}, ...]) since batch queuing might be supported by underlying mechanism and significantly improve performance.

Higher level code queues all links found in the page in a tight loop. It can/should be done in batch. From existing code: each(urls, url => void this._push(extend({}, options, { url }), depth));

This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single key. For a shared network queue. This is has really horrible for performance implications.

What is the motivation / use case for changing the behavior? Performance and readability.

Please tell us about your environment:

Version: 1.5.0
Platform / OS version: Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
Node.js version: v6.13.0

Might I make a secondary suggestion as part of this refactor?

Always have a durable "already processed urls cache" (referred to below as pre-filter-list). Consider using LevelDB which chrome uses internally.

Overall process would go something like this:

Gather all links on page
Pass links to pre-filter-list checker api (in bulk). The implementation could use checksums + Bloom filter for urls internally, but that's not important. If you store the urls as keys in LevelDB, I think you'll get C++ efficient Bloom filter checks and local durability as part of it. A list of not-filtered-out urls are passed to the next step.
The remaining urls are checked against the robots-parser. Any that fail the robot test are immediately added to the pre-filter-list. A list of not-filtered-out urls are passed to the next step.
Remaining links not removed are passed one-by-one to the client's preRequest() to see if it should be crawled (and to override the priority). All rejects are added to the pre-filter-list. A list of not-filtered-out urls are passed to the next step.
Any links that pass all the above steps are finally inserted into the Priority Queue.

Dequeuing out of a shared or network based Priority Queue atomically will be a challenge. But the first operation of this step is to sanity peek the next url and check against the pre-filter-list. After this sanity-check, the approved url is immediately inserted into the pre-filter-list before removing it from the Priority Queue.

Note: You could use this pre-filter-list to get atomic operations in the Priority Queue.

Assume the Priority Queue and pre-filter-list internally use a noSql key/value store.

For the pre-filter-list, the key would be the url but the value could be a "claim ticket" random number.
For the Priority Queue, the key is for sorting priority and the value is the url.

When a process goes to pull an element out of the Priority Queue, the Priority Queue logic first sanity checks that the url is not already in the pre-filter-list. Then it immediately inserts the url into that pre-filter-list along with a random value used to verify it grabbed the lock.

The Priority Queue then flushes the pre-filter-list and waits some random milliseconds (e.g. if there's a delay for not over aggressive crawling). It then re-reads the pre-filter-list key/value and verifies the random claim number is what it expects. If it matches, then it's safe to assume it got the atomic claim on that url.

As a bonus for using a pre-filter-list is that the Priority Queue doesn't have to worry about duplicate url inserts anymore. Assuming a key/value store, this means the keys can be used to sort the priority instead of detecting duplicates. For example, a key could be priority-number + timestamp + rnd() . (e.g. "00000008_2018-04-04T22:34:33.22_893422") and the value would just be the url. Sorting the queue just then becomes a string sort on the keys. This sort is the default behavior for most nosql DBs.

Let me know if any of this is confusingly worded.

yujiosaka / headless-chrome-crawler

Suggestion: BaseCache api is confusing and not efficient. #186