yujiosaka / headless-chrome-crawler

Distributed crawler powered by Headless Chrome
MIT License
5.52k stars 405 forks source link

Suggestion: BaseCache api is confusing and not efficient. #186

Open tomnielsen opened 6 years ago

tomnielsen commented 6 years ago

Background LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback.

What is the current behavior? The difference between get(key), set(key, value), enqueue(key, value, priority), dequeue(key) is very confusing. key seems to be more a keyspace?

Since the underlying chrome browser had its own cache, the term is overload. E.g. does the clearCache setting clear Chrome's cache or the persistent priority queue?

What is the expected behavior? I expected the API to be more like standard Priority Queue APIs (similar to what is used for the PriorityQueue class used internally in the code), but that's not the BaseCache API.

Here's what the current API looks like (for reference).

class BaseCache {
    init();
    close();

    clear();

    get(key);
    set(key, value);

    enqueue(key, value, priority)
    dequeue(key);

    size(key);
    remove(key);
}

Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?

How is get() / set() supposed to be used outside the class compared to enqueue() and dequeue()?

I kind of expected the API to persist a queue to look more like this:

class BasePersistentPriorityQueue {
    init(namespace);
    close();

    clear();

    enqueue( [{value, priority}] )
    dequeue();

    peek();
    queue_length();
}

Notice that enqueue() takes an array like enqueue([{value: value, priority: priory}, ...]) since batch queuing might be supported by underlying mechanism and significantly improve performance.

Higher level code queues all links found in the page in a tight loop. It can/should be done in batch. From existing code: each(urls, url => void this._push(extend({}, options, { url }), depth));

This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single key. For a shared network queue. This is has really horrible for performance implications.

What is the motivation / use case for changing the behavior? Performance and readability.

Please tell us about your environment:

yujiosaka commented 6 years ago

@tomnielsen You are 100% right. BaseCache grew strange time to time. I should work on this.

tomnielsen commented 6 years ago

Might I make a secondary suggestion as part of this refactor?

Always have a durable "already processed urls cache" (referred to below as pre-filter-list). Consider using LevelDB which chrome uses internally.

Overall process would go something like this:

Dequeuing out of a shared or network based Priority Queue atomically will be a challenge. But the first operation of this step is to sanity peek the next url and check against the pre-filter-list. After this sanity-check, the approved url is immediately inserted into the pre-filter-list before removing it from the Priority Queue.

Note: You could use this pre-filter-list to get atomic operations in the Priority Queue.

Assume the Priority Queue and pre-filter-list internally use a noSql key/value store.

When a process goes to pull an element out of the Priority Queue, the Priority Queue logic first sanity checks that the url is not already in the pre-filter-list. Then it immediately inserts the url into that pre-filter-list along with a random value used to verify it grabbed the lock.

The Priority Queue then flushes the pre-filter-list and waits some random milliseconds (e.g. if there's a delay for not over aggressive crawling). It then re-reads the pre-filter-list key/value and verifies the random claim number is what it expects. If it matches, then it's safe to assume it got the atomic claim on that url.

As a bonus for using a pre-filter-list is that the Priority Queue doesn't have to worry about duplicate url inserts anymore. Assuming a key/value store, this means the keys can be used to sort the priority instead of detecting duplicates. For example, a key could be priority-number + timestamp + rnd() . (e.g. "00000008_2018-04-04T22:34:33.22_893422") and the value would just be the url. Sorting the queue just then becomes a string sort on the keys. This sort is the default behavior for most nosql DBs.

Let me know if any of this is confusingly worded.