webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
660 stars 83 forks source link

Suggestion: make it easy to integrate adblocker #119

Open phiresky opened 2 years ago

phiresky commented 2 years ago

Adding an ad-blocker seems to make crawling much easier. On two sites I've tested, without an adblocker the number of requests is an order of magnitude higher than with it, and on one of the two sites it doesn't even know when the page load is done because it keeps loading more ads.

I guess it's possible to do this manually by creating a profile ourselves but it's pretty cumbersome (doesn't work with the interactive profile creation tool either).

I'm thinking of something like this in crawl-config.yaml:

browser-extensions:
   - "cjpalhdlnbpafiamejdnhcphjbkeiagm" # ublock origin
phiresky commented 2 years ago

using --profile=... with a profile manually created by installing the extension doesn't seem to work either, not sure why. What worked for me was doing

docker run -e CHROME_FLAGS="--disable-extensions-except=/crawls/ublock --load-extension=/crawls/ublock"

and putting the extracted extension in /crawls/ublock

ikreymer commented 2 years ago

Yes, I think its a good idea, but probably should figure out a way to have it be installed by default, w/o requiring a custom profile, perhaps via --load-extension? Do you have time to work on a PR by any chance? Would be greatly appreciated! Maybe a more high-level flag, like --enable-ad-block might be fine to start..

phiresky commented 2 years ago

Do you have an idea why loading a profile that has an extension installed doesn't work? Maybe I did something wrong, but without explicitly specifying the extension with disable-extensions-except+load-extension it seemed to ignore it. Probably should figure that out before being able to implement extension loading in code... What I did:

chromium --user-data-dir tmpdir
# install the extension manually  from the store
cd tmpdir && tar cf ublock-profile.tar *
docker run ....... --profile /.../ublock-profile.tar

I scraped this URL by the way to test whether or not uBlock was installed: https://blockads.fivefilters.org/

I'll create a PR that at least adds documentation for the environment flag(s).

despens commented 2 years ago

There is a bit of a weird behavior indeed when installing a browser extension during manual profile creation: the browser never shows the extension as installed:

https://user-images.githubusercontent.com/571494/161566516-0ecd7a43-c536-4f91-b610-5adcc9945e73.mp4

The containerized Chromium doesn't seem to have any extensions enabled, at least according to chrome://extensions/:

image

rennr commented 1 year ago

Do you have an idea why loading a profile that has an extension installed doesn't work? Maybe I did something wrong, but without explicitly specifying the extension with disable-extensions-except+load-extension it seemed to ignore it.

Did you ever figure out why this was?

Having extensions enabled on a profile and actually have the crawler use them would be pretty significant, especially for ad block. I appreciate having the DNS adblock list integrated, but that doesn't block a bunch of stuff that something like uBlock would be able to do.