webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
662 stars 84 forks source link

userAgent option #90

Closed kauka-1 closed 2 years ago

kauka-1 commented 3 years ago

Hello, it seems that userAgent option doesn't work in my installation (latest image).

The crawl doesn't even start, message is "Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally." When option is removed, chrome is suddenly found ok.

I have tried simple and more complex strings as userAgents, for example

--userAgent "https://www.kansalliskirjasto.fi/en/legal-deposit-office" --userAgent https://www.kansalliskirjasto.fi/en/legal-deposit-office --userAgent "something" --userAgent something

Also tried to place this option in different places in the command line.

Regards,

rmfkdehd commented 3 years ago

The useragent option doesn't seem to work for older versions either...0.4.4, 0.4.3,

emmadickson commented 3 years ago

hmmm @kauka-1 @rmfkdehd I'm having trouble recreating this error. What computer are you using and what command exactly are you running?

When I run the following on an m1 mac

docker pull webrecorder/browsertrix-crawler 
docker-compose run crawler crawl --url https://www.mixedconnections.us/missed --generateWACZ  --saveState partial --screenshot --userAgent fake

The job completes normally.

rmfkdehd commented 3 years ago

@emmadickson My system is configured with amd ryzen 1600x ubuntu 20.04. I now have sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome I've tried with this command.... Still getting the error. (I don't do docker-compose, just run it as docker and prepend sudo to avoid permission errors.)

Below is the log. sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome Crawl failed Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally. Runnpm installto download the correct Chromium revision (856583). at Cluster.init (/app/node_modules/puppeteer-cluster/dist/Cluster.js:110:19) at async Function.launch (/app/node_modules/puppeteer-cluster/dist/Cluster.js:69:9) at async Crawler.crawl (/app/crawler.js:354:20) at async Crawler.run (/app/crawler.js:250:7)

rmfkdehd commented 3 years ago

@emmadickson I tried using docker-compose. command sequence.

1. unzip the git repository
2. mkdir last
3. cd last
4. sudo docker-compose build
5. sudo docker-compose run crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent fake

Below is the error log

sudo docker-compose run crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome
Creating network "composee_default" with the default driver
Creating composee_crawler_run ... done
Crawl failed
Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally. Run `npm install` to download the correct Chromium revision (856583).
    at Cluster.init (/app/node_modules/puppeteer-cluster/dist/Cluster.js:110:19)
    at async Function.launch (/app/node_modules/puppeteer-cluster/dist/Cluster.js:69:9)
    at async Crawler.crawl (/app/crawler.js:354:20)
    at async Crawler.run (/app/crawler.js:250:7)
ERROR: 1
kauka-1 commented 3 years ago

This is macOS Catalina with Intel processor, and I had a command like this

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl \ --profile /crawls/profiles/facebook-profile.tar.gz \ --seedFile /crawls/tokio-facebook-seeds.txt \ --scopeType prefix --depth 1 \ --statsFilename "tokio-log" --crawlId 2021_tokion_olympialaiset2020 \ --behaviours autoscroll,autoplay,autofetch,siteSpecific \ --workers 4 --limit 1000 --text --collection 2021_tokion_olympialaiset2020-hops1

So that command worked, but it didn't with userAgent option.

karenhanson commented 2 years ago

I'm also seeing the same error (on a Mac / Big Sur) so I looked at crawler.js to see if I could figure it out - not sure if what I found might be helpful:

Whatever I enter for userAgent, it gives the "Unable to launch browser" error, even if I replicate the default value here and make that the parameter: https://github.com/webrecorder/browsertrix-crawler/blob/affa45a7d4fc2f06e966631ae98e2d77d9f390c1/crawler.js#L129 Also, if I add a userAgentSuffix, it doesn't error, but it doesn't add the suffix either. My User-Agent in the WARC requests still look like this, no suffix (I'm including the two Sec-Ch-Ua lines, not sure if they are relevant):

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36
Sec-Ch-Ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"
Sec-Ch-Ua-Mobile: ?0

I modified the crawler.js to see if I could hardcode the line referenced above to set this.userAgent and add a suffix that way, but the WARC still didn't show the modified value. I then added logging to print out the value of this.headers in isHTML(). It showed the modified hardcoded value is still what it should be here and has the suffix: https://github.com/webrecorder/browsertrix-crawler/blob/e160382f4d3beaf3a67e4c5fc4aa7748527a351f/crawler.js#L711-L715 So something seems to be overriding the User-Agent between isHTML and it being written to the WARC file, but I'm not sure what.

I tested this in 0.5.0 Beta 4. Hope this helps!

ikreymer commented 2 years ago

Sorry, yeah, this was a typo where the configureUA() function exited early before setting this.browserExe = getBrowserExe();, so it was unset when userAgent was set, causing an error. Moved it to a separate function now so it always get set. Fixed in 0.5.0-beta.7