Closed kauka-1 closed 2 years ago
The useragent option doesn't seem to work for older versions either...0.4.4, 0.4.3,
hmmm @kauka-1 @rmfkdehd I'm having trouble recreating this error. What computer are you using and what command exactly are you running?
When I run the following on an m1 mac
docker pull webrecorder/browsertrix-crawler
docker-compose run crawler crawl --url https://www.mixedconnections.us/missed --generateWACZ --saveState partial --screenshot --userAgent fake
The job completes normally.
@emmadickson
My system is configured with amd ryzen 1600x ubuntu 20.04.
I now have sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome
I've tried with this command.... Still getting the error. (I don't do docker-compose, just run it as docker and prepend sudo to avoid permission errors.)
Below is the log.
sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome Crawl failed Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally. Run
npm installto download the correct Chromium revision (856583). at Cluster.init (/app/node_modules/puppeteer-cluster/dist/Cluster.js:110:19) at async Function.launch (/app/node_modules/puppeteer-cluster/dist/Cluster.js:69:9) at async Crawler.crawl (/app/crawler.js:354:20) at async Crawler.run (/app/crawler.js:250:7)
@emmadickson I tried using docker-compose. command sequence.
1. unzip the git repository
2. mkdir last
3. cd last
4. sudo docker-compose build
5. sudo docker-compose run crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent fake
Below is the error log
sudo docker-compose run crawler crawl --url https://www.forexfactory.com/thread/1057466-fractal-geometry --generateWACZ --saveState partial --screenshot --userAgent Chrome
Creating network "composee_default" with the default driver
Creating composee_crawler_run ... done
Crawl failed
Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally. Run `npm install` to download the correct Chromium revision (856583).
at Cluster.init (/app/node_modules/puppeteer-cluster/dist/Cluster.js:110:19)
at async Function.launch (/app/node_modules/puppeteer-cluster/dist/Cluster.js:69:9)
at async Crawler.crawl (/app/crawler.js:354:20)
at async Crawler.run (/app/crawler.js:250:7)
ERROR: 1
This is macOS Catalina with Intel processor, and I had a command like this
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl \ --profile /crawls/profiles/facebook-profile.tar.gz \ --seedFile /crawls/tokio-facebook-seeds.txt \ --scopeType prefix --depth 1 \ --statsFilename "tokio-log" --crawlId 2021_tokion_olympialaiset2020 \ --behaviours autoscroll,autoplay,autofetch,siteSpecific \ --workers 4 --limit 1000 --text --collection 2021_tokion_olympialaiset2020-hops1
So that command worked, but it didn't with userAgent option.
I'm also seeing the same error (on a Mac / Big Sur) so I looked at crawler.js to see if I could figure it out - not sure if what I found might be helpful:
Whatever I enter for userAgent
, it gives the "Unable to launch browser" error, even if I replicate the default value here and make that the parameter:
https://github.com/webrecorder/browsertrix-crawler/blob/affa45a7d4fc2f06e966631ae98e2d77d9f390c1/crawler.js#L129
Also, if I add a userAgentSuffix
, it doesn't error, but it doesn't add the suffix either. My User-Agent
in the WARC requests still look like this, no suffix (I'm including the two Sec-Ch-Ua lines, not sure if they are relevant):
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36
Sec-Ch-Ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"
Sec-Ch-Ua-Mobile: ?0
I modified the crawler.js to see if I could hardcode the line referenced above to set this.userAgent
and add a suffix that way, but the WARC still didn't show the modified value. I then added logging to print out the value of this.headers
in isHTML()
. It showed the modified hardcoded value is still what it should be here and has the suffix:
https://github.com/webrecorder/browsertrix-crawler/blob/e160382f4d3beaf3a67e4c5fc4aa7748527a351f/crawler.js#L711-L715
So something seems to be overriding the User-Agent
between isHTML
and it being written to the WARC file, but I'm not sure what.
I tested this in 0.5.0 Beta 4
. Hope this helps!
Sorry, yeah, this was a typo where the configureUA() function exited early before setting this.browserExe = getBrowserExe();
, so it was unset when userAgent was set, causing an error.
Moved it to a separate function now so it always get set.
Fixed in 0.5.0-beta.7
Hello, it seems that userAgent option doesn't work in my installation (latest image).
The crawl doesn't even start, message is "Error: Unable to launch browser, error message: Could not find expected browser (chrome) locally." When option is removed, chrome is suddenly found ok.
I have tried simple and more complex strings as userAgents, for example
--userAgent "https://www.kansalliskirjasto.fi/en/legal-deposit-office" --userAgent https://www.kansalliskirjasto.fi/en/legal-deposit-office --userAgent "something" --userAgent something
Also tried to place this option in different places in the command line.
Regards,