webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
626 stars 81 forks source link

using profile #84

Closed robert-1043 closed 3 years ago

robert-1043 commented 3 years ago

I'm having troubles crawling certain sites where browsertrix doesn't seem getting past the cookie consent form. So I've created an interactive profile which contains the cookies created after accepting the cookie consent form. (works fine!)

The issue is however using the profile. With code sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /profiles/profile.tar.gz --url https://test.com/--generateWACZ --collection test-with-profile the error is "No such file or directory" (also when using /home/../profiles/profile.tar.gz)

When trying to update the profile with code sudo docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://test.com/ --profile /profiles/profile.tar.gz" the error is "unknown flag: --filename"

Is this an issue or rather a typo?

full error message when trying to use the profile:

tar (child): /home/testbtrix/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
node:child_process:903
    throw err;
    ^

Error: Command failed: tar xvfz /home/testbtrix/profiles/profile.tar.gz
tar (child): /home/testbtrix/profiles/profile.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
ikreymer commented 3 years ago

The issue is however using the profile. With code sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /profiles/profile.tar.gz --url https://test.com/--generateWACZ --collection test-with-profile the error is "No such file or directory" (also when using /home/../profiles/profile.tar.gz)

The profiles directory also needs to be mapped as a volume into the Docker image, otherwise /profiles does not exist. If the profiles are in /crawls as in the example, then it should be:

sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://test.com/--generateWACZ --collection test-with-profile

or if its somewhere else, eg in a separate profiles directory:

sudo docker run -v $PWD/crawls:/crawls/ -v $PWD/profiles:/profiles/ -it webrecorder/browsertrix-crawler crawl --profile /profiles/profile.tar.gz --url https://test.com/--generateWACZ --collection test-with-profile

I realize this is a bit confusing, will see if there's a way to make this clearer!

sudo docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://test.com/ --profile /profiles/profile.tar.gz" the error is "unknown flag: --filename"

Is this an issue or rather a typo?

Yes, this is indeed a typo! The --filename flag should be after create-login-profile, eg:

sudo docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles  -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://test.com/ --profile /profiles/profile.tar.gz --filename /profiles/newProfile.tar.gz"

Will update in the README!

robert-1043 commented 3 years ago

Thanks for clearing this out. I used the code from the readme txt to create an interactive profile, that code is slightly different from the code for the standard profile. So the profile folder ended up elsewhere.

docker run -p 9222:9222 -p 9223:9223 -v **$PWD/profiles:/output/** -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://test.com"

docker run -v **$PWD/crawls/profiles:/output/** -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"

ikreymer commented 3 years ago

Thanks, I think they should be consistent now, using /crawls/profiles everywhere!