Gotchas - Githubissues

I'm using browsertrix to scrape a soon-to-be offline service at my university, and I wanted to share some gotchas I encountered. (I'll update this list when I encounter new issues.)

Preserve session cookies

Chrome doesn't clear session cookies if you check off "Continue where you left off" in its settings. Select this option in Chrome's settings when you do browsertrix profile create! Many sites have their login cookies expire at the end of the session. By selecting this option, you'll stay logged in for your scrape.

Clear cache after logging in

After you're finished logging in after running browsertrix profile create, go into Chrome's settings again and clear out all cached files (but not cookies). If you don't do this, there's a good chance that when you begin your scrape, crucial assets will simply be loaded from cache, and thus will not be preserved.

If your scrapes look broken, there's a good chance that this is the reason why.

⚠ Carefully Control Outlinks ⚠

Are you scrapping a webapp that encodes destructive actions with anchor tags? If so, don't use crawl_type: all-links without first making these changes:

In webrecorder/autobrowser, comment out these lines.
In webrecorder/behaviors, remove all calls to lib.collectOutlinksFromDoc. Replace them with calls to lib.addOutLinks and pass in arrays containing only the links you want to scrape.

webrecorder / browsertrix-old

Gotchas #38

Preserve session cookies

Clear cache after logging in

⚠ Carefully Control Outlinks ⚠