webrecorder / browsertrix-old

Browsertrix: Containerized High-Fidelity Browser-Based Automated Crawling + Behavior System
Apache License 2.0
88 stars 7 forks source link

Gotchas #38

Open jswrenn opened 4 years ago

jswrenn commented 4 years ago

I'm using browsertrix to scrape a soon-to-be offline service at my university, and I wanted to share some gotchas I encountered. (I'll update this list when I encounter new issues.)

Preserve session cookies

Chrome doesn't clear session cookies if you check off "Continue where you left off" in its settings. Select this option in Chrome's settings when you do browsertrix profile create! Many sites have their login cookies expire at the end of the session. By selecting this option, you'll stay logged in for your scrape.

Clear cache after logging in

After you're finished logging in after running browsertrix profile create, go into Chrome's settings again and clear out all cached files (but not cookies). If you don't do this, there's a good chance that when you begin your scrape, crucial assets will simply be loaded from cache, and thus will not be preserved.

If your scrapes look broken, there's a good chance that this is the reason why.

⚠ Carefully Control Outlinks ⚠

Are you scrapping a webapp that encodes destructive actions with anchor tags? If so, don't use crawl_type: all-links without first making these changes:

nvanderperren commented 3 years ago

Hi @jswrenn , thank you for your gotchas. Very interesting. Did you crawl social media profiles with browsertrix? I try to use it for Twitter and Facebook, but it's not executing the behaviours.

Can you maybe explain why the changes for 'carefully control outlinks' are necessary?