Open jswrenn opened 4 years ago
Hi @jswrenn , thank you for your gotchas. Very interesting. Did you crawl social media profiles with browsertrix? I try to use it for Twitter and Facebook, but it's not executing the behaviours.
Can you maybe explain why the changes for 'carefully control outlinks' are necessary?
I'm using browsertrix to scrape a soon-to-be offline service at my university, and I wanted to share some gotchas I encountered. (I'll update this list when I encounter new issues.)
Preserve session cookies
Chrome doesn't clear session cookies if you check off "Continue where you left off" in its settings. Select this option in Chrome's settings when you do
browsertrix profile create
! Many sites have their login cookies expire at the end of the session. By selecting this option, you'll stay logged in for your scrape.Clear cache after logging in
After you're finished logging in after running
browsertrix profile create
, go into Chrome's settings again and clear out all cached files (but not cookies). If you don't do this, there's a good chance that when you begin your scrape, crucial assets will simply be loaded from cache, and thus will not be preserved.If your scrapes look broken, there's a good chance that this is the reason why.
⚠ Carefully Control Outlinks ⚠
Are you scrapping a webapp that encodes destructive actions with anchor tags? If so, don't use
crawl_type: all-links
without first making these changes:lib.collectOutlinksFromDoc
. Replace them with calls tolib.addOutLinks
and pass in arrays containing only the links you want to scrape.