webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
653 stars 83 forks source link

Crawl button with javascript navigation #665

Open hamzamac opened 3 months ago

hamzamac commented 3 months ago

Hi, we are try to crawl a site that use

tw4l commented 3 months ago

Hi @hamzamac, would you be able to share the URL of the site you're trying to capture so I can take a look?

hamzamac commented 3 months ago

Hi @tw4l, thank you for responding. The site is actually a SharePoint site with MFA. We manages to crawl it by creating a profile. but the links to folders appears to be spans. image

when when clicking the button on the replayweb.page it shows this error below image (the URL is pointing to is a public CDN URL which is accessible) Do we need to include all the URI for JavaScripts in the seeds?

tw4l commented 3 months ago

Hm, you shouldn't need to include the URIs for scripts - if the script is on the page, the crawler will discover it. This looks to me like it's more likely to be a bug in our replay engine than a missing script. It's hard to tell further without being able to reproduce it ourselves - would you be able to share a copy of the WACZ by email?

hamzamac commented 3 months ago

Hi @tw4l, sure I will send the WACZ to the email on your profile.

hamzamac commented 2 months ago

Hi @tw4l , can you please confirm if you have received the WACZ file? thanks.