webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
607 stars 79 forks source link

SharePoint / Microsoft 2FA Authentication #140

Open CPJPRINCE opened 2 years ago

CPJPRINCE commented 2 years ago

Hi, I've been looking to run some crawls of my organisation's Sharepoint/intranet site but I'm having some issues getting through Microsoft 2FA Authentication.

Using --interactive successfully creates a profile of the login process. But on calling on this profile to run a crawl results in a timeout error. Is this something that's not yet possible?

ikreymer commented 2 years ago

It's very hard to debug unfortunately as I don't have a Sharepoint instance to test with. Which version have you been testing with? With 0.6.0, now ensuring that all cookies are persistent, so that may help. You can use the screencasting to see if you're actually logged in or not.

robert-1043 commented 2 years ago

I've run a test with 0.6.0 on Windows with WSL2, 2FA using MS Authenticator app on mobile device. Profile & crawl runs ok on a Sharepoint online site.

Replay however mixes up portions of pages that are personalised (each element on Sharepoint is checked to user rights: no rights > no access > isn't shown on page or in document library). Most visible on filtered content overviews on pages which should show different content but show the same (highlighted content web parts with KQL filtering). Replay isn't showing my name or profile picture.

Have archived portions of older Sharepoint (server version) +/- august 2020. Didn't work on Heritrix/Linux, used the Archiveweb/Conifer plugin instead. Those pages do contain my name as the 'logged in user' while archiving.

ikreymer commented 2 years ago

@robert-1043 thanks for running the test! The profile contains both cookies, as well as local/sessionStorage, but only cookies are actually archived (and I believe only HttpOnly cookies)

My guess is that the pages try to load data from localStorage which is not being preserved. It is possible to do so, but of course need to assess the risk when it comes to archiving private data and perhaps make it opt-in (if that is indeed the issue).

If you're able to determine if that is the issue, could find a way to make it work. Or, if there a small WARC/WACZ that you're able to share, we could take a look.

robert-1043 commented 2 years ago

I'll transfer wacz (3x browsertrix, 1x archiveweb)

On replay the archiveweb wacz 'downloads' a 'resources' file, the browsertrix wacz doesn't.

The replay problem mentioned above, seems to be related to the rather new Sharepoint feature 'collapsible sections'. What is collapsed on page load, seems only to be loaded when expanded. On replay the collapsed content is absent (fe images), or mixed up (1 element expanded on page load + 3 collapsed > all show the 1 expanded element || when all expanded on other archived page > all ok).

CPJPRINCE commented 1 year ago

To report back on this. The update fixed my original issue and I able to use the --interactive functionality just fine. Had to tinker with the settings to prevent hitting the throttle and a couple of other things but had no problems otherwise.

On playback, the content of the pages seems good. But there was an issue. Some pages will load for 5-10 seconds, then attempting a redirect to 'login.microsoftonline', returning a URL not found. I think this was coming from a 'TokenFactoryIFrame'. Possibly it could have been prevented by setting a blockRule, but I wasn't able to test.

One thing too, I wasn't able to crawl through a SharePoint list, as I had hoped. This is probably down to the way SharePoint creates hyperlinks from it's functions and I'm guessing pywb only looks at 'href' links.