webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
205 stars 36 forks source link

how to help the crawler to find links on specific sites #693

Open tuehlarsen opened 1 year ago

tuehlarsen commented 1 year ago

It would be useful if you could be able to figure out what are considered links or clickable element, to make it easier to debug when a resource is not indexed.

Perhaps having some kind of overlay view, were you could load in some metadata, that could visualize what elements the indexer consider clickable, that would make it easier to see whats going on.

Also would be good to guide the indexer with a list of css selector to help it find elements to call click on. e.g see google search top menu "More" in crawl ID manual-20230307134720-b23cfddf-bfa or in soundcloud podcast crawl ID manual-20230310092711-7c02d217-c4b - none of these links are found by the crawler

Its not clear if the indexer can handle js triggered downloads? or how it handles downloads in general. Sometimes a site will do the following to trigger a download of a resource The client app will call out to an api and store the response from the api in memory and then use a browser api to save the content to disk. If the indexer could capture those files too it would be great. See e.g. statstidende.dk/publications pdf download in Crawl ID
manual-20230304090628-8b0d5f9f-97b

and example of a library that might be used https://github.com/eligrey/FileSaver.js/

tuehlarsen commented 1 year ago

Johan Flensmark have made a behavior that click download on all pdfs on statstidende.dk/publications ( one of the examples above) see https://github.com/bitknox/browsertrix-behaviors/blob/feature/statstidende.dk/src/site/statstidende.js Note: It works best if the browser is set not to ask for the download location every time.