how to help the crawler to find links on specific sites

It would be useful if you could be able to figure out what are considered links or clickable element, to make it easier to debug when a resource is not indexed.

Perhaps having some kind of overlay view, were you could load in some metadata, that could visualize what elements the indexer consider clickable, that would make it easier to see whats going on.

Also would be good to guide the indexer with a list of css selector to help it find elements to call click on. e.g see google search top menu "More" in crawl ID manual-20230307134720-b23cfddf-bfa or in soundcloud podcast crawl ID manual-20230310092711-7c02d217-c4b - none of these links are found by the crawler

Its not clear if the indexer can handle js triggered downloads? or how it handles downloads in general. Sometimes a site will do the following to trigger a download of a resource The client app will call out to an api and store the response from the api in memory and then use a browser api to save the content to disk. If the indexer could capture those files too it would be great. See e.g. statstidende.dk/publications pdf download in Crawl ID
manual-20230304090628-8b0d5f9f-97b

and example of a library that might be used https://github.com/eligrey/FileSaver.js/

webrecorder / browsertrix

how to help the crawler to find links on specific sites #693