Open tuehlarsen opened 1 year ago
Johan Flensmark have made a behavior that click download on all pdfs on statstidende.dk/publications ( one of the examples above) see https://github.com/bitknox/browsertrix-behaviors/blob/feature/statstidende.dk/src/site/statstidende.js Note: It works best if the browser is set not to ask for the download location every time.
It would be useful if you could be able to figure out what are considered links or clickable element, to make it easier to debug when a resource is not indexed.
Perhaps having some kind of overlay view, were you could load in some metadata, that could visualize what elements the indexer consider clickable, that would make it easier to see whats going on.
Also would be good to guide the indexer with a list of css selector to help it find elements to call click on. e.g see google search top menu "More" in crawl ID manual-20230307134720-b23cfddf-bfa or in soundcloud podcast crawl ID manual-20230310092711-7c02d217-c4b - none of these links are found by the crawler
Its not clear if the indexer can handle js triggered downloads? or how it handles downloads in general. Sometimes a site will do the following to trigger a download of a resource The client app will call out to an api and store the response from the api in memory and then use a browser api to save the content to disk. If the indexer could capture those files too it would be great. See e.g. statstidende.dk/publications pdf download in Crawl ID
manual-20230304090628-8b0d5f9f-97b
and example of a library that might be used https://github.com/eligrey/FileSaver.js/