webrecorder / browsertrix-behaviors

Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.
GNU Affero General Public License v3.0
34 stars 18 forks source link

Crawler never fetches `<img>` source image in `src` when `srcset` is activated #79

Closed benoit74 closed 3 weeks ago

benoit74 commented 2 months ago

By default (when the autofetch behavior is activated if I'm not mistaken), the crawler automatically fetches images from srcset of <img> tags so that all resolutions are available in the WARC.

However, this seems to not take into account the situation where the srcset condition is activated, and it is hence the src of the image which is never fetched (and break under some conditions).

Sample website: https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval

Sample WARC: crawl-enciclopedia-banrep-onepage-20240930.warc.gz (this WARC has images displayed only a DPR 1.5 or above, with DPR 1 all images are broken)

HTML source code causing the issue:

<img src="images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg" decoding="async" width="300" height="341" srcset="images/b/b8/Avatar-mujer.jpg 1.5x">

Since I crawled with --mobileDevice "Pixel 2", images/b/b8/Avatar-mujer.jpg has automatically been fetched by the browser, but the autoFetch behavior seems to never have fetched images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg.

Full crawl command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0 crawl --url "https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval" --cwd /output --combineWARC --depth 0 --mobileDevice "Pixel 2" 

Nota: I'm not sure this website HTML code is 100% valid to the spec, in general I see that img src is repeated in srcset as well, but I didn't find any spec around this (is this just a good practice - to avoid situation like this one - or a spec?).

benoit74 commented 1 month ago

@ikreymer sorry, this is not at all a wombat issue, I don't know what happened in my mind when opening this issue. Can you move this to webrecorder/browsertrix-crawler and fix the title which is wrong?

Or should I reopen this?

ikreymer commented 3 weeks ago

This is actually a behaviors issue, transferred again! Will add a fix to also grab the src URL if srcset exists.

ikreymer commented 3 weeks ago

This is now fixed, and 300px-Avatar-mujer.jpg is being fetched with 0.6.5 release of browsertrix-behaviors. According to MDN, the src image is treated as another candidate when used together with srcset: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img#src

benoit74 commented 3 weeks ago

Thanks!