Closed benoit74 closed 3 weeks ago
@ikreymer sorry, this is not at all a wombat issue, I don't know what happened in my mind when opening this issue. Can you move this to webrecorder/browsertrix-crawler and fix the title which is wrong?
Or should I reopen this?
This is actually a behaviors issue, transferred again! Will add a fix to also grab the src
URL if srcset
exists.
This is now fixed, and 300px-Avatar-mujer.jpg
is being fetched with 0.6.5 release of browsertrix-behaviors. According to MDN, the src
image is treated as another candidate when used together with srcset
: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img#src
Thanks!
By default (when the
autofetch
behavior is activated if I'm not mistaken), the crawler automatically fetches images fromsrcset
of<img>
tags so that all resolutions are available in the WARC.However, this seems to not take into account the situation where the
srcset
condition is activated, and it is hence thesrc
of the image which is never fetched (and break under some conditions).Sample website: https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval
Sample WARC: crawl-enciclopedia-banrep-onepage-20240930.warc.gz (this WARC has images displayed only a DPR 1.5 or above, with DPR 1 all images are broken)
HTML source code causing the issue:
Since I crawled with
--mobileDevice "Pixel 2"
,images/b/b8/Avatar-mujer.jpg
has automatically been fetched by the browser, but the autoFetch behavior seems to never have fetchedimages/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg
.Full crawl command:
Nota: I'm not sure this website HTML code is 100% valid to the spec, in general I see that img src is repeated in srcset as well, but I didn't find any spec around this (is this just a good practice - to avoid situation like this one - or a spec?).