Difference in replayed warc with pywb and browsertrix

adityaraj-28 commented 1 year ago

I am aware that browsertrix uses pywb in the background. I tried a website https://www.kugou.com/ and noticed some notable missing elements with browsertrix. I started pywb using docker run -e INIT_COLLECTION=myarc -p 8080:8080 -v ./pywb-data/:/webarchive webrecorder/pywb wayback --record --live -a And hit google chrome browser with http://localhost:8080/myarc/record/https://www.kugou.com/, I waited for the page to load, scrolled and waited for the images to load after scrolling Here's a screenshot of replayed warc with replayweb

With browsertrix I started it using docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://www.kugou.com/" --generateWARC --combineWARC true --collection kuguo1 --scopeType page-spa --waitUntil networkidle0 --timeout 100 --behaviorTimeout 100 --netIdleWait 30

When replaying the combined warc with pywb I got the following warc replay

Is there any configurational error I am doing when running using browsertrix. Can anyone please help me figure out whats the issue.

lakshya16240 commented 1 year ago

I am able to reproduce this issue. Even tried using --waitUntil load,networkidle0 as was suggested in one of the previous issues. Another similar website on which the issue can be replicated is https://www.cctv.com/index.shtml. PFA the screenshots.

Actual Website

Using Pywb recorder

Using browsertrix and replaying using replay.web

edsu commented 1 year ago

I see the same behavior as well. Did you happen to notice a message like this in the log output?

{"logLevel":"info","timestamp":"2023-06-20T15:03:50.679Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.kugou.com/","workerid":0}}

I did notice it when I ran your docker command, and I think this might be similar to #321 perhaps? It appears autoscrolling is not working on the page?

adityaraj-28 commented 1 year ago

Yes, I did get this. And as autoscrolling is not happening, and the images load lazily after scroll, it isn't able to record those assets. Interesting

lakshya16240 commented 1 year ago

Absolutely. I see this log as well. Seems weird though, since some of the images appearing towards the end of scroll of the page have been loaded

lakshya16240 commented 1 year ago

As rightly pointed out by @adityaraj-28, most of the lazy loaded images are not being crawled properly. But, there are indeed a few instances in which some of the loaded images are laze loaded, as can be seen in the image below.

lakshya16240 commented 1 year ago

Tried the steps mentioned in behaviours for the above website. The page seems to be autoscrolling fine using these steps, but somehow fails to autoscroll through browsertrix crawl

webrecorder / browsertrix-crawler

Difference in replayed warc with pywb and browsertrix #333