webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
653 stars 83 forks source link

Difference in replayed warc with pywb and browsertrix #333

Open adityaraj-28 opened 1 year ago

adityaraj-28 commented 1 year ago

I am aware that browsertrix uses pywb in the background. I tried a website https://www.kugou.com/ and noticed some notable missing elements with browsertrix. I started pywb using docker run -e INIT_COLLECTION=myarc -p 8080:8080 -v ./pywb-data/:/webarchive webrecorder/pywb wayback --record --live -a And hit google chrome browser with http://localhost:8080/myarc/record/https://www.kugou.com/, I waited for the page to load, scrolled and waited for the images to load after scrolling Here's a screenshot of replayed warc with replayweb

Screenshot 2023-06-20 at 7 39 37 PM

With browsertrix I started it using docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://www.kugou.com/" --generateWARC --combineWARC true --collection kuguo1 --scopeType page-spa --waitUntil networkidle0 --timeout 100 --behaviorTimeout 100 --netIdleWait 30

When replaying the combined warc with pywb I got the following warc replay

Screenshot 2023-06-20 at 7 58 01 PM

Is there any configurational error I am doing when running using browsertrix. Can anyone please help me figure out whats the issue.

lakshya16240 commented 1 year ago

I am able to reproduce this issue. Even tried using --waitUntil load,networkidle0 as was suggested in one of the previous issues. Another similar website on which the issue can be replicated is https://www.cctv.com/index.shtml. PFA the screenshots.

Actual Website

Screenshot 2023-06-20 at 8 14 54 PM

Using Pywb recorder

Screenshot 2023-06-20 at 8 16 12 PM

Using browsertrix and replaying using replay.web

Screenshot 2023-06-20 at 8 17 03 PM
edsu commented 1 year ago

I see the same behavior as well. Did you happen to notice a message like this in the log output?

{"logLevel":"info","timestamp":"2023-06-20T15:03:50.679Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.kugou.com/","workerid":0}}

I did notice it when I ran your docker command, and I think this might be similar to #321 perhaps? It appears autoscrolling is not working on the page?

adityaraj-28 commented 1 year ago

Yes, I did get this. And as autoscrolling is not happening, and the images load lazily after scroll, it isn't able to record those assets. Interesting

lakshya16240 commented 1 year ago

Absolutely. I see this log as well. Seems weird though, since some of the images appearing towards the end of scroll of the page have been loaded

Screenshot 2023-06-20 at 9 04 00 PM
lakshya16240 commented 1 year ago

As rightly pointed out by @adityaraj-28, most of the lazy loaded images are not being crawled properly. But, there are indeed a few instances in which some of the loaded images are laze loaded, as can be seen in the image below.

Screenshot 2023-06-20 at 9 09 08 PM
lakshya16240 commented 1 year ago

Tried the steps mentioned in behaviours for the above website. The page seems to be autoscrolling fine using these steps, but somehow fails to autoscroll through browsertrix crawl