Open adityaraj-28 opened 1 year ago
I am able to reproduce this issue. Even tried using --waitUntil load,networkidle0
as was suggested in one of the previous issues. Another similar website on which the issue can be replicated is https://www.cctv.com/index.shtml
. PFA the screenshots.
Actual Website
Using Pywb recorder
Using browsertrix and replaying using replay.web
I see the same behavior as well. Did you happen to notice a message like this in the log output?
{"logLevel":"info","timestamp":"2023-06-20T15:03:50.679Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.kugou.com/","workerid":0}}
I did notice it when I ran your docker command, and I think this might be similar to #321 perhaps? It appears autoscrolling is not working on the page?
Yes, I did get this. And as autoscrolling is not happening, and the images load lazily after scroll, it isn't able to record those assets. Interesting
Absolutely. I see this log as well. Seems weird though, since some of the images appearing towards the end of scroll of the page have been loaded
As rightly pointed out by @adityaraj-28, most of the lazy loaded images are not being crawled properly. But, there are indeed a few instances in which some of the loaded images are laze loaded, as can be seen in the image below.
Tried the steps mentioned in behaviours for the above website. The page seems to be autoscrolling fine using these steps, but somehow fails to autoscroll through browsertrix crawl
I am aware that browsertrix uses pywb in the background. I tried a website
https://www.kugou.com/
and noticed some notable missing elements with browsertrix. I started pywb usingdocker run -e INIT_COLLECTION=myarc -p 8080:8080 -v ./pywb-data/:/webarchive webrecorder/pywb wayback --record --live -a
And hit google chrome browser withhttp://localhost:8080/myarc/record/https://www.kugou.com/
, I waited for the page to load, scrolled and waited for the images to load after scrolling Here's a screenshot of replayed warc with replaywebWith browsertrix I started it using
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://www.kugou.com/" --generateWARC --combineWARC true --collection kuguo1 --scopeType page-spa --waitUntil networkidle0 --timeout 100 --behaviorTimeout 100 --netIdleWait 30
When replaying the combined warc with pywb I got the following warc replay
Is there any configurational error I am doing when running using browsertrix. Can anyone please help me figure out whats the issue.