webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
200 stars 34 forks source link

[Bug]: In replay same articles/images just repeats when clicking SE FLERE (view more) #1384

Open Klindten opened 12 months ago

Klindten commented 12 months ago

Browsertrix Cloud Version

v1.8.0-beta.3-6789299

What did you expect to happen? What happened instead?

I expected to be able to extend content by clicking SE FLERE on crawls of https://www.tvsyd.dk/ instead articles/images just repeats over and over.

Step-by-step reproduction instructions

  1. Navigate to: https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/sched-7440fc0e-444-28331940?workflowId=7440fc0e-4447-4374-9524-4eca7fc23c7d#replay2. Click:
  2. Scroll Down all the way possible
  3. Click on SE FLERE

image

  1. The same images and articles just repeats....
  2. Scroll down
  3. Click on SE FLERE The same images and articles just repeats....

image

Additional details

No response

Shrinks99 commented 12 months ago

This doesn't appear to be a bug. The page loads 5 more articles from the server when the user clicks the SE FLERE button. As we are looking at an archive, the site can't contact the server to get the next 5 articles so it "loads" the last 5 that were accessed. One might argue that the page shouldn't do anything if it can't find the server, but in this instance that's likely on the website's creators to sort out and not us?

If you'd like more to be saved in the archive you can use ArchiveWeb.page to capture more manually, upload that file to Browsertrix, and merge both in a collection. I clicked it a few times to load more stories, then it replayed as expected and loaded what I had looked at. Once all the stories I had loaded were displayed it looped the last 5 as you mention.

If you'd like to use my crawl it is available via Dropbox Transfer at this link. See if it works when you upload it and combine them both into a collection! :)

Shrinks99 commented 12 months ago

Oop, looks like I may have been a little too quick on closing this!

From @ikreymer:

we actually have a behavior that clicks on 'load more' but not in every language!

Sounds like this actually is in scope and that's the issue!

ikreymer commented 12 months ago

The autoscroll behavior attempts to find links that say, 'Load More', however, only in English. This is a bit tricky to identify more broadly though, it may still require a custom behavior. Sounds like replay will work if the 'load more' button can be clicked. This may be more of behavior/crawler issue.

Klindten commented 11 months ago

This doesn't appear to be a bug. The page loads 5 more articles from the server when the user clicks the SE FLERE button. As we are looking at an archive, the site can't contact the server to get the next 5 articles so it "loads" the last 5 that were accessed. One might argue that the page shouldn't do anything if it can't find the server, but in this instance that's likely on the website's creators to sort out and not us?

If you'd like more to be saved in the archive you can use ArchiveWeb.page to capture more manually, upload that file to Browsertrix, and merge both in a collection. I clicked it a few times to load more stories, then it replayed as expected and loaded what I had looked at. Once all the stories I had loaded were displayed it looped the last 5 as you mention.

If you'd like to use my crawl it is available via Dropbox Transfer at this link. See if it works when you upload it and combine them both into a collection! :) Interesting if a fix liek this will also work when using WARC-files>index to Outback CDX>replay in PyWb or SolrWayback

Hi Henry. I made my own archiveweb.page crawl. made a compilation with the previous crawl and repair crawl - and it fixed the problem. But we/other will most likely not QA everything so we would miss something like this normally. The real solution is to get the crawler to actually click the link SE FLERE...in the future it would great to have way to add custom behaviours - maybe even in the GUI:-)

Shrinks99 commented 11 months ago

But we/other will most likely not QA everything so we would miss something like this normally.

It's sometimes difficult to determine which behaviors should be in scope for defaults. This one seems to be however!

In the future we would like to allow users to define custom behaviors scripts in workflow creator. It's not currently scoped but this year we've made some good strides towards that goal!

Klindten commented 11 months ago

Great. I tried to use archiveweb.page extra crawls to fix crawls on our local installations and it didn´t repair it the same way as on Cloud. But it might also be a fault in the ways I'm doing things:-). One take away could be to describe some of these cases or similar in the documentation. People using this will have a pretty varied technical background so great to have documentation eg- curators will understand.

Shrinks99 commented 11 months ago

There's a brief note about combining items from multiple sources on the collections docs page, but always interested in better documentation!

If I'm understanding correctly, you're saying you combined your crawl and your uploaded item from ArchiveWeb.page within a collection, downloaded that collection, and then ingested it into your system and it didn't replay the same way?

Klindten commented 11 months ago

Yes. But they could be some differences - but basically I just crawled from the frontpage and did a lot of "SE FLERE"-clicks as add-ons in archiveweb.page. Seems it works for some of the tvsyd.dk URL and and less for other (could be that the articles are not crawled).

I also noticed that to find any content for my seeded crawl of www.tvsyd.dk, local installation, i had to tick the "Show non-seed Pages"-box (also click the PAGES-section a few times to get the box...- I thought the first URL was included. image

In the cloud.installation..I see the PAGE URL fine: image