Open Klindten opened 12 months ago
This doesn't appear to be a bug. The page loads 5 more articles from the server when the user clicks the SE FLERE
button. As we are looking at an archive, the site can't contact the server to get the next 5 articles so it "loads" the last 5 that were accessed. One might argue that the page shouldn't do anything if it can't find the server, but in this instance that's likely on the website's creators to sort out and not us?
If you'd like more to be saved in the archive you can use ArchiveWeb.page to capture more manually, upload that file to Browsertrix, and merge both in a collection. I clicked it a few times to load more stories, then it replayed as expected and loaded what I had looked at. Once all the stories I had loaded were displayed it looped the last 5 as you mention.
If you'd like to use my crawl it is available via Dropbox Transfer at this link. See if it works when you upload it and combine them both into a collection! :)
Oop, looks like I may have been a little too quick on closing this!
From @ikreymer:
we actually have a behavior that clicks on 'load more' but not in every language!
Sounds like this actually is in scope and that's the issue!
The autoscroll behavior attempts to find links that say, 'Load More', however, only in English. This is a bit tricky to identify more broadly though, it may still require a custom behavior. Sounds like replay will work if the 'load more' button can be clicked. This may be more of behavior/crawler issue.
This doesn't appear to be a bug. The page loads 5 more articles from the server when the user clicks the
SE FLERE
button. As we are looking at an archive, the site can't contact the server to get the next 5 articles so it "loads" the last 5 that were accessed. One might argue that the page shouldn't do anything if it can't find the server, but in this instance that's likely on the website's creators to sort out and not us?If you'd like more to be saved in the archive you can use ArchiveWeb.page to capture more manually, upload that file to Browsertrix, and merge both in a collection. I clicked it a few times to load more stories, then it replayed as expected and loaded what I had looked at. Once all the stories I had loaded were displayed it looped the last 5 as you mention.
If you'd like to use my crawl it is available via Dropbox Transfer at this link. See if it works when you upload it and combine them both into a collection! :) Interesting if a fix liek this will also work when using WARC-files>index to Outback CDX>replay in PyWb or SolrWayback
Hi Henry. I made my own archiveweb.page crawl. made a compilation with the previous crawl and repair crawl - and it fixed the problem. But we/other will most likely not QA everything so we would miss something like this normally. The real solution is to get the crawler to actually click the link SE FLERE...in the future it would great to have way to add custom behaviours - maybe even in the GUI:-)
But we/other will most likely not QA everything so we would miss something like this normally.
It's sometimes difficult to determine which behaviors should be in scope for defaults. This one seems to be however!
In the future we would like to allow users to define custom behaviors scripts in workflow creator. It's not currently scoped but this year we've made some good strides towards that goal!
Great. I tried to use archiveweb.page extra crawls to fix crawls on our local installations and it didn´t repair it the same way as on Cloud. But it might also be a fault in the ways I'm doing things:-). One take away could be to describe some of these cases or similar in the documentation. People using this will have a pretty varied technical background so great to have documentation eg- curators will understand.
There's a brief note about combining items from multiple sources on the collections docs page, but always interested in better documentation!
If I'm understanding correctly, you're saying you combined your crawl and your uploaded item from ArchiveWeb.page within a collection, downloaded that collection, and then ingested it into your system and it didn't replay the same way?
Yes. But they could be some differences - but basically I just crawled from the frontpage and did a lot of "SE FLERE"-clicks as add-ons in archiveweb.page. Seems it works for some of the tvsyd.dk URL and and less for other (could be that the articles are not crawled).
I also noticed that to find any content for my seeded crawl of www.tvsyd.dk, local installation, i had to tick the "Show non-seed Pages"-box (also click the PAGES-section a few times to get the box...- I thought the first URL was included.
In the cloud.installation..I see the PAGE URL fine:
Browsertrix Cloud Version
v1.8.0-beta.3-6789299
What did you expect to happen? What happened instead?
I expected to be able to extend content by clicking SE FLERE on crawls of https://www.tvsyd.dk/ instead articles/images just repeats over and over.
Step-by-step reproduction instructions
Additional details
No response