Issues raised in IIPC WAC Workshop

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

https://browsertrix.com

GNU Affero General Public License v3.0

143 stars 29 forks source link

Issues raised in IIPC WAC Workshop #923

Open anjackson opened 1 year ago

anjackson commented 1 year ago

These are some notes I made on issues raised during the workshop at the IIPC conference:

The title field too short - 50 chars is not enough.
'URLs prefix" scope is wrong/misleading, because the seed URLs like http://site.com/subsite ended up crawling other site sections e.g. http://site.com/another. I assume the scope is dropping everything after after the last /.
Please add a "Export YAML" option.
Plesse expose embeds/transclusions logs/post crawl log (eg captured pagers for sign off of decommissioning)
Please add Heritrix-style crawl reports

Shrinks99 commented 1 year ago

Hi Andy, thanks for noting these! I believe point 2 is due to be addressed in our 1.6 release (coming extremely soon... minutes away!) and was hopefully fixed in #913 :)

Some expansion on the following points would be helpful for us!

Do you remember why the title field was too short? What kind of names are being assigned that lead to 50 characters not being enough?
What kind of YAML data do people want exported?

anjackson commented 1 year ago

On 1, I'll ask colleagues @HelenaByrne and @nicolabingham to comment, but IIRC it's because we often have a lot of crawl targets and sometimes they are named things like 'The Institute for Wiggly Wagglies: Second quarterly report on herbaceous statistics from the extemporaneous emissions of the Coalition of International Cheesemongers (2023) [REDACTED]' 😄

On 2, I think it was a reference to the YAML that gets sent to Browsertrix Crawler, so you could experiment in Cloud and then run it directly if you wanted.

nicolabingham commented 1 year ago

Hi, @Shrinks99 our users are library staff who create targeted crawls at differing levels of granularity - in some cases at the website level, but in other cases at the level of a sub-section of a website or even at the article or document level. We'd like to allow them the ability to be flexible in how they construct titles, and sometimes that means being very descriptive. In our current archiving software, we don't impose a character limit (or if we do, we haven't hit it yet). Example: "BBC News: Brexit: Hammond says PM's demands 'wreck' chance of new deal" Example: "Amrik Kandola (@KandolaAmrik) on Twitter (Change UK politician)" Example: "Conservative Home: Nigel Evans: If the Prime Minister can’t deliver a clean Brexit, she must make way for a successor who will"

I hope this helps

Shrinks99 commented 1 year ago

Thank you both for the details!

I've spun out the request for YAML in #924.

As for character limits, currently this exists to encourage shorter names so they display more consistently throughout the app. I agree that right now handling this with such a small limit isn't ideal and it's not the direction I think we'll be going long term for these names.