Open anjackson opened 1 year ago
Hi Andy, thanks for noting these! I believe point 2 is due to be addressed in our 1.6 release (coming extremely soon... minutes away!) and was hopefully fixed in #913 :)
Some expansion on the following points would be helpful for us!
On 1, I'll ask colleagues @HelenaByrne and @nicolabingham to comment, but IIRC it's because we often have a lot of crawl targets and sometimes they are named things like 'The Institute for Wiggly Wagglies: Second quarterly report on herbaceous statistics from the extemporaneous emissions of the Coalition of International Cheesemongers (2023) [REDACTED]' 😄
On 2, I think it was a reference to the YAML that gets sent to Browsertrix Crawler, so you could experiment in Cloud and then run it directly if you wanted.
Hi, @Shrinks99 our users are library staff who create targeted crawls at differing levels of granularity - in some cases at the website level, but in other cases at the level of a sub-section of a website or even at the article or document level. We'd like to allow them the ability to be flexible in how they construct titles, and sometimes that means being very descriptive. In our current archiving software, we don't impose a character limit (or if we do, we haven't hit it yet). Example: "BBC News: Brexit: Hammond says PM's demands 'wreck' chance of new deal" Example: "Amrik Kandola (@KandolaAmrik) on Twitter (Change UK politician)" Example: "Conservative Home: Nigel Evans: If the Prime Minister can’t deliver a clean Brexit, she must make way for a successor who will"
I hope this helps
Thank you both for the details!
I've spun out the request for YAML in #924.
As for character limits, currently this exists to encourage shorter names so they display more consistently throughout the app. I agree that right now handling this with such a small limit isn't ideal and it's not the direction I think we'll be going long term for these names.
These are some notes I made on issues raised during the workshop at the IIPC conference:
http://site.com/subsite
ended up crawling other site sections e.g.http://site.com/another
. I assume the scope is dropping everything after after the last/
.