feat: Improve UX of choosing new workflow crawl type

SuaYoo commented 2 months ago

Resolves https://github.com/webrecorder/browsertrix/issues/2066

Changes

Allows directly choosing new "URL List" or "Seeded Crawl" from workflow list
Reverts terminology introduced in https://github.com/webrecorder/browsertrix/pull/2032

Manual testing

Log in as crawler
Go to Crawling
Click "New Workflow"
Click "URL List". Verify new workflow form is shown as URL list
Go back and verify for "Seeded Crawl"
Go back to crawl workflows list. Click "New Workflow" > "Help Me Decide". Verify job type dialog is shown

Screenshots

Page	Image/video
Crawl Workflows
New Crawl Workflow dialog
New Crawl Workflow dialog

Follow-ups

Per Discord conversation we may want to revisit the term "Seeded Crawl" to better indicate that a single site is being crawled.

DaleLore commented 2 months ago

Truly appreciate these screenshots 🙏

In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'

The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.

Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."

And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."

You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.

ikreymer commented 2 months ago

Truly appreciate these screenshots 🙏

In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'

The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.

Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."

So just removing the word 'known' mostly? Yeah, I guess that could work..

And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."

You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.

Yes. The more I think about it, I think 'Site Crawl can work well here also...

SuaYoo commented 2 months ago

@ikreymer @DaleLore @Shrinks99 Updated with the following:

New workflow dropdown

New workflow dialog

Crawl URLs -> Page URLs

webrecorder / browsertrix