webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
201 stars 35 forks source link

feat: Improve UX of choosing new workflow crawl type #2067

Closed SuaYoo closed 2 months ago

SuaYoo commented 2 months ago

Resolves https://github.com/webrecorder/browsertrix/issues/2066

Changes

Manual testing

  1. Log in as crawler
  2. Go to Crawling
  3. Click "New Workflow"
  4. Click "URL List". Verify new workflow form is shown as URL list
  5. Go back and verify for "Seeded Crawl"
  6. Go back to crawl workflows list. Click "New Workflow" > "Help Me Decide". Verify job type dialog is shown

Screenshots

Page Image/video
Crawl Workflows Screenshot 2024-09-04 at 3 16 25 PM
New Crawl Workflow dialog Screenshot 2024-09-04 at 3 16 30 PM
New Crawl Workflow dialog Screenshot 2024-09-04 at 3 20 33 PM

Follow-ups

Per Discord conversation we may want to revisit the term "Seeded Crawl" to better indicate that a single site is being crawled.

DaleLore commented 2 months ago

Truly appreciate these screenshots 🙏

In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'

The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.

Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."

And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."

You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.

ikreymer commented 2 months ago

Truly appreciate these screenshots 🙏

In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'

The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.

Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."

So just removing the word 'known' mostly? Yeah, I guess that could work..

And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."

You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.

Yes. The more I think about it, I think 'Site Crawl can work well here also...

SuaYoo commented 2 months ago

@ikreymer @DaleLore @Shrinks99 Updated with the following:

New workflow dropdown

Screenshot 2024-09-09 at 12 33 17 PM

New workflow dialog

Screenshot 2024-09-09 at 12 37 28 PM

Crawl URLs -> Page URLs

Screenshot 2024-09-09 at 12 35 36 PM