Closed SuaYoo closed 2 months ago
Truly appreciate these screenshots 🙏
In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'
The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.
Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."
And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."
You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.
Truly appreciate these screenshots 🙏
In the section for URL List: 'Choose this option if you already know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.'
The phrasing about knowing the URL of every page might imply that you need to be aware and/or of all the pages on the site. If I understand this correctly, we mean that you can just have a list of specific URLs that you want to crawl.
Would the following convey the same meaning without emphasizing the need to know every page? "Choose this option if you have the URLs of the pages you want crawled and don't need to include any additional pages beyond one hop out."
So just removing the word 'known' mostly? Yeah, I guess that could work..
And I know we talked about it before, but just want to re-confirm for choose seeded crawl: "you're archiving a subset of a website, like everything under your website.com/your-username."
You'd be able to crawl https://www.instagram.com/crunchyroll/ without crawling all of instagram, but only under Pages under the Same Directory, right? It would grab all of Instagram if the URL scope was Pages under This Domain.
Yes. The more I think about it, I think 'Site Crawl can work well here also...
@ikreymer @DaleLore @Shrinks99 Updated with the following:
Resolves https://github.com/webrecorder/browsertrix/issues/2066
Changes
Manual testing
Screenshots
Follow-ups
Per Discord conversation we may want to revisit the term "Seeded Crawl" to better indicate that a single site is being crawled.