Create Crawl Configuration/Template/Definition UI

ikreymer commented 2 years ago

This screen will produce a JSON that is then passed to the crawl config creation API endpoint.

The format includes a top-level dictionary with a Browsertrix Cloud-specific options, and a config dictionary, which corresponds to the Browsertrix Crawler config.

The format is:

{
  "schedule": "",
  "runNow": false,
  "colls": [],
  "crawlTimeout": 0,
  "parallel": 1,
  "config": {...}
}

The key properties to include are:

[x] run now, a checkbox to start a crawl instantly.
[x] schedule, a way to specify a schedule in cron-style format. (but can be simpler, eg. a day time, and an option like daily, weekly, monthly, etc..)
[x] Time Limit in seconds (mostly will be helpful for testing, though not strictly required)

The actual crawl configuration, the config property, can be what is passed to browsertrix-crawler can be either a:

[x] Advanced view where json can be pasted
[x] Simplified view that includes a subset of properties, maybe starting with: seed list, containing:
- [x] URL
- [x] Scope Type (page, page-spa, prefix, host, any) other properties:
- [x] limit, total number of pages to crawl

For the seed list, the input might be:

[x] Text area with one URL per line + scope type, which then get added to the list. This would be to support pasting in a bunch of URLs with a specified scope.

The supported properties in the 'simplified view' will likely continue to evolve, but also have the advanced view for pasting a custom config.

SuaYoo commented 2 years ago

Thoughts on adding a name field to easily label and differentiate configs?

ikreymer commented 2 years ago

Thoughts on adding a name field to easily label and differentiate configs?

Yes, definitely a good idea. Just free-form text that can be searched by, right?

SuaYoo commented 2 years ago

@ikreymer initial pass at mockup: https://app.mockplus.com/run/rp/TgN5xi_FPJvV/2DkJibt0vw?cps=expand&rps=expand&nav=1&ha=1&la=1&fc=0&out=0&rt=1

q's:

what are sane defaults for all these fields? I'll update the mockup to reflect default values.
can you go into more detail as to how scope type and page limit should apply to the URLs? Should the term seed be user facing? or should the UI display seeds as something like "URL group"/"Page groups"?
should the JSON configuration represent the whole config/template, or just the nested .config object? IMO the former is more user-friendly if we expect users to download JSON configs and then upload or copy-paste them in the future

SuaYoo commented 2 years ago

per discussion on call:

change default to "run instantly" and schedule to "weekly", @ikreymer to document other defaults
seed URL is user-facing term (TODO link glossary), group seed configs in UI
render whole JSON object

SuaYoo commented 2 years ago

@ikreymer minor issue with the POST crawlconfigs endpoint: leaving out the trailing / throws a Method Not Allowed error on the server

ikreymer commented 2 years ago

Simplified config for now, just send URL list for seeds, everything else moved to outer scope, eg.

{
  "schedule": null,
  "runNow": true,
  "config": {
    "seeds": [
      "https://webrecorder.net/"
    ],
    "scopeType": "prefix"
  },
  "name": "Example Name",
  "colls": [],
  "crawlTimeout": "0",
  "parallel": 1
}

ikreymer commented 2 years ago

All done! Crawl scaling and tags to be separate issues.

webrecorder / browsertrix

Create Crawl Configuration/Template/Definition UI #74