webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
202 stars 35 forks source link

Create Crawl Configuration/Template/Definition UI #74

Closed ikreymer closed 2 years ago

ikreymer commented 2 years ago

This screen will produce a JSON that is then passed to the crawl config creation API endpoint.

The format includes a top-level dictionary with a Browsertrix Cloud-specific options, and a config dictionary, which corresponds to the Browsertrix Crawler config.

The format is:

{
  "schedule": "",
  "runNow": false,
  "colls": [],
  "crawlTimeout": 0,
  "parallel": 1,
  "config": {...}
}

The key properties to include are:

The actual crawl configuration, the config property, can be what is passed to browsertrix-crawler can be either a:

For the seed list, the input might be:

The supported properties in the 'simplified view' will likely continue to evolve, but also have the advanced view for pasting a custom config.

SuaYoo commented 2 years ago

Thoughts on adding a name field to easily label and differentiate configs?

ikreymer commented 2 years ago

Thoughts on adding a name field to easily label and differentiate configs?

Yes, definitely a good idea. Just free-form text that can be searched by, right?

SuaYoo commented 2 years ago

@ikreymer initial pass at mockup: https://app.mockplus.com/run/rp/TgN5xi_FPJvV/2DkJibt0vw?cps=expand&rps=expand&nav=1&ha=1&la=1&fc=0&out=0&rt=1

q's:

  1. what are sane defaults for all these fields? I'll update the mockup to reflect default values.
  2. can you go into more detail as to how scope type and page limit should apply to the URLs? Should the term seed be user facing? or should the UI display seeds as something like "URL group"/"Page groups"?
  3. should the JSON configuration represent the whole config/template, or just the nested .config object? IMO the former is more user-friendly if we expect users to download JSON configs and then upload or copy-paste them in the future
SuaYoo commented 2 years ago

per discussion on call:

  1. change default to "run instantly" and schedule to "weekly", @ikreymer to document other defaults
  2. seed URL is user-facing term (TODO link glossary), group seed configs in UI
  3. render whole JSON object
SuaYoo commented 2 years ago

@ikreymer minor issue with the POST crawlconfigs endpoint: leaving out the trailing / throws a Method Not Allowed error on the server

ikreymer commented 2 years ago

Simplified config for now, just send URL list for seeds, everything else moved to outer scope, eg.

{
  "schedule": null,
  "runNow": true,
  "config": {
    "seeds": [
      "https://webrecorder.net/"
    ],
    "scopeType": "prefix"
  },
  "name": "Example Name",
  "colls": [],
  "crawlTimeout": "0",
  "parallel": 1
}
ikreymer commented 2 years ago

All done! Crawl scaling and tags to be separate issues.