webrecorder / browsertrix-old

Browsertrix: Containerized High-Fidelity Browser-Based Automated Crawling + Behavior System
Apache License 2.0
88 stars 7 forks source link

Better domain extraction for same-domain crawls and browser overrides serialization fixes #25

Closed N0taN3rd closed 5 years ago

N0taN3rd commented 5 years ago

use tldextract to create the domain rule since urlcannon's domain based match rules do not work well with "www" as it is counted as apart of the domain which is the result of using urllib.parse.urlsplit

ensure no serialization errors occur when browser overrides are used fixed readme typo bumped fastapi version to latest