webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

Feature request: Limit number of versions per URL during indexing #824

Open VascoRatoFCCN opened 1 year ago

VascoRatoFCCN commented 1 year ago

Some web pages are crawler traps that generate an abnormal number of mementos for the same URL. For example, on some of our crawls the google script for ads had over one million mementos because so many websites use this script.(https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js)

We created a workaround with a filtering script but other users can have problems while crawling traps.