webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

Getting [Errno 2] No such file or directory after deleting a WARC and reindexing. #834

Open YousufSSyed opened 1 year ago

YousufSSyed commented 1 year ago

I'm not too familiar with how Pywb works, but I want it so that if a WARC were to be deleted (either unintentionally or on accident), Pywb recognizes that its not no longer there and doesn't try to show it. When I record a page, delete the WARC for it, run wb-manager reindex {collection}, and then go to .../{collection}/{url}, I get this:

{'args': {'coll': 'archive', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "rec-20230401040109730649-file.warc.gz: [Errno 2] No such file or directory: \'/Users/yousuf/.pyenv/versions/3.9.16/lib/python3.9/collections/archive/archive/rec-20230401040109730649-file.warc.gz\'",

Environment

tw4l commented 1 year ago

Thanks @YousufSSyed - looks like you hit a bug here, and agreed that pywb should handle a missing WARC file more gracefully. In the case you describe above, are you also deleting corresponding CDX/CDXJ index entries for the deleted WARC, or just the WARC itself? The latter might explain why pywb expects there to be an archive file to load from.

YousufSSyed commented 1 year ago

No I haven’t been deleting index files, however I’d use the wb-manager reindex command when I’d face the errors.

Is there any testing or reproducing you’d like me to do?

tw4l commented 1 year ago

I'm going to reproduce on my end tomorrow and will let you know!