webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 209 forks source link

Old data & replay issue #695

Open mw0000 opened 2 years ago

mw0000 commented 2 years ago

Expected behavior

hi,

I try to set up an archive with a an old (1996-2001) archive data collection (from IA), but got errors like this:

{'args': {'coll': 'my-web-archive', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz: [Errno 2] No such file or directory: \'/usr/lib/python3.8/collections/my-web-archive/archive/pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz\'", "errors": {"WARCPathLoader": "pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz: [Errno 2] No such file or directory: \'/usr/lib/python3.8/collections/my-web-archive/archive/pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz\'"}}'}

in all URLs requested there is the same error, CDX indexing also makes some errors:

mw@webarch:~$ wb-manager cdx-convert collections/my-web-archive/indexes/
Convert 38 index files? (y/n)y
Converting collections/my-web-archive/indexes/pl-2001-EXTRACTION-20200922232618-00040-00049-ARC_arc.utf8.cdx -> collections/my-web-archive/indexes/pl-2001-EXTRACTION-20200922232618-00040-00049-ARC_arc.utf8.cdxj
Error: Invalid Url: http://www.amd.pl:8021349/21349d.html

With the original CDX files I can see the search results, but when I want to see the replay copy, I get en error.

I've tried to archive some current pages and made a test archive with the new WARC files and everything is working - so the pywb setup should be ok. Should I prepare the original files in some way?

mijho commented 2 years ago

Hi @mw0000,

Do you have a little more information on the paths you have set up the collections on the filesystem? It looks like PyWB is configured to look in /usr/lib/python3.8/collections/... whilst your collection is in you home directory ~/collections/... from you output above.

If you want to convert the ARC files to WARC's it's a simple process. They can be converted with warcio.

pip install warcio
warcio recompress <source.arc.gz> <destination.warc.gz>