webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.36k stars 212 forks source link

support converting from cdx format created by wget (was: wb-manager cdx-convert mangles CDX indexes created by Wget) #699

Open puigru opened 2 years ago

puigru commented 2 years ago

Describe the bug

Running wb-manager cdx-convert on CDX indexes created by Wget 1.20.3 creates broken CDXJ indexes.

Steps to reproduce the bug

  1. Use Wget to create a WARC and its CDX: e.g. wget --mirror --warc-file=test --warc-cdx http://motherfuckingwebsite.com/
  2. Run wb-manager cdx-convert .
  3. Look at the created CDXJ file

Expected behavior

The CDXJ file would not be mangled

Environment

Additional context

Example CDX file created by Wget:

 CDX a b a m s k r M V g u
http://motherfuckingwebsite.com/ 20220301145028 http://motherfuckingwebsite.com/ text/html 200 XVS5WBOSP2LHWRTCV7WNNFGYLB5PRJI2 - - 848 motherfuckingwebsite.warc.gz <urn:uuid:f3c7553a-f88c-4eae-81e4-9bc714fe65ef>

CDXJ file created by wb-manager:

com,motherfuckingwebsite)/ 20220301145028 {"url":"http://motherfuckingwebsite.com/","mime":"text/html","status":"200","digest":"XVS5WBOSP2LHWRTCV7WNNFGYLB5PRJI2","length":"848","offset":"motherfuckingwebsite.warc.gz","filename":"<urn:uuid:f3c7553a-f88c-4eae-81e4-9bc714fe65ef>"}

Notice how the fields have been shifted, "length" contains the "compressed arc file offset", "offset" contains "file name", etc. See: https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/

ikreymer commented 2 years ago

Yes, unfortunately, the wget CDX format has never been supported. The cdx-convert was designed to support the cdx 2015 format you've referenced, and older formats, but not the wget format currently. I'll revise this as a request to support that format.