warc format - Githubissues

Natkeeran commented 6 years ago

I downloaded a website from Internet Archive using wayback-machine-downloader then created a WARC using warcit with the following command: warcit --fixed-dt 20100212221453 http://domainname.com /dirpath.

It did create a WARC file. I would like to index them into solr using webarchive-discovery. When trying to do so, I get the following error:

2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@28005
2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@40193
2018-08-16 18:22:08 WARN  WARCIndexer:414 - Invalid status line: null@79054

I could not load it into to AUT as well.

Example warc is attached. Can WARCIT be used to convert snapshots downloaded from Internet Archive into WARC format? (Unfortunately, Internet Archive does not provide a way to download WARCs).

esports.com.warc.gz

ikreymer commented 6 years ago

Hm, it seems that AUT must not support resource records, can let them know. Can also generate fake response records probably, although that's less ideal..

But, for your use case, you can also use webrecorder.io directly and enter a wayback machine url. Webrecorder will detect that its a wayback machine url and should do the right thing with it. You'll then be able to download a WARC directly as well.

Natkeeran commented 6 years ago

@ikreymer

Thank you for looking into this issue.

Can you please provide some additional background around resource records support. Is this related to how they are implementing/using the WARC standards.

I tried providing this url to webrecorder.io. http://web.archive.org/web/20071016060747/http://eelamsports.com:80/. It seems to download just the home page. I need the full snapshot/site to be downloaded.

webrecorder / warcit

warc format #11