I want to create a .wacz from somewhat irregular collections of HTML/CSS/PDF files. To do so, I've decided to first shove these documents into a .warc using warcit, and then run wacz create on that.
Here's an illustration of a collection's contents:
Notably, there's no index.html in the collection. And there probably never was, since I'm not sure if these files were ever hosted at a live URL.
Not much of a problem, since there are always one or more *_Index.html files. I can pass these along to wacz create --pages, because they serve as nice entrypoints:
Interestingly, however, if I then run wacz create -f mywarc.warc.gz --pages pages.jsonl, I'm greeted with the following error:
Traceback (most recent call last):
File "/usr/bin/wacz", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/lib/python3.12/site-packages/wacz/main.py", line 123, in main
value = cmd.func(cmd)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/wacz/main.py", line 264, in create_wacz
wacz_indexer.process_all()
File "/usr/lib/python3.12/site-packages/wacz/waczindexer.py", line 112, in process_all
raise ValueError(
ValueError: ts None not found in index with http://example.archive/
This error goes away if I manually create a index.html file.
I could easily do so — say, by simply renaming one of the existing *_Index.html files to index.html — but that brings a few downsides. I would rather prefer to simply provide pages that serve as entrypoints into the .wacz file by supplying them with --pages pages.jsonl. That way, everything also looks nice and clean on ReplayWeb.page.
Would it be possible to remove the assumption that an index.html file should always exists?
FTR, if I generate a stub index.html and remove it from the .wacz (i.e. edit pages.jsonl, recalculate checksums, etc.) after running wacz create, said file will display on ReplayWeb.page just fine.
I want to create a
.wacz
from somewhat irregular collections of HTML/CSS/PDF files. To do so, I've decided to first shove these documents into a.warc
usingwarcit
, and then runwacz create
on that.Here's an illustration of a collection's contents:
Notably, there's no
index.html
in the collection. And there probably never was, since I'm not sure if these files were ever hosted at a live URL.Not much of a problem, since there are always one or more
*_Index.html
files. I can pass these along towacz create --pages
, because they serve as nice entrypoints:Interestingly, however, if I then run
wacz create -f mywarc.warc.gz --pages pages.jsonl
, I'm greeted with the following error:This error goes away if I manually create a
index.html
file.I could easily do so — say, by simply renaming one of the existing
*_Index.html
files toindex.html
— but that brings a few downsides. I would rather prefer to simply provide pages that serve as entrypoints into the.wacz
file by supplying them with--pages pages.jsonl
. That way, everything also looks nice and clean on ReplayWeb.page.Would it be possible to remove the assumption that an
index.html
file should always exists?