webrecorder / py-wacz

MIT License
39 stars 10 forks source link

py-wacz fails without a `index.html` file #44

Open rien333 opened 3 weeks ago

rien333 commented 3 weeks ago

I want to create a .wacz from somewhat irregular collections of HTML/CSS/PDF files. To do so, I've decided to first shove these documents into a .warc using warcit, and then run wacz create on that.

Here's an illustration of a collection's contents:

$ ls BP00009/
t_BP00009_Index.html
t_BP00009.html
r_BP00009_Index.html
r_BP00009.html
b_BP00009_0000Bijlagenbijdereg.html
BP00009.css
b_BP00009_0002.pdf
i_BP00009_402022.png

Notably, there's no index.html in the collection. And there probably never was, since I'm not sure if these files were ever hosted at a live URL.

Not much of a problem, since there are always one or more *_Index.html files. I can pass these along to wacz create --pages, because they serve as nice entrypoints:

{"format":"json-pages-1.0","id":"pages","title":"Seed Pages","hasText":false}
{"id":"38567fe1-...","url":"http://example.archive/r_BP00009_Index.html", ...,"mime":"text/html", ...}
{"id":"858f7661-...","url":"http://example.archive/t_BP00009_Index.html", ...,"mime":"text/html", ...}

Interestingly, however, if I then run wacz create -f mywarc.warc.gz --pages pages.jsonl, I'm greeted with the following error:

Traceback (most recent call last):
  File "/usr/bin/wacz", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/wacz/main.py", line 123, in main
    value = cmd.func(cmd)
            ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/wacz/main.py", line 264, in create_wacz
    wacz_indexer.process_all()
  File "/usr/lib/python3.12/site-packages/wacz/waczindexer.py", line 112, in process_all
    raise ValueError(
ValueError: ts None not found in index with http://example.archive/

This error goes away if I manually create a index.html file.

I could easily do so — say, by simply renaming one of the existing *_Index.html files to index.html — but that brings a few downsides. I would rather prefer to simply provide pages that serve as entrypoints into the .wacz file by supplying them with --pages pages.jsonl. That way, everything also looks nice and clean on ReplayWeb.page.

Would it be possible to remove the assumption that an index.html file should always exists?

rien333 commented 3 weeks ago

FTR, if I generate a stub index.html and remove it from the .wacz (i.e. edit pages.jsonl, recalculate checksums, etc.) after running wacz create, said file will display on ReplayWeb.page just fine.