webrecorder / warcit

Convert Directories, Files and ZIP Files to Web Archives (WARC)
https://pypi.python.org/pypi/warcit
Apache License 2.0
80 stars 13 forks source link

Slash (/) missing in warcit output #26

Open DHKaplan opened 11 months ago

DHKaplan commented 11 months ago

I needed a small warc file for testing, so I took a regular wget download and picked a few files that interconnected and used warcit to create the warc file. When I looked at it in Replayweb.page there were no pages visible. I edited the warc file in an ASCII editor and found that the "/" was not being inserted after the domain name. Please see https://forum.webrecorder.net/t/warcit-not-putting-a-before-the-file-name/413 for more information.

despens commented 11 months ago

Hey @DHKaplan, you need to enter the exact URL prefix you want when running warcit. For instance

warcit http://www.wticalumni.com/ my-local-folder

The prefix could be anything, for instance something like:

warcit 'http://mydomain.com/query?q=' my-local-folder

This flexibility of the tool makes it necessary that you give the exact URL prefix.

DHKaplan commented 11 months ago

@despens The folder that contains my html is www.wticalumni.com and the command I am using is warcit https://www.wticalumni.com ./www.wticalumni.com/

I get no pages found. When I edit the gz file with an ASCII editor I get:

WARC/1.0
WARC-Date: 2023-05-10T18:36:00Z
WARC-Source-URI: file://./www.wticalumni.com/events.htm
WARC-Creation-Date: 2023-08-26T17:00:45Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:e1377996-9417-4ddb-8af8-19dc44972209>
WARC-Target-URI: https://www.wticalumni.comevents.htm
WARC-Payload-Digest: sha1:AP4CVEJE4OHSPK24OURQRPDOHKP2LWOA
WARC-Block-Digest: sha1:AP4CVEJE4OHSPK24OURQRPDOHKP2LWOA
Content-Type: text/html
Content-Length: 10002

Note the the Source-URI line is WARC-Source-URI: file://./www.wticalumni.com/events.htm While the Target-URI line is WARC-Target-URI: https://www.wticalumni.comevents.htm There is no slash before the file name in the Target-URI.

I really appreciate your reply, but I can't see what I am doing wrong.

despens commented 11 months ago

Hi @DHKaplan, you just need to use the desired / character in the command:

warcit https://www.wticalumni.com/ ./www.wticalumni.com/
                                 ^
                                 |
                             important