webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
432 stars 38 forks source link

Webrecorder Player and warcit: no pages to browse #96

Closed nvanderperren closed 4 years ago

nvanderperren commented 4 years ago

I used warcit to create a webarchive of a website which was created with HTTrack years ago. When I load it into the Webrecorder Player I get this page and no pages are shown.

Schermafbeelding 2020-04-17 om 15 14 36

An extract of the WARC-file:

ARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:e9b35b70-a4f7-4397-8711-1be6fec30cce>
WARC-Filename: www.koophandeltongeren.be.warc.gz
WARC-Date: 2020-04-17T13:25:13Z
Content-Type: application/warc-fields
Content-Length: 129

software: warcit 0.3.1
format: WARC File Format 1.0
cmdline: warcit www.koophandeltongeren.be/ ././www.koophandeltongeren.be/

WARC/1.0
WARC-Date: 2015-01-01T11:07:45Z
WARC-Source-URI: file://././www.koophandeltongeren.be/index.html
WARC-Creation-Date: 2020-04-17T13:25:13Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:00928f4d-b6c5-45c2-ab51-2cba2da63250>
WARC-Target-URI: www.koophandeltongeren.be/index.html
WARC-Payload-Digest: sha1:PVVOCULAPFRGDKD7Z2UMTU3VHJJKFSOB
WARC-Block-Digest: sha1:PVVOCULAPFRGDKD7Z2UMTU3VHJJKFSOB
Content-Type: text/html
Content-Length: 2567

<HTML>

<!-- Mirrored from www.koophandeltongeren.be/ by HTTrack Website Copier/3.x [XR&CO'2014], Thu, 13 Oct 2016 08:20:56 GMT -->
<HEAD><TITLE>index</TITLE>
<META name=description content="rechtbank van koophandel Antwerpen, afdeling Tongeren">
<META name=keywords content="juridat,justitie,fgov,rechtbanken,koophandel,ondernemingen,rechters,handelsrechters,juristen advocaten,griffie,faillissementen,curatoren,falingen,procesverloop,vennootschappen,bvba,nv,vzw">
<META name=author content=W.Bours>
<META content="text/html; charset=ISO-8859-1" http-equiv=content-type>
<META content=0 http-equiv=Expires>
<META name="generator" content="One.com WebCreator">
<STYLE type=text/css>
<!--

table { background-color:#000000; position: relative; }  

.titel { font-family: Vivaldi, Arial, Sans-serif; font-size:40px; color: #FFFFFF; font-style:italic; letter-spacing:3px; vertical-align: middle; text-align: center; }

BODY, TD, P {font-size: 12pt; font-family: times new roman, times, serif}
H1 {font-size: 26pt; font-family: arial, helvetica, sans-serif; font-weight: normal}
H2 {font-size: 18pt; font-family: arial, helvetica, sans-serif; font-weight: normal}
H3 {font-size: 16pt; font-weight: normal}
H4 {font-size: 14pt; font-weight: normal}
A {text-decoration: none}
A:hover {text-decoration: underline}
-->
</STYLE>
</HEAD>
<BODY style="BACKGROUND-COLOR: #000000">
<TABLE width=950 align=center>
<TBODY>
<TR>
<TD>
<DIV class=titel>Rechtbank van koophandel Antwerpen,<BR>afdeling Tongeren.</DIV></TD></TR><BR></TBODY></TABLE>
<TABLE width=950 align=center>
<TBODY>
<TR>
<TD>
<DIV style="TEXT-ALIGN: center"><BR></DIV>
<DIV style="WIDTH: 460px; MARGIN-LEFT: 290px; VISIBILITY: visible"><EMBED style="HEIGHT: 280px; WIDTH: 368px" name="Fancy Trans. 2" type=application/x-shockwave-flash align=middle height=350 width=460 src=http://flash.picturetrail.com/pflicks/3/spflick.swf  quality="high" FlashVars="ql=2&amp;src1=http://pic100.picturetrail.com:80/VOL1126/13486475/flicks/1/8785918" bgcolor="#000000" allowScriptAccess="sameDomain"> </EMBED>
<P style="HEIGHT: 24px; WIDTH: 460px; MARGIN-TOP: 10px; whitespace: no-wrap"></A></P></DIV>
<STYLE type=text/css>
.result9307671 {
    text-align: left;
  border:0;
  }
</STYLE>
<P style="TEXT-ALIGN: center"><A href="html/menu.html"><IMG onmouseover="this.src = 'afbeeldingen/button9307671.png'" onmouseout="this.src = 'afbeeldingen/button9307671.png'" class=result9307671 src="afbeeldingen/button9307671.png" width=150 height=70></A></P></TD></TR></TBODY></TABLE><BR></BODY>

WARC/1.0
WARC-Type: revisit
WARC-Record-ID: <urn:uuid:681041cd-f202-44f0-867e-335d96fc4ee8>
WARC-Target-URI: www.koophandeltongeren.be/
WARC-Date: 2015-01-01T11:07:45Z
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Refers-To-Target-URI: www.koophandeltongeren.be/index.html
WARC-Refers-To-Date: 2015-01-01T11:07:45Z
WARC-Payload-Digest: sha1:PVVOCULAPFRGDKD7Z2UMTU3VHJJKFSOB
WARC-Creation-Date: 2020-04-17T13:25:13Z
WARC-Source-URI: file://././www.koophandeltongeren.be/index.html
WARC-Block-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
Content-Type: application/http; msgtype=response
Content-Length: 0

I've also tried it with a download of a facebook page which gave the same result.

ikreymer commented 4 years ago

Thanks for including the detailed info from the WARC. I think the main issue is that the resulting urls in the WARCs are invalid as they are missing the scheme (http:// or https://)

I've added an issue to automatically detect/fix this in warcit. For now, you can rerun warcit with for example:

warcit https://www.koophandeltongeren.be/ ././www.koophandeltongeren.be/

instead of

warcit www.koophandeltongeren.be/ ././www.koophandeltongeren.be/

I think that might fix it

nvanderperren commented 4 years ago

oh! 🤦 that solved the problem indeed! Thanks for your clear answer!