Capturing a URL - unable to replay in ReplayWeb.page

ktorn commented 2 weeks ago

Hi,

I'm trying to test the basic scenario of capturing a URL, and save it as a WARC file.

The code below kind of works, the file is created (see gist), but when I try to open the WARC file in ReplayWeb.page it says "Archived Page Not Found"

Any ideas how to troubleshoot this?

Also, a quick follow-up question, is there a way to automatically fetch all the assets used by the page? (css, js, etc) My end goal is to fully capture a code-based artwork (like this one one), including network requests initiated by the piece.

My code:

const { WARCRecord, WARCSerializer } = require('warcio');
const fs = require('fs');

async function captureWebPage(url, warcFilename) {

    const { default: fetch } = await import('node-fetch');

    const warcStream = fs.createWriteStream(warcFilename);

    const response = await fetch(url);

    const warcInfo = await WARCRecord.createWARCInfo(
        {
            filename: warcFilename,
            warcVersion: 'WARC/1.1',
        },
        {
            software: 'warcio.js in node',
            description: `WARC file containing capture of ${url}`,
        }
    );

    const serializedWARCInfo = await WARCSerializer.serialize(warcInfo);
    warcStream.write(serializedWARCInfo);

    const warcRecord = await WARCRecord.create(
        {
            url,
            date: new Date().toISOString(),
            type: 'response',
            warcVersion: 'WARC/1.1',
            httpHeaders: response.headers.raw(),
        },
        response.body
    );

    const serializedRecord = await WARCSerializer.serialize(warcRecord);
    warcStream.write(serializedRecord);

    warcStream.end();

    console.log(`Captured WARC for ${url} and saved to ${warcFilename}`);
}

captureWebPage('https://arkivo.art', 'arkivo.warc');

ikreymer commented 2 weeks ago

You are missing the trailing slash, so its not a full URL, if you do:

captureWebPage('https://arkivo.art/', 'arkivo.warc');

It will work. (Note that copying gist will not work because it has \r stripped out).

Also, a quick follow-up question, is there a way to automatically fetch all the assets used by the page? (css, js, etc) My end goal is to fully capture a code-based artwork (like this one one), including network requests initiated by the piece.

This library is designed to be low-level tool for writing WARC files. To do what you want, you need to capture through the browser. We have several tools that do this:

ArchiveWeb.page - a browser extension which you can turn on while browsing, and you can then download the data as WARC or WACZ.
Browsertrix Crawler - an automated crawler for automatically crawling pages through the browser as a CLI tool.

For interactive artworks, the extension would be your best option, since you can interact with the browser and have all of the network traffic be captured. For more help / discussion, check out our forum at https://forum.webrecorder.net/

ktorn commented 2 weeks ago

It worked, thanks!

I need to do this programatically, from a node app, but I will investigate Browsertrix, thanks!

ikreymer commented 2 weeks ago

Also added a fix in #77 that will add a trailing slash if it's missing. Closing this as the original issue is answered.

webrecorder / warcio.js

Capturing a URL - unable to replay in ReplayWeb.page #76