webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
32 stars 6 forks source link

Capturing a URL - unable to replay in ReplayWeb.page #76

Closed ktorn closed 2 weeks ago

ktorn commented 2 weeks ago

Hi,

I'm trying to test the basic scenario of capturing a URL, and save it as a WARC file.

The code below kind of works, the file is created (see gist), but when I try to open the WARC file in ReplayWeb.page it says "Archived Page Not Found"

Any ideas how to troubleshoot this?

Also, a quick follow-up question, is there a way to automatically fetch all the assets used by the page? (css, js, etc) My end goal is to fully capture a code-based artwork (like this one one), including network requests initiated by the piece.

My code:

const { WARCRecord, WARCSerializer } = require('warcio');
const fs = require('fs');

async function captureWebPage(url, warcFilename) {

    const { default: fetch } = await import('node-fetch');

    const warcStream = fs.createWriteStream(warcFilename);

    const response = await fetch(url);

    const warcInfo = await WARCRecord.createWARCInfo(
        {
            filename: warcFilename,
            warcVersion: 'WARC/1.1',
        },
        {
            software: 'warcio.js in node',
            description: `WARC file containing capture of ${url}`,
        }
    );

    const serializedWARCInfo = await WARCSerializer.serialize(warcInfo);
    warcStream.write(serializedWARCInfo);

    const warcRecord = await WARCRecord.create(
        {
            url,
            date: new Date().toISOString(),
            type: 'response',
            warcVersion: 'WARC/1.1',
            httpHeaders: response.headers.raw(),
        },
        response.body
    );

    const serializedRecord = await WARCSerializer.serialize(warcRecord);
    warcStream.write(serializedRecord);

    warcStream.end();

    console.log(`Captured WARC for ${url} and saved to ${warcFilename}`);
}

captureWebPage('https://arkivo.art', 'arkivo.warc');
ikreymer commented 2 weeks ago

You are missing the trailing slash, so its not a full URL, if you do:

captureWebPage('https://arkivo.art/', 'arkivo.warc');

It will work. (Note that copying gist will not work because it has \r stripped out).

Also, a quick follow-up question, is there a way to automatically fetch all the assets used by the page? (css, js, etc) My end goal is to fully capture a code-based artwork (like this one one), including network requests initiated by the piece.

This library is designed to be low-level tool for writing WARC files. To do what you want, you need to capture through the browser. We have several tools that do this:

For interactive artworks, the extension would be your best option, since you can interact with the browser and have all of the network traffic be captured. For more help / discussion, check out our forum at https://forum.webrecorder.net/

ktorn commented 2 weeks ago

It worked, thanks!

I need to do this programatically, from a node app, but I will investigate Browsertrix, thanks!

ikreymer commented 2 weeks ago

Also added a fix in #77 that will add a trailing slash if it's missing. Closing this as the original issue is answered.