webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
30 stars 6 forks source link

New WARCSerializer: serialize records with buffering to temp file without reading fully into memory #59

Closed ikreymer closed 11 months ago

ikreymer commented 11 months ago

To compute the digests, the current WARCSerializer must read the entire WARC payload into memory, which is less than ideal for records with large payloads.

This PR replaces the WARCSerializer with a new version which using hash-wasm for cross-platform incremental digest computation. Since computing digest requires reading data twice, it also includes an external buffer which includes a write() method and a readAll() async iterator which reads the data back.

The default implementation provides an in memory SerializerInMemBuffer() which still buffers data in memory. A BaseSerializerBuffer base class can be extended to provide custom buffering functionality as well.

In node, a version of WARCSerializer from warcio/node provides a serializer which buffers data to temp files on disk.

This version supports following usage (added to README) for streaming large files in Node and writing them to WARC, without having to buffer fully in memory. The

import fs from "node:fs";
import { pipeline } from "node:stream/promises";
import { Readable } from "node:stream";

import { WARCRecord } from "warcio";
import { WARCSerializer } from "warcio/node";

async function fetchAndWrite(url, warcOutputStream) {
  const resp = await fetch(url);

  const record = await WARCRecord.create({type: "response", url}, resp.body);

  // set max data per WARC payload that can be buffered in memory to 16K
  // payloads larger then that are automatically buffered to a temporary file
  const serializer = new WARCSerializer(record, {gzip: true, maxMemSize: 16384});

  await pipeline(Readable.from(serializer), warcOutputStream, {end: false});
}

async function main() {
  const outputFile = fs.createWriteStream("test.warc.gz");

  await fetchAndWrite("https://example.com/some/large/file1.bin", outputFile);

  await fetchAndWrite("https://example.com/another/large/file2", outputFile);

  outputFile.close();
}

main();
ikreymer commented 11 months ago

Some questions:

ikreymer commented 11 months ago

It doesn't seem like the methods/signatures have changed really from the old WARCSerializer to the new WARCSerializer for the simple use case of writing WARC files while buffering in memory. If that is the case, maybe better to just remove/replace it altogether rather than deprecating it under a new name?

Yes, that's a good point -- don't really see a reason to keep the old version, it still requires migration, and if there are issues might as well use an older version of warcio.js