Closed ikreymer closed 11 months ago
Some questions:
import { WARCSerializer } from "warcio/node"
make sense. (It's same serializer with temp file support - tried to make it as simpler as possible to use but also explicit w/o doing compile time replacement. Open to other suggestions.It doesn't seem like the methods/signatures have changed really from the old WARCSerializer to the new WARCSerializer for the simple use case of writing WARC files while buffering in memory. If that is the case, maybe better to just remove/replace it altogether rather than deprecating it under a new name?
Yes, that's a good point -- don't really see a reason to keep the old version, it still requires migration, and if there are issues might as well use an older version of warcio.js
To compute the digests, the current WARCSerializer must read the entire WARC payload into memory, which is less than ideal for records with large payloads.
This PR replaces the WARCSerializer with a new version which using
hash-wasm
for cross-platform incremental digest computation. Since computing digest requires reading data twice, it also includes an external buffer which includes a write() method and a readAll() async iterator which reads the data back.The default implementation provides an in memory
SerializerInMemBuffer()
which still buffers data in memory. A BaseSerializerBuffer base class can be extended to provide custom buffering functionality as well.In node, a version of WARCSerializer from
warcio/node
provides a serializer which buffers data to temp files on disk.This version supports following usage (added to README) for streaming large files in Node and writing them to WARC, without having to buffer fully in memory. The