webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
30 stars 6 forks source link

Puppeteer Example #21

Closed jlarmstrongiv closed 3 years ago

jlarmstrongiv commented 3 years ago

Would love an example for generating a warc via puppeteer 😄

jlarmstrongiv commented 3 years ago

It appears capture methods are explicitly off the roadmap. It is certainly possible to create a custom writer that follows the example in the README to work with puppeteer.

There are two main ways to read/write warc files in NodeJS—warcio.js and node-warc.

Between these two options, there will be differences in the metadata. For example, warcio.js automatically includes the content and block hashes.

However, neither of them entirely pass warc validation tests (from warcat and warctools). Both have content-length mismatches. Though, warcio.js fails more tests.

Subtle differences also mean that the Unarchiver app can only uncompress files from node-warc. However, other tools such as RelayWeb can open files from both libraries.

ikreymer commented 3 years ago

However, neither of them entirely pass warc validation tests (from warcat and warctools). Both have content-length mismatches. Though, warcio.js fails more tests.

Subtle differences also mean that the Unarchiver app can only uncompress files from node-warc. However, other tools such as RelayWeb can open files from both libraries.

That's definitely something that should be fixed asap, if there's incorrect lengths, can you provide an example warc and open a new issue?

Yes, direct support for Puppeteer is out of scope in this library as this is supposed to be fairly low-level and also work in the browser, but can see about adding an example.

jlarmstrongiv commented 3 years ago

Yes, direct support for Puppeteer is out of scope in this library as this is supposed to be fairly low-level and also work in the browser, but can see about adding an example.

No worries, I’ll go ahead and close this issue. Thanks!