webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
30 stars 6 forks source link

Off-by-one error with `CDXIndexer.indexRecord()` when called inside a `WARCParser` iterator? #52

Closed matteocargnelutti closed 1 year ago

matteocargnelutti commented 1 year ago

As discussed w/ @ikreymer

I have noticed a strange issue when working with CDXIndexer.indexRecord() inside a WARCParser iterator: it seems that indexRecord() is off-by-one. More precisely, it will always be one record behind, and yield a CDX entry for the previous record.

This only occurs when using indexRecord() in that context.

const cdx = []
const stream = createReadStream(filename)
const parser = new WARCParser(stream)
const indexer = new CDXIndexer()

for await (const record of parser) {
  if (!indexer.filterRecord(record)) {
    continue
  }

  await record.readFully() // As discussed: necessary for indexRecord to report lengths and offsets properly

  // Off-by-one. First call will return `null`.
  const entry = indexer.indexRecord(record, parser, basename(filename))

  if (entry) {
    cdx.push(indexer.serializeCDXJ(entry))
  }
}

For comparison, the following will not be affected by this issue:

const cdx = []
const stream = createReadStream(filename)
const indexer = new CDXIndexer()

for await (const record of indexer.iterIndex([{ reader: stream, filename: basename(filename) }])) {
  cdx.push(indexer.serializeCDXJ(record))
}

Thanks in advance,

ikreymer commented 1 year ago

This is actually by design: in order to correctly index POST requests, the indexer needs to cache the previous record, and attempt to match request/response pairs. The indexRecord method therefore keeps track of the lastRecord, and processes both the request and response together. We could add an option to turn that off, but that will result in incorrect indexing when encountering POST requests, so I'm hesitant to add that option. The solution is to call the indexRecord(null) at the end, as is done here: https://github.com/webrecorder/warcio.js/blob/main/src/lib/indexer.ts#L189

But definitely we can make this more clear - maybe make indexRecord should be internal. It seems like the main reason for custom iterator is to have access to the WARC record itself, which perhaps can just be an added option?

Can also add comments to better explain this behavior.

ikreymer commented 1 year ago

A possible better option, now in the linked PR, is to add a new CDXRecordIndexer, which includes the WARC record via record and optional reqRecord (if a corresponding WARC request record exists) to the iteration, ex:

    const indexer = new CDXRecordIndexer();
    ...
    for await (const {cdx, record, reqRecord} of indexer.iterIndex(files)) {
      ...
    }

This should eliminate the need to try to call indexRecord() manually

matteocargnelutti commented 1 year ago

CDXAndRecordIndexer works great. Thanks @ikreymer !