Closed matteocargnelutti closed 1 year ago
This is actually by design: in order to correctly index POST requests, the indexer needs to cache the previous record, and attempt to match request/response pairs. The indexRecord method therefore keeps track of the lastRecord, and processes both the request and response together. We could add an option to turn that off, but that will result in incorrect indexing when encountering POST requests, so I'm hesitant to add that option. The solution is to call the indexRecord(null) at the end, as is done here: https://github.com/webrecorder/warcio.js/blob/main/src/lib/indexer.ts#L189
But definitely we can make this more clear - maybe make indexRecord should be internal. It seems like the main reason for custom iterator is to have access to the WARC record itself, which perhaps can just be an added option?
Can also add comments to better explain this behavior.
A possible better option, now in the linked PR, is to add a new CDXRecordIndexer
, which includes the WARC record via record
and optional reqRecord
(if a corresponding WARC request record exists) to the iteration, ex:
const indexer = new CDXRecordIndexer();
...
for await (const {cdx, record, reqRecord} of indexer.iterIndex(files)) {
...
}
This should eliminate the need to try to call indexRecord() manually
CDXAndRecordIndexer
works great. Thanks @ikreymer !
As discussed w/ @ikreymer
I have noticed a strange issue when working with
CDXIndexer.indexRecord()
inside aWARCParser
iterator: it seems thatindexRecord()
is off-by-one. More precisely, it will always be one record behind, and yield a CDX entry for the previous record.This only occurs when using
indexRecord()
in that context.For comparison, the following will not be affected by this issue:
Thanks in advance,