Desire Feature
Arquivo.pt needs more verbose logs to be able to find errors.
How to implement
Some of the fields are in the cdxj file, however, some errors are not saved in CDXJ. It would be interesting to add a logging system in which there would be a verbose mode.
We suggest using a format similar to Heritrix, with the following fields:
Field 1. Timestamp. The timestamp in ISO8601 format, to millisecond resolution. The time is the instant of logging.
Field 2. Fetch Status Code. Usually this is the HTTP response code but it can also be a negative number if URI processing was unexpectedly terminated.
Field 3. Document Size. The size of the downloaded document in bytes. For HTTP, this is the size of content only. The size excludes the HTTP response headers. For DNS, the size field is the total size for the DNS response.
Field 4. Downloaded URI The URI of the document downloaded.
Field 5. Referrer. The URI that immediately preceded the downloaded URI. This is the referrer. Both the discovery path and the referrer will be empty for seed URIs.
Field 6. Mime Type. The downloaded document mime type.
Field 7. SHA1 Digest. The SHA1 digest of the content only (headers are not digested).
Field 8. WARC Filename. The name of the WARC/ARC file to which the crawled content is written. This value will only be written if thelogExtraInfo property of the loggerModule bean is set to true. This logged information will be written in JSON format.
This should be possible in the new 1.0.0 system, since the WARC writing is now in the crawler. Will probably use SHA256 instead of SHA1, but generally everything else should be there, I think.
Desire Feature Arquivo.pt needs more verbose logs to be able to find errors.
How to implement Some of the fields are in the cdxj file, however, some errors are not saved in CDXJ. It would be interesting to add a logging system in which there would be a verbose mode. We suggest using a format similar to Heritrix, with the following fields: