webrecorder / warcit

Convert Directories, Files and ZIP Files to Web Archives (WARC)
https://pypi.python.org/pypi/warcit
Apache License 2.0
81 stars 13 forks source link

Revisit records don't seem to adhere to the --fixed-dt timestamp #31

Open Shrinks99 opened 1 year ago

Shrinks99 commented 1 year ago

All other records in the created WARC file seem to adhere to the --fixed-dt flag if set by the user. Revisit records, automatically created by warcit based on the directory structure, are the only ones that seem to exhibit this issue.

This is possibly because revisit records use a different method of deriving warc_date than other records do. See https://github.com/webrecorder/warcit/blob/d94ecd791c43a27b186dba81d5c118c23f1647c9/warcit/warcit.py#L547-L554 vs https://github.com/webrecorder/warcit/blob/d94ecd791c43a27b186dba81d5c118c23f1647c9/warcit/warcit.py#L495-L501

Screenshot

This issue only appears to affect revisit records as shown below.

Screenshot 2023-11-02 at 11 02 23 PM

The current URL timestamp shows the current date of WARC creation instead of the --fixed-dt date. The HTML file displays the correct date displaying the time that these website files would have been seen (according to the user of warcit).

Screenshot 2023-11-02 at 11 07 58 PM