Closed pa5cal closed 4 years ago
Yeah, I woke up to an inbox full of error messages! I wonder if there's something that has a tendency to cause this around midnight UTC (CPU or disk I/O pressure from other cron jobs maybe?) or whether it's just a coincidence.
Thanks for reporting! It should be running again now.
Thx for fixing.
What did the error message say?
ERROR: Bad file descriptor @ fptr_finalize_flush - /store/planet/replication/changesets/state.yaml.tmp
That's the first error message from last night. It varies, but it's usually something like that. Subsequent error messages said:
ERROR: undefined method `[]' for nil:NilClass
Which means that the state.yaml
file was empty. Therefore, somehow there was an issue writing out the temporary state file on one run, which left the state file empty. When I logged into the machine, the temporary state file was not empty. However, the only place in the code which modifies the state file copies it from the temporary state file.
So I have no idea how it gets into this state - and on a fairly regular basis!
I had thought it was concurrent modifications, but unless I wrote the big flock
around the whole program totally wrong, then there should only be one copy of this code running at any one time. Perhaps theres some file writing stuff that's only being run at GC time, despite all the writes being wrapped in file blocks?
The changeset replication stopped yesterday evening again. Latest state file has been created at: last_run: 2020-09-07 23:42:01.730936000 +00:00 sequence: 4095906
See: https://planet.openstreetmap.org/replication/changesets/