ytsapras / robonet_site

Django RoboNet operational database.
GNU General Public License v2.0
0 stars 0 forks source link

DB dashboard displays all entries in the error report twice #18

Open rachel3834 opened 7 years ago

rachel3834 commented 7 years ago

The dashboard display displays repeated entries for all of our codes, doubling the length of the table.

The dashboard now adapts the table on demand depending on the number of entries in the errors.txt file. So this error is happening because the errors.txt file somehow got duplicate entries.

I've looked through this code and all the code I can find that reports to this file uses the correct functions in get_errors.py. I've written unit tests and runtime tests for this code and it works fine, without duplicating the entries.

I've corrected the duplicated errors.txt file on Einstein and this has resolved the problem in the dashboard, so that is also working fine.

So the question is, how did the errors.txt get duplicated in the first place.

rachel3834 commented 7 years ago

I've tested artemis_subscriber, obs_control and romerea_db_backup and none of them duplicate the entries in errors.txt when I run them in my local test environment. The other scripts that report to the file are run_tap_rea (which uses the same get_errors function as my scripts and so should be fine) and reception (which as far as I can tell, isn't yet writing an update to errors.txt...right, Etienne?)

So I don't know what caused this, unless it was an accident manual edit. I've corrected the file now and we should monitor it to see if it re-occurs.

ytsapras commented 7 years ago

The errors.txt file in /data/romerea/data/logs/2017/ has two lines for artemis_subscriber so I suspect your scripts update both of these lines. Since they are redundant I recommend removing the last one.

rachel3834 commented 7 years ago

This is a weird error. The subscriber uses the same get_errors.update_err function to update the errors.txt throughout the code. It does have code to output errors at different stages of the software, but the update_err function is designed to update any entry referring to the code.

Unittesting this code shows that it works, so it seems unlikely that the output to the file is the cause of this problem.

The function should only add a line with the name of the code if there is no such line in the previous version of the error.txt. Yet we know that there was. So either there's a filesystem problem, and the code is occasionally incompletely reading in the old version of error.txt when the update_err function is called or perhaps two codes are trying to write to the file simultaneously. This would be a rare error, but possible and consistent with what we've seen.

I realise the function was designed to avoid this problem by writing to a new file and renaming it, but it looks like this isn't quite enough in all circumstances.

rachel3834 commented 7 years ago

I checked for filesystem problems and I don't think we're experiencing any right now, so I don't think that's it.

rachel3834 commented 7 years ago

Also I can't see any code crashes in the logs or issues that would produce this outcome.

The only absolute way to avoid clashing IO processes is to lock-out the error.txt file while a function is writing to it so that nothing else can. This might be do-able, but would add complexity to all our cron tasks.

The alternative would be creating an error table in the DB and reading the info from there - DBs are designed to handle this kind of issue. But I remember we talked about that and there was a good reason why it didn't work. Yianni, what do you think?

ytsapras commented 7 years ago

I can create an Error model in the DB if you prefer to do it this way but I am not convinced it is necessary. An alternative would be to generate temp files with random unique ids which can be used for this purpose. We can try that first. I'll make the necessary changes on Monday and commit to github.

On 14 July 2017 at 19:36, Rachel Street notifications@github.com wrote:

Also I can't see any code crashes in the logs or issues that would produce this outcome.

The only absolute way to avoid clashing IO processes is to lock-out the error.txt file while a function is writing to it so that nothing else can. This might be do-able, but would add complexity to all our cron tasks.

The alternative would be creating an error table in the DB and reading the info from there - DBs are designed to handle this kind of issue. But I remember we talked about that and there was a good reason why it didn't work. Yianni, what do you think?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ytsapras/robonet_site/issues/18#issuecomment-315419763, or mute the thread https://github.com/notifications/unsubscribe-auth/AOLsaojF1OE8co9Y86xb9w-3YRxJkx17ks5sN6cKgaJpZM4OThel .

ytsapras commented 7 years ago

Ι have now updated get_errors.py to use a unique temporary file name instead of errors_new.txt. The change has been committed on github so next time we restart the container it should be picked up.

rachel3834 commented 7 years ago

OK thanks Yianni