Closed cybe closed 5 years ago
Hi
I think its ok to disable interruptingcow inside mithermometer, as it uses its own internal timeout. I think every worker should be able to cleanup and interrupt long operation on its own. Timeout in worker_manager is just last resort to keep gateway as whole stable and running. Thats why timeout there is much bigger than in workers. That was my initial idea for design.
That deadlock, can be also tackled by recreating object somehow ? Or maybe btlewrap has still some design flaw around that locks ? I'm not a Python expert, so its hard to me to see if there is some wrong cleanup handling etc.
Anyway to sum up I'm ok with removing interruptingcow timeout from mithermostat worker, but I will keep long timeout higher in the execution chain not to freeze whole gateway, in case of some weird condition from any worker.
Thanks for looking into this. I agree that we should not remove the timeout within workers_manager.py.
That deadlock, can be also tackled by recreating object somehow ?
Theoretically we could recreate the object. But I'm not sure about any side effects this might have on a lower level. That lock is probably there for a reason, perhaps only one client is allowed to access the Bluetooth device in a given moment.
One idea is to recreate the object in those instances where the 35 second time limit is reached. Before the recreation we could explicitly delete the old object, which causes the cleanup routines to be executed. I'll try that and monitor the behavior. I'll report back once I have more information.
Or maybe btlewrap has still some design flaw around that locks ? I'm not a Python expert, so its hard to me to see if there is some wrong cleanup handling etc.
btlewrap has the correct cleanup mechanisms implemented, they're just not reachable. So far, I was unable to debug the exact reason why the thread is unable to be interrupted.
Yes only one client can access bluetooth device on linux, at given moment. In case of this gateway, its handled by gateway itself. Everything goes on the same thread using internal queue, so workers doesn't need to take care of locking bluetooth device etc.
As for workers affected by this, from what I understand its only miflora and mithermometer, as they use btlewrap. I'm using mainly thermostat worker, which uses bluepy for bluetooth handling, and there are no locks there etc. From what I've read, interruptingcow uses SIGALARM to interrupt flow, maybe such signal needs to be catched inside btlewrap ( or also other signals ? ), and also do cleanup procedure then ?
So far I was unable to reproduce a deadlock within over two weeks now by just removing the timeout in mithermometer.py (and miflora.py for that matter, as it also relies on btlewrap). The 35 second timeout in workers_manager.py is still in place.
Those 35 second seems to be enough time on my setup, but I also only use a single device. I'll move over my 5 miflora sensors from home-assistant to bt-mqtt-gateway and do some more testing.
As for workers affected by this, from what I understand its only miflora and mithermometer, as they use btlewrap. I'm using mainly thermostat worker, which uses bluepy for bluetooth handling, and there are no locks there etc.
I can confirm that only those workers that use btlewrap seems affected by this. The thermostat worker for my five thermostats are running fine.
From what I've read, interruptingcow uses SIGALARM to interrupt flow, maybe such signal needs to be catched inside btlewrap ( or also other signals ? ), and also do cleanup procedure then ?
Might be. I'll look into btlewrap and check on signal handling. By the way: I also tested pnpnpn/timeout-decorator which allows for a different technique. Without any luck though.
Found the likely cause of those deadlocks: The underlying POSIX implementation of threads on musl libc doesn't support signal interrupts for lock.acquire(). I can indeed reproduce this behavior within my Docker container with Python 3.7.2 running on Alpine Linux.
Still, running into the deadlock is only a secondary issue that occurs after reaching the timeout and aborting the executing without proper cleanup (and releasing the lock). I'll do some more investigation.
Those 35 second seems to be enough time on my setup, but I also only use a single device. I'll move over my 5 miflora sensors from home-assistant to bt-mqtt-gateway and do some more testing.
Been running 3 MiFlora sensors for weeks now with no issues.
@bbbenji What is your Python version and Linux distribution (i.e. glibc vs. musl)?
Linux DietPi 4.14.66+ #1 SMP PREEMPT Thu Aug 23 05:59:33 UTC 2018 armv7l GNU/Linux Python 2.7.13 GLIBC 2.24-11+deb9u3
Just as an update: I'm currently looking into recreating the MiThermometerPoller object to enforce using a new locking object.
I managed to find the actual culprit in btlewrap. When a connect operation to a Bluetooth device is interrupted by any other exception than an internal BluetoothBackendException, a previously acquired lock is not released. This is generally not a problem, because this is normally the only exception type that may occur. That is not the case though, if we use interruptingcow which produces a RuntimeError by default when the timeout hits.
The uninterruptible deadlock I mentioned is only a consequence of the skipped release of the lock object. As the lock was not released the following connect operation deadlocks at lock.acquire().
So how do we fix this? As the lock object is declared at class level, we cannot just recreate the object. We have three options:
_BackendConnection._lock.release()
. This is obviously dirty, as we tinker with the internal state of an external library.I forked btlewrap and implemented the needed changes. I also forced pip to use my fork of btlewrap. So far it works without any issues.
Thanks for great detective work on this. Maybe lets wait a week, to see if upstream changes gets merged ? If not lets use your fork until things get fixed upstream.
@ChristianKuehnel already merged my PR :smile:. Including it here is kind off a hassle though. This project doesn't depend on btlewrap directly, but through miflora and mithermometer. I'm pessimistic about pip
s ability to override the version of a transitive dependency.
Currently I do something very ugly:
pip install --no-deps https://github.com/ChristianKuehnel/btlewrap/archive/master.zip mithermometer
pip install bluepy pyserial
pip install -r requirements.txt
So the right way would be creating a release of btlewrap, bumping that dependency in miflora and mithermometer and also create releases of those, and finally include these new releases within bt-mqtt-gateway. :sweat:
Maybe it will work, if we specify forced btlewrap version in workers REQUIEREMENTS ( https://github.com/zewelor/bt-mqtt-gateway/blob/master/workers/mithermometer.py#L5 ) ?
I just created a new release of btlewrap v0.0.4 (incl. pypi) and requested a new release from miflora, as I do not have the permissions to do so: https://github.com/open-homeautomation/miflora/issues/117
Thank you @ChristianKuehnel. I requested the same for mithermometer: hobbypunk90/mithermometer#2
@zewelor If we end up with different versions of dependencies, pip will complain:
mithermometer 0.1.2 has requirement btlewrap==0.0.3, but you'll have btlewrap 0.0.4 which is incompatible.
miflora 0.4 has requirement btlewrap==0.0.2, but you'll have btlewrap 0.0.4 which is incompatible.
The pip installation will succeed though ...
The use of the timeout mechanism in mithermometer.py and workers_manager.py can lead to a deadlock when a BTLE operation fails to complete within 8 or 35 seconds respectively: When the timeout fires, a currently ongoing operation is not cleaned up properly and leaves behind a lock. The next BTLE operation uses the same locking object and therefore lands in a deadlock. Furthermore, the timeout mechanism of interruptingcow is unable to kill the deadlocked operation. As a result, the whole worker hangs. This might be related to #14.
You can see the behavior in the attached logfile. I added a couple of custom log statements. In line 6 the backend fails to connect to the device within 8 seconds and kills the operation, resulting in the exception until line 63. Note that there are no messages about releasing the lock.
The next operation then tries to acquire the same lock in line 69. This is the deadlock. As mentioned, interruptingcow is also unable to interrupt the deadlocked acquire-method. Note that for my testings I manually set an upper limit to acquire the lock as seen on line 102. After the 100 seconds passed, the timeout of interruptingcow fired immediately.
I'm unsure how to tackle this. In the case of mithermometer.py just removing the use of interruptingcow seems to do the trick. If a device is unresponsive, an internal timeout of 10 seconds still applies. However, removing the timeout in workers_manager.py doesn't seem to be a good idea.
Ultimately, the best approach would be to determine why resource-cleanup fails when a timeout fires. I'm guessing the thread hangs somewhere internally in lock.acquire() and is therefore not able to reach any cleanup code.
Let me know if you need any more information.