xorbit / LiFePO4wered-Pi

Access library, command line tool and daemon for the LiFePO4wered/Pi module
GNU General Public License v2.0
132 stars 31 forks source link

shut down for unknown reason #48

Closed badzz closed 3 years ago

badzz commented 3 years ago

Hi, I got a restart of my raspberry and I am struggling to understand why. looking at the code, it seems to be PI_RUNNING was at 0. here is the (only) log I found Dec 2 04:06:51 paris LiFePO4wered[374]: Signal from LiFePO4wered module to shut down Dec 2 04:06:52 paris LiFePO4wered[374]: Signaling LiFePO4wered module that system is shutting down Dec 2 04:06:52 paris LiFePO4wered[374]: Triggering system shutdown Can you shed some light on how it could happen ? Thanks

xorbit commented 3 years ago

Hello Matthias, There are various reasons the LiFePO4wered/Pi+ may clear the PI_RUNNING flag to trigger shutdown, such as button press, AUTO_SHDN_TIMER expiration or the battery power getting too low.

badzz commented 3 years ago

I found the doc since indeed. However nothing has changed for months and the battery is reported to be 3.5v and it is running fine at the moment. There was no power cut either. Don't know what happened. Thanks anyway

On Wed, Dec 2, 2020, 18:29 Patrick Van Oosterwijck notifications@github.com wrote:

Hello Matthias, There are various reasons the LiFePO4wered/Pi+ may clear the PI_RUNNING flag to trigger shutdown, such as button press, AUTO_SHDN_TIMER expiration or the battery power getting too low.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xorbit/LiFePO4wered-Pi/issues/48#issuecomment-737379997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR5UYLMDOTHWKOLRLPPQ43SSZ2RLANCNFSM4UKJW53A .

pefribeiro commented 3 years ago

Hi,

Just wanted to say that I've recently updated my RPi (including adding another I2C device, and updating the LiFePO4wered software) and I've also just observed the same behaviour for no apparent reason:

Timestamp Level Program Message Priority
2021-01-02 17:50:34 6 LIFEPO4WERED-DAEMON Signal from LiFePO4wered module to shut down info
2021-01-02 17:50:34 6 LIFEPO4WERED-DAEMON Signal from LiFePO4wered module to shut down info
2021-01-02 17:50:35 6 LIFEPO4WERED-DAEMON Triggering system shutdown info
2021-01-02 17:50:35 6 LIFEPO4WERED-DAEMON Triggering system shutdown info

I've noticed that upon the first boot, PI_RUNNING is set to 1, but after some time (not sure how long yet!) PI_RUNNING now seems to be 135. [Edit: In case that helps, I've manually set it to 1 by issuing lifepo4wered-cli set PI_RUNNING 1, and after around 1-2 hours the value is back at 135.] The user manual and the control code, seem to indicate this should be either 0 or 1. Is this perhaps an indication of an issue? Full output of lifepo4wered-cli get below:

I2C_REG_VER = 4
I2C_ADDRESS = 67
LED_STATE = 1
TOUCH_STATE = 0
TOUCH_CAP_CYCLES = 20
TOUCH_THRESHOLD = 12
TOUCH_HYSTERESIS = 2
DCO_RSEL = 14
DCO_DCOMOD = 206
VIN = 5178
VBAT = 3592
VOUT = 4985
VBAT_MIN = 2854
VBAT_SHDN = 2952
VBAT_BOOT = 3152
VOUT_MAX = 3889
VIN_THRESHOLD = 4498
VOFFSET_ADC = 54
AUTO_BOOT = 0
WAKE_TIME = 0
SHDN_DELAY = 65
AUTO_SHDN_TIME = 65535
PI_RUNNING = 135
CFG_WRITE = 0

This system had otherwise been working for years without any random shutdowns. Any suggestions would be welcome.

Thanks

pefribeiro commented 3 years ago

Hi,

Just to note, I'm using the LiFePo4 for RPi3 with a RPi2 B+ model, using the latest Raspbian (Buster).

Made some progress with moving the new I2C sensor across to a new pair of GPIOs to try and mitigate any interference. Unfortunately, upon running lifepo4wered-cli get indefinitely with a 2 second delay between issuing commands, it eventually came to the following run:

pi@host:~ $ while true; do lifepo4wered-cli get; sleep 2; done
...
(elided)
...
TOUCH_THRESHOLD = 12
TOUCH_HYSTERESIS = 2
DCO_RSEL = -2
DCO_DCOMOD = -2
VIN = -2
VBAT = -2
VOUT = -2
VBAT_MIN = -2
VBAT_SHDN = -2
VBAT_BOOT = -2
VOUT_MAX = -2
VIN_THRESHOLD = -2
VOFFSET_ADC = -2
AUTO_BOOT = -2
WAKE_TIME = -2
SHDN_DELAY = -2
AUTO_SHDN_TIME = -2
PI_RUNNING = -2
CFG_WRITE = -2
I2C_REG_VER = -2
I2C_REG_VER = -2
I2C_REG_VER = -2
^C
pi@host:~ $ lifepo4wered-cli get
I2C_REG_VER = -2
pi@host:~ $ lifepo4wered-cli get
I2C_REG_VER = -2

After this, the battery controller becomes nearly unresponsive. So if I shutdown the Pi, it is the impossible to power-up the system by touching the button on the LiFePO4 controller. Indeed, the only alternative at this point is to remove power, and the battery, to return the system to a stable status.

If I try the above loop with a 0.1 delay (while true; do lifepo4wered-cli get; sleep 0.1; done), then I quickly get the LiFePo4 controller to die after some seconds.

Not sure what is at fault here.

Thanks

xorbit commented 3 years ago

Hi @pefribeiro,

On your first issue (PI_RUNNING at 135):

Very weird. So both times, the value ended up at 135? I'm trying to figure out if something is incrementing the value, or writing this wrong value to the register. If something is incrementing, it could eventually overflow the byte to 0 and trigger a shutdown. If something is writing 135, I wonder what it could be. While the unit accepts any non-zero value, none of my software ever intentionally writes any value other than 1.

Second issue:

I tried to reproduce your issue by doing the while true; do lifepo4wered-cli get; sleep 0.1; done loop, but it's still running just fine after 2 hours of running it. So I can't help wonder if it's something specific to your setup. You mention "moving the new I2C sensor across to a new pair of GPIOs", I don't know the details about what you're doing. What I do know is that the I2C implementation on the Raspberry Pi is quit poor (for instance, it doesn't support clock stretching), so I know you have to be careful to avoid collisions if you have multiple things going at the same time. You say you have the sensor on different pins now but I wonder if there could be issues with collisions at the driver level, I don't know how this is implemented and how you move I2C to different GPIO pins.

pefribeiro commented 3 years ago

Hi @xorbit,

Thanks for reaching out, and apologies for hijacking the thread.

On your first issue (PI_RUNNING at 135):

Very weird. So both times, the value ended up at 135? I'm trying to figure out if something is incrementing the value, or writing this wrong value to the register. If something is incrementing, it could eventually overflow the byte to 0 and trigger a shutdown.

Yes, it is weird. I've just recently re-deployed the system and after some time the value of PI_RUNNING is 135 again.

If something is writing 135, I wonder what it could be. While the unit accepts any non-zero value, none of my software ever intentionally writes any value other than 1.

I thought so too. I checked through your repository and couldn't find any obvious place where it would be written with a wrong value.

Second issue:

I tried to reproduce your issue by doing the while true; do lifepo4wered-cli get; sleep 0.1; done loop, but it's still running just fine after 2 hours of running it. So I can't help wonder if it's something specific to your setup.

Perhaps, but I'm not yet sure what exactly it could be. Up until recently the only devices plugged into the RPi were: the battery controller on the GPIOs, and a Realtek RTL2832U device on a USB that receives 433Mhz sensor data using rtl_433. Very rarely, and rather sporadically (maybe once or twice a year) the controller displayed similar behaviour, ie. RPi powered down and the LiFePo4 was unable to boot, requiring me to remove the battery. This didn't bother me much as the system had otherwise uptimes of months.

You mention "moving the new I2C sensor across to a new pair of GPIOs", I don't know the details about what you're doing.

Recently the SDcard died, so I took the opportunity to install a new I2C sensor, and a clean Raspbian image (Buster). Previously I had been running Raspbian (Jessie?) for a really long time, along with an older version of your software controller (not sure what commit exactly now!). I note that the new sensor uses different addresses from those of the LiFePo4wered.

In the interest of trying to eliminate any possible interference from the new sensor on the I2C bus, I moved the SLA/SCL lines to other GPIOs, adding pull-up resistors. I used the dtoverlay=i2c-gpio facility to create a new software-defined I2C bus. So far so good, but I haven't yet dared to attempt running the stress loop again.

What I do know is that the I2C implementation on the Raspberry Pi is quit poor (for instance, it doesn't support clock stretching), so I know you have to be careful to avoid collisions if you have multiple things going at the same time. You say you have the sensor on different pins now but I wonder if there could be issues with collisions at the driver level, I don't know how this is implemented and how you move I2C to different GPIO pins.

Like you say, I've found similar comments online regarding the I2C implementation of the RPi.

I also don't know much about the driver implementation, and as to whether the software i2c-gpio is any better (supposedly it can do clock stretching?). I had wondered whether there could be any racing conditions going on with multiple programs writing to the same I2C device, irrespective of the bus? According to an answer on Stackoverlow it hints that this should be safe as long as "no other processes talk to the same I2C device".

Could the problem be that running the lifepo4wered-cli command (even just to perform a get) while the daemon runs in the background somehow leads to a racing condition? I am just speculating here as I know very little about your controller and the ioctl for I2C on Linux/RPi.

I have recently scripted the output of lifepo4wered-cli get to log voltages via SNMP extensions, which are queried by a logging server at regular intervals. Somehow it feels like this may have made the problem worse, but it could be unrelated. I may also start recording other values, like that of PI_RUNNING, if they turns out to be interesting to watch over time.

Food for thought. Thanks

xorbit commented 3 years ago

@pefribeiro In my while true; do lifepo4wered-cli get; sleep 0.1; done test I also had the daemon running in the background. Now I would not recommend that anyone would run a test to hammer the LiFePO4wered/Pi+in that manner, nothing really changes much faster than a second and I'm using a pretty anemic micro, but I had no issues during the test time and you said on yours, that speed pretty much immediately caused a problem.

The LiFePO4wered daemon and CLI code are pretty well behaved in that the software only opens the device when reading/writing and closes it again right away so other services can have access to the bus. There is also retries, so for instance in case the daemon is accessing the device when you run the CLI, the CLI access will fail but retry again later when the daemon is done and usually the user doesn't see any effect of the collision. Error codes are only propagated to the user if all retries fail.

pefribeiro commented 3 years ago

Hi @xorbit,

Thanks for your comments. I agree that it looks like the code/design shouldn't be causing such issues.

Since my last update, I've managed to run the system without interruptions for 8 days straight, without noticing any daemon initiated shutdown or system freeze. However, as usual, after a few hours the PI_RUNNING flag had been set to 135. 8 days later, I have also noticed that other registers seem to have been disrupted:

pi@host:~ $ lifepo4wered-cli read
I2C_REG_VER = 4
I2C_ADDRESS = 67
LED_STATE = 1
TOUCH_STATE = 0
TOUCH_CAP_CYCLES = 20
TOUCH_THRESHOLD = 135
TOUCH_HYSTERESIS = 255
DCO_RSEL = 14
DCO_DCOMOD = 206
VIN = 5216
VBAT = 3578
VOUT = 4968
VBAT_MIN = 2854
VBAT_SHDN = 2952
VBAT_BOOT = 3152
VOUT_MAX = 41328
VIN_THRESHOLD = 4829
VOFFSET_ADC = 54
AUTO_BOOT = 255
WAKE_TIME = 65415
SHDN_DELAY = 65
AUTO_SHDN_TIME = 65415
PI_RUNNING = 135
CFG_WRITE = 0

I am unable to explain (or observe) how the registers manage to get such weird values. Is there anything else I should try?

xorbit commented 3 years ago

It's almost as if some other process is writing registers. Or you have a unit with a problem, I guess that can't be ruled out. There isn't anything else loading the I2C bus is there? Some HATs add extra pull ups or level shifters that cause problems for this chip.

pefribeiro commented 3 years ago

Hi @xorbit,

It's almost as if some other process is writing registers.

Not that I am aware of.

Or you have a unit with a problem, I guess that can't be ruled out.

In the absence of other obvious issues, I have started suspecting as well. I suppose it would be hard to check this.

There isn't anything else loading the I2C bus is there? Some HATs add extra pull ups or level shifters that cause problems for this chip.

I've moved the other I2C device sensor to another set of pins (as I described above), so currently this is the only device on the (true) I2C bus that is connected to the battery controller. As far as I recall, I have also observed the issue with the registers even without the other device plugged-in.

I wonder if perhaps the issue had already surfaced for a while, but since I usually didn't check the registers regularly I wouldn't know.

pefribeiro commented 3 years ago

Just to add that one of these days the register PI_RUNNING finally changed to 0. I had changed the code not to trigger the actual shutdown in order to prevent it from powering off, so it is still running despite the value.

One interesting thing to note is that the register CFG_WRITE is recording a value of 255, when according to the manual it should always be read as zero. I am starting to think there is indeed a fault? Full output below at the moment:

I2C_REG_VER = 4
I2C_ADDRESS = 67
LED_STATE = 135
TOUCH_STATE = 0
TOUCH_CAP_CYCLES = 135
TOUCH_THRESHOLD = 255
TOUCH_HYSTERESIS = 135
DCO_RSEL = 255
DCO_DCOMOD = 206
VIN = 3969
VBAT = 2957
VOUT = 4290
VBAT_MIN = 4197813
VBAT_SHDN = 3749
VBAT_BOOT = 3152
VOUT_MAX = 41328
VIN_THRESHOLD = 4829
VOFFSET_ADC = 4197813
AUTO_BOOT = 255
WAKE_TIME = 65415
SHDN_DELAY = 255
AUTO_SHDN_TIME = 65415
PI_RUNNING = 0
CFG_WRITE = 255

System is operational, but the daemon is off for the time being.

pefribeiro commented 3 years ago

18 days of uptime now, and the LiFePO4wered has disappeared from the I2C bus:

pi@host:~ $ lifepo4wered-cli read
I2C_REG_VER = -2
pi@host:~ $ sudo i2cdetect -y 1
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
00:          -- -- -- -- -- -- -- -- -- -- -- -- -- 
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
20: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
40: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
60: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
70: -- -- -- -- -- -- -- -- 

Is this evidence of a faulty unit at this point? Power is still on somehow.

Not sure if it matters, but dmesg now has a few of these messages from today and yesterday: i2c-bcm2835 3f804000.i2c: i2c transfer timed out

xorbit commented 3 years ago

This convinces me more and more that the unit is working correctly internally, but I2C communication is failing for some reason. The MSP430 on the LiFePO4wered/Pi+ doesn't have very strong drivers and I have determined that it normally works with just the I2C pull-ups on the Pi, but if there are any extra pull-ups on the I2C lines, the output low level may not be enough to be read correctly by the Pi.

If you don't have any extra pull-ups on the I2C lines that could cause a logic level problem, it might be interesting as an experiment to add a pull-down of say 22-27K on the SDA line and see if that makes a difference.