xorbit / LiFePO4wered-Pi

Access library, command line tool and daemon for the LiFePO4wered/Pi module
GNU General Public License v2.0
132 stars 31 forks source link

RTC reset and unit not powering on #68

Open Q-ten opened 4 months ago

Q-ten commented 4 months ago

I've recently deployed 12 units and have begun to notice some odd behaviour.

The units are given a schedule to wake and shut down in order to capture images at certain times. The unit relies on the LiFePO4wered to wake on time. It works really well... except when it doesn't.

While investigating why a unit had not woken when it was meant to, after powering on with the button, I found in the syslog that the LiFePO4wered had restored the system time to 1/1/1970 (when I powered on with the button). Very shortly after, the system got the real time from the internet. The device continued to work normally after this manual start. I checked that it set the RTC on shutdown.

So following a normal software triggered shutdown, the unit just didn't seem to wake up. Perhaps because it had a 0 written to the RTC at shutdown? Or possibly the unit did wake at the designated time, but something caused it to shutdown before the RPI could write anything to the syslog?

I've confirmed this behaviour on at least two units.

Another issue I've encountered is that occasionally a unit just won't wake up. The button won't even wake it. The charge light is on if powered, but I can't wake the unit without going through the battery removal procedure.

I've wondered if both issues could be caused by the same underlying problem? I've seen that other issues have encountered register corruption. That might explain why the RTC_TIME could go to 0. But what if VBAT_MIN was corrupted to a high value? Would that prevent the unit powering on at all? It's a 2 byte register and with the scaling, an overflowed -1 value would correspond to 3.99V.

The units are the latest revision as of Feb 2024 (I think). They were ordered via Tindie. They have the USB-C connector. The units have no battery holder. Two LiFePO4 18650 batteries in parallel are connected to the board via a short lead.

It might be possible that a significant current is drawn for a short period as it boots. VOUT is powering a wifi dongle. I suppose there is a chance that the voltage might drop below VBAT_SHDN or VBAT_MIN. But I don't think that would be sufficient to explain these issues.

Q-ten commented 4 months ago

I also recall a situation where a unit would not power on from the battery (fully charged) but in a previous revision of the hardware (purchased from Mouser in early 2023 with micro usb VIN). Required battery removal, but then continued to work as normal.

Q-ten commented 4 months ago

I have a bit more information around events when the RTC reset.

4:10pm: shut down unit via software. vbat is 3.123. Set rtc wake time to 1am. Appears to shut down normally. 1am: No wakeup recorded in syslog. 9:46am: RTC time set to 0. 7:15pm: Woken up with button. Syslog reports system time restored from lifepowered: 34103. That timestamp means that it was 0 at 9:46am.

The unit was attached to a solar panel. It got light quite a bit earlier than that. Would have been charging from about 7am. AUTO_BOOT is set to 2: It should boot if not shut down properly when the VBAT exceeds 3.2V (slightly elevated from defaults).

Why did the unit not wake up at 1am and why did the RTC time reset to 0 at 9:46am? It seems like these would be related but I can't see how.

Then there are those times when the unit wont wake even with the button. (By not wake I mean no green light at all. No fast flashing.)

Q-ten commented 3 months ago

I've been looking into possible explanations for the behaviour I'm seeing and came across the errata document for the micro: (See https://www.ti.com/lit/er/slaz168o/slaz168o.pdf)

  1. Does anyone know whether the BCL12 workaround is implemented?

  2. Could BCL13 (slow VCC ramp) be a plausible/likely cause the system to die when powered by solar? If the battery drains overnight (for some reason) and then dawn breaks, perhaps VIN slowly creeps up?

I've wondered about the ENABLE signal from the micro. If the micro enters a fail state, perhaps it isn't pulling down the ENABLE pin? With the RPI and other peripherals attached (say to VOUT) then perhaps these are drawing current, draining the batteries, that might then lead to brownout and subsequent VCC creep at daybreak.

  1. Then there's CPU45. If the micro is running at 12MHz, the min safe operating voltage in the datasheet is 2.7V. But CPU45 suggests increasing this value by 0.2V as a workaround. Maybe there could be CPU register corruption as high as 2.9V? Would it be safe to adjust the DCO_RSEL and DCO_DCOMOD registers to dial back the frequency to below the 4.15MHz limit?
xorbit commented 3 months ago

Hello, sorry for the slow response. And thanks for reporting and digging into this.

  1. You could be on to something with the BCL12 erratum. It used to be for older iterations of the firmware (/Pi and /Pi3) that I only set the RSEL and DCOMOD once on boot, which would not cause this erratum according to:

    Note that the 3-step clock startup sequence consisting of clearing DCOCTL, loading the BCSCTL1 target value, and finally loading the DCOCTL target value as suggested in the in the "TLV Structure" chapter of the MSP430x2xx Family User's Guide is not affected by BCL12 if (and only if) it is executed after a device reset (PUC) prior to any other modifications being made to BCSCTL1 since in this case RSEL still is at its default value of 7.

But this was changed on the /Pi+, because I'm now reducing the clock to 8 MHz before putting the micro to sleep and then back up to 12 MHz when waking IF the supply (battery) voltage turns out to be sufficient. This change was done to ensure proper operation down to 2.2V. [The 12 MHz is mostly needed for I2C communication and no communication is needed when the device remains in low voltage condition.]

That said, according to the DCO Frequency table in the datasheet, 12 MHz is in the RSEL=14 or RSEL=15 (center DCO) range and 8 MHz is in the RSEL=13 (center DCO) range. RSEL=12 only goes to 7.3 MHz. Most likely the system only ever switches between RSEL=13 and RSEL=14 which should not be affected by the BCL12 erratum. Still, I should add the recommended code just in case some chip happens to be down at RSEL=12 for the 8 MHz calibration. But I don't think it's the cause of your issue. Can you tell me what lifepo4wered-cli get shows for your device's DCO_RSEL value?

  1. As for BCL13, if the battery really discharges below 2.2 V, I think we have worse issues. When the battery voltage is below VBAT_MIN (2.85V by default), the device just wakes to maintain RTC and do ADC conversions at a slow rate, while being in deep sleep most of the time. Current draw will only be about 4 uA, and ENABLE of the power supplies is turned off every wake cycle while in this state.

  2. CPU45 is an interesting one. I find it all a bit vague ("under certain conditions"?) but to me it sounds like "sorry we lied in the datasheet but we're putting this here so you can't sue us lol". Because of the lousy I2C peripheral that requires way too much software support, I need to run at 12 MHz to support 100 kHz I2C communication. If you reduce the Pi's bus speed, you could try to reduce the DCO_RSEL and DCO_DCOMOD but I wouldn't go down to 4 MHz, because of the thing I do where I set the clock to the 8 MHz cal value before going to sleep (which the firmware doesn't give I2C access to--the existing DCO_RSEL and DCO_DCOMOD are relics from older iterations that used chips without cal values from TI to be honest but I kept it for backward compatibility). I would probably try setting it to ~8 MHz values if you were to experiment with that, otherwise you increase your risk of running into the BCL12 erratum. The easiest way to check the DCO frequency is the PWLED pads, the LED PWM is DCO clock / 65536 (around 183 Hz at 12 MHz DCO, 122 Hz at 8 MHz DCO).

While all the above are interesting possibilities to explore, I would suggest to take a step back and focus on the overall system. I'm not saying the LiFePO4wered/Pi+ is perfect by any means, there are definitely things that could be better and I have had to make improvements over the years, and there's always the possibility of hitting a batch of microcontrollers that exposes a flaw such as BCL12 that never popped up before. But I have been selling it for almost 6 years now, many thousands of them have been deployed by customers and many of them are used in solar power systems and this is the first time such a severe issue has been reported. The fact that you are using the external battery option definitely adds a wrinkle of uncertainty, but I know others have used this option and not seen this issue.

It is hard to beat the proximity of the on-board battery, connected through 2 oz copper pours is provides a rock solid power supply, and the decoupling caps etc were designed around this. If you use an external battery it's extremely important to use short thick gauge (I would say 12 AWG or better) wiring. I admit one of the flaws on the /Pi+ is that the BAT+/- are way too small to accommodate such wiring, I usually recommend soldering thick wire to their surface instead of going through the holes. Another possibility is to add a bulk electrolytic cap to the BAT+/- pads in addition to the wiring. I hope to improve this use case in a future product. But your problem description above definitely makes me suspect power brownout when the load is switched.

Q-ten commented 3 months ago

Thanks for your very helpful response!

  1. BCL12. Given the post-boot frequency changes are from 8MHz to 14MHz that map to RSEL 13 and 14, I agree this wouldn't explain it. It's possible for 12MHz to be in the RSEL 15 range and potentially cause problems, but on 3 units I confirmed saw the issue, RSEL, DCO, MOD were:

14, 4, 3 14, 3, 29 14, 3, 15

  1. BCL13. I agree this is a long shot. TBH I don't have a consistent hypothesis on this. I had an idea that if the micro got into a state where it locked or browned out and didn't properly restart, maybe ENABLE could be floating leading to other peripherals draining the batteries... but then even if the solar input slowly crept up, the micro is already in a bad state - it wouldn't be causing the problem.

  2. CPU45. Yep. Sure looks like they got the datasheet specs wrong! They say that the CPU register contents could be corrupted. That's game over if that happens though. It's seems to me that if that happens, it would surely cause a reset in short order due to the security key violation condition. That wouldn't necessarily cause a lockup but I'm certain I've seen a PUC because of the RTC_TIME reset I came across. (I discovered the RTC_TIME had been reset 9.5 hours earlier when no-one was near it and it was connected to battery. After a manual start with the button, the syslog showed that the time was restored from Lifepowered to 1970 + 9.5hrs.)

I think CPU45 is a plausible explanation for the reset as the 3V3 level might dip below 2.9V as I have peripherals that will draw considerable current. I'm thinking the 4G dongle might draw considerable current as it powers on its radio to connect to a tower. It's connected to VOUT. Maybe I'll connect the dongle through a mosfet and have the RPi turn it on if VBAT is above a reasonable threshold. Even if it causes a reset, at least I'll have a log.

Another thought that occurred to me around my random reset was that perhaps a bunch of interrupts came in and casued a stack overflow? Without knowing the internals I don't know how plausible this might be. Maybe if there's an interrupt on a floating input or voltage read without hysteresis? I'm really clutching a straws here. :)

While I can't easily change the current units in the field, I'll definely look at adding a bulk capacitor across the battery terminals (and for my current application put my 4G dongle behind a mosfet controlled by the RPi). And I agree that it's excellent having an inbuilt battery right there from the perspective of keeping the micro running, but for my application/users it's really handy (and safer) for them to be able to disconnect the battery.

I now reaslise I could have recovered from the reset if I had flashed AUTO_BOOT to AUTO_BOOT_VBAT or AUTO_BOOT_VIN. Was set to AUTO_BOOT_VBAT_SMART in RAM, but the reset would have destroyed that and the SMART setting would probably mean it wouldn't boot anyway (not sure). Also, checking the AUTO_BOOT value at start would tell me if it had reset.

Q-ten commented 3 months ago

Having gone through all that, I think we've ruled out these errata from causing a lockup. But something is causing lockups. It's something I see very occassionaly on deployed units (all with solar, connected to battery, some with the built-in battery). But it's actually quite common when inserting/connecting the battery.

If a human is sliding in a battery, perhaps connecting/disconnecting/connecting/etc perhaps as there are capacitors to fill... maybe the supply voltage rises slowly and that's enough to trigger BCL13?

Q-ten commented 3 months ago

I see that test point D is connected to _RST. But it can be configured as an interrupt. Would it be possible to use that test point to reset the device if it gets locked?

xorbit commented 3 months ago

Just a couple more thoughts.

Q-ten commented 3 months ago

Thanks for these ideas.

That's a good thought on the first point, but no, it really did seem to reset approx 9.5 hrs earlier when no-one was near it and connected to batteries. When the Lifepowered restores the time, the syslog records the timestamp in seconds. It was around 34000 when I powered it on manually. Another time when there was a reset due to disconnecting/reconnecting batteries and I checked the same message, it was about 23 seconds, which makes sense; it had just had the batteries plugged in before I powered it on.

Yes, I'm considering alterntative topologies for the 4G dongle. I actually already have a boost module connected to the VBSW and another RPi controlled mosfet to control a flash. So not out of the question.

I'm seeing the frozen state quite a bit now. I've literally just had another call about two units that wont power on.

I'm considering attaching an external RTC to periodically reset the micro with the D pad. I'd really rather get to the bottom of it though.

Q-ten commented 3 months ago

I have some more information.

I've tested a couple of units (rev 8) that have been returned and discovered that they are leaking current when off. One unit was leaking about 6mA and the other about 3mA. On the unit leaking 6mA, the VBSW was also not working; no VBAT on VBSW when the unit turned on. (Maybe that's suggestive?) Other than that, both units appeared to be working ok in that they would respond to the power button and would charge, with the charge LED turning on, and I observed reverse current flow on the battery when being charged. I checked with 2 battery voltages: 3.25V and 2.98V. The leakage current was lower when powered with the 2.98V battery. Almost in proportion to the voltage reduction.

To test the current leak I disconnected the LiFePO4wered from the rest of my circuitry and used aligator clips to connect multimeter, battery, and LiFePO4wered.

I've also been able to test a 3rd unit (from a previous batch using rev 7) that had locked up and found that it was drawing 111uA. After not being used for a few months, it wouldn't turn on and required the batteries to be removed to get it to turn on. After that, it was working as normal. 0.1mA is far less than 5mA, but over months (rather than hours/days) would lead to the same result.

On these units I also tested the resistance between the battery terminals when there is no battery connected. MOhms.

For a unit that had low charge, it's possible that the ~5mA drain would take the battery voltage way down. I left one unit on and came back the next day to find the battery at 1.5V. It obviously would have turned itself off at the VBAT_SHDN voltage, but then continued to decline. Only took 10mAh or so of charging to get it back to recognisable levels close to 3V. So it's plausible that the mA drain took it that low overnight. And if the battery voltage was going that low, it's very likely to reset the micro and potentially lock up (say, due to slowly increasing voltage from solar input as per BCL13).

I've gone over the schematic and can't see where this current would be leaking. Today I received two more units from Mouser (rev 8), tested them straight from the packaging and they both drew ~3.5uA (bang on the rate in the LiFePO4wered user guide.) In future, I'll test the drain of units when I recieve them and also before my units get deployed to the field.

I was able to probe the voltage on the EN pin on U4 because it's big enough. It's fine. 0V when off and VBAT - 0.01 when powered.

3mA (which was the current I observed on the rev 8 unit where VBSW was working) is right where you'd expect the micro to operate when it's active (at 12MHz), according to the datasheet. Could there be something preventing it from going to sleep? But I don't see how that would explain the other 2 units at 6mA and 0.1mA.

I'm tempted to remove U4 on the unit where VBSW is not working to see if it has an effect on the leakage current. It would be very interesting if it happens to drop it to 3mA also.

Q-ten commented 3 months ago

Well this is interesting. I removed U4 on that unit where VBSW wasn't working.... It fixed the current leak.

I tested it immediately before, and the current leak was 6.04mA on a 3.2V battery. Immediately after, (same battery, same setup) there was no current leak. It was giving me the 4uA like it should.

xorbit commented 3 months ago

Thanks for digging into this and reporting back, very interesting result! Seems like we're zeroing in on the culprit. And indeed, if that part is drawing mA of current that the micro can't turn off, the battery will over discharge and the slow power ramp erratum might cause the micro to be unresponsive. So it seems the VBSW load switch on that board was bad or got damaged in the application and is now bad. Very useful information. Now I recall from your previous comments that unlike most of my customers, you actually use the VBSW. Is it possible that long wiring or highly reactive loads could have produced switching transients that damaged U4? Another, more concerning possibility is that I somehow ended up with counterfeit parts for U4... 🤔

Q-ten commented 3 months ago

I don't think I was asking much of U4. It would have been drawing about 500mA from the battery, but it does see an inductor. It's connected to a boost convertor unit that then powers a flash with a low side mosfet controlled by the RPi. The flash is a ring of LEDs. The boost convertor is one of those cheap blue boost modules based on the MT3608. The flash is only on for short periods. Usually a few seconds.

The previous batch of products had the LED boost module connected to VOUT, but this time round I thought that powering the LED boost module directly from VBSW would be more efficient than using the boosted 5V from VOUT. So the new batch were connected to VBSW. (Turns out I was wrong. It's actually a bit more efficient the first way I'd done it.) My point is that I've had two batches, and the first didn't touch VBSW at all. But one of those I know did experience a lockup and is leaking 0.1mA.

Or rather, I should I say, "was". I removed its U4 chip and the 0.1mA leak disappeared. I then did the same with the board with the 3mA leak. And its leak disappeared as well. These are all the problem units that I have access to at this time. The two brand new units (which are rev 8 from Mouser) don't have the leak.

So I think this confirms that the culprit has been the U4 chip. I hadn't considered counterfeit chips, but I think that might be the most plausible explanation. I have pored over the datasheet for that chip and it looks pretty robust - certainly for what I was using it for. And given the 0.1mA leak was from a unit that never had anything connected to VBSW I don't think electronic use could explain it. 0.1mA isn't much, but it's still way off the specced value for that part.

Now I'm also going to see if I can replicate the lock up issue by depleting the battery and putting it outside to catch dawn.