zxdavb / ramses_cc

HA integration for CH/DHW and HVAC systems that use the RAMSES II RF protocol
GNU General Public License v3.0
80 stars 17 forks source link

Strange behaviour by OTB appliance (e.g. Remeha heatpump) #53

Open henkvdt opened 2 years ago

henkvdt commented 2 years ago

Description of setup:

Evohome controller, connected to Remeha Elga Ace heatpump with R8810A1018 (Opentherm module) Firmware version: WiFi module: 02.00.17.00, Application module: 02.00.19.33

ramses_cc ver: 0.21.0

Home Assistant config:

ramses_cc:
  serial_port: 
    port_name: /dev/serial/by-id/usb-FTDI_FT232R_USB_UART_A50285BI-if00-port0
    baudrate: 57600  #  default: 115200
  packet_log: 
    file_name: packet.log
    rotate_backups: 30

Nanocul from this supplier: https://schlauhaus.biz/en/product/nanocul-868/ Firmware: 0.7.1

Problem

After setup of the nanocul and software, all the sensors and entities became visible in Home Assistant. However, the heatpump started showing strange behaviour. Even though there is a heat demand from the Evohome to the heatpump, and output/return watertemp are below set point, every 6-10 minutes the heatpump stops heating for about 2 minutes. After this, the heatpump turns back on. The display on the heatpump doesn't show anything abnormal or error codes.

As soon as I disabled the ramses_cc plugin in Home Assistant, the heatpump started operating normally again. This can be seen in the energy diagram as attached. There was constant heatdemand most of the time between 11.35 and 13.50, but the heatpump was switching on and off. The peak around 13.40 was caused by setting the Evohome to 25 degrees to force a very high setpoint, but here the heatpump also stopped way before the setpoint was reached.

I disabled the plugin at 13.50. The Evohome send heat demand between 13.50 - 14.15 and 14.40 - 15.08. Between these timestamps, there is clear correlation between heat demand and power usage. Between 14.56 and 15.08 there was heat demand, but the heatpump reached it's setpoint. Therefore only the circulation pump was active which causes the lower power usage.

elga_energy_usage

Outdoor temperature is between 10-15 degrees celcius, so defreeze cycles are not relevant.

It seems like the nanocul is interfering the messages between the Evohome and heatpump (connected through R8810A1018) Could this be caused by active probing?

zxdavb commented 2 years ago

For clarity: HA -> ramses_cc -> ramses_rf -> evofw3 -> nanoCUL -> RF -> controller and/or OTB.

Evohome controller, connected to Remeha Elga Ace heatpump with R8810A1018 (Opentherm module)

There has been no testing with OTBs (R8810A, R8820A) and BDRs (BDR91T) connected to heatpumps.

It seems like the nanocul is interfering the messages between the Evohome and heatpump

However, ramses_rf, even in monitor mode, shouldn't cause these sort of issues.

For now, please run ramses_rf in read-only mode. In the configuration.yaml file in HA:

ramses_cc:
  ramses_rf:
    disable_sending: true
    enable_eavesdrop: true

If you do the above, ramses_rf will not send any packets, and thus cannot interfere with any conversation between the controller and the OTB.

Please PM me the packet logs created during the time you describe, above.

zxdavb commented 2 years ago

In response to your PM:

When restarting in receive only mode, these parameters are not visible anymore in HA: fault present and oem code. These parameters were visible when starting in normal mode.

The above is expected behaviour - there is no way to get this data, except to ask the OTB (R8810A). There are other (albeit more esoteric) examples of such data (e.g. number of CH pump starts).

The data you can see is obtained by eavesdropping packets exchanged from between the controller and the OTB.

zxdavb commented 2 years ago

2022-10-21 09:12:22.680 WARNING (MainThread) [ramses_rf.dispatcher] RP --- 01:060555 18:071692 --:------ 000C 018 0408001107C204080010831304080013F768 < CorruptStateError(Inconsistent schema: 04:259944 (TRV): None cant change parent (01:060555_00 (RAD)_00 to 01:060555_04 (RAD)_04)(try restarting the client library))

You are getting this error as your system is mis-configured. The TRV 04:259944 is configured as an actuator in two zones:

09:12:22.403 || CTL:060555 |  18:071692 | RP | zone_devices     | 0008 || {'zone_type': '08', 'zone_idx': '00', 'device_role': 'rad_actuator', 'devices': ['04:259944']}
09:12:22.677 || CTL:060555 |  18:071692 | RP | zone_devices     | 0408 || {'zone_type': '08', 'zone_idx': '04', 'device_role': 'rad_actuator', 'devices': ['04:067522', '04:033555', '04:259944']}

The device itself beleives it to be in zone 00:

09:12:23.576 || TRV:259944 |  01:060555 |  I | setpoint         |  00  || {'zone_idx': '00', 'setpoint': 18.0}

The only way to correct this is to destroy & re-create the mis-configured zone. I am not sure of the value of doing so as, AFAIK, this is not going to be causing you any issues.

zxdavb commented 2 years ago

At 9.25 I restarted in receive only mode and the normal usage/power pattern became visible again.

While I am still analysing the packet log, you can try this:

ramses_cc:
  ramses_rf:
    disable_sending: false
  block_list:
    10:066416: {}

This will prevent ramses_rf picking up any packet sent to/from the OTB and would be interesting to know if you get 'normal usage/power pattern' with it, or not.

Once you've determined that, you may want to put it back to read-only mode until we learn more?

Also, I strongly recommend you use a white_list:

zxdavb commented 2 years ago

The packet logs you have provided:

henkvdt commented 2 years ago

I've tested the following configurations:

ramses_cc:
  ramses_rf:
    disable_sending: true
    enable_eavesdrop: true

Result: issue does not occur anymore

ramses_cc:
  ramses_rf:
    disable_sending: false
  block_list:
    10:066416: {}

Result: issue does not occur anymore

ramses_cc:
  ramses_rf:
    xxx_xxxxx:
      xxxxxxx:
        " 2401 001 ": " 7FFE 001 "

Result: issue does occur again.

I assume the last config filters out a specific TX message? Would it make sense to change the filter to other commands and test in order to pinpoint which message is causing this issue?

zxdavb commented 2 years ago

I assume the last config filters out a specific TX message? Would it make sense to change the filter to other commands and test in order to pinpoint which message is causing this issue?

Correct. If we assume it is only the one command class (e.g. 2401) that is causing the issue, then - as your symptoms is so reproducible - you should be able to isolate which command class by walking through them all.

However, please check first that no 2401 packets are in the packet log, but there are 7FFE packets. (i.e. that the filter worked as expected).

Then the obvious first step would be to try this:

outbound:
  " 18:.* 10:.* ": " 18:000730 10:000000 "  # keep a valid packet for logging

... which is more targeted than read_only (my expectation is you won't have the issue).

Then these:

outbound:
  "18:.* 10:.* 3220 (...) ": "18:000730 10:000000 3220 \\1 "  # keep a valid packet for logging

For all of these, one at a time:

            "1081": "max_ch_setpoint",
            "10A0": "dhw_params",
            "1290": "outdoor_temp",
            "1300": "ch_pressure",
            "1FD4": "opentherm_sync",
            "22D9": "boiler_setpoint",
            "3150": "heat_demand",
            "3200": "boiler_output",
            "3210": "boiler_return",
            "3220": "opentherm_msg",
            "3EF0": "actuator_state"
zxdavb commented 2 years ago

@henkvdt

I don't know where you are with this, but you can try release 0.22.x, whihc has been refactored.

In particular, you can use the use_native_ot feature to prevent OpenTherm messages being sent:

ramses_cc:
  ramses_rf:
    use_native_ot: prefer  # or avoid

... start with prefer, then try avoid if you still have an issue.

henkvdt commented 2 years ago

I have been testing for the past week and can confirm with certainty that this issue is caused by (and only by) message type: 18:. 10:. 3220

Meanwhile I've acquired a diagnostics tool from the manufacturer to readout detailed status information and found out that when these type of messages are not filtered out, the water circulation pump is set to 6 liter/minute instead of 17 liter/minute (this value of 17 is default value configured in the settings menu of the unit). Due to this low circulation speed, the heat generated from the unit does not leave the heatexchange unit fast enough which makes the output temperature overshoot the setpoint within a few minutes. The unit then goes into pause mode and starts again a few minutes later. This is the saw teeth pattern you can see in the energy consumption.

zxdavb commented 2 years ago

I have been testing for the past week and can confirm with certainty that this issue is caused by (and only by) message type: 18:. 10:. 3220

Incredible work.

For now, ignore my advice above. Try avoid, and if you still have a problem, try never:

ramses_cc:
  ramses_rf:
    use_native_ot: avoid  # or never

The thing is: I am convinced - from my understanding of RAMSES & of OpenTherm - that ramses_rf cannot be causing this behaviour (although all evidence is that it is).

Here is the code used to make OT packets:

    def get_opentherm_data(cls, otb_id: _DeviceIdT, msg_id: Union[int, str], **kwargs):
        """Constructor to get (Read-Data) opentherm msg value (c.f. parser_3220)."""

        msg_id = msg_id if isinstance(msg_id, int) else int(msg_id, 16)
        payload = f"0080{msg_id:02X}0000" if parity(msg_id) else f"0000{msg_id:02X}0000"
        return cls.from_attrs(RQ, otb_id, Code._3220, payload, **kwargs)

The 80/00 above is the payload header - it is the parity bit, like so:

PMMM0000

What this tells is that all such commands use the msg-type of Read-Data (MMM is: 0b000) and never Write-Data (MMM would be 0b001) (as far as I see it).

So, I'm stumped.

What would be great would be if you could use the same technique identify which OT message is causing the issue, but there are a lot of them. I had a look, and something stands out, could you try:

"18:.* 10:.* 3220 (... ....01)": "18:000730 10:000000 3220 \\1"

Otherwise, just start with avoid.

I would love to see some packet logs.

henkvdt commented 2 years ago

I'll test the new filter ("18:.* 10:.* 3220 (... ....01)": "18:000730 10:000000 3220 \\1") for the coming days to see what happens. I've send the packet logs for the past days to your email.

henkvdt commented 2 years ago

Tested new filter:

Time 08.04:35

Time 08:36:52

Problem does occur.

Power pattern:

afbeelding

Logging data from heatpump diagnostics tool (08:30 - 08:36, similar for complete periode with same filter)

Drops in orange line: set point output temperature Drops in green line: internal setpoint. Red and blue lines: actual input and output temperature.

Because of the interruptions, the water temperature hardly increases.

afbeelding

Logging data from heatpump diagnostics tool (09:10 - 09:18), similar for complete periode with same filter)

Normal pattern where the water temperature steadily increases.

afbeelding

At this time, I starting to doubt my previous conclusion that the the software influences the pump speed which overshoots the setpoint. From the graph above it seems the setpoints are changed. However it does conform there is clear relation between packet filter and heatpump behavior. It might also be cause by changed other variables such as the outside temperature which is a lot lower at this time.

Will send the packet logs and heatpump logging for offline viewing by email.

zxdavb commented 2 years ago

Try this filter:

  outbound:
    "18:.* 10:.* 3220 (... ....0E)": "18:000730 10:000000 3220 \\1"
henkvdt commented 2 years ago

Issue occurs with this filter too, although the pattern from the diagnostics tool and power usage seems a little bit different than before. Don't know if this is significant. Maybe it's caused the a combination of messages and the last filter is a part of this?

afbeelding

afbeelding

zxdavb commented 2 years ago

Issue occurs with this filter too, although the pattern from the diagnostics tool and power usage seems a little bit different

Seems markedly different.

I'd say try them both:

  outbound:
    "18:.* 10:.* 3220 (... ....01)": "18:000730 10:000000 3220 \\1"
    "18:.* 10:.* 3220 (... ....0E)": "18:000730 10:000000 3220 \\1"

These two are my only hope.

We've established it's a 3220 (OT) packet. The above two are the only two where they're sent Data-Read, but the protocol says they should only be Data-Write.

henkvdt commented 2 years ago

Results with these filters (01 & 0E) was same same (issue did occur) so I tried all of the 3220 possible data_id values and am sure my issue is triggered by message type 00 (status):

" 18:.* 10:.* 3220 (... ....00)": "18:000730 10:000000 3220 \\1 "

From the little knowledge that I've gained in the past weeks of the opentherm and ramses protocols I understand this doesn't make any sense at all....? However, it is easy to reproduce.

After disabling the filter, the following messages were sent and the issue occurred:

16:31:10.631 || HGI:XXXXXX | OTB:XXXXXX | RQ | opentherm_msg    |  00  || {'msg_id': 0, 'msg_type': 'Read-Data', 'msg_name': 'status_flags', 'description': 'Status'}
16:31:10.702 || OTB:XXXXXX | HGI:XXXXXX | RP | opentherm_msg    |  00  || {'msg_id': 0, 'msg_type': 'Read-Ack', 'msg_name': None, 'value': [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], 'description': 'Status'}
16:32:10.784 || HGI:XXXXXX | OTB:XXXXXX | RQ | opentherm_msg    |  00  || {'msg_id': 0, 'msg_type': 'Read-Data', 'msg_name': 'status_flags', 'description': 'Status'}
16:32:11.109 || OTB:XXXXXX | HGI:XXXXXX | RP | opentherm_msg    |  00  || {'msg_id': 0, 'msg_type': 'Read-Ack', 'msg_name': None, 'value': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], 'description': 'Status'}

If its true that it doesn't make sense, could it be caused by a, let's say, not 100% perfect implementation of the protocol at the heatpump level?

Is there anything left to test?

zxdavb commented 2 years ago

so I tried all of the 3220 possible data_id values and am sure my issue is triggered by message type 00 (status):

" 18:.* 10:.* 3220 (... ....00)": "18:000730 10:000000 3220 \\1 "

Well, that is not what I expected!

From the little knowledge that I've gained in the past weeks of the opentherm and ramses protocols I understand this doesn't make any sense at all....? However, it is easy to reproduce.

Absolutely correct: it doesn't make sense (I re-read the protocol just now to check).

OT data_id 0 (0x00) is definitely read-only. Even if it wasn't, you can see the payload has Read-Data and not Write-Data, so - like you - I also wonder if:

it (could) be caused by a, let's say, not 100% perfect implementation of the protocol at the heatpump

FWIW, I have seen something similar before, which is why I send a 3220|01|Read-Data when it shouldn't be supported (but is).

~(Currently) there is no way to stop these particular OT msgs with the use_native_ot: avoid feature (or even use_native_ot: never), so you'll have to keep the above packet filter in.~

Thanks for your work.

[EDIT] From 0.31.7, use_native_ot: never will strictly stop all 3220 packets sent by ramses_rf.

zxdavb commented 2 years ago

I'm tempted to remove 3220|00 even for use_native_ot: never, but you'd miss out on all the other OT data_ids.

As a separate issue: it is probably worth scanning your HP to see what it might respond to - ramses_rf is only designed to scan OTBs (Honeywell OT gateways fro CH/DHW systems).

Do you want a hand figuring out how to do that?

The data_id values only go from 00 (0) to 7F (127), and you also want to look at the FSBs and the TSPs.

henkvdt commented 2 years ago

Do you want a hand figuring out how to do that?

Yes please ! :)