wirepas / gateway

Wirepas's Linux gateway reference implementation built on top of Wirepas's dual mcu api and Wirepas' gateway to backend api
Apache License 2.0
60 stars 39 forks source link

Running into GatewayResultCode.GW_RES_SINK_OUT_OF_MEMORY when sending a lot of requests #255

Open vespakoen opened 1 year ago

vespakoen commented 1 year ago

Describe the bug

I am trying to "spam" MQTT with a lot of messages to send into the network, to see what the limits are. I am running mosquitto, wp-sink-service, wp-transport-service and wp-dbus-service on a Raspberry Pi 400, with a NINA B3 USB Dongle.

I am using the latest wirepas_mqtt_library for python with something like:

        res = client.send_message(
            gw_id=parameters.gw_id,
            sink_id=parameters.sink_id,
            dest=destination,
            src_ep=parameters.src_ep,
            dst_ep=parameters.dst_ep,
            payload=bytes.fromhex("8A")
        )
        if res != GatewayResultCode.GW_RES_OK:
            print("Sending data synchronously failed: res=%s" % res)
        else:
            print("Sending data synchronously succeeded")

I have a script that spams a FastAPI endpoint which calls the code above:

#!/usr/bin/env python

import requests
from concurrent import futures
from itertools import repeat

wpparameters_list = repeat({
   "gw_id": "my_wp_gateway",
   "sink_id": "sink0",
   "dest": 1,
   "src_ep": 1,
   "dst_ep": 1
}, 500)

def make_request(wpparameters):
  r = requests.post('http://localhost:3000/execute', timeout=4000, verify=False, json=wpparameters)
  print(r.status_code)

with futures.ThreadPoolExecutor(max_workers=10) as executor:
    res = executor.map(make_request, wpparameters_list)

The res output from the backend is:

Sending data synchronously failed: res=GatewayResultCode.GW_RES_SINK_OUT_OF_MEMORY

On the Raspberry Pi logs I find a bunch of:

wp-sink-service-1       | 2023-06-23 17:30:25,100 | [ERROR] wpc:Cannot send data. Dualmcu error code: 4
wp-sink-service-1       | 2023-06-23 17:30:25,100 | [ERROR] Data:Cannot send data: 21
wp-transport-service-1  | 2023-06-23 17:30:25,102 | [ERROR] wirepas_gateway@sink_manager.py:86:Cannot send message err=21

I have been looking through the c-mesh-api code but cannot find from where the out of memory error is being returned. And I am also wondering if I should be putting a queue in between my backend and MQTT so that messages are not coming in too fast for the sink to handle them? I am basically looking for a way to make the sink send out as many messages as possible ;)

To Reproduce Steps to reproduce the behavior:

  1. Start gateway services
  2. Send a whole lot of messages
  3. See error

Expected behavior No errors? ;)

Platform details:

vespakoen commented 1 year ago

After some more testing I figure out that it runs alright for a while, and then at one point it starts giving me GW_RES_SINK_OUT_OF_MEMORY, even when sending the requests slowly, so it seems that when I send a lot of requests at the same time, the usb stick / sink service(?) get's into a weird state and starts returning those errors.

Looks more like a memory leak now?

vespakoen commented 1 year ago

Recompiled sink service with DEBUG output in data.c, which didn't bring many new insights (log below).

wp-sink-service-1       | 2023-06-23 21:24:25,814 | [DEBUG] Data:Message to send on EP 1 from EP 1 to 0xffffffff size = 1
wp-sink-service-1       | 2023-06-23 21:24:25,857 | [ERROR] wpc:Cannot send data. Dualmcu error code: 4
wp-sink-service-1       | 2023-06-23 21:24:25,857 | [ERROR] Data:Cannot send data: 21
wp-transport-service-1  | 2023-06-23 21:24:25,859 | [ERROR] wirepas_gateway@sink_manager.py:85:Cannot send message err=21
wp-transport-service-1  | 2023-06-23 21:24:25,942 | [DEBUG] wirepas_gateway@transport_service.py:509:Downlink traffic: sink0 | 16371271045505284465
wp-sink-service-1       | 2023-06-23 21:24:25,945 | [DEBUG] Data:Message to send on EP 1 from EP 1 to 0xffffffff size = 1
wp-sink-service-1       | 2023-06-23 21:24:25,958 | [ERROR] wpc:Cannot send data. Dualmcu error code: 4
wp-sink-service-1       | 2023-06-23 21:24:25,958 | [ERROR] Data:Cannot send data: 21

So I tried to run the c-mesh-api example with some more logging to rule out dbus, and I got something more interesting:

2023-06-24 01:04:35,128 | [DEBUG] SERIAL:Custom bitrate set: 115200
2023-06-24 01:04:35,131 | [INFO] wpc_int:WPC initialized
2023-06-24 01:04:36,120 | [ERROR] wpc_int:Didn't receive answer to the request 0x06 error is: -2
2023-06-24 01:04:36,120 | [INFO] wpc_int:========= confirm_size: -2
 2023-06-24 01:04:36,899 | [WARNING] SLIP:Too small packet received (size=3)
2023-06-24 01:04:37,883 | [ERROR] SLIP:Wrong CRC 0x2e3e (computed) vs 0x0d51 (received)
2023-06-24 01:04:37,883 | [ERROR] wpc_int:CRC error in confirm for request 0x04
2023-06-24 01:04:37,883 | [INFO] wpc_int:========= confirm_size: -5
 2023-06-24 01:04:37,883 | [ERROR] wpc_int:No poll answer
2023-06-24 01:04:38,623 | [INFO] Main:Stack is stopped
2023-06-24 01:04:38,890 | [ERROR] wpc_int:Didn't receive answer to the request 0x04 error is: -2
2023-06-24 01:04:38,890 | [INFO] wpc_int:========= confirm_size: -2
 2023-06-24 01:04:38,890 | [ERROR] wpc_int:No poll answer
2023-06-24 01:04:39,880 | [ERROR] wpc_int:Didn't receive answer to the request 0x0d error is: -2
2023-06-24 01:04:39,880 | [INFO] wpc_int:========= confirm_size: -2
 2023-06-24 01:04:39,880 | [ERROR] Main:Cannot set node address
2023-06-24 01:04:39,880 | [WARNING] linux_plat:Exiting dispatch thread
2023-06-24 01:04:40,870 | [ERROR] wpc_int:Didn't receive answer to the request 0x04 error is: -2
2023-06-24 01:04:40,870 | [INFO] wpc_int:========= confirm_size: -2
 2023-06-24 01:04:40,870 | [ERROR] wpc_int:No poll answer
2023-06-24 01:04:40,870 | [WARNING] linux_plat:Exiting polling thread

I guess the message is not being received, so it is retrying to send it?

GwendalRaoul commented 1 year ago

Hello @vespakoen,

Why do you think it is a bug? A Wirepas network has limitation in term of downlink messages.

Downlink packets are queued in the sink before being sent to the network and once the buffers are full, the sink will answer you that its memory is full (GW_RES_SINK_OUT_OF_MEMORY). And depending on the network topology you have it may take time to flush them.

Best regards, Gwendal

vespakoen commented 1 year ago

Hi Gwendal,

Thanks for looking at this.

Sorry for marking this as a bug, it probably is just a misunderstanding on my part ;)

I am trying to figure out: 1) where this queue is, and if it's size can be increased 2) if there is a way to check how full the queue is, so I can slow down the "spamming & losing messages"

Maybe what I am looking for is a blocking way to use the c-mesh-api, so I can make it process as many messages as possible without buffering anything at all?

All the best,