zephyriot / zephyr-issues

0 stars 0 forks source link

net: "queue: Use k_poll if enabled" commit regressed BSD Sockets performance #2422

Open nashif opened 7 years ago

nashif commented 7 years ago

Reported by Paul Sokolovsky:

With a sample "http_server" written in MicroPython, with BSD Sockets as the underlying implementation, I get:

With "git checkout 84db641de66463550fb41f083dbdf0b02fc78284"

$ ab -n200 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests

Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   52.057 seconds
Complete requests:      200
Failed requests:        190
   (Connect: 0, Receive: 0, Length: 190, Exceptions: 0)
Total transferred:      9090 bytes
HTML transferred:       5690 bytes
Requests per second:    3.84 [#/sec] (mean)
Time per request:       260.283 [ms] (mean)
Time per request:       260.283 [ms] (mean, across all concurrent requests)
Transfer rate:          0.17 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        6  249 431.0      7    1029
Processing:     8   11   2.6     11      29
Waiting:        8   10   2.1     10      27
Total:         15  260 430.9     18    1040

Percentage of the requests served within a certain time (ms)
  50%     18
  66%     21
  75%     36
  80%   1022
  90%   1025
  95%   1026
  98%   1036
  99%   1037
 100%   1040 (longest request)

With "git checkout 84db641de66463550fb41f083dbdf0b02fc78284^" (note ^ at the end)

$ ab -n200 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests

Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   3.712 seconds
Complete requests:      200
Failed requests:        190
   (Connect: 0, Receive: 0, Length: 190, Exceptions: 0)
Total transferred:      9090 bytes
HTML transferred:       5690 bytes
Requests per second:    53.88 [#/sec] (mean)
Time per request:       18.561 [ms] (mean)
Time per request:       18.561 [ms] (mean, across all concurrent requests)
Transfer rate:          2.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        6    7   1.8      7      23
Processing:     8   11   2.8     11      38
Waiting:        7   10   2.4      9      35
Total:         15   19   3.3     18      44

Percentage of the requests served within a certain time (ms)
  50%     18
  66%     19
  75%     19
  80%     19
  90%     21
  95%     23
  98%     31
  99%     34
 100%     44 (longest request)

Sockets implementation heavily relies on FIFOs (and uses k_poll on FIFO too). Per above, there's clear case of "some FIFO operation took +1s with k_fifo_get based on k_poll vs previous wait_q based impl".

https://github.com/zephyrproject-rtos/zephyr/commit/84db641de66463550fb41f083dbdf0b02fc78284

(Imported from Jira ZEP-2593)

nashif commented 7 years ago

by Paul Sokolovsky:

Luiz Augusto von Dentz : FYI

I'm working on a native Zephyr testcase.

nashif commented 7 years ago

by Luiz Augusto von Dentz:

So what exactly is this some sort of benchmark? Anyway k_poll version should be getting similar numbers so perhaps something else is causing the problem, 1 second is awful long for k_fifo_get.

nashif commented 7 years ago

by Paul Sokolovsky:

Luiz Augusto von Dentz :

So what exactly is this some sort of benchmark?

It uses ab, Apache Bench, which is indeed a well-known and accessible HTTP benchmark tool. But I use not as a benchmark tool, but as an HTTP request generator for a kind of smoke load testing. Like, if it can sustain 1000 requests, it can move onto the next stage of testing with 10000 requests, and if it can sustain that, then perhaps it works (based on the anecdote of the production testing of Android phones, which if it can sustain 24hrs of testing with "monkey" tool gets "ship it!" command).

So, as promised, I prepared native Zephyr example: https://github.com/zephyrproject-rtos/zephyr/pull/1378

The way to test it:

Build that sample, run in qemu, run:

$ ab -n100 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient).....done

Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        19 bytes

Concurrency Level:      1
Time taken for tests:   134.229 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      7800 bytes
HTML transferred:       1900 bytes
Requests per second:    0.74 [#/sec] (mean)
Time per request:       1342.293 [ms] (mean)
Time per request:       1342.293 [ms] (mean, across all concurrent requests)
Transfer rate:          0.06 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5  683 477.0   1013    1032
Processing:     8  659 1254.8     10    3139
Waiting:        7  657 1254.9      9    3138
Total:         14 1342 1495.2   1024    4154

Percentage of the requests served within a certain time (ms)
  50%   1024
  66%   1026
  75%   1027
  80%   4095
  90%   4097
  95%   4097
  98%   4124
  99%   4154
 100%   4154 (longest request)

git checkout 84db641de66463550fb41f083dbdf0b02fc78284, get the same picture.
git checkout 84db641de66463550fb41f083dbdf0b02fc78284^ get:

$ ab -n10000 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        19 bytes

Concurrency Level:      1
Time taken for tests:   173.989 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      780000 bytes
HTML transferred:       190000 bytes
Requests per second:    57.47 [#/sec] (mean)
Time per request:       17.399 [ms] (mean)
Time per request:       17.399 [ms] (mean, across all concurrent requests)
Transfer rate:          4.38 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5    7   1.6      7      26
Processing:     7   10   1.8     10      29
Waiting:        6    9   1.6      8      27
Total:         13   17   2.8     17      45

Percentage of the requests served within a certain time (ms)
  50%     17
  66%     18
  75%     18
  80%     19
  90%     21
  95%     23
  98%     25
  99%     28
 100%     45 (longest request)

nashif commented 7 years ago

by Paul Sokolovsky:

Anyway k_poll version should be getting similar numbers so perhaps something else is causing the problem, 1 second is awful long for k_fifo_get.

Perhaps, something else is causing it, like race condition or bug in the scheduler. But exactly https://github.com/zephyrproject-rtos/zephyr/commit/84db641de66463550fb41f083dbdf0b02fc78284 is what triggered this regression, so I'd appreciate help with investigating it (I for one have no idea how to approach it).

nashif commented 7 years ago

by Paul Sokolovsky:

More info: with the following patch:

--- a/kernel/queue.c
+++ b/kernel/queue.c
@@ -205,7 +205,9 @@ static void *k_queue_poll(struct k_queue *queue, s32_t timeout)

        event.state = K_POLL_STATE_NOT_READY;

+printk("p in %p\n", queue);
        err = k_poll(&event, 1, timeout);
+printk("p out\n");
        if (err) {
                return NULL;
        }

Slowdowns happen much less frequently. Like, none for 100 requests, and for 1000 it's:

Percentage of the requests served within a certain time (ms)
  50%     16
  66%     17
  75%     18
  80%     18
  90%     20
  95%     21
  98%     23
  99%     24
 100%   1039 (longest request)

nashif commented 7 years ago

by Paul Sokolovsky:

Actually, if only "printk("p out\n");" is active, it's enough to alleviate the slowdown.

nashif commented 7 years ago

by Paul Sokolovsky:

Further update: the above all happens on qemu_x86. I tested on real hardware, frdm_k64f, and can't reproduce:

Complete requests:      10000
Failed requests:        0
Total transferred:      780000 bytes
HTML transferred:       190000 bytes
Requests per second:    155.79 [#/sec] (mean)
Time per request:       6.419 [ms] (mean)
Time per request:       6.419 [ms] (mean, across all concurrent requests)
Transfer rate:          11.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.1      1       4
Processing:     2    6   0.2      6      13
Waiting:        2    6   0.2      6      13
Total:          4    6   0.1      6      14

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      6
  80%      6
  90%      6
  95%      6
  98%      7
  99%      7
 100%     14 (longest request)

This is worst of all of course, when you can't trust your devel tool (emulator). However, there're other difference points still: arch and interface speed (slip vs ethernet). Will try on qemu_cortex_m3 next.

nashif commented 7 years ago

by Paul Sokolovsky:

Will try on qemu_cortex_m3 next.

That almost helped. First run on -n10000 went perfect. Repeating it, tested was aborted with "conn reset by peer", which is surely different issue. Scaling down to -n1000, I still get regular:

  50%     17
  66%     20
  75%     22
  80%     23
  90%     27
  95%     32
  98%     38
  99%     43
 100%   1033 (longest request)

In other words, qemu_cortex_m3 unchanged works like qemu_x86 with printk added after k_poll() call (above).

nashif commented 7 years ago

by Andy Ross:

So... the hypothesis is that something in k_poll() (or the k_queue use of k_poll) is failing to wake up synchronously and we're hitting 1000ms timeouts, but only on some systems? Have we verified that the "slow downs" are indeed a timeout condition? The printk interaction seems werid, on qemu that's just a bunch of synchronous register writes, no changes to interrupt delivery order should be happening. Can you replace the printk() with a k_busy_wait() and fix it too?

Do we know which queue is involved? Have we ruled out delays on insert from some other source (seems unlikely I guess).

nashif commented 7 years ago

by Andy Ross:

I spent some time staring at the queue and poll implementations and can't see a race. But here's one shot in the dark:

Ben's code has some subtle locking at the entry to k_poll() to allow the initialization of the (potentially large) poll array to be done without the IRQ lock held for the whole process. I can't see anything wrong with it, but the fact that the printk() above "fixes" the issue implies the the race might be somewhere in this viscinity: a "wakeup event" happens "very soon, maybe already" relative to the k_poll() call, and simply delaying it a little causes poll to hit the latter code path instead of the former.

If you have the rig ready, can you try this experiment and see if it affects anything? It wouldn't be a fix (it simply keeps the lock the whole time and thus is subject to latency problems with big poll arrays), but if it works it helps to localize the problem:

diff --git a/kernel/poll.c b/kernel/poll.c
index eb014ea..d15bbd7 100644
--- a/kernel/poll.c
+++ b/kernel/poll.c
@@ -199,7 +199,6 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)

        key = irq_lock();
        set_polling_state(_current);
-       irq_unlock(key);

        struct _poller poller = { .thread = _current };

@@ -207,7 +206,6 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
        for (int ii = 0; ii < num_events; ii++) {
                u32_t state;

-               key = irq_lock();
                if (is_condition_met(&events[ii], &state)) {
                        set_event_ready(&events[ii], state);
                        clear_polling_state(_current);
@@ -219,11 +217,8 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
                                __ASSERT(0, "unexpected return code\n");
                        }
                }
-               irq_unlock(key);
        }

-       key = irq_lock();
-
        /*
         * If we're not polling anymore, it means that at least one event
         * condition is met, either when looping through the events here or

nashif commented 7 years ago

by Paul Sokolovsky:

Andy Ross : Thanks much for looking into this! Let me start from the end, with trying the patch you give. Somehow, it didn't apply for me as is, so I redid it manually, catching another case of irq_lock to remove, see my exact patch below.

So, unfortunately, it didn't resolve issue:

Complete requests:      853
Percentage of the requests served within a certain time (ms)
  50%     31
  66%     34
  75%     38
  80%     42
  90%   1025
  95%   1046
  98%   1055
  99%   1058
 100%   4146 (longest request)

The patch:

diff --git a/kernel/poll.c b/kernel/poll.c
index eb014ea51..50add842f 100644
--- a/kernel/poll.c
+++ b/kernel/poll.c
@@ -199,7 +199,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)

    key = irq_lock();
    set_polling_state(_current);
-   irq_unlock(key);
+// irq_unlock(key);

    struct _poller poller = { .thread = _current };

@@ -207,7 +207,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
    for (int ii = 0; ii < num_events; ii++) {
        u32_t state;

-       key = irq_lock();
+//     key = irq_lock();
        if (is_condition_met(&events[ii], &state)) {
            set_event_ready(&events[ii], state);
            clear_polling_state(_current);
@@ -219,10 +219,10 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
                __ASSERT(0, "unexpected return code\n");
            }
        }
-       irq_unlock(key);
+//     irq_unlock(key);
    }

-   key = irq_lock();
+// key = irq_lock();

    /*
     * If we're not polling anymore, it means that at least one event
@@ -259,7 +259,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
     * added to the list of events that occurred, the user has to check the
     * return code first, which invalidates the whole list of event states.
     */
-   key = irq_lock();
+// key = irq_lock();
    clear_event_registrations(events, last_registered, key);
    irq_unlock(key);

nashif commented 7 years ago

by Paul Sokolovsky:

So... the hypothesis is that something in k_poll() (or the k_queue use of k_poll) is failing to wake up synchronously and we're hitting 1000ms timeouts, but only on some systems?

Well, it can be even worse than that:

Until reproduced by someone else, it can be an artifact of my particular system setup. (I mention this to cover my ass, to avoid situation like was with UART busy-wait API, when I did too much noise regarding imaginable issues with kernel scheduling).
Then it still can be issues with QEMU emulation. There're known issues with how UART is emulated, why there can't be more corner cases? Testing on more hardware would help, but except for frdm_k64f/Ethernet (which is "too fast") I have only 96b_carbon/BLE, which is "too slow" and may have its own issues. (Would be cool if we could connect a real board over serial/SLIP).

Otherwise yes, the hypothesis is that with slow enough packet speeds (like with SLIP), there's some race or something leading to "extra sleep".

nashif commented 7 years ago

by Paul Sokolovsky:

Have we verified that the "slow downs" are indeed a timeout condition?

Well, that was exactly my question - where does this 1s (1000ms) - even delay comes from? One such figure I know is CONFIG_NET_TCP_ACK_TIMEOUT. I changed it to 500, but nope delays are still integer numbers of 1000ms (1s, 2s, 3s, 4s seen). Grepping thru .config, I don't see other "1000" figures, would need to grep thru codebase for "hardcoded" values.

nashif commented 7 years ago

by Paul Sokolovsky:

Can you replace the printk() with a k_busy_wait() and fix it too?

Even before, I tried k_yield() there, as seen (with some worry from myself) in some Jukka Rissanen 's networking patches. Didn't help visibly.

With k_busy_wait(500) after k_poll() and 1000 requests it's:

  50%     32
  66%     35
  75%     37
  80%     39
  90%     44
  95%     49
  98%     55
  99%     67
 100%   1052 (longest request)

With k_busy_wait(1000) after k_poll() and 1000 requests it's:

  50%     36
  66%     40
  75%     42
  80%     44
  90%     51
  95%     63
  98%   1050
  99%   1057
 100%   1091 (longest request)

With k_busy_wait(2000) after k_poll() and 1000 requests it's:

  50%     37
  66%     41
  75%     44
  80%     46
  90%     55
  95%     66
  98%     78
  99%     89
 100%    150 (longest request)

So, k_busy_wait() definitely "helps", but doesn't resolve the issue, and the behavior isn't even monotonic function of the delay used (of course, there's enough variability/nondeterminism in the first place).

nashif commented 7 years ago

by Paul Sokolovsky:

Do we know which queue is involved?

Exact one - no, but (at least) following queues are involved:

RX packet queue

Accept socket queue (new connection contexts go there)

Per-socket RX packet queue

I modified to print k_uptime_get_32() at some points in socket code, and here what I see:

38690 zsock_accepted_cb 0x004053b8 38690 Connection from 192.0.2.2 38710 Connection from 192.0.2.2 closed

38740 zsock_accepted_cb 0x004053b8 38750 Connection from 192.0.2.2 38760 Connection from 192.0.2.2 closed

39810 zsock_accepted_cb 0x004053b8 39810 Connection from 192.0.2.2 39830 Connection from 192.0.2.2 closed

40860 zsock_accepted_cb 0x004053b8 40870 Connection from 192.0.2.2 40870 Connection from 192.0.2.2 closed

So, first 2 connections got served without a delay, then 1s between following 2. But once zsock_accepted_cb() is called (which does k_fifo_put), there're no big delay until this connection is served, the delay happens inbetween connection acceptance and servings. Will patch in more printk...

Have we ruled out delays on insert from some other source (seems unlikely I guess).

What do you mean/have in mind? The processing should be like above: a driver puts a packet in RX queue, where it's taken from by RX thread and routed to either accept on recv callback, both of which put into other queues (which are actually union'ed, as their lifetimes don't cross), from which app code pulls them via calls to accept() or recv().

nashif commented 7 years ago

by Paul Sokolovsky:

Luiz Augusto von Dentz on IRC suggested:

Ok, so I enabled in the config:
CONFIG_NET_BUF_LOG=y
CONFIG_SYS_LOG_NET_BUF_LEVEL=3
CONFIG_NET_BUF_WARN_ALLOC_INTERVAL itself defaults to 1. Nothing is being logged. I verified that with CONFIG_SYS_LOG_NET_BUF_LEVEL=4 I get debug spam.