RTMFP costs so much cpu

doanquocnam commented 3 months ago

Run tcserver -B 0.0.0.0:1935 -b 0.0.0.0:1936 and make a stress test for server.

The tcserver run with RTMFP which disables openssl by call EVP_enc_null() in if( (not EVP_CipherInit_ex(m_ctx, EVP_enc_null(), NULL, (const unsigned char )key, NULL, encrypt ? 1 : 0)) or (not EVP_CIPHER_CTX_set_padding(m_ctx, 0)) ) return false; and: m_ciphertextBuf = (uint8_t )malloc(ENCRYPT_BUF_LENGTH + ENCRYPT_BUF_MARGIN); m_plaintextBuf = m_ciphertextBuf + ENCRYPT_BUF_MARGIN;

I've tested, and it were ok for both client tcconn & tcserver to make sure that the crypto didnt costs any cpu.

Using ffmpeg to publish a channel 720p-1Mbps
Using 500 rtmfp clients to connect to the tcserver to play this channel., the cpu (1 core) gets 90-100%. When cpu get to 100% it start stuttering and client have the same.
Using 1000 rtmp clients to connect to the tcserver and play the same stream above, the cpu (1 core) gets only 60%.

I had 3 days to review to fix this problem but I havent found any solution yet. Please help me to fix this!

I think there is some this wrong on RTMFP, so it costs so much CPU.

P1: 500 - RTMFP clients

P2: 1000 - RTMP clients

zenomt commented 3 months ago

my profiling has shown that a significant amount of CPU time is spent in the OS kernel receiving and (especially) sending UDP packets, on both MacOS and Linux. RTMFP is responsible for assembling or parsing each packet that goes over the network, instead of that being delegated to the OS. each send or receive of one packet requires a system call to the OS (a context switch), which is expensive, and operating system implementations of those system calls for sending or receiving UDP packets also use a lot of CPU resources, according to my profiling.

there is no portable solution to this problem. Linux has a system call that can send or receive multiple UDP packets at once, which can reduce the overhead. i haven't investigated using this facility, however. this would be implemented in the Platform Adapter, not RTMFP itself.

there are probably other areas of the RTMFP library that could be optimized. i haven't run the profiler when there are hundreds of connections all going at once to see if there's a bottleneck for that case.

no matter what, it will always take more CPU resources to run RTMFP than to do RTMP or RTWebSocket or anything else based on TCP. this implementation is single-threaded, so it is expected that it will not perform perfectly past saturation.

TL;DR: this is not a bug, this is a Fact of Life™, unfortunately.

zenomt commented 3 months ago

the Linux-specific system calls that can send and receive multiple UDP packets at once are sendmmsg(2) and recvmmsg(2). you can of course write your own Linux Platform Adapter that uses these system calls to send or receive multiple packets with one system call, which can improve the performance by amortizing the per-packet system call overhead.

doanquocnam commented 3 months ago

I take a test that pushing 10 connections per second for each protocol RTMFP or RTMP until the full cpu to find out the limit. I am writting a live streaming system for millions of client. So I must have an overload test to find out the solution. I choose UDP protocol because this protocol can make peer-to-peer connection. But on my server 40 cores - 2.3GHz (40 running tcserver processes) , when the number of connections RTMFP get to 12.000 clients (300 connections per tcserver process) then the cpu usage is 100%, it starts stuttering.

I''ve just read this topic:: https://medium.com/@pranav_maniar/how-i-improved-udp-servers-processing-performance-from-900-msg-sec-to-9000-msg-sec-1295e87ccf75

Maybe this problem is network buffer for sending/receiving. I will increase these buffer size and take a test. Maybe I must also write another socket class to read/write network messages!

doanquocnam commented 3 months ago

I write a simple echo UDP sever to take a benmark & stress test udp tranport. My simple udp echo server recv/sendto to 500 clients, each client has 3MB/s transfer, so all the client make server recv/send 300MB/s (150MB/s receving - 150MB/s sending) the cpu usage of server get 55-60%. With tcserver the bandwidth of 500 connected clients only 50MB/s and tcserver is call sendmsg only, because it receives only one publishing channel from rtmp ffmpeg . But the cpu usage get to > 90%.

I still believe that the tcserver has some problem while sending packet over udp.

This is my simple udp with 500 clients send/recv the same time to udp echo server udpstresstest:

This is tcserver with 500 client sending only

Please review!!!

doanquocnam commented 3 months ago

I've reduce the system call to sendmsg by using sendmmsg, reduce fragment from SendFrag, reduce call TCMessage::message in relayStreamMessage, ... and now the tcserver has decreased cpu usage 10-12% for 600 rtmfp clients. But I want to more reduction, maybe 20% is acceptable. I think there are so many call to SendFrag, what i think this is called only one for each time to perform relayStreamMessage!!

zenomt commented 3 months ago

I think there are so many call to SendFrag, what i think this is called only one for each time to perform relayStreamMessage!!

i'm not sure exactly what you're saying here. a SendFrag tracks one fragment of a user message on a flow. if the message is too big to fit in one packet, it needs to be broken up into multiple fragments. each fragment needs its own sequence number and other bookkeeping to track if it's been sent and acknowledged, since any fragment could be lost in flight.

the maximum packet size is fairly conservative to fit in IPv6 networks even if there's a tunnel or a VPN on the path. you can adjust the maximum packet size to get more data in each packet and reduce the number of fragments needed for big messages by changing MAX_SESSION_PACKET_LENGTH in include/rtmfp/param.hpp. note that making it too big (especially approaching or exceeding the path MTU between peers) can increase the likelihood of packet loss. i recommend not exceeding 1400 bytes for MAX_SESSION_PACKET_LENGTH.

doanquocnam commented 3 months ago

I mean that, the SendFrag, TCMessage::message() call the same data. If you send to 1000 clients you will call constructor/destructor the TCMessage::message() 1000 times, the SendFrag also m_data(data, data + len) (1000xFrag) times, and memcpy to consstructor/destructor vector = Bytes.

If you have 1 TCMesssage with 10000 bytes, the SendFrag will make a fragment 10x1000 bytes for 1000 of times the same because the algorithm to split is only one.I means that the SendFrag only split the that message. At this point, I still havent improved yet. I make a fragment when calling function sendmsg (sendmmsg) and split message here to reduce system call to sendmmsg.

doanquocnam commented 3 months ago

Maybe I was wrong that SendFrag, I mean it is m_send_queue is the same. In one for loop of onMessage (Video, Data, Audio) for(auto it = stream.m_subscribers.begin(); it != stream.m_subscribers.end(); it++) { (it)->m_owner->relayStreamMessage(it, messageType, timestamp, actualPayload, limit - actualPayload); //

          /// auto fragment = share_ref(this, SendFrag(payload, len, ....);
          //m_send_queue.append(fragment);

} the buffer payload will be split the same as one evey time, because it has the same data, the same length and the same fragging algorithm. So at this point, m_send_queue will be only one in this loop.

zenomt commented 3 months ago

there needs to be a separate TCMessage for each recipient because their timestamp origins could be different. looking at the profiler, i do see an optimization i can make in TCMessage::message() to make it a little faster.

however, looking at the profiler on my Macintosh (which is old and much slower than modern ones), neither TCMessage::message() nor the allocation of SendFrags is really the bottleneck. i ran tcserver for 30 seconds with 150 tcconn clients playing the same 1 Mbps stream. that got the process up to 72% CPU on my old slow Mac (which means it used about 21 seconds of CPU). the profiler says that tcserver's Client::relayStreamMessage() used 3.07 seconds of CPU time, of which 1.19 seconds (5.5%) were attributed to SendFrag's constructor (and almost all of that in the constructor of the vector holding the fragment bytes), and 1.14 seconds (5.3%) in TCMessage::message(). even if both of these could be optimized to take 0 time, that would only speed things up about 12%.

zenomt commented 3 months ago

again on my Macintosh (same run as above), most of the CPU time (11.4 seconds out of 21 seconds, 54%) is spent on the network sending side (in Session::onInterfaceWritable() and its descendants). of that, 6.35 seconds (30%) is spent in the kernel for sendmsg(), and 1.9 seconds (9%) is spent in encryption (AES-CBC and HMAC-SHA256). another 2.1 seconds (10%) in that loop is spent in SendFlow::assembleData(), of which 1.3s (6%) is spent rescheduling the timeout alarm. that's a place i should look to optimize, because that's too much housekeeping overhead happening there.

unfortunately, the dominating factors on the sending side are the time spent in sendmsg() and encryption.

doanquocnam commented 3 months ago

again on my Macintosh (same run as above), most of the CPU time (11.4 seconds out of 21 seconds, 54%) is spent on the network sending side (in Session::onInterfaceWritable() and its descendants). of that, 6.35 seconds (30%) is spent in the kernel for sendmsg(), and 1.9 seconds (9%) is spent in encryption (AES-CBC and HMAC-SHA256). another 2.1 seconds (10%) in that loop is spent in SendFlow::assembleData(), of which 1.3s (6%) is spent rescheduling the timeout alarm. that's a place i should look to optimize, because that's too much housekeeping overhead happening there.

unfortunately, the dominating factors on the sending side are the time spent in sendmsg() and encryption.

Dont worry! Most server will run on Linux (CentOS, RHEL, ..) so that it can be improved by using sendmmsg to send multi messages at the same time. After testing with my algorithm with sending by sendmmsg I will show you how it works!

zenomt commented 3 months ago

https://github.com/zenomt/rtmfp-cpp/commit/08cf5eebba61f409dc805dc80a079997512351a0 is the optimization for Session::rescheduleTimeoutAlarm() (called from SendFlow::assembleData()) that takes that 1.3 seconds spent rescheduling the timeout alarm down to about 20 ms. :)

zenomt commented 3 months ago

i tried the optimization i had in mind for TCMessage::message(), but it didn't make a difference. most of the time is spent allocating a new byte vector, not in copying the data itself (when i changed the implementation to allocate a vector of the right size to begin with and then copy the data into it, it took the same amount of time, and the profiler showed negligible time spent in memcpy()). because it made no difference, i didn't merge that change (the code is clearer the way it is).

zenomt commented 3 months ago

https://github.com/zenomt/rtmfp-cpp/commit/08cf5eebba61f409dc805dc80a079997512351a0 is the optimization for Session::rescheduleTimeoutAlarm() (called from SendFlow::assembleData()) that takes that 1.3 seconds spent rescheduling the timeout alarm down to about 20 ms. :)

i originally missed a case with that optimization where the timeout could be late if the ERTO got shorter. that's fixed in https://github.com/zenomt/rtmfp-cpp/commit/63ee4f3f98a6e1372af7e25453d388463bf592fd . this fix won't make the method use more CPU time in practice, but it is more correct to handle an edge case.

doanquocnam commented 3 months ago

I've just tested again, and average cpu usage is decreased about 10% but sometimes there is a burst overhead (in a short time) cpu get very high usage at this time!

ROBERT-MCDOWELL commented 3 months ago

I don't know if this link can help, MSG_ZEROCOPY options is available since linux kernel 5.x for TCP/UDP and 6.x fo TCP/UDP/VSOCK https://www.kernel.org/doc/html/v6.9/networking/msg_zerocopy.html

zenomt commented 3 months ago

MSG_ZEROCOPY would not be appropriate for two reasons (from the page you linked):

"Copy avoidance is not a free lunch. As implemented, with page pinning, it replaces per byte copy cost with page accounting and completion notification overhead. As a result, MSG_ZEROCOPY is generally only effective at writes over around 10 KB."
"Page pinning also changes system call semantics. It temporarily shares the buffer between process and network stack. Unlike with copying, the process cannot immediately overwrite the buffer after system call return without possibly modifying the data in flight."

furthermore, the amount of data being copied is actually pretty small, and data copying is most likely not the bottleneck. memory bandwidth is modern computers is 10-100+ GB/second, and 1000x 1 Mbps (128 KB/sec) streams (128MB/sec) ends up being between 0.1 and 1% of the memory bandwidth.

ROBERT-MCDOWELL commented 3 months ago

thanks for your clear explanation, I learn everyday ;)

zenomt commented 3 months ago

but sometimes there is a burst overhead (in a short time) cpu get very high usage at this time!

i have also observed this behavior (including just now in a profiling run), streaming 1 Mbps Big Buck Bunny and seeing single-core CPU use go between 61% and 88%. i looked at process samples during the lower periods and the higher periods, and the percentages and distributions across the various things that take a lot of CPU were the same. there didn't appear to be some additional special work that was happening during the higher CPU periods; rather, it seemed that it was just doing all the same things in the same proportions, but "more".

i suspect this is being caused by bit rate variation of the video stream combined with everyone receiving the same stream at the same time. remember that a 1 Mbps stream isn't going to be exactly 1 Mbps all the time. key frames (IDR in H.264 terminology) are typically much larger than the non-IDR predicted frames, and encoders can sometimes redistribute bandwidth across a large window so there's enough for high-complexity pictures while keeping the "average" lower.

i suggest taking a look at the video file that you're streaming, and seeing if some portions of it are actually higher bitrate than you think, and if those higher bitrate portions coincide with the higher CPU usage periods.

zenomt commented 3 months ago

i suggest taking a look at the video file that you're streaming, and seeing if some portions of it are actually higher bitrate than you think

i looked at a minute or so of my 1 Mbps version of Big Buck Bunny (which i encoded with FFmpeg some time ago from the 2 Mbps original). i see a wide variation of the bitrate over that period of time. using 4-second buckets:

time    bps
4   1477406
8   1285480
12  859512
16  906650
20  1004232
24  1402228
28  1256106
32  1040420
36  816812
40  1183386
44  1290554
48  1127064
52  1676418
56  1338680
60  859614
64  892300

notice the bit rate in those 4 second periods fluctuates between about 800kbps and about 1600kbps (a factor of 2). i generated that by capturing the output of tcconn and feeding it into a simple awk script:

 awk 'BEGIN { t0 = 0; acc = 0; bucket = 4; printf "time\tbps\n" } /^stream on.*ts:.*len:/ {t = $4 / 1000; if(t - t0 >= bucket) {printf "%d\t%d\n", t, acc * 8 / (t - t0); acc = 0; t0 += bucket}; acc += $6}' /tmp/tcconn.out

doanquocnam commented 2 months ago

Good news! I've just improved sending by replacing sendmsg by sendmmsg. And now, cpu usage getdown so that I could 50% connecting rtmfp clients (tcconn). Of course this includes two of your update in Session above.

At the last time, my server with 40 processes of tcserver (40 cores - 2.3GHz) stands only 8000 tcconn clients. And now I can make a test that 12.000 - 13.000 tcconn clients playing.

At the first time, I try to increase MAX_SESSION_PACKET_LENGTH = 8000 to reducte numbers calling to sendmsg, so the cpu reduce from 90% to 50% (on my desktop, 500 tcconns). Of course this is a local test on network interface loopback only to prove that many call sendmsg take a overhead cpu. OK! Let's go to improve this issue. After that, I restore MAX_SESSION_PACKET_LENGTH = 1200.

The way it works is:

Writting a simple sendmmsg(sockfd, mmsg, N, 0) with N-messages at one time, the receiver call recvmsg(sockfd, msg, 0) in a while loop, and it take N-times to receive those message. OK!!
In the function bool Session::onInterfaceWritable(int interfaceID, int priority) I make a loop to get N fragments ( maybe less than N - depending on how many frag are left in m_send_queue) from m_send_queue (each call to assembleDatais one time).
the function encryptAndSendPacket(), now that encrypt data only without calling m_rtmfp->writePacket() to send data. it just keeps this encrypted packet.
When the loop ends, I have N packets are ready to send, now I call sendmmsg to send these N packets.

And now the cpu usage seems so good. This solution aslo remove the issue of IDR large encoded frame. It now more stable, I don't see burst overhead any more! I've also tried writting on MacOS this solution, and it is also OK by using system call sendmsg_x.

@zenomt But, I think that, it would be better if you can implement this improvement . You could try sendmsg_x (sendmmsg) here : https://gist.github.com/Maximus-/487c70508e161680d550ccb343781859

zenomt commented 2 months ago

this should be handled entirely in a Platform Adapter. Session (or anything in the core protocol implementation) doesn't need to be modified to accomplish this.

notice in PosixPlatformAdapter::notifyWhenInterfaceWritable() already it will attempt several write opportunities for each select() (or epoll() on Linux) wakeup. what should happen instead is the Platform Adapter should, when it wakes up, initialize a buffer for multiple packets, and then PosixPlatformAdapter::writePacket() should append to this buffer (up to a number of times, or until the onwritable callback doesn't have any more work), and then those multiple packets can be sent at once with sendmmsg().

i might make this optimization in the future, but it is not a priority for me right now. you can of course make your own Platform Adapter that does this, and then the only patch needed to tcserver is to use your platform adapter instead of the stock one.

this optimization to use sendmmsg() isn't as simple as it seems, because error handling gets much more complicated. you now need to keep track of if fewer packets than requested were sent, what the error was on the send that failed, figure out if it should be retried (like without its msg_control or something), and whether to try to send the other packets now or queue them up for later.

doanquocnam commented 2 months ago

notice in PosixPlatformAdapter::notifyWhenInterfaceWritable() already it will attempt several write opportunities for each select() (or epoll() on Linux) wakeup. what should happen instead is the Platform Adapter should, when it wakes up, initialize a buffer for multiple packets, and then PosixPlatformAdapter::writePacket() should append to this buffer (up to a number of times, or until the onwritable callback doesn't have any more work), and then those multiple packets can be sent at once with sendmmsg().

Thanks for your discussion! I will write my own Platform Adapter in this way. It can send packet to N-addresses at once instead of sending N-packets to only one address by now. Also, reading the documents of sendmmsg the maximum of vlen is 1024, so that this system call can send to 1024 addresses at once. And it means that the cpu usage can be reduced more! I hope so!

Many thanks!!

zenomt / rtmfp-cpp

RTMFP costs so much cpu #20