Open ghost opened 6 years ago
This issue https://github.com/nodejs/node/issues/8871#issuecomment-250915913 is the problem in ws and yes it will probably go away if sync calls are used but it's still valid as the async usage is what is actually recommended.
As written in the comment a different allocator can work around the issue and in fact, there is no leak on macOS.
I don't know if it's viable to use the Node.js zlib sync methods. It will certainly require extensive testing and a big production environment to validate the results.
I'm not going to argue with you either. I've tried in the past, it's useless, there is no constructive discussion. Thank you for the suggestion. If anyone want to experiment with sync zlib calls, cool. Please do that! Run benchmarks, use it in a heavy-loaded production environment, analyse the event loop lag, etc. Then eventually open a PR.
Again, experimentation with this in a production environment would be greatly appreciated. Just back it up with numbers.
Where? BitMEX? In that case why isn't @STRML proposing a patch? They proposed multiple patches in the past to work around the issue. Or are they no longer interested now since they are using uws?
Open a PR, convince maintainers that it is the right thing to do (with numbers), PR merged, discussion ends, all users of ws benefit from this. How awesome is that.
Hey @lpinca - sorry for not updating you further. I'm happy to contribute a sync patch and put it through our benchmarks for completeness' sake. Even with concurrency=1 we still saw fragmentation-related blowups pretty much daily in production. I have not yet tested if sync will fix it but it rings true. Unfortunately this is a difficult bug to reproduce in test.
Cool, looking forward to that. My biggest fear is that using sync calls will fix the memory fragmentation issue but introduce other issues like blocking the event loop for too long. This is what I meant above with "will probably go away".
Yeah it depends on how you've clustered out your app. If you're doing one app per hyperthread it's already pretty close to counterproductive to thread out zlib, and the fragmentation bug takes it over the edge.
blocking with CPU work is always happening to some degree and inevitable.
Well this is why many computationally intensive tasks are delegated to the libuv thread pool no?
Well they can be. Zlib is a bit of an outlier because the standard task delegated to the libuv pool is IO, not CPU. If you assume that any process can actually spawn 1-5 CPU-intensive processes this completely breaks assumptions when clustering. Ideally you want to spawn one cluster worker per hyperthread otherwise.
And here we are again with that arrogant attitude. The point is that it's not ws fault. Any Node.js app/library that uses the asynchronous zlib deflate methods is potentially affected by that memory fragmentation issue.
Is it viable to use the sync methods without incurring in other issues? I don't know. Very happy if this is the case as the code will actually be also a lot easier to maintain.
Issue should stay open, and I already said thank you for your suggestion, not sure what else should I do haha. Please explain me exactly what is ws doing wrong because I'm shortsighted. Is using a recommended and public API ws fault? Is it ws fault if that API leads to memory fragmentation?
So, https://github.com/websockets/ws/pull/1204 does not fix the issue completely, right?
@luislobo according to the above comments, yes that's right.
After some testing I saw the async version of gzip takes up to 10 times more than the sync version according to the content size.
100KB: sync version takes 5ms and async version 20ms 2KB: sync version takes 0.3ms and async version 3ms As size increase the overhead becomes more acceptable.
As websocket usually is used for small chunk of json data I think regardless of the memory leak it's better to use sync deflate for small data considering under 1ms operations are neglectable.
I suggest adding another threshold option for sync compression (syncThreshold maybe) with default value of 10KB or more. this way anyone who wants to go totally sync can increase this value.
Sorry for not having read it properly a minute ago. :+1: for the threshold option.
There should be two thresholds, one to enable compression and another to do it async... Sometimes, gzip compressed data gets bigger than the original one for small data chunks, and also there's the CPU usage for the conpression.
El sáb., 25 de agosto de 2018 0:46, Alex Hultman notifications@github.com escribió:
As websocket usually is used for small chunk of json data I think regardless of the memory leak it's better to use sync deflate for small data considering under 1ms operations are neglectable.
Bingo.
At these small messages (most users send less than 500 bytes a message) the threading, eventing, copying and synchronization overhead it means to "make it async" becomes a bigger cost than the actual compression itself. Making it "async" only fools oneself as it accomplishes nothing of value.
I would never put compression as "async" because it simply is logically impossible to do so, but I also don't care enough to argue anymore. Setting a threshold would solve it enough to be usable.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/websockets/ws/issues/1369#issuecomment-415901743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfviJ5nirilZGMU4_CHPXS9_BNVwMyks5uUIJjgaJpZM4TgILB .
I'm assuming you toss the compressed variant if it turns out bigger than the original.....
I'll assume that for each length of input, gzip achives maximum efficiency if and only if all input bytes are the same. In this case, and using values from 0…127, the break-even is at 23 bytes.
Inputs shorter than 23 bytes will result in larger outputs. For these, we shouldn't waste any CPU cycles on gzip and the length comparison.
Compression becomes efficient starting at 24 bytes of input. For these we should compare lengths. Maybe even use a threshold a little bit higher, assuming that most messages won't be the same 7-bit character repeated. I see how people might want to tweak this threshold based on their message characteristics.
Example: If I know that most of my messages will contain at least {"nick":"","msg":""}
, I'd want to not spend extra CPU time for inputs up to 42 bytes long.
Node.js does not have an official API to use inflate/deflate streams synchronously.
We can't use https://nodejs.org/api/zlib.html#zlib_zlib_deflaterawsync_buffer_options as that does not work as intended for fragmented messages of if context takeover is enabled. We need the ability to reuse the same https://nodejs.org/api/zlib.html#zlib_class_zlib_deflateraw instance.
In order to use a deflate stream synchronously we need to copy this internal function https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L459-L542 but ofc that can change at any time.
As said in https://github.com/websockets/ws/issues/1369#issuecomment-383808666 experimentation (backed by numbers) is welcomed.
Yes, but where is the code that produced those results? How many deflate instances were used (we have one per socket)? How long is the event loop blocked when doing that for say 50k clients? I would like to see this kind of stuff.
2KB: sync version takes 0.3ms and async version 3ms
@rahbari can you please share your code?
I did a dummy and irrelevant quick test and this is what I get:
I think it would be nicer to run the benchmark some few hundered times in the same nodejs process for warmup, only then collect measurements. v8 might be able to optimize functions, once it knows we want to use them a lot.
I'll assume that for each length of input, gzip achives maximum efficiency if and only if all input bytes are the same. In this case, and using values from 0…127, the break-even is at 23 bytes. Inputs shorter than 23 bytes will result in larger outputs. For these, we shouldn't waste any CPU cycles on gzip and the length comparison.
Not sure about the 23 bytes number, but yes, that's exactly what I was talking about :-)
Node.js does not have an official API to use inflate/deflate streams synchronously.
Streams not, but there's for buffers.
@piranna please read the full comment. That creates a new instance for each buffer, see https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L692-L706. We can't use that for fragmented messages or if context takeover is enabled.
But does compression being done at fragment level? Isn't it message level?
It is done at the frame/fragment level, yes.
The option to enable compression based on the message byte size is available since version 2.0.0, see https://github.com/websockets/ws/blob/1e78aba9b9b29407517ec7ba1da97427c1678e5c/doc/ws.md#new-websocketserveroptions-callback
threshold
{Number} Payloads smaller than this will not be compressed. Defaults to 1024 bytes.
I noticed that there is Zlib.prototype._processChunk()
. The API is "private" and undocumented but it seems to work and it is used in user land so it will probably not go away anytime soon.
const deflate = new zlib.DeflateRaw();
const data = deflate._processChunk(chunk, zlib.Z_SYNC_FLUSH);
I made the following code to reproduce the problem (still happening on alpine linux or debian buster with ws@7.2.0 and node 12 or 13).
const WebSocket = require('ws');
const wss = new WebSocket.Server({
port: 8080,
perMessageDeflate: {
zlibDeflateOptions: {
memLevel: 3,
},
serverMaxWindowBits: 10,
threshold: 16,
},
});
wss.on('connection', (ws) => {
ws.on('message', (message) => {
ws.send(message);
});
ws.on('close', () => {
console.log(`close, ${wss.clients.size} listeners left`);
});
});
const message = Buffer.alloc(256);
function sendMessage(ws, j) {
if (j < 100) {
setTimeout(() => {
ws.send(message, sendMessage.bind(null, ws, j + 1));
}, 16);
} else {
ws.close();
}
}
for (let i = 0; i < 1000; ++i) {
const ws = new WebSocket('ws://localhost:8080/');
ws.on('open', () => {
sendMessage(ws, 0);
});
}
I tried to make the zlib compression sync, but I only managed to break the unit tests. However from an external point of view of the ws
project, I feel like the permessage-deflate.js
file is already very dependant on the zlib.js
NodeJS implementation. So I think it wouldn't be a huge issue to go further into custom code that will break if the NodeJS zlib.js
changes, because it will already break with the current code.
By the way, the memory leak doesn't happen on Windows 10. I don't feel good about deploying windows servers to fix a memory leak though.
Would it help to deflate/inflate buffers instead of the stream? That way we could use zlib.deflateSync. ("Added in: v0.11.12")
The difficulty is to support "context takeover".
The RFC defines it like this :
The term "LZ77 sliding window" used in this section means the buffer used by the DEFLATE algorithm to store recently processed input. The DEFLATE compression algorithm searches the buffer for a match with the following input.
The term "use context takeover" used in this section means that the same LZ77 sliding window used by the endpoint to build frames of the previous sent message is reused to build frames of the next message to be sent.
In my opinion, context takeover is what makes websocket compression very cool. You can compress based on the previous messages, which is very convenient when you have many similar messages which I think is very common with websockets.
So, ws
should keep the sliding window between messages and the easiest way to do it was to code something on top of zlib streams.
@fungiboletus I can't reproduce on Ubuntu 18.04 using your example and Node.js 13.1.0:
Memory usage stabilizes at ~100 MiB.
Why not send and receive compressed buffer in userland?
https://github.com/nodejs/node/pull/34048
This PR is an experiment which hopefully mitigates #8871. Moves zlib initialization to the point of execution on worker thread pool. This allows to postpone the corresponding memory allocation (the computation) to a later point in time which may be critical in situations when many zlib streams are created and operated simultaneously.
Greetings,
I am experimenting with synchronous zlib compression in ws
.
This implementation is built on top of Zlib.prototype._processChunk()
but because its also released as a standalone library, i added it to ws
via a peer dependency to avoid tampering with the existing solution as much as possible. The code is still a bit rough and may be missing some error handling but here are some early test results made with a modified version of the /bench/speed.js
test:
(windows 10 build 19041, node.js 14.8.0, intel i5 7300HQ 2.5ghz)
> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data: 2.9s 437.94 KiB/s
5000 roundtrips of 16 KiB binary data: 2.8s 55.45 MiB/s
1000 roundtrips of 128 KiB binary data: 13.5s 18.48 MiB/s
100 roundtrips of 1 MiB binary data: 9.6s 20.87 MiB/s
1 roundtrips of 100 MiB binary data: 9.4s 21.23 MiB/s
10000 roundtrips of 64 B text data: 3.6s 352.02 KiB/s
5000 roundtrips of 16 KiB text data: 2.6s 59.87 MiB/s
1000 roundtrips of 128 KiB text data: 13.2s 18.95 MiB/s
100 roundtrips of 1 MiB text data: 9.8s 20.34 MiB/s
1 roundtrips of 100 MiB text data: 9.5s 20.96 MiB/s
> npm install fast-zlib
> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data: 1.1s 1.11 MiB/s
5000 roundtrips of 16 KiB binary data: 1.5s 101.33 MiB/s
1000 roundtrips of 128 KiB binary data: 10.1s 24.77 MiB/s
100 roundtrips of 1 MiB binary data: 7.3s 27.53 MiB/s
1 roundtrips of 100 MiB binary data: 8s 24.93 MiB/s
10000 roundtrips of 64 B text data: 1.2s 1.05 MiB/s
5000 roundtrips of 16 KiB text data: 1.6s 96.19 MiB/s
1000 roundtrips of 128 KiB text data: 9.7s 25.65 MiB/s
100 roundtrips of 1 MiB text data: 7.7s 26.09 MiB/s
1 roundtrips of 100 MiB text data: 8.2s 24.29 MiB/s
Synchronous zlib appears to provide performance gains of 20% to 40%+ depending on packet size. If anyone wants to experiment with it and test for memory issues, the code is available in my fork of ws
.
Cheers!
fast-zlib
is bit hacky but that is not a problem, we can propose an API to use inflate/deflate streams synchronously in Node.js core.
The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.
The benchmark above uses only a single client and reflects the results of running this benchmark https://github.com/websockets/ws/issues/1369#issuecomment-415950185. There is no need to use ws
or transmit data over the network to benchmark this.
One more thing, it doesn't make sense to compress small messages. It is actually counterproductive. This is why ws
does not compress messages whose size is < 1024 bytes by default. The threshold can be customized of course.
Indeed, i intentionally disabled the threshold to test zlib in all situations, to avoid having mixed compressed and non-compressed results
The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.
What about a flag to left user decide to use them sync or async?
It's an option but it complicates everything :(.
The user can already compress the message synchronously before sending it in a WebSocket message. It's not the same but close enough. This also allows to optimize a broadcast a lot because it allows to create only one frame and send the same frame to all clients. See discussion on https://github.com/websockets/ws/issues/617.
99% of all WS apps always send messages smaller than 1KB, so this rule is essentially disabling compression altogether.
What is this based on? Probably true for stock trading apps but 99% of all WS apps seems a bit too broad. The default value can be changed but:
I've seen and worked on WS apps used to transfer huge files, used to stream audio/video data, etc. Those apps used messages way bigger than 1 KB (64 KB / 16 KB).
Yes some use it like that. Strange. But it doesn't invalidate the observation that the vast majority of use cases don't. And it doesn't matter, really, all that matters is that small message sending is a huge use case with WS. So having an optimized/stable execution path that works well for small messages is important. That's all.
I agree, we need an optimized path for applications that are basically signalling or chat, because that's how I will primarily use it. (Of course we shall still support large media transfer as well.)
@mk-pmb to stay on topic, from my experience compressing very small messages (regardless of how compression is done, sync or async) does not make an app faster but slower.
If you are not happy with ws
performance when dealing with very small messages you are in the wrong repository because the bottleneck is not ws
but Node.js {net,tls}.Socket
, so you should bring this up to Node.js core repository.
compressing very small messages […] does not make an app faster but slower.
Thanks for reminding. I should have expressed myself more clearly. I meant that even in scenarios where network traffic is the bottle neck, where a few bytes saved make a difference, and it's thus acceptable to invest some more CPU cycles, our compression should still try to be economical in how many extra CPU cycles users would need to invest.
You're right that most of my chat-like apps won't usually run in such transmission-restricted circumstance though.
The discussion about the performances is interesting, but I want to mention that the memory leak is still there with ws@7.3.1 and node 14.9.0. The current implementation may be faster in some uses cases and slower in some others, but it does leak memory.
@fungiboletus I was not able to reproduce the memory fragmentation issue using your code above, see https://github.com/websockets/ws/issues/1369#issuecomment-553831310.
Also it should be better now as https://github.com/nodejs/node/pull/34048 mitigated the original issue.
I've read and followed the zlib "memory fragmentation" issue down to where people blaming the Linux kernel and glibc and whatnot.
My two cents:
The issue is with websockets/ws.
Doing CPU-bound tasks "async" is nothing more than a delusion - there is no such thing as an async CPU-bound task. If it takes 1 second of CPU-time, it takes 1 second of CPU time no matter when you perform it. Passing a CPU bound task to some thread pool doesn't improve anything in fact it only adds more overhead:
If you cannot keep up with what you buffer up, you're obviously going to blow the process up from too much memory usage since you produce more than you consume.
You have two solutions:
OR if you really want to try and use multithreading here (which goes against how Node.js is built and deployed):