Zlib compression needs to be sync - Githubissues

websockets / ws

Simple to use, blazing fast and thoroughly tested WebSocket client and server for Node.js

MIT License

21.32k stars 2.3k forks source link

Zlib compression needs to be sync #1369

Open ghost opened 6 years ago

ghost commented 6 years ago

I've read and followed the zlib "memory fragmentation" issue down to where people blaming the Linux kernel and glibc and whatnot.

My two cents:

There is nothing wrong with the Linux kernel
There is nothing wrong with glibc
There is nothing wrong with Node.js

The issue is with websockets/ws.

Doing CPU-bound tasks "async" is nothing more than a delusion - there is no such thing as an async CPU-bound task. If it takes 1 second of CPU-time, it takes 1 second of CPU time no matter when you perform it. Passing a CPU bound task to some thread pool doesn't improve anything in fact it only adds more overhead:

Threading overhead
You need to copy the data one extra time (major overhead!)
You're basically just procrastinating what you need to do right now.

If you cannot keep up with what you buffer up, you're obviously going to blow the process up from too much memory usage since you produce more than you consume.

You have two solutions:

Replace all zlib calls with sync calls and all problems go away and you will lower latencies and actually do work you need to do and not just fool yourself you're doing things "in the background".

OR if you really want to try and use multithreading here (which goes against how Node.js is built and deployed):

Only buffer up NUMBER_OF_LOGICAL_CPU_CORES with the async version and after this do sync calls until the buffer contains less than NUMBER_OF_LOGICAL_CPU_CORES tasks. Aka - you NEED to block sometime because CPU work cannot be "async" like networking. This solution is more complex and goes against what Node.js is but you get to choose.

lpinca commented 6 years ago

This issue https://github.com/nodejs/node/issues/8871#issuecomment-250915913 is the problem in ws and yes it will probably go away if sync calls are used but it's still valid as the async usage is what is actually recommended.

As written in the comment a different allocator can work around the issue and in fact, there is no leak on macOS.

I don't know if it's viable to use the Node.js zlib sync methods. It will certainly require extensive testing and a big production environment to validate the results.

lpinca commented 6 years ago

I'm not going to argue with you either. I've tried in the past, it's useless, there is no constructive discussion. Thank you for the suggestion. If anyone want to experiment with sync zlib calls, cool. Please do that! Run benchmarks, use it in a heavy-loaded production environment, analyse the event loop lag, etc. Then eventually open a PR.

lpinca commented 6 years ago

Again, experimentation with this in a production environment would be greatly appreciated. Just back it up with numbers.

lpinca commented 6 years ago

Where? BitMEX? In that case why isn't @STRML proposing a patch? They proposed multiple patches in the past to work around the issue. Or are they no longer interested now since they are using uws?

Open a PR, convince maintainers that it is the right thing to do (with numbers), PR merged, discussion ends, all users of ws benefit from this. How awesome is that.

STRML commented 6 years ago

Hey @lpinca - sorry for not updating you further. I'm happy to contribute a sync patch and put it through our benchmarks for completeness' sake. Even with concurrency=1 we still saw fragmentation-related blowups pretty much daily in production. I have not yet tested if sync will fix it but it rings true. Unfortunately this is a difficult bug to reproduce in test.

lpinca commented 6 years ago

Cool, looking forward to that. My biggest fear is that using sync calls will fix the memory fragmentation issue but introduce other issues like blocking the event loop for too long. This is what I meant above with "will probably go away".

STRML commented 6 years ago

Yeah it depends on how you've clustered out your app. If you're doing one app per hyperthread it's already pretty close to counterproductive to thread out zlib, and the fragmentation bug takes it over the edge.

lpinca commented 6 years ago

blocking with CPU work is always happening to some degree and inevitable.

Well this is why many computationally intensive tasks are delegated to the libuv thread pool no?

STRML commented 6 years ago

Well they can be. Zlib is a bit of an outlier because the standard task delegated to the libuv pool is IO, not CPU. If you assume that any process can actually spawn 1-5 CPU-intensive processes this completely breaks assumptions when clustering. Ideally you want to spawn one cluster worker per hyperthread otherwise.

lpinca commented 6 years ago

And here we are again with that arrogant attitude. The point is that it's not ws fault. Any Node.js app/library that uses the asynchronous zlib deflate methods is potentially affected by that memory fragmentation issue.

Is it viable to use the sync methods without incurring in other issues? I don't know. Very happy if this is the case as the code will actually be also a lot easier to maintain.

lpinca commented 6 years ago

Issue should stay open, and I already said thank you for your suggestion, not sure what else should I do haha. Please explain me exactly what is ws doing wrong because I'm shortsighted. Is using a recommended and public API ws fault? Is it ws fault if that API leads to memory fragmentation?

luislobo commented 6 years ago

So, https://github.com/websockets/ws/pull/1204 does not fix the issue completely, right?

lpinca commented 6 years ago

@luislobo according to the above comments, yes that's right.

rahbari commented 5 years ago

After some testing I saw the async version of gzip takes up to 10 times more than the sync version according to the content size.

100KB: sync version takes 5ms and async version 20ms 2KB: sync version takes 0.3ms and async version 3ms As size increase the overhead becomes more acceptable.

As websocket usually is used for small chunk of json data I think regardless of the memory leak it's better to use sync deflate for small data considering under 1ms operations are neglectable.

I suggest adding another threshold option for sync compression (syncThreshold maybe) with default value of 10KB or more. this way anyone who wants to go totally sync can increase this value.

mk-pmb commented 5 years ago

Sorry for not having read it properly a minute ago. :+1: for the threshold option.

piranna commented 5 years ago

There should be two thresholds, one to enable compression and another to do it async... Sometimes, gzip compressed data gets bigger than the original one for small data chunks, and also there's the CPU usage for the conpression.

El sáb., 25 de agosto de 2018 0:46, Alex Hultman notifications@github.com escribió:

As websocket usually is used for small chunk of json data I think regardless of the memory leak it's better to use sync deflate for small data considering under 1ms operations are neglectable.

Bingo.

At these small messages (most users send less than 500 bytes a message) the threading, eventing, copying and synchronization overhead it means to "make it async" becomes a bigger cost than the actual compression itself. Making it "async" only fools oneself as it accomplishes nothing of value.

I would never put compression as "async" because it simply is logically impossible to do so, but I also don't care enough to argue anymore. Setting a threshold would solve it enough to be usable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/websockets/ws/issues/1369#issuecomment-415901743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfviJ5nirilZGMU4_CHPXS9_BNVwMyks5uUIJjgaJpZM4TgILB .

ghost commented 5 years ago

I'm assuming you toss the compressed variant if it turns out bigger than the original.....

mk-pmb commented 5 years ago

I'll assume that for each length of input, gzip achives maximum efficiency if and only if all input bytes are the same. In this case, and using values from 0…127, the break-even is at 23 bytes. Inputs shorter than 23 bytes will result in larger outputs. For these, we shouldn't waste any CPU cycles on gzip and the length comparison. Compression becomes efficient starting at 24 bytes of input. For these we should compare lengths. Maybe even use a threshold a little bit higher, assuming that most messages won't be the same 7-bit character repeated. I see how people might want to tweak this threshold based on their message characteristics. Example: If I know that most of my messages will contain at least {"nick":"","msg":""}, I'd want to not spend extra CPU time for inputs up to 42 bytes long.

lpinca commented 5 years ago

Node.js does not have an official API to use inflate/deflate streams synchronously.

We can't use https://nodejs.org/api/zlib.html#zlib_zlib_deflaterawsync_buffer_options as that does not work as intended for fragmented messages of if context takeover is enabled. We need the ability to reuse the same https://nodejs.org/api/zlib.html#zlib_class_zlib_deflateraw instance.

In order to use a deflate stream synchronously we need to copy this internal function https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L459-L542 but ofc that can change at any time.

As said in https://github.com/websockets/ws/issues/1369#issuecomment-383808666 experimentation (backed by numbers) is welcomed.

lpinca commented 5 years ago

Yes, but where is the code that produced those results? How many deflate instances were used (we have one per socket)? How long is the event loop blocked when doing that for say 50k clients? I would like to see this kind of stuff.

lpinca commented 5 years ago

2KB: sync version takes 0.3ms and async version 3ms

@rahbari can you please share your code?

I did a dummy and irrelevant quick test and this is what I get:

Code

```js 'use strict'; const crypto = require('crypto'); const zlib = require('zlib'); const buf = crypto.randomBytes(2048); console.time('deflate sync'); for (let i = 0; i < 1000; i++) { zlib.deflateRawSync(buf); } console.timeEnd('deflate sync'); const deflate = zlib.createDeflateRaw(); deflate.resume(); let runs = 0; function runAsync() { deflate.write(buf); deflate.flush(() => { if (++runs === 1000) { console.timeEnd('deflate async'); return; } runAsync(); }); } console.time('deflate async'); runAsync(); ```

Results

``` $ for i in $(seq 1 10); do node index.js; echo ""; done deflate sync: 148.891ms deflate async: 169.522ms deflate sync: 149.968ms deflate async: 164.356ms deflate sync: 150.329ms deflate async: 164.349ms deflate sync: 144.671ms deflate async: 162.748ms deflate sync: 163.011ms deflate async: 164.543ms deflate sync: 147.325ms deflate async: 163.505ms deflate sync: 172.986ms deflate async: 169.949ms deflate sync: 148.711ms deflate async: 167.773ms deflate sync: 146.679ms deflate async: 168.316ms deflate sync: 153.743ms deflate async: 166.653ms ```

mk-pmb commented 5 years ago

I think it would be nicer to run the benchmark some few hundered times in the same nodejs process for warmup, only then collect measurements. v8 might be able to optimize functions, once it knows we want to use them a lot.

piranna commented 5 years ago

I'll assume that for each length of input, gzip achives maximum efficiency if and only if all input bytes are the same. In this case, and using values from 0…127, the break-even is at 23 bytes. Inputs shorter than 23 bytes will result in larger outputs. For these, we shouldn't waste any CPU cycles on gzip and the length comparison.

Not sure about the 23 bytes number, but yes, that's exactly what I was talking about :-)

piranna commented 5 years ago

Node.js does not have an official API to use inflate/deflate streams synchronously.

Streams not, but there's for buffers.

lpinca commented 5 years ago

@piranna please read the full comment. That creates a new instance for each buffer, see https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L692-L706. We can't use that for fragmented messages or if context takeover is enabled.

piranna commented 5 years ago

But does compression being done at fragment level? Isn't it message level?

lpinca commented 5 years ago

It is done at the frame/fragment level, yes.

lpinca commented 5 years ago

The option to enable compression based on the message byte size is available since version 2.0.0, see https://github.com/websockets/ws/blob/1e78aba9b9b29407517ec7ba1da97427c1678e5c/doc/ws.md#new-websocketserveroptions-callback

threshold {Number} Payloads smaller than this will not be compressed. Defaults to 1024 bytes.

lpinca commented 5 years ago

I noticed that there is Zlib.prototype._processChunk(). The API is "private" and undocumented but it seems to work and it is used in user land so it will probably not go away anytime soon.

const deflate = new zlib.DeflateRaw();
const data = deflate._processChunk(chunk, zlib.Z_SYNC_FLUSH);

fungiboletus commented 4 years ago

I made the following code to reproduce the problem (still happening on alpine linux or debian buster with ws@7.2.0 and node 12 or 13).

const WebSocket = require('ws');

const wss = new WebSocket.Server({
  port: 8080,
  perMessageDeflate: {
    zlibDeflateOptions: {
      memLevel: 3,
    },
    serverMaxWindowBits: 10,
    threshold: 16,
  },
});

wss.on('connection', (ws) => {
  ws.on('message', (message) => {
    ws.send(message);
  });
  ws.on('close', () => {
    console.log(`close, ${wss.clients.size} listeners left`);
  });
});

const message = Buffer.alloc(256);

function sendMessage(ws, j) {
  if (j < 100) {
    setTimeout(() => {
      ws.send(message, sendMessage.bind(null, ws, j + 1));
    }, 16);
  } else {
    ws.close();
  }
}

for (let i = 0; i < 1000; ++i) {
  const ws = new WebSocket('ws://localhost:8080/');
  ws.on('open', () => {
    sendMessage(ws, 0);
  });
}

I tried to make the zlib compression sync, but I only managed to break the unit tests. However from an external point of view of the ws project, I feel like the permessage-deflate.js file is already very dependant on the zlib.js NodeJS implementation. So I think it wouldn't be a huge issue to go further into custom code that will break if the NodeJS zlib.js changes, because it will already break with the current code.

By the way, the memory leak doesn't happen on Windows 10. I don't feel good about deploying windows servers to fix a memory leak though.

mk-pmb commented 4 years ago

Would it help to deflate/inflate buffers instead of the stream? That way we could use zlib.deflateSync. ("Added in: v0.11.12")

fungiboletus commented 4 years ago

The difficulty is to support "context takeover".

The RFC defines it like this :

The term "LZ77 sliding window" used in this section means the buffer used by the DEFLATE algorithm to store recently processed input. The DEFLATE compression algorithm searches the buffer for a match with the following input.

The term "use context takeover" used in this section means that the same LZ77 sliding window used by the endpoint to build frames of the previous sent message is reused to build frames of the next message to be sent.

In my opinion, context takeover is what makes websocket compression very cool. You can compress based on the previous messages, which is very convenient when you have many similar messages which I think is very common with websockets.

So, ws should keep the sliding window between messages and the easiest way to do it was to code something on top of zlib streams.

lpinca commented 4 years ago

@fungiboletus I can't reproduce on Ubuntu 18.04 using your example and Node.js 13.1.0:

server.js

```js 'use strict'; const WebSocket = require('ws'); const message = Buffer.alloc(256); const wss = new WebSocket.Server({ port: 8080, perMessageDeflate: { zlibDeflateOptions: { memLevel: 3 }, serverMaxWindowBits: 10, threshold: 16 } }); function onClose() { if (wss.clients.size === 0) { console.log('connections: 0'); } } function broadcast() { console.log('Broadcasting message'); for (const ws of wss.clients) { ws.send(message); ws.close(); } } wss.on('connection', (ws) => { if (wss.clients.size === 1000) { console.log('connections: 1000'); broadcast(); } ws.on('close', onClose); }); wss.on('listening', function() { console.log('Listening on *:8080'); }); setInterval(function() { console.log(process.memoryUsage()); }, 10000); ```

clients.js

```js 'use strict'; const WebSocket = require('ws'); const total = 1000; let connected = 0; function connect() { const ws = new WebSocket('ws://localhost:8080/'); ws.on('open', function() { if (++connected < total) connect(); ws.on('close', function() { if (--connected === 0) setTimeout(connect, 5000); }); }); } connect(); ```

Memory usage stabilizes at ~100 MiB.

LongTengDao commented 4 years ago

Why not send and receive compressed buffer in userland?

chanced commented 3 years ago

https://github.com/nodejs/node/pull/34048

This PR is an experiment which hopefully mitigates #8871. Moves zlib initialization to the point of execution on worker thread pool. This allows to postpone the corresponding memory allocation (the computation) to a later point in time which may be critical in situations when many zlib streams are created and operated simultaneously.

timotejroiko commented 3 years ago

Greetings,

I am experimenting with synchronous zlib compression in ws. This implementation is built on top of Zlib.prototype._processChunk() but because its also released as a standalone library, i added it to ws via a peer dependency to avoid tampering with the existing solution as much as possible. The code is still a bit rough and may be missing some error handling but here are some early test results made with a modified version of the /bench/speed.js test:

(windows 10 build 19041, node.js 14.8.0, intel i5 7300HQ 2.5ghz)

> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data:   2.9s    437.94 KiB/s
5000 roundtrips of 16 KiB binary data:  2.8s    55.45 MiB/s
1000 roundtrips of 128 KiB binary data: 13.5s   18.48 MiB/s
100 roundtrips of 1 MiB binary data:    9.6s    20.87 MiB/s
1 roundtrips of 100 MiB binary data:    9.4s    21.23 MiB/s
10000 roundtrips of 64 B text data:     3.6s    352.02 KiB/s
5000 roundtrips of 16 KiB text data:    2.6s    59.87 MiB/s
1000 roundtrips of 128 KiB text data:   13.2s   18.95 MiB/s
100 roundtrips of 1 MiB text data:      9.8s    20.34 MiB/s
1 roundtrips of 100 MiB text data:      9.5s    20.96 MiB/s

> npm install fast-zlib
> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data:   1.1s    1.11 MiB/s
5000 roundtrips of 16 KiB binary data:  1.5s    101.33 MiB/s
1000 roundtrips of 128 KiB binary data: 10.1s   24.77 MiB/s
100 roundtrips of 1 MiB binary data:    7.3s    27.53 MiB/s
1 roundtrips of 100 MiB binary data:    8s      24.93 MiB/s
10000 roundtrips of 64 B text data:     1.2s    1.05 MiB/s
5000 roundtrips of 16 KiB text data:    1.6s    96.19 MiB/s
1000 roundtrips of 128 KiB text data:   9.7s    25.65 MiB/s
100 roundtrips of 1 MiB text data:      7.7s    26.09 MiB/s
1 roundtrips of 100 MiB text data:      8.2s    24.29 MiB/s

Synchronous zlib appears to provide performance gains of 20% to 40%+ depending on packet size. If anyone wants to experiment with it and test for memory issues, the code is available in my fork of ws.

Cheers!

lpinca commented 3 years ago

fast-zlib is bit hacky but that is not a problem, we can propose an API to use inflate/deflate streams synchronously in Node.js core.

The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.

The benchmark above uses only a single client and reflects the results of running this benchmark https://github.com/websockets/ws/issues/1369#issuecomment-415950185. There is no need to use ws or transmit data over the network to benchmark this.

One more thing, it doesn't make sense to compress small messages. It is actually counterproductive. This is why ws does not compress messages whose size is < 1024 bytes by default. The threshold can be customized of course.

timotejroiko commented 3 years ago

Indeed, i intentionally disabled the threshold to test zlib in all situations, to avoid having mixed compressed and non-compressed results

piranna commented 3 years ago

The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.

What about a flag to left user decide to use them sync or async?

lpinca commented 3 years ago

It's an option but it complicates everything :(.

The user can already compress the message synchronously before sending it in a WebSocket message. It's not the same but close enough. This also allows to optimize a broadcast a lot because it allows to create only one frame and send the same frame to all clients. See discussion on https://github.com/websockets/ws/issues/617.

lpinca commented 3 years ago

99% of all WS apps always send messages smaller than 1KB, so this rule is essentially disabling compression altogether.

What is this based on? Probably true for stock trading apps but 99% of all WS apps seems a bit too broad. The default value can be changed but:

Compression is already disabled by default. I think there is not much need for it when using an efficient binary serialization/deserialization format.
Users can customize the threshold to fit their needs.

lpinca commented 3 years ago

I've seen and worked on WS apps used to transfer huge files, used to stream audio/video data, etc. Those apps used messages way bigger than 1 KB (64 KB / 16 KB).

ghost commented 3 years ago

Yes some use it like that. Strange. But it doesn't invalidate the observation that the vast majority of use cases don't. And it doesn't matter, really, all that matters is that small message sending is a huge use case with WS. So having an optimized/stable execution path that works well for small messages is important. That's all.

mk-pmb commented 3 years ago

I agree, we need an optimized path for applications that are basically signalling or chat, because that's how I will primarily use it. (Of course we shall still support large media transfer as well.)

lpinca commented 3 years ago

@mk-pmb to stay on topic, from my experience compressing very small messages (regardless of how compression is done, sync or async) does not make an app faster but slower.

If you are not happy with ws performance when dealing with very small messages you are in the wrong repository because the bottleneck is not ws but Node.js {net,tls}.Socket, so you should bring this up to Node.js core repository.

mk-pmb commented 3 years ago

compressing very small messages […] does not make an app faster but slower.

Thanks for reminding. I should have expressed myself more clearly. I meant that even in scenarios where network traffic is the bottle neck, where a few bytes saved make a difference, and it's thus acceptable to invest some more CPU cycles, our compression should still try to be economical in how many extra CPU cycles users would need to invest.

You're right that most of my chat-like apps won't usually run in such transmission-restricted circumstance though.

fungiboletus commented 3 years ago

The discussion about the performances is interesting, but I want to mention that the memory leak is still there with ws@7.3.1 and node 14.9.0. The current implementation may be faster in some uses cases and slower in some others, but it does leak memory.

lpinca commented 3 years ago

@fungiboletus I was not able to reproduce the memory fragmentation issue using your code above, see https://github.com/websockets/ws/issues/1369#issuecomment-553831310.

Also it should be better now as https://github.com/nodejs/node/pull/34048 mitigated the original issue.