zeromq / zbroker

Elastic pipes
Mozilla Public License 2.0
50 stars 11 forks source link

zyre issues on larger clusters #84

Closed willkelly closed 10 years ago

willkelly commented 10 years ago

We recently pushed zbroker to production (!)

This would be a joyous event, except it doesn't seem to be working in the new environment. There are a few factors here -- we've got more network interfaces than in our staging, and we're running on more hosts. The current code seems to be working fine in staging but not at all in prod -- we're seeing high error rates even on very simple tests.

Here's an example: Error in test "scripts/simple_noop.yml"

Host "reader" (10.4.48.5)

Test script

open pipe1 read
close pipe1 read

Script Log

Set prefix to 1
Set broker to 1-0
Opening descriptor "1-0|1-pipe1"

Broker Log

zbroker service/0.0.1
Copyright (c) 2014 the Contributors
This Software is provided under the MPLv2 License on an "as is" basis,
without warranty of any kind, either expressed, implied, or statutory.

14-05-19 15:22:38 I: starting zpipes broker using config in '/tmp/tmpAPWSAT/zbroker.cfg'
14-05-19 15:22:38 I: joining cluster as 058F95
14-05-19 15:22:38 N: starting zpipes_server service
14-05-19 15:22:38 I: ZPIPES server appeared at 591D95
14-05-19 15:22:38 I: ZPIPES server appeared at 6FC4CD
14-05-19 15:22:38 I: ZPIPES server appeared at 5A49F1
14-05-19 15:22:38 I: ZPIPES server appeared at 45AE72
14-05-19 15:22:38 I: ZPIPES server appeared at C1D236
14-05-19 15:22:38 I: ZPIPES server appeared at B924BB
14-05-19 15:22:38 I: ZPIPES server appeared at F6C0F8
14-05-19 15:22:38 I: ZPIPES server appeared at B096FF
14-05-19 15:22:38 I: ZPIPES server appeared at AB7003
14-05-19 15:22:38 I: ZPIPES server appeared at 21D409
14-05-19 15:22:38 I: ZPIPES server appeared at 0A9109
14-05-19 15:22:38 I: ZPIPES server appeared at AB26E0
14-05-19 15:22:38 I: ZPIPES server appeared at 48976B
14-05-19 15:22:38 I: ZPIPES server appeared at 31C08E
14-05-19 15:22:38 I: ZPIPES server appeared at 6C305B
14-05-19 15:22:38 I: ZPIPES server appeared at 87172A
14-05-19 15:22:38 I: ZPIPES server appeared at 4A5E10
14-05-19 15:22:38 I: ZPIPES server appeared at 4F0384
14-05-19 15:22:38 I: ZPIPES server appeared at 5C7378
14-05-19 15:22:38 I: ZPIPES server appeared at 692316
14-05-19 15:22:38 W: [058F95BDFBB11F15FE8894458B49FF16] lost messages from 591D950229B66F7A0B6FBA786DAD6915
zbroker: zyre_node.c:531: zyre_node_recv_peer: Assertion `0' failed.

Host "writer" (10.4.48.7)

Test script

open pipe1 write
close pipe1 write

Script Log

Set prefix to 1
Set broker to 1-1
Opening descriptor "1-1|>1-pipe1"

Broker Log

zbroker service/0.0.1
Copyright (c) 2014 the Contributors
This Software is provided under the MPLv2 License on an "as is" basis,
without warranty of any kind, either expressed, implied, or statutory.

14-05-19 15:22:20 I: starting zpipes broker using config in '/tmp/tmpSmkuka/zbroker.cfg'
14-05-19 15:22:20 I: joining cluster as CF4084
14-05-19 15:22:20 N: starting zpipes_server service
14-05-19 15:22:20 I: ZPIPES server appeared at AB26E0
14-05-19 15:22:20 I: ZPIPES server appeared at 45AE72
14-05-19 15:22:20 I: ZPIPES server appeared at 591D95
14-05-19 15:22:20 I: ZPIPES server appeared at 21D409
14-05-19 15:22:20 I: ZPIPES server appeared at 6C305B
14-05-19 15:22:20 I: ZPIPES server appeared at F6C0F8
14-05-19 15:22:20 I: ZPIPES server appeared at C1D236
14-05-19 15:22:20 I: ZPIPES server appeared at 5C7378
14-05-19 15:22:20 I: ZPIPES server appeared at 0A9109
14-05-19 15:22:20 I: ZPIPES server appeared at 692316
14-05-19 15:22:20 I: ZPIPES server appeared at 87172A
14-05-19 15:22:20 I: ZPIPES server appeared at B924BB
14-05-19 15:22:20 I: ZPIPES server appeared at 6FC4CD
14-05-19 15:22:20 I: ZPIPES server appeared at 31C08E
14-05-19 15:22:20 I: ZPIPES server appeared at 5A49F1
14-05-19 15:22:20 I: ZPIPES server appeared at 4A5E10
14-05-19 15:22:20 I: ZPIPES server appeared at 48976B
14-05-19 15:22:20 I: ZPIPES server appeared at AB7003
14-05-19 15:22:20 I: ZPIPES server appeared at 4F0384
14-05-19 15:22:20 I: ZPIPES server appeared at B096FF
14-05-19 15:22:20 W: [CF4084321D73DA1D672A8A9E089ECEE8] lost messages from AB26E074CBE465B16F6BE0F3B78EE31A
zbroker: zyre_node.c:531: zyre_node_recv_peer: Assertion `0' failed.
hintjens commented 10 years ago

What's the kind of zpipes traffic going across the cluster? I'll add a little tracing to zyre.

(I've got a fairly large refactor of Zyre and zbroker in the works, using the new actor model in CZMQ. Hope this doesn't destabilize things too much...)

rpedde commented 10 years ago

Very little traffic. Only test traffic that consisted of very small transactions (open a pipe, write a byte, read a byte, close the pipe), and only a matter of 10 or so running concurrently. For the most part, the cluster was idle.

hintjens commented 10 years ago

OK, so it's not caused by high water marks or such, rather some interconnection failure. Let's start by switching off automatic interface detection and making it all configured, so we can bring up the Zyre cluster gradually on the production environment, and isolate any problems as they hit.

hintjens commented 10 years ago

Is this resolved by setting the interface explicitly? If so, can we close it?

rpedde commented 10 years ago

Probably so. Will re-open under new issue if it happens again.