wechaty / puppet-service

Wechaty Puppet Provider for providing/consuming the Wechaty Puppet Service
https://paka.dev/npm/wechaty-puppet-service
Apache License 2.0
75 stars 18 forks source link

GRPC Error: 14 UNAVAILABLE: TCP Write failed #31

Open huan opened 4 years ago

huan commented 4 years ago

When the puppet hostie are calling hundreds of GRPC methods parallelly from the puppet service, it seems that there's a very high possibility to happen the following error:

11:42:37 VERB Message ready()
11:42:37 VERB Puppet messagePayload(275128767010511814)
11:42:37 SILL Puppet messagePayloadCache(275128767010511814) cache MISS
11:42:37 VERB PuppetHostie messageRawPayload(275128767010511814)
11:42:38 VERB PuppetHostie startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Write failed
11:42:38 SILL Puppet constructor() this.on(reset) payload: "{"data":"startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Write failed"}"
11:42:38 SILL Puppet constructor() resetThrottleQueue.subscribe() reason: "startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Write failed"
11:42:38 VERB Puppet reset(startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Write failed)
11:42:38 VERB PuppetHostie stop()
11:42:38 SILL StateSwitch <PuppetHostie> off() is false
11:42:38 SILL StateSwitch <PuppetHostie> off() is false
11:42:38 VERB StateSwitch <PuppetHostie> off(pending) <- (false)

Links

huan commented 4 years ago

Update: It seems that this problem related to the container: it will only happen when the hostie server runs inside the Docker. (the wechaty-puppet-hostie is always inside a Linux box of the above testings)

su-chang commented 4 years ago

Some same error logs related to this issue.

Project route: wechaty@0.39.14 -> hostie -> donut@0.3.11 -> windows server(wechaty@0.39.14)

10:56:27 VERB PuppetHostie startGrpcStream()
10:56:27 VERB PuppetHostie onGrpcStreamEvent({type:EVENT_TYPE_LOGIN(25), payload:"{"contactId":"wxid_rr9ej1o8xv9h21"}"})
10:56:27 VERB Contact load(wxid_rr9ej1o8xv9h21) init pool
10:56:27 SILL Accessory #0<ContactSelf> constructor()
10:56:27 SILL Contact constructor(wxid_rr9ej1o8xv9h21)
10:56:27 SILL Contact ready() @ Puppet#0<PuppetHostie>(5dbe8221fc191f13bc072908) with id="wxid_rr9ej1o8xv9h21"
10:56:27 SILL Puppet contactPayload(wxid_rr9ej1o8xv9h21) cache MISS
10:56:27 VERB PuppetHostie contactRawPayload(wxid_rr9ej1o8xv9h21)
10:56:28 SILL StateSwitch <PuppetHostie> on() is pending
10:56:28 VERB StateSwitch <PuppetHostie> on(true) <- (pending)
10:56:28 VERB Puppet recover$() switchOn$ fired
10:56:28 VERB Puppet recover$() heartbeat: undefined
10:56:28 VERB Puppet recover$() switchOn$ fired
10:56:28 VERB Puppet recover$() heartbeat: undefined
10:56:28 VERB Wechaty on(heartbeat, function) registered
10:56:28 VERB Wechaty addListenerFunction(heartbeat)
10:56:28 SILL StateSwitch <Wechaty> on() is pending
10:56:28 VERB StateSwitch <Wechaty> on(true) <- (pending)
10:56:43 VERB Puppet recover$() heartbeatDing()
10:56:43 SILL PuppetHostie ding(recover$() CPR #0)
10:56:43 VERB PuppetHostie constructor() recover$().subscribe() next(0)
10:56:46 VERB PuppetHostie startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Read failed
10:56:46 SILL Puppet constructor() this.on(reset) payload: "{"data":"startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Read failed"}"
10:56:46 SILL Puppet constructor() resetThrottleQueue.subscribe() reason: "startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Read failed"
10:56:46 VERB Puppet reset(startGrpcStream() eventStream.on(error) Error: 14 UNAVAILABLE: TCP Read failed)
10:56:46 VERB PuppetHostie stop()
10:56:46 SILL StateSwitch <PuppetHostie> off() is false
10:56:46 SILL StateSwitch <PuppetHostie> off() is false
10:56:46 VERB StateSwitch <PuppetHostie> off(pending) <- (false)
10:56:46 VERB Puppet selfId()
10:56:46 SILL Contact ready() @ Puppet#0<PuppetHostie>(5dbe8221fc191f13bc072908) with id="wxid_rr9ej1o8xv9h21"
10:56:46 SILL Puppet contactPayload(wxid_rr9ej1o8xv9h21) cache MISS
10:56:46 VERB PuppetHostie contactRawPayload(wxid_rr9ej1o8xv9h21)
10:56:46 VERB PuppetHostie stopGrpcStream()
10:56:46 VERB Contact ready() this.puppet.contactPayload(wxid_rr9ej1o8xv9h21) exception: 14 UNAVAILABLE: TCP Read failed
10:56:46 ERR Config ###########################
10:56:46 ERR Config unhandledRejection: Error: 14 UNAVAILABLE: TCP Read failed [object Promise]
10:56:46 ERR Config ###########################
10:56:46 ERR Config process.on(unhandledRejection) promise.catch(14 UNAVAILABLE: TCP Read failed)
Config { Error: 14 UNAVAILABLE: TCP Read failed
    at Object.exports.createStatusError (/Users/guxiaobei/develop/juzi/xiaoju-bot/node_modules/grpc/src/common.js:91:15)
    at Object.onReceiveStatus (/Users/guxiaobei/develop/juzi/xiaoju-bot/node_modules/grpc/src/client_interceptors.js:1209:28)
    at InterceptingListener._callNext (/Users/guxiaobei/develop/juzi/xiaoju-bot/node_modules/grpc/src/client_interceptors.js:568:42)
    at InterceptingListener.onReceiveStatus (/Users/guxiaobei/develop/juzi/xiaoju-bot/node_modules/grpc/src/client_interceptors.js:618:8)
    at callback (/Users/guxiaobei/develop/juzi/xiaoju-bot/node_modules/grpc/src/client_interceptors.js:847:24)
  code: 14,
  metadata: Metadata { _internal_repr: {}, flags: 0 },
  details: 'TCP Read failed' }
10:56:46 ERR PuppetHostie ding() rejection: Error: 14 UNAVAILABLE: TCP Read failed
(node:13299) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 1)
10:56:46 SILL Puppet contactPayload(wxid_rr9ej1o8xv9h21) cache SET
10:56:46 VERB ContactSelf name()
10:56:46 INFO logoutListener BOT LOGOUT: 小北 logout
su-chang commented 4 years ago

Log

[0609/113010.421:ERROR:gl_surface_egl.cc(770)] ANGLE Platform initialization failed.
[260:283:0609/043011:ERROR:network_change_notifier_win.cc(156)] WSALookupServiceBegin failed with: 8
[260:283:0609/043011:ERROR:xwalk_net_log.cc(65)] Could not open file f:\xw_log\xw_net_log.txt for net logging
[260:283:0609/043011:ERROR:xwalk_platform_notification_service.cc(143)] NOT IMPLEMENTED
[260:283:0609/043011:ERROR:xwalk_browser_context.cc(88)] Failed to read preference, error num: 0
11:30:14 SILL Io start() setInterval() ws.ping()
11:30:18 VERB PuppetServiceImpl stop()
11:30:18 VERB EventStreamManager stop()
11:30:18 VERB EventStreamManager this.onStreamingCallEnd() this.eventStream.on(finish) fired
11:30:18 VERB EventStreamManager connectPuppetEventToStreamingCall() unconnect() 13 callbacks
11:30:18 VERB PuppetDonut stop()
11:30:18 SILL StateSwitch <PuppetDonut> off() is false
11:30:18 SILL StateSwitch <PuppetDonut> off() is false
11:30:18 VERB StateSwitch <PuppetDonut> off(pending) <- (false)
11:30:18 VERB PuppetDonut stopBridge()
11:30:18 SILL StateSwitch <PuppetDonut> on() is false
11:30:18 VERB Bridge stop()
11:30:18 VERB CacheManager release()
11:30:18 VERB CacheManager releaseCache()
11:30:18 SILL CacheManager releaseCache() closing caches ...
11:30:18 SILL CacheManager releaseCache() cache closed.
11:30:18 SILL StateSwitch <PuppetDonut> off() is pending
11:30:18 VERB StateSwitch <PuppetDonut> off(true) <- (pending)
11:30:18 SILL Io initWebSocket() ws.on(message): {"name":"jsonrpc","payload":{"id":-9007199254739864,"jsonrpc":"2.0","method":"getHostieGrpcPort"}}
11:30:18 INFO Io on(jsonrpc): [object Object]
11:30:18 VERB Io on(jsonrpc) send({"id":-9007199254739864,"jsonrpc":"2.0","result":8828})
11:30:18 SILL Io send({"name":"jsonrpc","payload":{"id":-9007199254739864,"jsonrpc":"2.0","result":8828}})
11:30:19 SILL Io initWebSocket() ws.on(message): {"name":"sys","payload":"roger"}
11:30:19 VERB PuppetServiceImpl start()
11:30:19 VERB PuppetDonut start()
11:30:19 SILL StateSwitch <PuppetDonut> on() is false
11:30:19 SILL StateSwitch <PuppetDonut> on() is false
11:30:19 VERB StateSwitch <PuppetDonut> on(pending) <- (false)
11:30:19 VERB PuppetDonut startBridge()
11:30:19 SILL StateSwitch <PuppetDonut> off() is false
11:30:19 VERB Bridge initWeChatSocket()
11:30:19 VERB BridgeCallback getConnectCallback()
11:30:19 VERB BridgeCallback getCloseCallback()
11:30:19 VERB BridgeCallback getEventCallback()
11:30:19 VERB Donut(ffi-adapter) InitWeChatSocket(�(>,��>,��>)
11:30:19 VERB ffi-adapter Backend.lib(0: "Z:\juzi\node_modules\@juzibot\wechaty-puppet-donut\dlls\WxLoader.dll")
11:30:19 VERB PuppetServiceImpl event()
11:30:19 VERB EventStreamManager start(stream)
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() for Puppet#0<PuppetDonut>(donut-friday)
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(friendship) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(login) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(logout) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(message) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(room-invite) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(room-join) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(room-leave) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(room-topic) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(scan) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(dong) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(error) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(heartbeat) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(ready) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager connectPuppetEventToStreamingCall() this.puppet.on(reset) (listenerCount:0) registering...
11:30:19 VERB EventStreamManager onStreamingCallEnd(callback)
11:30:19 VERB EventStreamManager start() puppet is logged in, emit a login event for downstream
11:30:19 VERB Puppet selfId()
11:30:19 VERB EventStreamManager grpcEmit(EVENT_TYPE_LOGIN[25], {"contactId":"wxid_4wkls7tu62ua12"})
11:30:19 VERB EventStreamManager grpcEmit(EVENT_TYPE_SCAN[22], {"qrcode":"http://weixin.qq.com/x/A_bFDtou4uVXoWc2wp8o","status":2})
11:30:19 VERB PuppetDonut onLogin(wxid_4wkls7tu62ua12)

Key Info

  1. PuppetServiceImpl stop()
  2. EventStreamManager this.onStreamingCallEnd() this.eventStream.on(finish) fired
  3. PuppetServiceImpl start()
  4. EventStreamManager grpcEmit(EVENT_TYPE_SCAN[22], {"qrcode":"http://weixin.qq.com/x/A_bFDtou4uVXoWc2wp8o","status":2})
  5. PuppetDonut onLogin(wxid_4wkls7tu62ua12)

Every PuppetServiceImpl stop() will let scan event happened, and emit the same QR code link, which is used for the first login.

The WeChat login already, but still emit the old QR code and login event.

huan commented 4 years ago

It seems there have one issue that we can fix now: If we do not have any valid QRcode, then we should not emit the scan event for it because it's invalid.

windmemory commented 4 years ago

Shall we add some check around the reset inside the error listener? We should not reset the wechaty instance unless we know the error means something really bad and a reset is needed to recover from the bad situation, otherwise, just print the error to console should be okay.

huan commented 4 years ago

I agree with you that it would be better to check more when we get the error event and try to recover our system without doing any reset actions.

However, the reset was designed to recover the Wechaty/Puppet from unknown error states, because it's very hard for us to be able to cover all error conditions.

Please let me know if you have any way to recover the system from unknown errors, thanks!

windmemory commented 4 years ago

I would assume most errors we encountered recently do not require a system recover, the system is not down. So it would be better to use log to console as default operation to errors instead of reset.

windmemory commented 4 years ago

Maybe we need to refer to this comments and do a retry logic for our grpc connection.

huan commented 4 years ago

Thanks for the link. I'd like to accept a PR that adds retry logic to the GRPC service.

However, I feel that there are other reasons that caused the Write Failed, as I posted above on Apr, as my experience, this issue happens a lot when we are working with the containers, but not so much with the native windows.

windmemory commented 4 years ago

However, I feel that there are other reasons that caused the Write Failed, as I posted above on Apr, as my experience, this issue happens a lot when we are working with the containers, but not so much with the native windows.

I agree. There should be some weird network issue when we using containers. To resolve this issue, I think we have two ways:

  1. Add retry to recover from the disconnection
  2. Figure out the network issue and fix it

I would prefer to do both, but since I don't have any insights related to the second one, I will do the first one first :P

windmemory commented 4 years ago

I finally got a reproduce code:

import { Contact, Wechaty, Room } from 'wechaty';

const bot = new Wechaty({ puppet: 'wechaty-puppet-hostie', puppetOptions: {
  token: 'a-very-secret-token-that-no-one-knows',
}});

const syncContact = async (contacts: Contact[]) => {
  await Promise.all(contacts.map(c => c.sync()));
}

const syncRoom = async (rooms: Room[]) => {
  await Promise.all(rooms.map(async r => {
    await r.sync();
    const members = await r.memberAll();
    await Promise.all(members.map(m => m.sync()));
  }))
}

bot.on('login', async user => {
  console.log(`Login: ${user}`);
  setInterval(async () => {
    const contacts = await bot.Contact.findAll();
    const rooms = await bot.Room.findAll();
    void syncContact(contacts);
    void syncRoom(rooms);
  }, 5000);
}).start();

This will trigger the TCP Write Failed in some sync process.

windmemory commented 4 years ago

I found the lower layer error happened in TCP layer

read: error={"created":"@1594209127.281811000","description":"EOF","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":106}
... // A lot of write binary data log
write complete on 0x10271cf00: error="No Error"
TCP 0x10271cf00 shutdown why={"created":"@1594209127.283200000","description":"Delayed close due to in-progress write","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":593,"referenced_errors":[{"created":"@1594209127.281902000","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2559,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1594209127.281811000","description":"EOF","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":106}]}]}
I0708 19:52:07.339226000 4344593856 tcp_client_custom.cc:151]          CLIENT_CONNECT: 0x102a2c940 ipv4:68.79.51.17:8803: asynchronously connecting

To catch this error, we could use

GRPC_VERBOSITY=info GRPC_TRACE=tcp ts-node test.ts

to run the test code above. Then you will probably see the error above.

windmemory commented 4 years ago

From this blog post

4.总结 如果server端没有启动,client尝试连接时,会收到RST。

连接建立后,如果读端或者写端关闭连接,具体分两种情况:

如果读端关闭连接,写端继续写,第一次写,会收到RST,再写,报错broken pipe 如果写端关闭连接,读端继续读,报错EOF

So the problem is probably the server side close the connection first. I will try to dive more into the service side logs.

windmemory commented 4 years ago

Here is the log in the hostie server when the connection dropped:

tcp_custom.cc:218] write complete on 029FCC20: error={"created":"@1594210168.896000000","description":"TCP Write failed","file":"d:\a\grpc-node\grpc-node\packages\grpc-native-core\deps\grpc\src\core\lib\iomgr\tcp_uv.cc","file_line":72,"grpc_status":14,"os_error":"message too long"}

There are multiple TCP Write Failed errors following this line of error. So I think the root cause for this bug is the message too long error reported from OS level.

windmemory commented 4 years ago

The TCP Write Failed with os_error is message too long, they somehow points to a word MTU.

Then I found some article related to docker MTU issues: https://mlohr.com/docker-mtu/ https://medium.com/@sylwit/how-we-spent-a-full-day-figuring-out-a-mtu-issue-with-docker-4d81fdfe2caf

When we use docker to start donut, there are multiple layers, I think some layer within the stack might have different MTU, then cause the error above. Will dig deeper.

windmemory commented 4 years ago

Linked StackOverFlow

Thanks for murgatroid99's comments, another keyword EMSGSIZE appears. Below is the original answer:

The string "message too long" is the error message for the error code UV_EMSGSIZE, which corresponds to the Linux error code EMSGSIZE. That appears to generally mean that gRPC is trying to write a buffer that is too large to the socket, but I'm not sure what exactly could trigger that.

Searching with keyword EMSGSIZE, more clues points to MTU again. And I found another mechanism for tcp to deal with the MTU which is Path MTU Discovery. I suspect that the MTU used by GRPC is not aligned with the one that TCP automatically discovered, which results in the MTU inconsistency, thus cause the error we've seen.

Will investigate more.

windmemory commented 4 years ago

@huan Any progress on your side?

huan commented 4 years ago

Unfortunately, I have not started yet because I have to finish some other tasks before I can enjoy debugging this MONSTER.

Thank you very much for your patient, I will let you know when I head to it soon!

windmemory commented 4 years ago

Got it!

huan commented 4 years ago

Maybe we should have a look at ulimit for the Docker:

See:

The userns_mode / might also desire a try for tunning, with a reading of https://docs.docker.com/network/host/

windmemory commented 4 years ago

Maybe we should have a look at ulimit for the Docker:

See:

The userns_mode / might also desire a try for tunning, with a reading of https://docs.docker.com/network/host/

Have you tested these configurations? What's the results? Does this solve the problem in the issue?

huan commented 3 years ago

Update:

We start using @grpc/grpc-js instead of grpc since Wechaty v0.57, so it might be different behavior in this issue.

Let's keep eye on this issue when we have upgraded our puppet service server with the wechaty v0.57 or above.

MauritsR commented 2 years ago

@huan With an application I'm working on I'm seeing the same GRPC error. It uses grpc as well: can you confirm whether this behavior magically disappeared after migrating to @grpc/grpc-js?

huan commented 2 years ago

Hi @MauritsR ,

We have not seen this issue after we switch to @grpc/grpc-js so I think you can try it to see if the pure js code will fix this issue at your side.