xmidt-org / webpa-common

The collection of small common packages for the webpa project.
Apache License 2.0
25 stars 26 forks source link

SCYTALE: fanout operation canceled or timed out #1045

Closed krshna-dtdl closed 4 months ago

krshna-dtdl commented 5 months ago

What could be the possible reason for this error?

logger.Error("fanout operation canceled or timed out", zap.Int("statusCode", http.StatusGatewayTimeout), zap.Any("url", original.URL), zap.Error(fanoutCtx.Err()))

We are getting this intermittently at SCYTALE and are not sure of the reason.

https://github.com/xmidt-org/webpa-common/blob/cdd32a26087537693accfe44520e6d7995391b86/xhttp/fanout/handler.go#L328

denopink commented 5 months ago

that error indicates the operation to talk to a set of talarias has either been canceled by scytale or a subset of talarias are taking too long response leading to a timed out

krshna-dtdl commented 5 months ago

So is it talaria to scytale delay or Router to talaria delay?

Also how can we overcome to this problem? @denopink

denopink commented 5 months ago

So is it talaria to scytale delay or Router to talaria delay?

Talaria's response to the request from scytales is taking too long.

Also how can we overcome to this problem? @denopink

As you already know, there isn't a magic solution to this because it depends on your infra and how your xmidt cluster is setup.

assuming your infra and xmidt cluster are setup correctly, 1 easy place to start may be increasing your talaria compute (vm size or total talaria count) in case your current talaria(s) are overwhelmed by the current load from connected devices and incoming requests from scytale(s) and other services.

closing this ticket for now because xmidt cluster tuning is out of this ticket's scope.

feel free to open up another ticket if you encounter any new issues relating to this.

Sachin4403 commented 5 months ago

Hello @denopink We don't see any spike in terms of resources in talaria and scytale and the message is being received by talaria, as a next step of action we are checking where it is taking time to respond due to which we are getting this error.

denopink commented 5 months ago

@Sachin4403 continue the conversation from #1046

recap:

I suggested the following:

A timeout means talaria's response to the request from scytales is taking too long.

Have you confirmed talaria is receiving the request from scytale? If not, try running your talaria server with a DEBUG logging level.

If you're seeing logs indicating talaria is receiving the requests from scytale, then feel free to post the logs here.

But, if you don't see any logs indicating talaria is receiving the requests from scytale, then something in your stack is causing the timeout before it reaches talaria. At that point, I would suggest reaching out to your infra team/provider or ops running the testing environment for help figuring out the issue on your end.

You mentioned you were going to post some logs: @denopin

k Yes the request went to talaria

At talaria we are getting: Could not process device request

context canceled

https://github.com/xmidt-org/talaria/blob/c122b182c6f9a27783908d1d48f106469982b5ce/WRPHandler.go#L77

Will post more logs soon

I just read your latest comment.

I would also recommend you to check your device (what's connected to talaria) logs and whether your device is taking too long to respond back to talaria.

Feel free to post you talaria and test device logs here, I may be able to give you better insight using those logs.

denopink commented 4 months ago

@Sachin4403 is this still an issue? Otherwise, I'm going to close this. 🙂

Sachin4403 commented 4 months ago

Hello @denopink

Feel free to close this, as we see the WRP Requests are going to the device, Now as a next step we are checking in device what could be root cause of it.

denopink commented 4 months ago

Sounds good, best of luck.