Closed krshna-dtdl closed 4 months ago
that error indicates the operation to talk to a set of talarias has either been canceled by scytale or a subset of talarias are taking too long response leading to a timed out
So is it talaria to scytale delay or Router to talaria delay?
Also how can we overcome to this problem? @denopink
So is it talaria to scytale delay or Router to talaria delay?
Talaria's response to the request from scytales is taking too long.
Also how can we overcome to this problem? @denopink
As you already know, there isn't a magic solution to this because it depends on your infra and how your xmidt cluster is setup.
assuming your infra and xmidt cluster are setup correctly, 1 easy place to start may be increasing your talaria compute (vm size or total talaria count) in case your current talaria(s) are overwhelmed by the current load from connected devices and incoming requests from scytale(s) and other services.
closing this ticket for now because xmidt cluster tuning is out of this ticket's scope.
feel free to open up another ticket if you encounter any new issues relating to this.
Hello @denopink We don't see any spike in terms of resources in talaria and scytale and the message is being received by talaria, as a next step of action we are checking where it is taking time to respond due to which we are getting this error.
@Sachin4403 continue the conversation from #1046
recap:
I suggested the following:
A timeout means talaria's response to the request from scytales is taking too long.
Have you confirmed talaria is receiving the request from scytale? If not, try running your talaria server with a DEBUG logging level.
If you're seeing logs indicating talaria is receiving the requests from scytale, then feel free to post the logs here.
But, if you don't see any logs indicating talaria is receiving the requests from scytale, then something in your stack is causing the timeout before it reaches talaria. At that point, I would suggest reaching out to your infra team/provider or ops running the testing environment for help figuring out the issue on your end.
You mentioned you were going to post some logs: @denopin
k Yes the request went to talaria
At talaria we are getting: Could not process device request
context canceled
https://github.com/xmidt-org/talaria/blob/c122b182c6f9a27783908d1d48f106469982b5ce/WRPHandler.go#L77
Will post more logs soon
I just read your latest comment.
I would also recommend you to check your device (what's connected to talaria) logs and whether your device is taking too long to respond back to talaria.
Feel free to post you talaria and test device logs here, I may be able to give you better insight using those logs.
@Sachin4403 is this still an issue? Otherwise, I'm going to close this. 🙂
Hello @denopink
Feel free to close this, as we see the WRP Requests are going to the device, Now as a next step we are checking in device what could be root cause of it.
Sounds good, best of luck.
What could be the possible reason for this error?
logger.Error("fanout operation canceled or timed out", zap.Int("statusCode", http.StatusGatewayTimeout), zap.Any("url", original.URL), zap.Error(fanoutCtx.Err()))
We are getting this intermittently at SCYTALE and are not sure of the reason.
https://github.com/xmidt-org/webpa-common/blob/cdd32a26087537693accfe44520e6d7995391b86/xhttp/fanout/handler.go#L328