wearable-learning-cloud-platform / wlcp-issues

0 stars 0 forks source link

Game Server / Game Player Connectivity Issues #215

Closed mmicciolo closed 2 years ago

mmicciolo commented 2 years ago

This issue pertains around an bug where when a game instance is started no players can connect to that instance and all of the players get stuck at "connecting..."

No state or transition data is received.

From @sai-educ in meeting minutes.

image

As you can see all devices stuck and connecting and no states or transitions are received for the user to interact with.

mmicciolo commented 2 years ago

I think I have finally after many hours today figured out the true cause of the stuck at "connecting..." issue.

Originally I thought it was due to the system running on some very old versions of spring boot. We ruled that out by upgrading production (master branch) to the latest version and then testing. The system stability and speed seemed to improve a bit, but the connecting... issues were still present.

Another thing I considered were some stability issues with AWS, but since this connecting issue has been so persistent that is definitely ruled out.

Another thing I thought it was, were packets being dropped when they were being received by the wlcp-gateway and that the connecting packets weren't making it to the server. This has been ruled out.Today I was able to reproduce the issue locally and on the cloud and the connections and communication is working perfectly. This includes the STOMP connecting to server protocol as well as our own connect to game instance protocol.

The issue is actually with when the game instance is first started, it makes a call to the wlcp-transpiler micorservice to get a transpiled version of the game the user is requesting to start.

That call seems to be causing the issue. Either a "time out" exception is thrown on the call or an exception in the transpiler happens that returns a malformed string or no data at all. Since there is no error handling around this, the error has gone unnoticed.Since the returned string from the transpiler is malformed or null, when a user connects and the game server starts up a new PlayerVM, its feeding a null or malformed string to the JavaScript engine and essentially no game is being started in the PlayerVMs JavaScript engine.

This is why the game player gets stuck at "connecting..." because its expecting the PlayerVM to send its first packet but it never does since the game is never even running in the JavaScript engine.

After many tries I have been able to reproduce it locally and in the cloud. Now i need to figure out why these "between microservice" feign client calls are failing / timing out.

Another thing I have discovered is that by default the spring boot STOMP message broker does not guarantee in order delivery. So when users connect and get the games running, sometimes I think its possible that packets can get out of order and cause some unexpected behaviors.

More to follow shortly...

mmicciolo commented 2 years ago

So the true issue is the Feign Clients.These feign clients allow microservices to make API calls to each other for ex.wlcp-gameserver has 2 feign clients:

wlcp-transpiler has 1 feign client:

The default connection-timeout for these clients is 2000 ms... So if the client cannot make connection within 2 seconds it times out. When the sever is under load that value is way too small or just might not barely be enough.

The cherry on top is that there is a bug when you want to change this connection-timeout value. In the code there is a check to see if you are setting both the connection-timeout value and read-timeout value. So in order to change the connection-timeout value the read-timeout value has to be changed too or the feign client wont pickup on the config change.

The default read-timeout is 60 seconds... so I am going to set both to 60 seconds so it takes the config change.

I have finished testing this locally and it has fixed the issue. Ive figured out a way to reproduce the stuck at connection issue now on demand so it takes much less time.

See stack overflow thread I have found with others having similar timeout issues. Defaults are too aggressive and there is a bug with how the config is read. https://stackoverflow.com/questions/38080283/how-to-solve-timeout-feignclient

Config change is as follows:

feign: client: config: default: connectTimeout: 60000 readTimeout: 60000

mmicciolo commented 2 years ago

This bug has been fixed and temporarily deployed to production (master) : http://new.wearablelearning.org/

mmicciolo commented 2 years ago

This bug has been fixed and deployed to dev : http://dev.wearablelearning.org/

mmicciolo commented 2 years ago

Testing complete. This issue has not shown up again since it was fixed.