Smoking gun and a dead chicken.

I have been working non stop on a lingering issue. During peak times some players/gms are not able to load their worlds.

What has made this a challenge to debug is that it was so random.  People would show up in Discord and look for a solution. I tried my best to cycle through all the known issues, but in the end we couldn't solve it.  Games where cancelled and dreams of world domination by the players were dashed.

A customer came into discord with the same issue. I was fortunate in that this person was not in the middle of a game and was able to stick around and help me do some deep diving. 

So here is the funny thing about running a production environment.  In order to make sure things are  healthy and to help diagnose issues like this we set up monitoring tools.  These tools collect a great deal of statistics and forward them to a collection point. Then we can use fancy graphs to look at the data. Here is what one graph would look like.


Pretty cool eh?  

While I had the customer keep trying to access their game, I was restarting all of the network elements one at a time trying to determine who the culprit was.  I literally restarted every piece of software I run thinking it was an over capacity issue and restarting it would help. Nope not a damn thing I did worked.  I was at the point of pulling the plug on all the servers and going for a beer. (It only briefly crossed my mind.)  There was only one thing left that I hadn't tried. I shut off the monitoring software.  What the heck it couldn't hurt to try right?  Well guess what he was instantly able to access his game server.  I almost fell out of my chair.  Needless to say it is still turned off.  This does have its down side.

The only thing I can think of is that there was so much monitoring data running on the network that it was creating traffic jams.  Based on the type of hardware I am running this "shouldn't" be happening but it seems like it was. More research to do.

I have taken a proactive step and ordered new hardware with more resources.  It is way over kill but I owe it to my very patient customers to provide as much horsepower as I can to try to prevent a similar situation.

Again I wish to apologize for the less than stellar performance of late and here is hoping better days to come.

Please if you have any questions drop by discord and say hi.

May your games always be fun,
Pencils forever sharp,
and your rules books at the ready.

G Pa Dax (Brad)

P.S.  This process of debugging would account for your games being restarted unexpectedly. I am sorry it couldn't be helped.

Comments

  1. Hmm... been there, done that. The problem with some/most monitoring software is that you have to account for their monitoring load when sizing CPU/Memory of a running task. (Don't ask how I know - production, in 4 regions, mis-behaving with logging turned on is, well, undesirable).
    But it is annoying that to monitor, one has to "over-capacitize" tasks... If I wasn't running you, it'd cost $... while running you it costs $$... well, you can't win everything.

    ReplyDelete

Post a Comment

Popular posts from this blog

I surrender :(

Sunshine and Rainbows