Posts

Sunshine and Rainbows

Image
  Ok, well it looks like the migration worked.  It took just 90 min to move all the user accounts and update the user data.  There were a few bugs that I was able to hammer out quickly. It was a long road to get here but we are here. Early reports are encouraging.  We are seeing 3 to 4 times performance improvements.   Again I just want to thank you all for hanging in there with me on this journey and having faith that it would work eventually. May your games always be fun, Pencils forever sharp, and your rules books at the ready. G Pa Dax Brad

I surrender :(

Image
  I have worked feverishly over the last 6 weeks trying to make this new setup work.  No matter what I try I can't fix the poor performance some people are experiencing. I lay awake at night racking my brain for new things to try and trust me I have tried a great many things.  I have to recognize the new infrastructure I created just doesn't work well.  The improvements I promised the users just didn't pan out.  Actually things are worse than what I upgraded from. So what now?  I am going back to the previous cloud provider I was using.  It is an infrastructure I know works. I will try to use their environment more efficiently to try to gain the performance that I was wishing for. I have been giving a lot of thought on how to do this with zero downtime.  I think I can make this happen without nothing but a "Disconnection" message in your game window.  I will update the Discord announcements with all the details.  I am hoping to do this before next weekend, but tha

Smoking gun and a dead chicken.

Image
I have been working non stop on a lingering issue. During peak times some players/gms are not able to load their worlds. What has made this a challenge to debug is that it was so random.  People would show up in Discord and look for a solution. I tried my best to cycle through all the known issues, but in the end we couldn't solve it.  Games where cancelled and dreams of world domination by the players were dashed. A customer came into discord with the same issue. I was fortunate in that this person was not in the middle of a game and was able to stick around and help me do some deep diving.  So here is the funny thing about running a production environment.  In order to make sure things are  healthy and to help diagnose issues like this we set up monitoring tools.  These tools collect a great deal of statistics and forward them to a collection point. Then we can use fancy graphs to look at the data. Here is what one graph would look like. Pretty cool eh?   While I had the customer

Stability Update

 Hello, I wanted to update the membership on where we are at with the transition to the new infrastructure.  To say it was bumpy is an understatement.   I think we are on the other side.  I have been running issue free from Friday morning to Saturday night, which are our busiest times. I have been watching a lot of status graphs and error logs and everything is finally running at normal levels.  This is good news.  It was starting to get dis heartening and it didn't look like there was a solution.  Well thanks to google Foo I was able to source a different piece of software that was more suitable to the task. The membership also found some bugs and I was able to squash them. So to recap some changes. Backup file download. I had wanted to use a http download service to make it easier for people to download the backup zip file.  This proved to be problematic for very large files.  Some users had backup files over 20 gig in size!  To fix this, I decided to move the backup file into yo

Website Crash

 January 8th, Our database provider had an outage this evening.  The cascading effect is the website was not able process user requests and then crashed. I believe this lasted no more than 10 minutes. I have taken steps move our database requirements in house.  This should prevent further disruptions due to factors out of my control. I will be transitioning tomorrow. There may be a momentary interruption in service. May your games always be fun, Pencils forever sharp, and your rules books at the ready. G Pa Dax (Brad)

Post Mortem or Rather Recap

 That was a long two days I have to admit.  I worked on this migration for the last year and a bit.  I thought I had everything covered.  I was hoping to have it all done in about an hour.  As some of you noticed, it was more like 5 hours.  Here are some things I encountered and where they stand as far resolution. 1) There was an issue with an invalid SSL/TLS certificate.  This is the thing that puts the lock symbol in your browser.  I put the wrong certificate up and didn't realize it had expired. People reported it and I was able to fix right away.  Resolution: Bought a new certificate and put it up. 2) People were able to start their game server, however everything was slow as molasses. Very frustrating to say the least. Resolution:  The file server needed tunning.  It has lots of capacity, however, the network was choking off the connections.  Change some config values and viola! 3) Random in game "Server Disconnects" notices. Resolution: This one turned out to be tri

Major Infrastructure Upgrades!

Image
  Hello everyone, It has been a long time in the making but we are finally ready make the move to a new datacenter where we will have much bigger hardware and faster networks. We will make the move Jan 3, 2023 at 1600 UTC. Let's get right to the changes, first I will talk about the hardware/network. Our application servers are now our own. Meaning we won't share with any other customers like what happens in the cloud environment.  All CPU and ram is ours to use exclusively.  The application servers have a total of 128 CPU cores and 512 gig's of ram. That is just our starting point. ;) Our external network to the internet will be a full 1Gbe with an upgrade path to 10Gbe at the turn of a switch. The internal network has the same capacity. However this is not a shared network like in a Cloud, the full bandwidth is for the exclusive use of  the games. All hard drives in the storage array are all SSD managed by a hardware raid controller. The array has fault tolerance built in