MomoWeb Server Stability
Posted by admin | Filed under Server Updates

As many of our client know MomoWeb automatically publishes server status updates to twitter those tweets are also published here under the category of ‘Tweets’.
A quick scan of that will tell you that our server has had its stability issues of late. No one knows them better than myself as I am the one that gets the 2 am alert call from the server monitor service and gets out of bed to deal with it. There have been extensive tests and tweaks to try and root out the issue of this instability. As this is a new built server only 1/2 year old it is frustrating to say the least.
A team of people have been pouring over this server both hardware and software. The software has been fully vetted by knowledgeable unix techs and have found nothing wrong. Hard drive settings have been tweaked and the drives appear to be funning better than ever.
Hardware is a little more tricky to fully check out however…. To do that they would need to take the server off line for several hours and clearly that is not an acceptable idea. So hardware is handled item by item in a likely order of instance failure. This can be very time consuming and could mean any number of requirements of downtime as one than another is checked. Another idea I am not a fan of…… The other night they took the server down for what was supposed to be the most likely culprit, the hard drive chassis. They were going to swap it out with a new on and pop the hard drive into that. A process that was to take 20 minutes of downtime at 2am pacific. When they did the swap the motherboard on the server refused to boot with the new hard drive chassis. After an hour of downtime I told them that this was over and to stop trying, replace the old one back and get my server online.
Than came the crash last night. This may have actually been a good thing…. Logs seemed to indicate ‘The server crashed due to a kernel panic, which Kevin said was due to a heat issue (which would be resolved with a chassis swap/restore)’. I should state that we also know these crashes are hard ware and not software because logs show the server running fabulous one minute with low load lots of resources and then gone in a second, sudden failure.
To Resolve This: We are going to swap out the whole server except for the hard drives. Everything will be new and identical in configuration to this server the existing hard drives will be popped into the new chassis in the new server and booted up. Because we are staying with the existing hard drives no data will need to be updated or will be lost and the server and all its settings will be the same. Downtime will be only 20 minutes late night/early morning.
Before we start this process the new server is being built and will have a couple of days running beside the existing server with a duplicate os install. IT will be fully tested on the hardware to make triple sure the hardware is all in tip top shape. When testing confirms all hardware to be running perfectly we will swap in the server hard drives to the new server and hopefully leave server instability behind us. We will be scheduling any work to go around important server dates for clients, so some days are off limits for this kind of work, and it will be in the very early hours of the morning when it does happen.
If this fails the only issue that could be left is the hard drives themselves and they have been tested and show no errors. The are new hard drives with monitoring built in for errors and reports show nothing. It is very unlikely it is the hard drives.