INetU Managed Hosting

Beware the uptime braggarts.

October 28th, 2009 by Scott W.

Be careful about the distinction between uptime and high availability. One should be the goal of your server infrastructure. The other is just a geek bragging right.

Many system administrators will brag about their systems having high uptimes. The uptime of a system is how long it has been running without a reboot. The current longest running uptime, as being tracked by Uptimes Project, is a VMS cluster that has been up just shy of 12 years as of the date this blog was published. Since we strive for 100% uptime, shouldn’t this be an impressive record to share with all of your friends as an exemplary form of sysadmin kung-fu? Actually no, never rebooting your systems fosters a “if it ain’t broke don’t fix mentality” that carries quite a few risks:

  • First, if a machine has not been rebooted kernel, OS, and application patches are probably not being kept up-to-date. In today’s age on the Internet this can be a very dangerous practice. Your systems may be vulnerable to known exploits. In addition, if you run into problems with your system, your vendor will require applying recommended patches as a first course of action. It’s better to have these things taken care of before adding to work during a problem. We’ve seen out of date Windows servers need over 4 hours of patches requiring multiple reboots.
  • Second, you’re potentially leaving around rotten Easter eggs. By rotten Easter eggs I mean changes to systems that don’t make it through a reboot: production services that should be running, startup scripts that do not work properly, IP addresses that have been added, speed/duplex issues that have been resolved since the machine has been up. So worst case scenario is that your server goes down (either planned or unplanned) and after the reboot you have a server with poor network performance, not all IP addresses are alive, and the database application isn’t running. If the reboot was not planned (a crash) this adds to the confusion when trying to bring the system back online.
  • Third, you’re missing out on file system integrity checks. Some operating systems (RedHat Linux being one of them) are configured to perform file system integrity checks during reboots after a certain amount of time. This is of course can be turned off, but finding out a server needs to check two terabytes of data (today’s disk drives are so large) when you desperately want the system back on-line is a real pain. Interrupting a file system check is not exactly what you want to do either. A two terabyte file system on a SATA drive running RedHat Enterprise Linux 5 can take over 2 hours to complete!
  • Fourth, if you never schedule maintenance you probably have never established acceptable maintenance windows with your users. By not scheduling maintenance, it is like you are running your car without ever getting it inspected, changing the oil, or flushing the coolant. That’s all fine until you blow a head gasket. Without establishing maintenance windows, your users will assume the system should be up “all the time.” Aligning the expectations of the users with the needs of properly maintaining the systems can be difficult, because users generally don’t want to get pinned down on acceptable downtime, so they can keep to their “it should always be up” guns. If you go for early Saturday or Sunday morning maintenance windows that must be scheduled in advance, users will generally be OK.

On the other hand, if your environment really can’t handle any downtime, you can accomplish high availability and scheduled maintenance but it requires multiple systems resulting in a higher cost. For example, in standard two-tier architectures with a web tier and a database tier, you’d implement load balancing on the web tier and take the web servers out one at a time as you perform maintenance. A similar method can be used on the database tier depending on how the database servers are configured (active-passive or active-active).

Veteran system administrators understand that avoiding maintenance for high uptimes is a rookie perspective. A guideline for frequency is scheduling reboots once a month or quarter. Not too long to accumulate too many rotten Easter eggs and not too short to have users upset. If you are afraid to reboot a system, your worst fears could be realized when the server crashes at the worst possible time. Schedule the oil change and keep your systems running in optimal condition.

Other posts that might interest you:

Leave a Reply

©1996-2010 INetU Inc, All rights reserved.