This is a very old article. It has been imported from older blogging software, and the formatting, images, etc may have been lost. Some links may be broken. Some of the information may no longer be correct. Opinions expressed in this article may no longer be held.
Most programmers, especially those who work on server software, will have been in a situation when we’ve been reconfiguring, upgrading, modifying or otherwise replacing some piece of vital software on a physically remote server, and things haven’t gone quite as expected.
Often, fixing it is a simple matter of logging into the server remotely (via, say, secure shell) and reversing the change. In some extreme cases, the problem is so severe though, that it can’t be fixed remotely — for example, you’ve managed to accidentally halt the machine, so you can’t log into it remotely and it needs a restart. In such a situation, you’ll need to physically go over to the server (or phone someone and have them do so) and fix the problem.
NASA’s Mars Global Surveyor (MGS) was launched in November 1996. Fast forward almost a decade and a couple of bugs in a firmware update end up swivelling the antena further than the antenna was built to go. This led to the craft reorienting itself and exposing one of its batteries towards the sun. The battery overheated so the spacecraft decided to stop charging it. The second battery didn’t have enough remaining power to keep the craft running. Within 12 hours, the MGS was space debris, and NASA’s not going to be able to find anyone able to walk over to it and hit the “reboot” button.
Next time you’re considering deploying some server-side software without testing, remember this story.
The New Scientist: Software ‘fix’ responsible for loss of Mars probe
Wikipedia: Mars Global Surveyor — Loss of Contact
NASA White Paper