Working in many large organizations there is still widespread apathy over change mangement and risk mitigation. The dominate attitude is that we do these types of changes 100’s of times – no big deal …. well here is an object lesson in why you don’t want to do that.
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”.
Complete description via Official Gmail Blog: More on today’s Gmail issue.