Tuesday, November 29, 2005
Bugs at Google
Every software contains bugs and the Google software is not nesessarily that better that it has no. Ron Garret shares his memories about a huge AdWords bug, that hit the advertisers.
The AdWords launch went fairly smoothly, and I spent most of the next two weeks just monitoring the system, fixing miscellaneous bugs, and answering emails from users. (Yes, I was front-line AdWords support for the first month or so.)Original post by ex-Googler Ron Garret
I pulled up the biller window and saw that a whole bunch of credit card charges were being declined one after another. The reason was immediately obvious: the amounts being charged were outrageous, tens of thousands, hundreds of thousands, millions of dollars. Basically random numbers, most of which no doubt exceeded people's credit limits by orders of magnitude.
But a few didn't. Some charges, for hundreds or thousands of dollars, were getting through. Either way it was bad. For the charges that weren't getting through the biller was automatically shutting down the accounts, suspending all their ads, and sending out nasty emails telling people that their credit cards had been rejected.
I got a sick feeling in the pit of my stomach, killed the biller, and started trying to figure out what the fsck was going on. (For you non-programmers out there, that's a little geek insider joke. Fsck is a unix command. It's short for File System ChecK.)
It quickly became evident that the root cause of the problem was some database corruption. The ad servers which actually served up the the ads would keep track of how many times a particular ad had been served and periodically dump those counts into a database. The biller would then come along and periodically collect all those counts, roll them up into an invoice, and bill the credit cards. The database was filled with entries containing essentially random numbers. No one had a clue how they got there.
Now, it's a complete no-brainer that when something like that happens you add some code to detect the problem if it ever happens again, especially when you don't know why the problem happened in the first place. But I didn't. It's probably the single biggest professional mistake I've ever made. In my defense I can only say that I was under a lot of stress (more than I even realized at the time), but that's no excuse. I dropped the ball. And it was just pure dumb luck that the consequences were not more severe. If the problem had waited a year to crop up instead of a couple of weeks, or if I hadn't just happened to be there watching the biller window (both times!) when the problem cropped up Google could have had a serious public relations problem on its hands. As it happened, only a few dozen people were affected and we were able to undo the damage fairly easily.
You can probably guess what happened next. Yep. One week later. Same problem. This time I added a sanity check to the billing code and kicked myself black and blue for not thinking to do it earlier. At least the cleanup went a little faster this time because by now I had a lot of practice in what to do.
And we still didn't know where the random numbers were coming from despite the fact that everyone on the ads team was trying to figure it out.