Today was a rough day.
I got up late, checked email, answered a few personal emails that had to be answered, and headed to work. I arrived about 20 minutes late, but my boss was out of the office all day, so while I was late, there will be no reprocussions…
I’m the guy responsible for all advertising sales and trafficking at my day job, and I was already ‘behind’ on the days tasks at this point. As I started checking my office email (on the PC which takes about 10 minutes to fully boot up, vs. the Mac on my desk which takes like 5 seconds to start up since I can leave it on all night, and even then it only takes like 2 minutes to boot up from a fully powered down mode) I got a phone call from our tech team. One of our ad servers was crashing and restarting itself every five minutes, and had been since 9am. He wanted to know if we had changed anything, and if I knew what might be the problem.
We hadn’t and I had no idea…
We started talking about possible causes and the symptoms that he was seeing, and all I could think about was the clients that we were supposed to be serving ads for, as well as the visitors to our website that weren’t getting a good experience. We discussed possible causes like recursive DNS timeouts being too long, caches being too large, possible memory exceptions, buffer overruns, attacks, and more. I offered some possible solutions (just reboot the machine is my favorite, and had already been done with no solution) and nothing seemed an easy solution.
We’re using a third pary ad serving platform (think something from this list). It’s great from a trafficking and sales support perspective, and we usually have very few problems with it, but today wasn’t one of those days.
You see, part of the problem is that the ‘delivery engines’ which actually do the ad serving are basically just ISAPI filters that run inside/alongside IIS on a Win2K Server Box (we have 3 or 4 of these machines running all the time) and these boxes all report to one central ‘admin’ box that acts like a central brain. The delivery engines can run on their own without the admin box throughout the day, unless there are changes to an ad campaign, or scheduling changes, or new campaigns that are entered by a human operator, or trafficker, and then pushed to the delivery engines manually. If there aren’t any manual changes, then the admin box just collects the days log files, recalculates the schedule for the next day and pushes new information to the delivery engines nightly.
So, as I said part of the problem is that we’ve chosen to run our ad serving platform under Windows. OAS on Windows is, in my opinion, just a port of the real ad serving platform which runs on some form off *nix, generally. Under a *nix OS, this ad serving platform is pretty much rock solid (it operates as a DSO for Apache, which I don’t have to explain is pretty much rock solid). Under IIS, well, we all know how good IIS is in general, plus you add in the fact that it runs as an ISAPI filter, which might as well be a big black box with 1,000 locks on it, and you get a great system when it works, and a shitty system when it doesn’t.
So, at 9am this morning we started experiencing problems where IIS would crash every 5 minutes or so, and there weren’t any good tell-tale signs that explained what the problems might be. No port scanning, no worm-like signatures, no large traffic spikes… nothing. Just a crappy IIS performance/non-performance thing.
Every 5 minutes it would crap out, and die… then restart itself. We tried isolating this one box, and the same problem started happening on the ‘new’ delivery engine that we put in its place, for no reason. So, we put the old box that was having problems back in place.
I tried everything, turning off specific campaigns, turning on others, all to no avail.
At lunch, I had a meeting off-site, so I was gone for about two hours, but I couldn’t help think about the clients that weren’t getting their ads delivered properly.
I got back to the office, and the problem still wasn’t fixed…
I helped trouble-shoot the problem again with our tech team on and off finally asking them if they’d filed a trouble-ticket with the OAS support team. They hadn’t, and it was already 2pm central. So, they finally filed a report, at 2pm.
To the vendor’s credit, they were on it like wildfire. And I mean on it.
The problem was that one of their best Windows tech’s (remember this thing is really designed for a *nix) couldn’t find anything pointing to an identifyable problem. Nothing. Nothing that could be used to reproduce the problem, and thus find a way to fix it. Nothing. There were two identical boxes sitting side by side (relatively) and one was fucked up and the other was just fine. Both serving the same loads with the exact same software running. This would seem to point to a hardware failure of some sort on the machine with problems, except, we tried a third identical machine, and it started to exhibit the same problem.
So, at 4:00, I called to check on the progress of the Real Media tech, and they were still totally stumped. My day had been shit so far, so I was thinking to myself, that I was in for more shit, late into the night. I was at the mercy of someone else to fix the problems that directly affected me, my clients, and bottom line my paycheck. (They decided to use Windows for the ad serving platform’s installation against my recommendation other-wise, and I know less about Windows servers than I know about Linux servers, which is still very little in the real world).
At 5pm or a little after, I called to check on things. My lead tech guy said “Well, you won’t believe this, but at 5:00, everything just started working again.”
I laughed the loudest laugh I’ve laughed in a long time, then checked to see if it was April 1st, somehow… it wasn’t. I said “You’re shitting me” (sidebar: I’m usually very professional while on the clock, so the tech sort of took a couple of seconds to collect his thoughts).
After a brief pause, he says “No, John, I’m not shitting you. It’s working just fine.”
We then discussed what he thought might have happened, and the best we could come up with is that our co-lo provider has mice that must have started their first day on the job today, and clocked in at exactly 9am, taken no lunch, and clocked out at 5pm. Nothing else explains what happened today. Nothing.
To top off all the delivery problems I had today, I also had to deal with new clients, an old client that’s very tough to please, and two BAC’s ( or Big Ass Clients that we don’t want to piss off at all) and it’s the beginning of the month which means I should have been checking all of the billing that was supposed to be happening today (but didn’t due to the OAS problems).
Tomorrow’s going to be a long day again, I’m sure… I just hope those mice don’t clock in again.