HellOps

Tales of being an operator in hell.

Let's set the scene. I'm taking over an existing, already set up operation. The topology, per-DC, is something like this:

  • Two redundant gateways, talking via heartbeat – if one dies, the other takes over. Both of them running raid1 mdadm ext4 everything.
  • Three database servers as a percona cluster. Data is on raid1 mdadm ext4, but the base OS is on regular ext4. Var is its own partition and is xfs.
  • A bunch of application servers running standalone ext4.
  • A “backup” server whose job it is to talk to the other machines and take backups. Backups stored locally on its own disk and on an external (usb) disk plugged into it at all times. This machine actually ran UEFI, and thus had a FAT32 partition. It had an experimental btrfs partition on top of the standard standalone ext4. External disk also ext4.

Note that this is a relatively low IO use-case (high on network and compute). The XFS var partition on the DBs probably has half the disk IO use of the rack.

There's several racks like this in different enterprise colocation datacenters. All of them have shared ventilation, air conditioning, PSUs, backup generators, etc. As such, for cost saving all of them are plugged into standard electrical outlets (no in-rack PSU – there's already one handled by the colo!).

One day, there's a huge storm going around. A quarter of the city is already dark, but neither the office nor any of the colocations is. Slowly, more and more of the city's infrastructure goes down (it ended up with being closer to ½ by the end of things). Eventually, everything goes dark in the office. As are two colos. We decide the power is dead and just go home – it'll come back up. The next day, we check. One of the colocations came back up just fine. One of them, however, did not. So grabbing my winter coat (it is very cold in the DC), I head there to look at what's going on.

All of the application servers won't boot. I boot into a rescue system and check – the root partition is dead. On all of them. E2fsck won't even try to repair anything. Okay, let's check the gateways and database servers. Ext4 partitions are dead. Including the raid1 ones. The errors are different across the copies. Well what about the external backup disk? That's just completely dead. Actually fried. Will not even spin up. It was working fine the day before! Some of the drives in general are also fried, mind you – above I was talking about the ones that survived the situation.

I spend the week trying to manually recover data, since e2fsck refused. Things seem to be corrupted at random. For every file I recover, there's one I can't. What's weirder, some of the corrupted files are ones that should not have been experiencing writes at all! I was essentially flying blind (a lot of metadata blocks were also gone) so for every db file I recovered I also recovered something completely useless (like the local cat(1) binary).

At this point, I get curious, asking that DC's administration on what even happened. They say a lightning bolt hit the top of the building. Wait so the shock blew past the UPS, into the servers, frying a bunch of things? How are the motherboards ok? Why didn't the power supplies try to surge protect? I'll never have answers for these questions, though I do know that the PSUs are likely too old to have good protections in place, and the servers did not run on ECC ram (potentially explaining at least some of the corruptions, though far from all of them).

This wasn't that huge of a deal. Databases were recovered from backup (albeit a slightly older one, more on this in a second). The rest of everything just got a clean install from scratch.

What really stood out for me, however, were what survived. The fat32 EFI partition did! The XFS partitions on the database servers either survived intact, or an fsck recovered them. The experimental btrfs partition on the backup host (the source for the database recovery, a bit older because it wasn't in active use yet) had zero issues whatsoever. If it hadn't survived, an even older copy would be available from another DC's backup server (they inter-sync).

That day I learned a couple of lessons:

  1. Use logical backups of data that's important – the other stuff may make restoring as a whole faster, but actively get in the way in most other cases, while also making the backups slower, more cumbersome, and thus less likely to happen often.
  2. Ext4 will eat your data at the slightest provocation, in unpredictable ways.
  3. Lightning strikes will eat your data. Do not trust shared UPS.
  4. Btrfs can survive acts of god (something that it has consistently done for me afterwards as well!). XFS is resistant to acts of god. FAT32 is too dumb to realize what is before it is an act of god, making it similarly resistant for all the wrong reasons.