(Warning: Reasonably technical stuff ahead. Go and put your geek hat on. It’s OK, I’ll still be here when you get back. Most stuff is hyperlinked)
I do not, only, run MCAU. I have (well, that’s now ‘Had’) a pile of other hosting that I was doing in a Brisbane Datacenter. These were all run on a HP c3000 Blade Chassis with a 16 disk, 32TB NAS Server (this was set up with a whole pile of redundancy, so that even if 8 disks failed, I still wouldn’t lose data). I also was backing that up to a separate group of 6 disks inside the Blade Chassis itself.
So this is all well and good. EXCEPT for the fact that that blade had failed terminally a couple of months ago, and I had pulled it apart to fix it. So I didn’t have any backups (or any I could restore prior to the storage blade dying). This wasn’t that traumatic, as I was very confident in the storage server NOT dying.
This confidence is what lead to my downfall, as it, of course, died. What’s even worse, is that ACTUALLY, it didn’t die. It started running really really slowly. Rather than doing what a sensible person would do, and immediately back up whatever was on it to somewhere else, I thought it was just a glitch and needed to be fixed.
So I started poking at it. I discovered that what the problem was was actually a disk that was failing but not actually failed. That disk was running slow, and slowing everything else down. As the system was trying to rebuild the storage onto the disk, I just thought ‘Ok. That’s /dev/sdk, I’ll just mark that disk as bad, and rebuild onto another one’. Unfortunately, I made a typo, and set /dev/sdi to failed. This broke THE ENTIRE ARRAY, immediately halting all the virtual machines,
This wasn’t traumatic in itself. If I had just stopped there, I could have fixed it, but I panicked, and tried a shortcut to fix it. This made it SLIGHTLY worse, but still recoverable, but with a lot more work. My fatal mistake was when someone ELSE went ‘Oh, I can fix that easily’. Yeah. He made it a LOT worse, and pretty much erased everything.
So. All my virtual machines, gone. Including the database server that was running logblock and permissions (as well as a pile of other things, too). All my hosting, all my virtual machines, gone.
I have some pretty damn annoyed customers out of this, to put it mildly.