Sorry for the downtime this past week everyone.
We're back up and running on some brand new server hardware! Will the site be faster? Will the site be better? Will this happen again? Who knows!!
I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.
The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them. A better admin would have caught that way sooner. Whoops... No data should have been lost (minus some cached stuff like last read posts in the forums here) - I was able to recover everything from the previous server. EDIT: I did choose to drop the subscription notices. These things are a real pain, so sorry for that. Hope you don't mind a fresh start!
What happened / wall of text
On July 1st around noon EST the site started to lock up. Either the dreaded blank white screen, or our 'Aww crap' page was all anyone could see. Super rare case where the joke that even the contact us page is broken actually wasn't a joke.
Our server had a pair of 500GB NVMe drives in RAID1 as our boot/database/application drive. Basically, everything the site needs to run minus the actual images (those were on a pair of 4TB HDDs also in RAID1) were on these drives. After running non-stop since May 2019 these drives accumulated over 4PB (4000 TB) of reads/writes. NVMe drives usually have a set lifetime before things start to break down. We blew past that with these drives. As they age, sectors will start to go bad, those get marked as bad and then spare sectors are put into use. After so much time all these spares get exhausted, so there will be no more place for new data to go when the next sector goes bad. This is when the drive's firmware will lock the drive to read-only. This will protect the data on the drive and allow for it to still be accessed, thankfully.
What I'm not sure about is when either of these drives locked into read-only mode. By the time I got into things and was looking at the health of the drives, both were already toast. The big fail takeaway here is, I had no monitoring for the health of the drives setup. Had I seen the health of the drives declining, I could have swapped them out for fresh drives much sooner.
After seeing this, I started copying off configs and other files I knew I'd need to make the restore quicker. After rebooting, I knew the server would be dead, so I tried to get all that I could before that. Go figure, after that reboot, I remembered a few other files I'd need. So I was a bit worried if they'd be recoverable or not.
Thankfully since the drives were read-only, I could still get the data off them. OVH (our server host) has a rescue mode we can boot the servers into and manage the drives/data. Since the drives were read-only, I couldn't just mount them normally. What I ended up doing was using dd to dump the partitions to a file on the still working HDDs. Once that was done, I was able to mount that file and get all the data I needed off it. rsync'd everything database related, config file related, and whatever else would make the new server easier to deploy was copied off.
Now, this might make it sound like I had no backups. Not true. Daily backups of the database and uploads are taken and sent offsite. I was ready to push these backups to the new server, but since I was able to get the latest and greatest data, that was that!
That said, the backups I take are for a 'worst case' type of scenario. For the uploads for example, only the full resolution images are backed up. Each of the three thumbnail sizes aren't. Since I could still get to the old server, I was able to add the thumbnails to TAR balls, and just transfer them over pretty quick. Much quicker to tar them up, rsync over to the new server, and untar them there versus sending however many million small files over the network.
Once the new server was here, all these files were copied over from the old server while still in the rescue mode, keeping all the network traffic nice and fast (in the same datacenter). Had I had to pull my backups from offsite, the transfers alone would have taken at least two days if not longer. Then I'd have to recreate all the thumbnails. Would have been a good excuse to finally learn some python though.
And now we're here. Fresh hardware, more time on the clock. There were a few other tweaks I needed to make with the new hardware and latest OS, but for the most part, we should be good to rock for a while now. Was also able to get the 4TB drives upgraded to 6TB, so we have more room for growth again.
wall of text over
Anyway, thanks again as always for hanging around with us!