And we're back!

Posts: 80 · Views: 2501
  • 35714

    Sorry for the downtime this past week everyone.

    We're back up and running on some brand new server hardware! Will the site be faster? Will the site be better? Will this happen again? Who knows!!

    I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

    The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them. A better admin would have caught that way sooner. Whoops... No data should have been lost (minus some cached stuff like last read posts in the forums here) - I was able to recover everything from the previous server. EDIT: I did choose to drop the subscription notices. These things are a real pain, so sorry for that. Hope you don't mind a fresh start!

    What happened / wall of text

    On July 1st around noon EST the site started to lock up. Either the dreaded blank white screen, or our 'Aww crap' page was all anyone could see. Super rare case where the joke that even the contact us page is broken actually wasn't a joke.

    Our server had a pair of 500GB NVMe drives in RAID1 as our boot/database/application drive. Basically, everything the site needs to run minus the actual images (those were on a pair of 4TB HDDs also in RAID1) were on these drives. After running non-stop since May 2019 these drives accumulated over 4PB (4000 TB) of reads/writes. NVMe drives usually have a set lifetime before things start to break down. We blew past that with these drives. As they age, sectors will start to go bad, those get marked as bad and then spare sectors are put into use. After so much time all these spares get exhausted, so there will be no more place for new data to go when the next sector goes bad. This is when the drive's firmware will lock the drive to read-only. This will protect the data on the drive and allow for it to still be accessed, thankfully.

    What I'm not sure about is when either of these drives locked into read-only mode. By the time I got into things and was looking at the health of the drives, both were already toast. The big fail takeaway here is, I had no monitoring for the health of the drives setup. Had I seen the health of the drives declining, I could have swapped them out for fresh drives much sooner.

    After seeing this, I started copying off configs and other files I knew I'd need to make the restore quicker. After rebooting, I knew the server would be dead, so I tried to get all that I could before that. Go figure, after that reboot, I remembered a few other files I'd need. So I was a bit worried if they'd be recoverable or not.

    Thankfully since the drives were read-only, I could still get the data off them. OVH (our server host) has a rescue mode we can boot the servers into and manage the drives/data. Since the drives were read-only, I couldn't just mount them normally. What I ended up doing was using dd to dump the partitions to a file on the still working HDDs. Once that was done, I was able to mount that file and get all the data I needed off it. rsync'd everything database related, config file related, and whatever else would make the new server easier to deploy was copied off.

    Now, this might make it sound like I had no backups. Not true. Daily backups of the database and uploads are taken and sent offsite. I was ready to push these backups to the new server, but since I was able to get the latest and greatest data, that was that!

    That said, the backups I take are for a 'worst case' type of scenario. For the uploads for example, only the full resolution images are backed up. Each of the three thumbnail sizes aren't. Since I could still get to the old server, I was able to add the thumbnails to TAR balls, and just transfer them over pretty quick. Much quicker to tar them up, rsync over to the new server, and untar them there versus sending however many million small files over the network.

    Once the new server was here, all these files were copied over from the old server while still in the rescue mode, keeping all the network traffic nice and fast (in the same datacenter). Had I had to pull my backups from offsite, the transfers alone would have taken at least two days if not longer. Then I'd have to recreate all the thumbnails. Would have been a good excuse to finally learn some python though.

    And now we're here. Fresh hardware, more time on the clock. There were a few other tweaks I needed to make with the new hardware and latest OS, but for the most part, we should be good to rock for a while now. Was also able to get the 4TB drives upgraded to 6TB, so we have more room for growth again.

    wall of text over

    Anyway, thanks again as always for hanging around with us!

    Last updated
  • 35715

    Why are you sorry man! We all love this site. Thank you for putting your hard work!

  • 35716

    Welxome back, and thank you for your hard work in getting everything back in order! Appreciate it greatly!

  • 35717

    glad to have you back. thx for the hard work

  • 35718

    YAY!!! TY for all your hard work! We appreciate you <3

  • 35719

    Please don't be sorry, this is the best wallpaper site and you're doing the best already. Many many thanks for the hard work, Sir.

  • 35720

    Amazing work, well done and fantastic effort on simply keeping us updated!!

  • 35721

    Thanks for the work, I'd love to know how you solve this.

  • 35722

    Glad the site is back, wish you all the best ADM

  • 35723

    right on, but no worries. we're only humans that use man made pc hardware, things nuke themselves out of the blue. thanks for the work!

  • 35724

    Never in my 37 years have I been so happy that a website came back online. Thanks AksumkA - don't be beat yourself up about whatever happened, nobody else is running a wallpaper site that isn't riddled with ads AND has rules in place to keep the place sane and orderly.

  • 35725

    Nothing to be sorry for! Glad it all ended with good news!

  • 35726

    I never thought I could feel withdrawal from a website

  • 35727

    AksumkA said:

    I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

    The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them.

    As an admin myself, I'd be very interested in the post-mortem. I've always learned at least a little something new from the ones I've read previously, and I'll bet it's no different here. Don't beat yourself up over this either, I know of multiple large corporate sites that have gone down because of similar issues, and not all of them have come back up anywhere near as smoothly. Take some credit for having good backups in place, and everything set up where you could restore it relatively quickly.

  • 35728

    Thank you for all of the hard work that you do. This site has been my only source of high quality wallpaper since it went up.

  • 35729

    It's good that you are back, I already feel the need to search for wallpaper on your website

  • 35730

    Awesome, glad you're back up and running. Thanks for all of the hard work, and thanks for the great site!

  • 35731

    I've been visiting this site since 2015 and this is the first time I've ever noticed something like this occurring, so I'd say, no worries! That is a wonderful track record. You have kept a brilliant and wonderful resource for us all going for so many years ─ it's a pleasure to be a part of this community and we're grateful you were able to get WH back online pretty quickly ultimately. Preciate'cha! ╰(´︶`)╯♡

  • 35732

    thanks for the update and hard work. long live wallhaven.

  • 35735

    welcome back and thank you for your effort!

  • 35736

    Don't be sorry! Thank you for your effort!

  • 35737

    Thank you for all you do, thank you for this site that looks great.