Ultrawide	16:9	16:10	4:3	5:4
2560 × 1080	1280 × 720	1280 × 800	1280 × 960	1280 × 1024
3440 × 1440	1600 × 900	1600 × 1000	1600 × 1200	1600 × 1280
3840 × 1600	1920 × 1080	1920 × 1200	1920 × 1440	1920 × 1536
	2560 × 1440	2560 × 1600	2560 × 1920	2560 × 2048
	3840 × 2160	3840 × 2400	3840 × 2880	3840 × 3072

Wide	Ultrawide	Portrait	Square
All Wide	All Portrait
16 × 9	21 × 9	9 × 16	1 × 1
16 × 10	32 × 9	10 × 16	3 × 2
	48 × 9	9 × 18	4 × 3
			5 × 4

And we're back!

Posts: 82 · Views: 2692

AksumkA – 2 weeks ago35714
Sorry for the downtime this past week everyone.

We're back up and running on some brand new server hardware! Will the site be faster? Will the site be better? Will this happen again? Who knows!!

I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them. A better admin would have caught that way sooner. Whoops... No data should have been lost (minus some cached stuff like last read posts in the forums here) - I was able to recover everything from the previous server. EDIT: I did choose to drop the subscription notices. These things are a real pain, so sorry for that. Hope you don't mind a fresh start!

What happened / wall of text

On July 1st around noon EST the site started to lock up. Either the dreaded blank white screen, or our 'Aww crap' page was all anyone could see. Super rare case where the joke that even the contact us page is broken actually wasn't a joke.

Our server had a pair of 500GB NVMe drives in RAID1 as our boot/database/application drive. Basically, everything the site needs to run minus the actual images (those were on a pair of 4TB HDDs also in RAID1) were on these drives. After running non-stop since May 2019 these drives accumulated over 4PB (4000 TB) of reads/writes. NVMe drives usually have a set lifetime before things start to break down. We blew past that with these drives. As they age, sectors will start to go bad, those get marked as bad and then spare sectors are put into use. After so much time all these spares get exhausted, so there will be no more place for new data to go when the next sector goes bad. This is when the drive's firmware will lock the drive to read-only. This will protect the data on the drive and allow for it to still be accessed, thankfully.

What I'm not sure about is when either of these drives locked into read-only mode. By the time I got into things and was looking at the health of the drives, both were already toast. The big fail takeaway here is, I had no monitoring for the health of the drives setup. Had I seen the health of the drives declining, I could have swapped them out for fresh drives much sooner.

After seeing this, I started copying off configs and other files I knew I'd need to make the restore quicker. After rebooting, I knew the server would be dead, so I tried to get all that I could before that. Go figure, after that reboot, I remembered a few other files I'd need. So I was a bit worried if they'd be recoverable or not.

Thankfully since the drives were read-only, I could still get the data off them. OVH (our server host) has a rescue mode we can boot the servers into and manage the drives/data. Since the drives were read-only, I couldn't just mount them normally. What I ended up doing was using dd to dump the partitions to a file on the still working HDDs. Once that was done, I was able to mount that file and get all the data I needed off it. rsync'd everything database related, config file related, and whatever else would make the new server easier to deploy was copied off.

Now, this might make it sound like I had no backups. Not true. Daily backups of the database and uploads are taken and sent offsite. I was ready to push these backups to the new server, but since I was able to get the latest and greatest data, that was that!

That said, the backups I take are for a 'worst case' type of scenario. For the uploads for example, only the full resolution images are backed up. Each of the three thumbnail sizes aren't. Since I could still get to the old server, I was able to add the thumbnails to TAR balls, and just transfer them over pretty quick. Much quicker to tar them up, rsync over to the new server, and untar them there versus sending however many million small files over the network.

Once the new server was here, all these files were copied over from the old server while still in the rescue mode, keeping all the network traffic nice and fast (in the same datacenter). Had I had to pull my backups from offsite, the transfers alone would have taken at least two days if not longer. Then I'd have to recreate all the thumbnails. Would have been a good excuse to finally learn some python though.

And now we're here. Fresh hardware, more time on the clock. There were a few other tweaks I needed to make with the new hardware and latest OS, but for the most part, we should be good to rock for a while now. Was also able to get the 4TB drives upgraded to 6TB, so we have more room for growth again.

wall of text over

Anyway, thanks again as always for hanging around with us!
Last updated 2 weeks ago
icefire7 – 2 weeks ago35715
Why are you sorry man! We all love this site. Thank you for putting your hard work!
TokyoHam – 2 weeks ago35716
Welxome back, and thank you for your hard work in getting everything back in order! Appreciate it greatly!
Troik – 2 weeks ago35717
glad to have you back. thx for the hard work
ansari1121 – 2 weeks ago35718
YAY!!! TY for all your hard work! We appreciate you <3
zainirelscyte – 2 weeks ago35719
Please don't be sorry, this is the best wallpaper site and you're doing the best already. Many many thanks for the hard work, Sir.
Boglet – 2 weeks ago35720
Amazing work, well done and fantastic effort on simply keeping us updated!!
FurFles – 2 weeks ago35721
Thanks for the work, I'd love to know how you solve this.
Stradvert – 2 weeks ago35722
Glad the site is back, wish you all the best ADM
spinkey – 2 weeks ago35723
right on, but no worries. we're only humans that use man made pc hardware, things nuke themselves out of the blue. thanks for the work!
ProTexanist – 2 weeks ago35724
Never in my 37 years have I been so happy that a website came back online. Thanks AksumkA - don't be beat yourself up about whatever happened, nobody else is running a wallpaper site that isn't riddled with ads AND has rules in place to keep the place sane and orderly.
samooryesord – 2 weeks ago35725
Nothing to be sorry for! Glad it all ended with good news!
zhax1486 – 2 weeks ago35726
I never thought I could feel withdrawal from a website
cypherdragon – 2 weeks ago35727
AksumkA said:

I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them.

As an admin myself, I'd be very interested in the post-mortem. I've always learned at least a little something new from the ones I've read previously, and I'll bet it's no different here. Don't beat yourself up over this either, I know of multiple large corporate sites that have gone down because of similar issues, and not all of them have come back up anywhere near as smoothly. Take some credit for having good backups in place, and everything set up where you could restore it relatively quickly.
ITSean – 2 weeks ago35728
Thank you for all of the hard work that you do. This site has been my only source of high quality wallpaper since it went up.
JoakMagic – 2 weeks ago35729
It's good that you are back, I already feel the need to search for wallpaper on your website
ForgedFate – 2 weeks ago35730
Awesome, glad you're back up and running. Thanks for all of the hard work, and thanks for the great site!
saga – 2 weeks ago35731
I've been visiting this site since 2015 and this is the first time I've ever noticed something like this occurring, so I'd say, no worries! That is a wonderful track record. You have kept a brilliant and wonderful resource for us all going for so many years ─ it's a pleasure to be a part of this community and we're grateful you were able to get WH back online pretty quickly ultimately. Preciate'cha! ╰(´︶`)╯♡
reebauer – 2 weeks ago35732
thanks for the update and hard work. long live wallhaven.
ceguu – 2 weeks ago35733
Welcome back!
Wasked – 2 weeks ago35734
welcome!!:))
thisis4u – 2 weeks ago35735
welcome back and thank you for your effort!
eroguro – 2 weeks ago35736
Don't be sorry! Thank you for your effort!
walldae2020 – 2 weeks ago35737
Thank you for all you do, thank you for this site that looks great.
virus0 – 2 weeks ago35738
Welcome back!