And we're back!

Posts: 79 · Views: 2446
  • 35770

    Glad to have you back, and thanks for your work

  • 35771

    Good to see you back. This site is so good.

  • 35774

    AksumkA said:

    Sorry for the downtime this past week everyone.

    We're back up and running on some brand new server hardware! Will the site be faster? Will the site be better? Will this happen again? Who knows!!

    I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

    The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them. A better admin would have caught that way sooner. Whoops... No data should have been lost (minus some cached stuff like last read posts in the forums here) - I was able to recover everything from the previous server. EDIT: I did choose to drop the subscription notices. These things are a real pain, so sorry for that. Hope you don't mind a fresh start!

    Anyway, thanks again as always for hanging around with us!

    AksumkA, it's a pity that this happened, but as they say, "Shit happens..."))

    P.S. I'm glad to see you online again)

  • 35779

    I'm happy to see the site back on again. Thank you AksumkA !

  • 35780

    This website is fantastic. I'm very happy to see you overcome this hardship.

    Added 20 minutes after

    AksumkA said:

    Sorry for the downtime this past week everyone.

    We're back up and running on some brand new server hardware! Will the site be faster? Will the site be better? Will this happen again? Who knows!!

    I'll update this post later this week with a quick breakdown of what happened and how things were recovered. I'm sure someone out there would like to learn from my mistakes! If you guys have any specific questions, please let me know and I'll do my best to answer.

    The TL;DR is: NVMe drives that serve as our boot and database storage went read only after ~4.5PB of data read/written to them. A better admin would have caught that way sooner. Whoops... No data should have been lost (minus some cached stuff like last read posts in the forums here) - I was able to recover everything from the previous server. EDIT: I did choose to drop the subscription notices. These things are a real pain, so sorry for that. Hope you don't mind a fresh start!

    Anyway, thanks again as always for hanging around with us!

    Thank you very much for all your hard work.

    Last updated
  • 35784

    we are SO BACK moment ........................👌👌

  • 35788

    Great work and welcome back

  • 35790

    All we have is now, go big or go home!

  • 35809

    hey bro don't worry thank god you return , that's enough :)

  • 35813

    Appreciate the hard work and post-mortem write-up. Can you share what you have implemented (or intend to) regarding the boot/data drives and monitoring their health, or other takeaways?

  • 35814

    thank you for all the hard work to get it back up and running!

  • 35826

    agcrouton said:

    Appreciate the hard work and post-mortem write-up. Can you share what you have implemented (or intend to) regarding the boot/data drives and monitoring their health, or other takeaways?

    I do have some basic monitoring setup with Munin, but the plugins for NVMe drives seems hit or miss, so I'll have to look into other options. As of now, still kinda running risky.

    One of the other things that burned me was not having good documentation on what config changes were made to the various services that run the site. Things like memory allocation, number of processes a service can spawn, etc. We had a few short down times after coming back thanks to that (like Elasticsearch's default config only allowing 1GB of memory, whoops). So that's another thing I'll be working on putting together, a document with all these notes.

    One other threat was the older versions of some things we're still running on. Needing to make sure the latest distros still has repositories for the versions we need to still use is a whole thing. I blame that on the general lack of updating I've been able to do to the site's code as a whole. We're so far behind in the framework version that to upgrade to whatever the latest is would be a whole huge project.

  • 35827

    Monitoring can just be some smartctl test and email the result.

    I always write a chroot script after a new server install and save it on a forge somewhere.

    love dd <3

    If you now the why the read only i take the feedback.

    Nice to see the site is back. keep fighting !!!

  • 35828

    AksumkA said: (interesting stuff)

    Agree with HumanG33k about smartctl - https://www.smartmontools.org/wiki/NVMe_Support suggests it may show useful Spare and R/W count that could be used to track drive longevity.

    Regarding the documentation and keeping current on packages/versions - always the hard part, but at least you know it and are trying to do something about it.

    Appreciate the info!

  • 35829

    THANK YOU!! we appreciate you and all you do for us fans <3 -D

  • 35832

    man,what can i say! I really miss you.

Message