Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Recent Downtime / Post Mortem Analysis
#1
Post-Mortem Analysis
Incident Title: Forum Down
Date & Time of Incident: 2025-02-23 02:01:00 (UTC) to 2025-02-23 10:43:00 (UTC)
Prepared by: Q

1. Summary
  • The hard drive became full, rejecting any write operation.
  • The Webserver stopped normal operation, yielding "502 Bad Gateway" errors.
  • The Forum was down.

2. Root Cause Analysis
  • Logging and Backups made the hard drive, which was at it's limit for some time, become full.


3. Timeline
Time (UTC)/Event Description
  • 02:01 dxasmodeus reported the issue, via an external communication tool.
  • 10:31 Q started working on the server
  • 10:37 Q determined the cause, being the full hard drive.
  • 10:43 Q made space on the hard drive and restarted the server

4. Resolution & Actions Taken
  • Old backups were deleted
  • Server was restarted.

5. Immediate preventative Measures
  • Make additional space on webserver
  • Add a command to backend login, that tells the amount of left space in MegaBytes
  • Add a file `buffer-storage-blocker-for-emergencies.dat` to the root of the filesystem, to that in the future such space can be freed even quicker.

6. Further possible Mitigations:
  • Create some Alerting capability, to notify the responsible Admins, through e.g. Email.
  • Mounting variable size directories on a separate volume (e.g. the uploads, the log files and/or the database)

6. Lessons Learned
  • Communication among staff worked well.
  • There was an absence of monitoring/alerting, that made the little hard drive not as quickly visible.
  • There is a permanent lack of storage on the server.

7. Action Items
Task/Owner/Deadline/Status
  • Space on Webserver / Q / 2025-02-23 / Done
  • Add Command / Q / 2025-03-05 / Done
  • Add File / Q / 2025-02-23 / Done
  • Separate Data to a separate Volume/Mount Point / swammy+Q / 2025-02-27 / Done 
  • Complete Post-Mortem-Analysis / Q / 2025-03-02 / Done 
  • Consider further Actions&Mitigations / Futanari Staff / 2025-03-09 / Open
Reply


Messages In This Thread
Recent Downtime / Post Mortem Analysis - by Q - 23rd February 2025, 14:45
RE: Recent Downtime / Post Mortem Analysis - by Q - 25th February 2025, 18:45
RE: Recent Downtime / Post Mortem Analysis - by Q - 26th February 2025, 10:34
RE: Recent Downtime / Post Mortem Analysis - by Q - 27th February 2025, 16:28
RE: Recent Downtime / Post Mortem Analysis - by Q - 1st March 2025, 14:42

Forum Jump:


Users browsing this thread: 1 Guest(s)