Major outage: intermittent community access
Incident Report for Bettermode
Postmortem

On October 13, 2023, we experienced a session issue affecting user access to our community platforms. Our monitoring systems, which automatically update our status page, did not detect this problem at first because our servers and main services were still okay. The issue was complicated, arising from a mix of updates over the past year that together led to this unexpected behavior. It took a special sequence of events to set off, making it hard to catch during our usual testing.

After hearing from our users around 03:00 AM UTC, our engineering team acted swiftly to look into and correct the issue. Solving it was not straightforward; it demanded careful pinpointing of the problem areas within our backend systems. Despite the challenging nature of the issue, our team was able to fix it within a few hours, ensuring that services were back to normal.

We understand the questions our customers may have about the late update on our status page. This delay happened because our automatic monitoring focuses on server health and common mistakes. This time, the distinct nature of the issue avoided these checks.

In light of this incident, we are taking steps to improve our systems. These include:

  1. Tweaking our automatic monitoring to catch a broader set of errors.
  2. Adding new tools to make updates more reliable.
  3. Improving our small-scale test releases to catch issues early before rolling out to everyone.
  4. Enhancing our emergency response process to speed up problem-solving.

We're dedicated to making our platforms reliable and secure. The lessons learned from this will help guide our future work and system enhancements. We deeply value your patience and understanding as we keep working to boost the dependability and experience of our platforms.

Posted Oct 17, 2023 - 01:56 UTC

Resolved
Starting on October 13, at 03:00 AM UTC, we experienced an issue affecting access to our communities. Users intermittently received "Community Not Found" or "Internal Server Error" messages when attempting to access their respective communities. Our team commenced an investigation into the issue at 03:30 AM UTC. A fix patch was deployed by 04:15 AM UTC on October 13, temporarily resolving the issue.

However, at 12:00 PM UTC on October 13, the issue re-emerged. Our team resumed investigations at 12:30 PM and successfully deployed a stable fix by 01:30 PM. The root cause of the issue was identified and resolved. Access to communities has been restored and is now stable. We are closely monitoring the situation to ensure that the fix holds and to prevent any recurrence of this issue.

We appreciate your patience and understanding as we worked to resolve this matter quickly.
Posted Oct 13, 2023 - 13:30 UTC