As many of you experienced, we had an outage earlier today, February 4, from 8:35AM to 1:04PM UTC.
First off, we are truly sorry for the downtime. This certainly had its greatest impact on our friends across both the Atlantic and Pacific. We want to apologize for the hours of productivity you may have lost during your Tuesday mornings and afternoons. We also want to thank you for the overwhelming sense of understanding we’ve gotten from you as we responded back to your emails and tweets — from “It happens” to a simple “Thanks for the help!”
What exactly happened?
At approximately 8:35AM UTC, connectivity was lost to one of our servers that holds our caching database. The connections came back within a few minutes, but, the DoneDone application itself was not able to connect. We restarted our applications to restore connectivity. In the end, it was a simple fix. While we were down, no data was lost and all issues sent via email were submitted properly.
While technical issues are inevitable, having them arise without any response for 4 hours is unacceptable. We did a bad job today.
What are we doing to avoid this type of outage again?
We’re going to do a few things to help mitigate this situation from happening again. As with most problems in production, the things we can do vary from the quick short-term solutions to more long-term ones.
For the short-term, we will have better alerts in place to notify our IT team when critical events like this happen, particularly during the overnight hours in Chicago. While we aren’t yet at the point of having a customer support team available 24/7, we can ensure that downtime like this doesn’t take hours to respond to.
Longer term, we are investigating a better fault-tolerance plan. Though we now know how to remedy this situation, we want to ensure that the application can handle this in a better way – that doesn’t rely on manual intervention.
A sincere thank you
We pride ourselves on keeping DoneDone up-and-running smoothly. This is our first significant outage in over sixteen months, but we know we can do even better. Thank you again for taking this outage in stride with us. We live, we learn, we improve.