As of this time we can consider this incident closed. On December 27, 2017 at approximately 8pm EST Zoey experienced a failure in one of its computing nodes in the Stack2 Computing Cluster. This issue was corrected with a simple restart of services and the outage lasted approximately twenty minutes. Over the following ten days we continued to experience, at random, one of our computing nodes in our Stack 2 cluster failing. Through intensive troubleshooting and over fifteen unique resolutions we were unable to find a fix. Each of these outages was short in duration, less than twenty minutes, and affecting less than 10% of Zoey customers at any given time. Through further troubleshooting we discovered that the error was a bug in the underlying Linux Kernel and worked with the appropriate third party team in implementing a patch. At this time Zoey believes this error has been corrected. Further information about the patch can be found at https://patchwork.ozlabs.org/patch/712373/
- please note this is a very short and technical summary of the overall issue.
Jan 9, 12:09 EST
We continue to see no interruptions in connectivity and no impact to underlying services. Although we believe that the issue has been mitigated, given the complicated nature of this problem we will leave this incident open until mid-day Monday, January 9, 2017 Eastern Time.
Jan 7, 15:40 EST
At this time we are turning on all underlying Zoey Services that were disabled during the troubleshooting. None of these services impacted the transactional capabilities of your Store and are internal tools that Zoey uses. We believe that the root cause of the issue has been identified and a fix has been put in place. Although we believe that the issue has been mitigated, given the complicated nature of this problem we will leave this incident open until mid-day Monday, January 9, 2017 Eastern Time. A complete write up will be available shortly thereafter. Thank you again for your patience while we worked through this issue.
Jan 6, 15:16 EST
After beginning the migrations of stores we discovered a possible other set of solutions to the problem that we are experiencing. We have completed the work to implement these fixes and are now monitoring the stability of all computing nodes on Stack 2. If this final set of fixes do not correct the issue we will begin migrating stores to the infrastructure that was setup to receive them. We will post an update this afternoon on our expected plan of action. Please note that these rolling outages have affected a small percentage of Zoey customers at a time and generally result in 15 minutes of downtime once a day - although we completely understand and realize this is not acceptable we just wanted to clarify the scope of these discussions as it is by no means a total outage or any sustained periods of time. Our Engineers have been monitoring the situation 24x7 and responding at all hours of the day or night when an issue popped up. Thank you for your continued patience.
Jan 6, 10:52 EST
The fixes that we have deployed have not provided the stability that we have been looking for. Therefore, the plan of action is to create new servers and to migrate all affected stores to these new servers. This operation was completed last night and we are beginning the process of transferring stores to the new servers. We will have more information about this process once we are underway.
Jan 5, 11:26 EST
We are currently finishing updating the remaining four computing nodes with our fix. We expect this work to take twenty to thirty minutes. We will post an update as soon as this work is done.
Jan 4, 18:48 EST
In progress -
Ongoing 20min rolling outages reported across Stack 2 reported. Work continues to investigate and resolve the root cause of this issue.
Jan 4, 00:51 EST
In progress -
We are currently performing an additional step that is causing a momentary outage of about twenty minutes to a limited subset of customers. Thank you for your patience.
Jan 3, 20:08 EST
Maintenance has been completed and Stores are now coming back online. We appreciate your patience and please note that the maintenance window extends to 9pm.
Jan 3, 17:07 EST
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 3, 16:30 EST
Zoey customers may experience rolling outages from 5 p.m. to 9 p.m. EST today. This is as a result of an ongoing error condition which causes connectivity to stores to stop working. Our Engineering team first noticed this issue on December 28 and has been tracking down the root cause as it is random and occurs with no specific trigger. Our team plans to use this time to further investigate this issue.
Jan 3, 16:21 EST