Incident Reported: 4:45 AM EST, 8/12/2022
Incident Resolved: 2:00 PM EST, 8/12/2022
Symptoms of Problem
- No inbound or outbound calls
- No use of applications such as MaXUC, commportal, etc
Root Cause of Problem
- Overview: the uninterrupted power supply batteries (UPS) failed to properly manage power at the primary carrier’s data center
- Detailed description and aftereffects of the root cause
- The UPS failed to properly manage the supply of power from the external power to the servers, leading the servers to lose power and fail.
- This power was solved relatively quickly (1 hour later at 5:55 AM), however, when the servers came back online they were out of sync from the improper shutdown, which lead to the continuation of call failures.
- Once this issue was identified, the CTS Cloud carrier partner reached out to Metaswitch (the hardware manufacturer, also a division Microsoft) for help. In the end, Metaswitch/Microsoft had to apply a software fix for this failure.
- Finally, given the servers were down for a few hours, it had to work through the millions of requests that were finally hitting the server, leading to a few hour lag time between the server fix and consistent end user reliability.
- Incident reported: 4:45AM EST, 8/12/2022
- Power issue solved: 5:55 AM EST, 8/12/2022
- Metaswitch/Microsoft software fix: 9:47 AM EST, 8/12/2022
- Working through backlog of requests: 9:47 AM EST – 2:00 PM EST, 8/12/2022
Given there was a chain reaction of issues at the carrier, there are several solutions being put in place to protect against this issue occurring again:
- Power: the power issue was repaired, and the carrier is implementing auto- alerts and notifications about its status
- Metaswitch server “out of sync”
- Metaswitch/Microsoft already applied software fix for this specific issue
- Longer term, our partner is distributing this AZ data center to other locations around the U.S. (parts of it are already distributed, but some are still centralized in AZ)
- Reliance on carrier: CTS is now able to forward calls directly from an upstream carrier (there already exists call forwarding in the Metaswitch, but should the Metaswitch infrastructure be down, we can have an upstream carrier call forward away from the Metaswitch directly to another number)
On top of these specific solutions, CTS is developing an emergency response plan, which will identify the top known issues and what the solutions will be should these various events occur (we have done much of this over the past few years, but this will formalize and systemize it). Furthermore, CTS is identifying other hosted solutions should our current partners continue to have these issues.