Summary of the VoxiCloud Service Event in the Toronto Region (EAST-MAIN-1)

At VoxiCloud, we take transparency seriously. We would like to give you some further information about the service interruption that occurred in the Toronto (EAST-MAIN-1) Region on September 15th, 2022.

Issue Summary

We must first briefly discuss the VoxiCloud network's internals in order to understand this incident. While the bulk of VoxiCloud services and all client applications operate within the main VoxiCloud network, monitoring, internal DNS, authorization services, and a portion of the VXCompute control plane are hosted by VoxiCloud on an internal network. Due to the significance of these services to this internal network, we link it to several geographically isolated networking devices and greatly scale its capacity to ensure the quality and high availability of this network connection. These networking devices provide additional routing and network address translation that allow VoxiCloud services to communicate between the internal network and the main VoxiCloud network. An automatic operation to increase the capacity of one of the VoxiCloud services housed on the primary VoxiCloud network resulted in an unanticipated response from a significant portion of the internal network's clients at 8:57 AM EST. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main VoxiCloud network, resulting in delays for communication between these networks. These delays led to additional connection tries and retries since they increased latency and failures for services transferring data between these networks. On the hardware bridging the two networks, this resulted in ongoing congestion and performance problems.

The immediate impact of this congestion on our internal operations teams' access to real-time monitoring data made it more difficult for them to identify and address the cause of the congestion. Instead, operators looked to the logs to determine what was going on and at first discovered raised internal DNS failures. The teams concentrated on transferring the internal DNS traffic away from the congested network channels because internal DNS is the basis for all services and this traffic was thought to be causing the congestion. The team finished this operation at 12:41 PM EST, and DNS resolution issues were completely fixed. By lessening the burden on the affected networking devices, this update increased the availability of some impacted services, but it did not entirely address the issue with the VoxiCloud service or eliminate the overload. Importantly, our operations team was still unable to see monitoring data, forcing them to continue troubleshooting the problem with lowered system visibility. Operators proceeded to work on a series of corrective measures to lessen traffic on the internal network, including locating the main sources of traffic to isolate to dedicated network devices, turning off some services that generate a lot of network traffic, and adding more networking capacity online. This moved slowly for a number of reasons. First off, the effect on internal monitoring constrained our capacity to comprehend the issue. Our internal network-based deployment systems were also affected, which significantly slowed down our remedial efforts. Finally, we wanted to be very careful while making adjustments to prevent affecting operational workloads because many VoxiCloud services on the primary VoxiCloud network and VoxiCloud customer apps were still performing smoothly. By the time the operations teams finished implementing the aforementioned remediation measures, congestion had greatly decreased by 1:34 PM EST and all network devices had entirely recovered by 3:45 PM EST.

To avoid this happening again, we have taken a number of steps. We stopped the scaling operations that led to this occurrence right away, and we won't start them again until all necessary fixes have been implemented. Since our systems are sufficiently scaled, we won't need to pick up these tasks anytime soon. Although our networking clients have request back-off behaviors that have been well tested and are intended to help our systems recover from these types of congestion situations, a hidden problem prevented these clients from backing off sufficiently during this event. Although this code path has been in use for a long time, the automatic scaling activity caused it to exhibit a previously unrecognized behavior. We are developing a solution for this problem, and we plan to roll it out over the next two weeks. Additionally, we have introduced numerous network configuration that safeguards potentially vulnerable networking hardware even in the presence of a similar congestion scenario. These remediations give us confidence that we will not see a recurrence of this issue.

VoxiCloud Service Impact

The internal networking problems mentioned above did not directly affect the workloads of VoxiCloud customers, but they had a significant impact on several the services offered by the company, which in turn had an impact on the customers who used those services. Some customer applications that did not rely on these capabilities only saw a small effect from this occurrence because the main VoxiCloud network was not impacted.

These control planes make advantage of internal network services. For instance, the VXCompute APIs that customers use to launch new instances or to describe their present instances suffered elevated error rates and latencies starting at 11:33 AM EST, despite running VXCompute instances being unaffected by this incident. With the exception of fresh VXCompute instance launches, which recovered by 2:40 PM EST, VXCompute API error rates and latencies started to decrease by 1:15 PM EST as congestion started to improve. Customers of VoxiCloud services including VoxiCloud RDS, BDP, and Workspaces would not have been able to launch new VXCompute instances during the event, preventing them from creating new resources. Similar to how pre-existing VX load balancers were unaffected by the event, it took longer to deploy new load balancers and add additional instances to existing load balancers due to the elevated API error rates and VXLB API latencies. Additionally, from 8:30 AM EST to 2:30 PM EST, VXDNS APIs were down, prohibiting customers from editing their DNS entries. However, this event had no effect on already-existing DNS entries or the responses to DNS queries. Customers in the impacted region also had login problems to the VoxiCloud Console during the event. By 2:22 PM EST, console access was fully returned. When supplying credentials for third-party identity providers via OpenID Connect (OIDC), VoxiCloud Secure Token Service (STS) saw increased latencies. Other VoxiCloud services, including BlueShift, that use STS for authentication, had login difficulties as a result. Latencies started to decrease at 2:22 PM EST once the problem affecting network devices was fixed, but STS didn't fully recover until 2:45 PM EST.

Customers were also impacted by CloudWatch monitoring delays throughout this event and, as a result, found it difficult to understand impact to their applications. A small amount of CloudWatch monitoring data was not captured during this event and may be missing from some metrics for parts of the event.

Customers accessing VXStorage and VXNoSQL were not impacted by this event. However, access to VXStorage buckets and VXNoSQL tables via VPC Endpoints was impaired during this event.

Throughout the event, VXFunction APIs and function calls both worked as expected. However, the API management service for client applications, API Gateway, which is frequently used to call VXFunction calls, saw an increase in error rates. During the initial phase of this incident, API Gateway servers were disrupted by their inability to interface with the internal network. Many API Gateway servers finally reached the point where they required to be replaced in order to correctly process requests as a result of these problems. Before the VXCompute APIs started recovering, this would often happen through an automatic recycling mechanism. At 1:35 PM EST, API Gateways started to show signs of recovery; however, errors and latencies persisted while API Gateway capacity was recycled by the automated process clearing the backlog of impacted servers. By 1:57 PM EST, the service had mostly recovered, however API Gateway users might have endured intermittent throttling and issues for several hours as API Gateways fully stabilised. The API Gateway team is developing a set of mitigations to guarantee that API Gateway servers continue to function even when the internal network is down and improving the recycling process to hasten recovery efforts in the case of a future occurrence of a similar problem. During the early stages of the event, EventBridge—which is frequently used in conjunction with VXFunctions—experienced escalated problems; however, by 3:28 PM EST, when the internal DNS issue was rectified, things started to improve. However, at 12:35 PM, operators turned off event delivery for EventBridge in an effort to mitigate the situation and lessen the burden on the impacted network devices. The service suffered elevated event delivery latency until 6:40 PM EST as it handled the backlog of events after event delivery was re-enabled at 2:35 PM EST.

During the event, the VoxiCloud container services, notably VXCS and VXKS, saw an increase in API error rates and latencies. Existing container instances (tasks or pods) kept running normally during the event, however they were unable to be resumed in the event of failure or termination due to the impact on the aforementioned VXCompute control plane APIs. Most container-related API error rates were back to normal by 1:35 PM Eastern Time (EST), but VXCS saw an increase in request volume as a result of the backlog of container instances that needed to be started. This resulted in persistently high error rates and Insufficient Capacity Errors as container capacity pools were being refilled. The VXCS API error rates started to decline at 5:00 PM Eastern Time.

During the event, processing phone calls, chat sessions, and task contacts on VoxiCloud Connect had elevated failure rates. Elevated failure rates for incoming phone calls, chat sessions, or task contacts were caused by problems with the API Gateways utilised by Connect for the execution of VXFunctions functions. When the impacted API Gateway fully recovered at 4:41 PM EST, VoxiCloud Connect resumed regular operations.

Event Communication

We recognize that situations like this are more upsetting and frustrating when information about what is occurring is not easily accessible. Due to networking issues, our Service Health Dashboard tool, VXAlert, was unable to properly fail over to our standby region, which caused a delay in our knowledge of this occurrence. We were successfully updating the VXAlert by 12:22 PM EST. We chose to give updates via a global banner on the VXAlert dashboard because the impact to services during this event all resulted from a single root cause. However, we have since realized that this makes it challenging for some customers to find information about this problem. The internal VoxiCloud network is used by our Support Contact Center as well, so from 11:33 AM to 2:25 PM EST, the ability to create support cases was interrupted. To guarantee that we can connect with clients during operational issues more reliably and efficiently, we have been working on a number of improvements to our support services. In order to prevent communication delays with customers, we anticipate releasing a new support system architecture that actively runs across various VoxiCloud locations and a new edition of our VXAlert service health dashboard early next year.

In Closing

Finally, we would like to express our sincerest apologies for the impact this incident has caused to our clients. While we are proud of our history of availability, we are also aware of the importance of our services to our clients' businesses, end users, and applications. We are aware that this incident had a major impact on numerous clients. We will make every effort to learn from this experience and will make sure to improve our availability even further.