Amazon Web Services Inc. (AWS), provided a detailed account on the recent service disruption that reportedly affected major Web sites and caused inconvenience for many customers in the US East Region.
Customers were most affected by a disruption to DynamoDB’s database service. This disrupted websites such as Tinder, Airbnb, Netflix, and IMDb during a six-to-eight-hour period on September 20.
AWS stated that a “brief network interruption” occurred in the early morning. This normally would have been handled seamlessly. A metadata service that transmits information to storage servers about their members (or their assigned partitions that contain actual data) became overwhelmed, AWS explained in a technical explanation. The result was a loss in service due to the combination of several problems.
Other services were also affected by the DynamoDB outage, including Simple Queue Service and EC2 Auto Scaling. The CloudWatch monitoring service was also affected.
The company stated that it was taking steps to prevent such outages in the future.
We have several actions that we will immediately take to prevent Sunday’s DynamoDB events from happening again. First, we have already significantly increased metadata service’s capacity. We are also monitoring performance dimensions such as membership size to ensure that we have a complete understanding of their state and plan for the appropriate capacity. We are reducing the speed at which storage nodes request member data and increasing the time it takes to process queries. We are also segmenting DynamoDB so that it has multiple instances of the metadata service, each serving a portion of the storage server fleet. This will further contain the impact of software, performance/capacity, or infrastructure failures.The company also apologized to customers, noting that, even though DynamoDB has effectively enjoyed 100 percent uptime in the past three years, “we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. We consider availability the most important aspect of DynamoDB. We will learn as much as possible from this event and avoid another one in the future.
