12/26/2023 0 Comments Awa outageAWS outage (full) resolutionįor the next three hours after that, the AWS engineers worked frantically, trying everything. Or, as AWS puts it, “Operators continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online.” And although this reportedly did help it did not solve everything. So, quite surprisingly, it was not DNS, this time. Well, two hours after the problems started, they had managed to fully recover internal DNS resolution. They looked at logs and figured that maybe it was DNS. It’s always DNS, right? (There’s even that haiku about it.) So, the AWS folks were sort of flying blind because their internal monitoring had been taken out by the flood. Unfortunately, this requires each client to do the right thing, and, as AWS writes in their report: “a latent issue prevented these clients from adequately backing off during this event.” Now, we do know how we should avoid network congestion problems like this: we use exponential backoff and jitter. Not even traffic cops who could try to resolve the issue. On December 7, 2021, at 10:30AM, Eastern Time / 7:30AM Pacific Time, an automated system in Amazon’s “us-east-1” region (North Virginia) tried to scale up an internal service running on AWS’s private internal network - the one they use to control and monitor all of their Amazon Web Services.Īs AWS describes it, this “triggered an unexpected behavior from a large number of clients inside the internal network”.īasically, AWS unintentionally triggered a Distributed Denial of Service (or DDoS attack) on their own internal network. Yikes.Īs an analogy, it was as if every single person who lives in a particular city got into their car and drove downtown at the same time. Instant gridlock. However, what that running instance could do might well have been impacted. For example, an EC2 instance would have had trouble connecting through the no-longer-working VPC Endpoints to the still-working S3 and DynamoDB.įurthermore, not only did the issue affect all availability zones in us-east-1, but it also broke a number of global services that happen to be homed in this region. This included AWS Account root logins, Single Sign-On (SSO), and the Security Token Service (STS).ĪWS outage causes: Internal issues Internal Network Congestion Now, to be clear, the issue was not a complete outage for all of these services. For example, if you already had an EC2 instance running when the problem started, it would likely keep running just fine throughout the entire event. EventBridge (what used to be called CloudWatch Events).Over the next three minutes - which is pretty much all of a sudden, from our external point of view - a number of AWS services in the region started having issues, including but not limited to: On the morning of December 7, 2021, at 10:30AM, Eastern Time / 7:30AM Pacific Time, things went wrong in Amazon’s “us-east-1” region: North Virginia. Should you stop using/building with AWS? (No!). Lessons from the AWS outage (for AWS and us).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |