https://www.cnn.com/2025/10/25/tech/aws-outage-cause
The cloud AWS provider experienced a massive outage on oct 20, 2025 that shut down or impacted many of the most popular services and products on the internet like Roblox, and snapchat. The outage was so large many systems and product became unusable. This issue stemmed from a DNS issue where multiple automated systems were trying to update the same DNS entry which then threw a empty field. This empty field then was carried down to other services like EC2 which then caused those to fail and further down into other workflows and other systems that relied on services like EC2. Those failures carried down to Network balancers which essentially snowballed into a enormous mess that wrecked many apps and services. I personally clocked into work with our error handling system and alerts absolutely flooded will alarms and alerts. This became so bad my boss told me to just ignore the errors for the day (One of my sprint tasks is to keep alerts down). I chose this article due to its relevance to testing software and the probable use of git to find out when and where the error occurred. As the use of git bisect could be used to troubleshoot where the issue was introduced into the workflow to find out where the DNS field was assigned to two systems that could trigger at the same time. Furthermore AWS in their statement to the public states that “We know this event impacted many customers in significant ways, We will do everything we can to learn from this event and use it to improve our availability even further” This statement aligns with the values of scrum since it prioritizes openness and respect which AWS gives to its customers by openly documenting the issue that occurred while at the same time stressing the communication with the customers to reassure them in their recovery from the failure. They also follow the AGILE manifesto as they are responding to change over developing a plan since their customers need their services back now, prioritizing getting their services back online rather than documenting and collaborating with customers to have their systems restored. Delivering that they are only releasing their documentation once the system was fully restored under 15 hours showing that they needed to deliver on working software. In a way they could also be benefiting from focusing on individuals and interactions over tools since this error occurred at multiple levels but started at the same point so one tool or process couldn’t immediately understand the root cause.
From the blog CS@Worcester – Aidan's Cybersection by Aidan Novia and used with permission of the author. All other rights reserved by the author.
