Amazon's latest network outage has highlighted the dangers of relying on a single point of failure in cloud infrastructure. The issue was triggered by a software bug in Amazon Web Services' (AWS) DynamoDB DNS management system, which caused a series of failures that cascaded from system to system within the sprawling network.
The problem began when a race condition occurred in the DNS Enactor, a component of DynamoDB that constantly updates domain lookup tables. The timing of this event triggered another enactor, leading to an inconsistency in the system and preventing subsequent plan updates from being applied by any DNS Enactors. This ultimately resulted in the entire DynamoDB system going down.
The failure caused systems that relied on DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected, including EC2 services located in the same region.
Amazon engineers revealed that the damage persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." This meant that new instances launched successfully but lacked necessary network connectivity due to delays in propagation.
The event has sparked concerns about the importance of eliminating single points of failure in cloud design. Ookla, a network intelligence company, noted that regional concentration and lack of routing flexibility can make it difficult for companies to mitigate the impact of such failures.
"This is not zero failure but contained failure," said Ookla. "The way forward is not to ignore or dismiss these failures but to contain them through multi-region designs, dependency diversity, and disciplined incident readiness."
In a bid to prevent similar failures in the future, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans.
The problem began when a race condition occurred in the DNS Enactor, a component of DynamoDB that constantly updates domain lookup tables. The timing of this event triggered another enactor, leading to an inconsistency in the system and preventing subsequent plan updates from being applied by any DNS Enactors. This ultimately resulted in the entire DynamoDB system going down.
The failure caused systems that relied on DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected, including EC2 services located in the same region.
Amazon engineers revealed that the damage persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." This meant that new instances launched successfully but lacked necessary network connectivity due to delays in propagation.
The event has sparked concerns about the importance of eliminating single points of failure in cloud design. Ookla, a network intelligence company, noted that regional concentration and lack of routing flexibility can make it difficult for companies to mitigate the impact of such failures.
"This is not zero failure but contained failure," said Ookla. "The way forward is not to ignore or dismiss these failures but to contain them through multi-region designs, dependency diversity, and disciplined incident readiness."
In a bid to prevent similar failures in the future, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans.