Faulty Operation of Microsoft’s DDoS Defenses Amplified Impact of Azure Outage
Microsoft’s defense response to a distributed denial-of-service (DDoS) attack resulted in service outages that impacted customers worldwide. Learn more about the development and the importance of backup plans and associated best practices.
- An error with Microsoft’s cybersecurity defenses resulted in widespread outages of Azure and other services on July 30.
- The worldwide Azure outage occurred days after the global Crowdstrike outage, which affected approximately 8.5 million Windows machines.
Microsoft’s Azure services experienced a significant outage on July 30, disrupting operations for users worldwide. The company has now revealed that the incident occurred due to a Distributed Denial-of-Service (DDoS) attack, further exacerbated by a fault in the company’s DDoS protection software.
The outage affected many services, including Application Insights, Azure App Services, Azure Log Search Alerts, Azure IoT Central, Azure Policy, the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services. The services were inaccessible for hours, affecting individual and business users.
External Attack and Internal Error
The outage resulted from both internal and external factors. One factor was a sophisticated DDoS attack, which attempted to flood Microsoft networks with a large traffic volume, making the targeted services unavailable. The unexpected increase in usage resulted in Azure Content Delivery Network (CDN) and Azure Front Door (AFD) components performing below acceptable thresholds, leading to intermittent errors, timeouts, and latency spikes.
This was followed by the activation of Microsoft’s internal DDoS protection software, which typically mitigates such attacks. Microsoft deploys multi-layered detection systems at regional data centers to detect attacks close to saturation points while maintaining global mitigation measures at edge nodes.
The company also uses special-purpose security devices for network address translation, firewall, IP filtering, and equal-cost multi-path (ECMP) routing. This network framework ensures multiple global paths to a service. However, in this case, it malfunctioned. Instead of preventing the attack, it amplified the issue, extending the duration and widening the impact of the outage.
According to Microsoft’s blog, the Azure DDoS Protection Standard, which safeguards systems from large-scale attacks, encountered an unexpected issue. The error resulted in overutilizing resources, which worsened the attack’s effect.
See More: Worldwide Outage Hits Microsoft 365 and Azure Services
Mitigation Measures and Best Practices
DDoS attacks have always been challenging to prevent. Correct implementation of security measures and proactive monitoring have been critical to mitigate the impact. The incident highlights the need for organizations using cloud services to have backup plans and make sure that plans are correctly implemented. Some major takeaways include:
Testing and validation
- Regular drills: One should ensure that disaster recovery plans and security measures are not just theoretical but are tested and validated regularly and in practical settings.
- Attack simulations: To identify potential weaknesses, users should regularly simulate DDoS attacks and other security breaches.
Layered security
- Layered defense: Implement multiple layers of safeguards to mitigate various attacks. This can include intrusion detection systems, firewalls, and DDoS protection services.
- Redundancy and failovers: Redundancy systems and automated failover capabilities are critical to maintaining service continuity during an outage.
Vendor communication and SLAs
- Service level agreements (SLAs): Organizations should clearly define SLAs with cloud service providers to understand what service and support one can expect.
- Reviews: Management should periodically review and update vendor agreements and security requirements.
Incident response plan
- Response plan: It is essential to develop a detailed incident response plan outlining the steps to be taken during security incidents.
- Employee training: Organizations should plan to train employees in their roles and responsibilities to ensure a prompt and coordinated response to incidents.
The Microsoft outage highlights the importance of correctly implementing security measures in cloud services. Businesses should be proactive in ensuring their systems are resilient to disruptions. A comprehensive strategy should include regular testing and clear communication with service providers to protect against such incidents going forward.