Falcon Sensor Product Update Creates Outages for Microsoft Users Worldwide: An Analysis
A faulty content upgrade pushed by CrowdStrike caused outages for Microsoft-powered systems worldwide. Learn more about the scope of the impact and measures to mitigate the effects of similar incidences going forward.
- The recent CrowdStrike update and Microsoft outage have significantly impacted many users and industries worldwide, monetarily and otherwise.
- This analysis will explore the reasons behind the widespread impact, the monetary estimates of losses across industries, and potential solutions to prevent such incidents.
The recent Microsoft outage caused by a faulty CrowdStrike update notably impacted individuals and businesses worldwide. The issue stemmed from a logic error triggered by an update to the CrowdStrike Falcon sensor configuration file. Here, we cover the factors behind the impact, estimates of potential losses, and likely solutions to mitigate such issues in the future.
Why Were So Many People Impacted?
The following are the possible reasons for such a massive impact on Microsoft systems.
- The global reach of Microsoft services: Microsoft services, including Office 365, Azure, and Teams, are used in the daily operations of numerous businesses worldwide. The reliance on such services for collaboration, communication, and cloud computing means that any disruption can have a massive impact.
- Dependence on cloud services: The global transition to cloud computing has resulted in the centralization of critical operations and data in the hands of a few service providers. If a market leader like Microsoft faces an outage, the ripple effect is felt globally.
- Interconnected systems: Integrating CrowdStrike’s cybersecurity measures with Microsoft’s infrastructure indicates a greater dependency on interlinked systems. A failure in one can propagate to the other, magnifying the impact.
- Security threats: According to a Security Scorecard report, 62% of global external vulnerabilities are linked to third-party software, which increases the risk and impact of breaches. The CrowdStrike/Microsoft outage highlights the vulnerability of interconnected and third-party-reliant systems.
See More: Faulty CrowdStrike Update Leads to Global Microsoft Outage
Potential Losses From the Outage
The impact of the outage was on multiple industries, resulting in significant losses:
- Healthcare: Hospitals and clinics often use cloud services to maintain and access communications, patient records, and operational records. Disruptions delayed critical medical procedures and access to patient data, resulting even in cancellations of appointments, accounting for an estimated loss of millions in the sector.
- Finance: Banks and financial services rely heavily on cloud services for daily transactions and data management. An outage could delay transactions, disrupt trading, and compromise data.
- Manufacturing: Factories and supply chains use cloud systems for planning, logistics, and operations. The incident is likely to have cost hundreds of millions globally.
- Retail: E-commerce platforms and retailers are major cloud services customers, with many using Microsoft systems for inventory management, sales, and customer service. The outage is expected to have affected vital systems, preventing essential operations and temporarily forcing businesses to opt for manual alternatives.
- Education: Schools and universities often rely on cloud-based platforms for remote learning and administration. Students and institutions are likely to have been temporarily inconvenienced at this time.
- Cost of fixing broken machines: The effort required to restore systems, patch vulnerabilities, and enhance security is, however, likely to be much more substantial, especially as there is no automated patch for the issue. This includes labor costs, software updates, and potential hardware replacements. The entire exercise is estimated to cost $1 to $2 billion worldwide.
Expert Opinions
Jake Moore, global security advisor at ESET
“These outages are increasing in volume due to the sheer increase in online users and traffic. After witnessing the blue screen of death (BSOD), many people are quick to suspect a cyberattack, but this can often add to the confusion. It highlights the importance of these services and the millions of people they serve.
Businesses must test their infrastructure and have multiple fail safes in place. Regardless of the company’s size, this is typically called a cyber-resilience plan. However, as is often the case, it is impossible to simulate the size and magnitude of the issue in a safe environment without testing the actual network.
The inconvenience caused by the loss of access to services for thousands of people reminds us of our dependence on Big Tech, such as Microsoft, to run our daily lives and businesses. Upgrades and maintenance to systems and networks can unintentionally include minor errors, which can have wide-reaching consequences, as CrowdStrike’s customers have experienced today.
Another aspect of this incident relates to “diversity” in using large-scale IT infrastructure. This applies to critical systems like operating systems (OSes), cybersecurity products, and other globally deployed (scaled) applications. Where diversity is low, a single technical incident, not to mention a security issue, can lead to global-scale outages with subsequent knock-on effects.”
Mike Walters, president and co-founder of Action1
“This type of issue typically occurs due to inadequate testing scenarios, particularly across diverse desktop and server environments. It can also result from a need for proper sandboxing and rollback mechanisms for critical updates that involve kernel-level interactions. It may also be a kernel driver conflict with other software products. It looks like the update has had no or minimal impact on the N-2 version of the program.
The BSOD problem often indicates kernel-level conflicts or bugs. Such bugs are challenging to diagnose and fix because they operate at the deepest levels of the operating system, where detailed interactions with hardware occur.
The ability to run in Safe Mode or Recovery Environment is an important diagnostic step. Safe Mode disables non-essential drivers and services, allowing administrators to isolate the problem more effectively.
To avoid similar problems in the future, organizations should consider rolling out updates, especially those involving security software, in phases. Before full deployment, test updates in a sandbox environment or on a limited subset of machines representative of all operational configurations. Employ a level of system redundancy, especially in critical infrastructure, to isolate and manage fault domains.
The downtime of large organizations such as banks, airlines, and supermarkets can have a significant financial impact, affecting everything from stock prices to operating costs. For example, airline cancellations can result in millions of dollars in lost revenue and customer compensation.
Critical services going offline, such as emergency services, public transport, and media broadcasts, can disrupt societal functions. News broadcasters can go off the air, disrupting the vital flow of information in times of crisis. Emergency services can have their systems down, potentially costing the life of someone needing emergency assistance. Both CrowdStrike and the affected organizations can suffer reputational damage. Customers and partners losing confidence in the reliability of services can have a long-term impact on business.
This incident can be compared to the WannaCry ransomware attack (2017), which demonstrated the catastrophic effects of unpatched vulnerabilities and the importance of rapid patch management. Ironically, the current downtime was caused by software that fights threats like WannaCry.
The CrowdStrike BSOD issue highlights the complexity of maintaining cybersecurity in a globalized, interconnected IT landscape. While immediate remediation is critical, deeper reflection on patch management, incident response strategies, and stakeholder collaboration is essential to building more resilient systems in the future.”
Carlos Aguilar Melchor, chief scientist, cybersecurity at SandboxAQ:
“It is essential to have visibility on your software supply chain, especially around critical practices such as cybersecurity, cryptography management, and, of course, testing and updates practices. With this historical outage and other recent software supply-chain catastrophic events, such as SolarWinds and Log4j, we cannot accept with blind trust software updates nor unquestioningly trust cybersecurity or cryptography practices. Every company should immediately implement observability in their software systems to monitor these high-impact platforms and prevent these catastrophes.”
J.J. Guy, CEO, Sevco Security
“This incident is Microsoft’s fault, not CrowdStrike’s fault. Yes, CrowdStrike pushed a kernel-level update that caused widespread blue screens. Yes, that should have been caught during QA, and I’m sure we will get an after-action report that details why release procedures didn’t catch it. But software bugs happen. They are unavoidable, even for top-tier shops like CrowdStrike.
This is a high-impact incident not because there was a blue screen but because it causes repeated blue screens on reboot and [appears as of right now] to require manual, command-line intervention on each box to remediate (and even more complicated if BitLocker is enabled). That is the result of poor resiliency in the Microsoft Windows operating system. Any software causing repeated failures on boot should not be automatically reloaded. We’ve got to stop crucifying CrowdStrike for one bug when the OS’s behavior is causing the repeated, systemic failures.”
Mitigation Measures
- Redundancy and backups: Organizations must implement robust backup systems and maintain redundancy in cloud services to ensure that operations continue even if primary systems fail. Regular testing of these backups is vital to ensure effectiveness.
- Enhanced security protocols: To mitigate vulnerabilities, users should regularly update and patch their systems, including third-party software. Advanced threat detection and response systems can help to identify and neutralize threats before the damage becomes widespread.
- Third-party risk management: Third-party vendors should be regularly assessed and monitored for compliance with security practices. This can help identify and mitigate risks associated with external software and services.
- Zero-trust architecture: Zero-trust approaches are becoming increasingly important. In these approaches, every user and device is verified before providing access to network resources. It can reduce the risk of breaches spreading through interconnected systems.
- Balancing agility and security: Agile development practices prioritize rapid deployment and frequent updates, which can lead to security oversights. Ensuring security is integrated into the Agile development process (DevSecOps) can help balance speed with safety.
- Cloud software dependencies: Heavy reliance on cloud providers results in a cascading effect of vulnerabilities and outages. To reduce such dependency, it is vital to diversify service providers and implement multi-cloud strategies.
- Continuous integration and deployment: If not appropriately managed, continuous integration and deployment practices can introduce new vulnerabilities. Regular security audits and automated testing can help identify and patch issues before they reach production.
Takeaways
The Microsoft/CrowdStrike outage highlights vulnerabilities in an increasingly cloud-based, interlinked world. Understanding the reasons behind the incident, analyzing its impact, and implementing adequate security measures can help mitigate future risks. Balancing Agile development and cloud computing with security requirements will be needed to prevent such widespread disruptions from now on.