Expert Reveals How ‘Biggest IT Outage in History’ Actually Happened: ScienceAlert

The world as we know it is increasingly dependent on digital connectivity that, for the most part, operates silently and invisibly in the background. So how could a single software update take down half the internet?

The global IT outage on July 19 is a painful reminder of our vulnerability to technological failures.

This was caused by a single faulty software update from cybersecurity firm CrowdStrike and had disastrous consequences for airlines, media, banks and retailers around the world, particularly those using Microsoft Windows operating systems.

This incident, described as the “largest IT outage in history,” is a reminder of the vast web of IT interconnections that sustain our digital infrastructure – and the potentially far-reaching consequences if something goes wrong.

What started as airport delays has morphed into widespread flight cancellations. Disruption to aviation systems is not only disrupting flight schedules, but is also affecting global supply chains that rely on air cargo, demonstrating the multifaceted nature of modern IT ecosystems.

Meanwhile, broadcasts on numerous television and radio stations were interrupted and operations at supermarkets and banks came to a standstill.

Preliminary analysis indicates that the chaos was caused by a software update to CrowdStrike’s Falcon Sensor security software that was deployed to Microsoft Windows operating systems.

Employees at companies using CrowdStrike were met with the “blue screen of death” (an error message screen indicating the system had crashed) when they tried to log in.

The outage not only exposed the hidden web of dependencies that sustain our digital society and economy, but also revealed the geopolitical dimensions of these dependencies.

Countries with strong ties to Microsoft and CrowdStrike were hit hardest, but companies in countries like China, with their relatively isolated and controlled IT infrastructures, appear to have been less affected.

Given the increasing geopolitical tensions of recent years, China and a growing number of other countries have actively developed their own cybersecurity measures and digital infrastructures, which may have mitigated the impact of this incident.

China’s focus on using indigenous technology and reducing dependence on foreign technology may also have contributed to the reduced impact on their systems.

The incident is a stark reminder that technological dependencies can create geopolitical vulnerabilities. Government agencies must increasingly consider not only the economic, but also the strategic and geopolitical implications of their IT alliances.

Recovery and implications

The way affected sectors deal with this crisis reflects both the strength and vulnerability of their own security and disaster recovery strategies.

The primary problem has been identified and reportedly fixed. The slow recovery process ahead will demonstrate the significant challenges ahead in restoring service continuity within our complex, deeply interconnected digital ecosystems.

It is particularly surprising that, despite many lessons learned from the past, such as TSB’s 2018 IT migration disaster that affected millions of the UK bank’s customers, there has been no phased software rollout.

The lack of this step, a fundamental but crucial strategy in IT management, exposed the vulnerability of systems that many considered robust.

It also raises serious questions about the resilience of Windows operating systems and the cybersecurity measures CrowdStrike takes to protect them.

Furthermore, the episode highlighted the strategic risks of relying on a single source of technology. This global outage demonstrated the importance of having diverse technology alliances to enhance national security and economic stability, while also raising concerns about the possibility that hostile states could exploit such vulnerabilities.

This incident will increase the urgency of international collaborations and policy interventions in the field of cybersecurity.

As services begin to stabilize and resume, this outage should be a wake-up call for IT professionals, business leaders, and policymakers.

The urgent need to reassess and even revise existing cybersecurity strategies and IT management practices is clear. Improving the resilience of the system to withstand large-scale disruptions must be a priority.

The global IT outage is a timely reminder and a pivotal moment for discussions about digital resilience and the future of technology management at the enterprise, infrastructure, and policy levels.

What about AI?

There’s one more thing we don’t know the answer to yet: If a single software glitch can bring down airlines, banks, retailers, media companies and more around the world, are our systems ready for AI?

Perhaps we should invest more in improving the reliability and methodology of software, rather than releasing chatbots too quickly. An unregulated AI industry is a recipe for disaster, especially in a world of growing geopolitical tensions.

While it is essential to embrace emerging technologies like AI and blockchain, we also need to get the basics right.

Cybersecurity operators must ensure that fundamental IT management and maintenance practices are strong, reliable, and can handle everything from a cyberattack to a simple software update.

The lessons learned from this incident will undoubtedly influence future IT infrastructure development and crisis management strategies.The conversation

Feng LiChair of Information Management, Associate Dean for Research and Innovation, Bayes Business School, City, University of London

This article is republished from The Conversation under a Creative Commons license. Read the original article.