Case Study: Crowdstrike
Rapid Crowdstrike Response
The 2024 Crowdstrike software error quickly made global headlines as it grounded companies and caused catastrophic losses. A quick response was key – here’s how Xcelerate technicians recognised and remedied the issue so our customers could get back up and running quickly.
01.
THE CHALLENGE
Direct losses to Fortune 500 companies alone are estimated to have exceeded $5 billion.
On July 19, 2024, a software update triggered a global IT outage. When the prominent American cybersecurity firm CrowdStrike released a faulty update to its Falcon Sensor software via a Microsoft Windows update, the consequences were devastating. Approximately 8.5 million Microsoft Windows systems crashed, trapping machines in an endless “Blue Screen of Death (BSOD)” boot loop. It effectively halted business operations worldwide.
The update caused severe disruption to banks, hospitals, airlines, and countless other organisations. Tens of thousands of planes were grounded, and companies faced significant operational downtime, with the healthcare and banking sectors particularly affected. Across the globe, IT teams—both internal and external—grappled with the fallout from an unexpected and catastrophic software failure.
The financial impact of the outage was staggering. Direct losses to Fortune 500 companies alone are estimated to have exceeded $5 billion. CrowdStrike’s stock value plummeted by more than 20%. The company issued an apology and explained that the faulty update was deployed due to a bug in its cloud-based testing process. CrowdStrike outlined measures to prevent such incidents in the future, but the outage underscored the risks inherent in relying on software updates often taken for granted.
Windows 10 and 11 no longer include the F8 boot menu, which in previous versions had allowed users to access the operating system safely. Instead, the latest versions of Windows use a “failed boot counter” that tracks unsuccessful boot attempts. This counter resets to zero each time a user successfully reaches the Windows login screen. After three successive failed boot attempts, the system triggers a Boot Recovery menu on the subsequent attempt.
However, CrowdStrike’s faulty update introduced a corrupt file via Windows Update, which was only read after the logon screen appeared. This resulted in the failed boot counter resetting, as it interpreted the boot as successful, even though users could not log in before their machines rebooted within seconds. This created an endless boot cycle—the infamous BSOD—with no opportunity for the system to trigger recovery options.
Once the issue was identified, the solution required deleting the problematic file from the computer’s System32\CrowdStrike folder and using an external boot tool to access Windows 10 and 11 recovery options. However, manually implementing this fix would have required hours—or even days—to access and reconfigure affected machines, leading to extended downtime and significant disruptions.
One of our finance customers, with thousands of machines rendered inoperable by the fault, was facing a catastrophic downtime period of more than 35 hours.
02.
THE DIAGNOSIS
One of our finance customers, with over 300 machines rendered inoperable by the fault, was facing a catastrophic downtime period of more than 35 hours.
03.
OUR SOLUTION
To resolve the crisis quickly, we created a hybrid solution that enabled us to apply the fix to networked and non-networked machines.
We created an HTTP boot image that automatically ran a script across the private network to delete the offending file. Our method meant applying httpboot/PXE in the WinPE (Windows pre-installation environment), which booted the machine into pre-installation, auto-deleted the Crowdstrike file, and restarted it.
While this ran on networked machines, we could manually reboot non-networked machines using a USB key.
It meant a successful and quick recovery for all affected PCs. Instead of having one engineer fix a PC every 10 minutes, we could update and ensure all the machines were back online and working within an hour.
It was speed and accuracy that minimised downtime and losses.
What our client said
“We were initially facing several days of downtime, with 8 staff members needing to be called in over the weekend to apply the fix to 2,800 machines. However, thanks to the automated network boot solution developed by Xcelerate, we were able to have all 2,800 machines fully operational within just 4 hours of the fix being published.”