Drawing lessons from CrowdStrike: Best practices for IT resilience
On July 19, 2024, a software update for CrowdStrike's Falcon sensor triggered an unprecedented worldwide crash of Windows computers. Countless systems used for essential business functions displayed a blue screen of death (BSOD), causing severe disruptions at airlines, banks, hospitals and many other organizations. Our team quickly developed an effective and efficient solution for baramundi customers. As a result, we were able to help two of our major customers get their total of 26,000 end devices up and running again on the same day. Below are steps that any organization can take to improve IT resilience and speed recovery from major incidents in the future. With the right preventive measures, the impact of such incidents can be significantly reduced.
Short & sweet
- Use your PXE infrastructure to boot affected computers into WinPE so you can edit the system partition and implement needed fixes.
- Manage BitLocker Recovery Keys for secure rapid access to decrypt systems in the WinPE phase using tools such as the baramundi Defense Control Module .
- Carry out thorough tests before deployment to avoid negative effects on systems.
- Create, communicate and practice a recovery plan including preventive measures (e.g. backups, monitoring) and incident response actions in the event of an emergency.
Secure your ability to act
To act and recover as quickly as possible, we recommend preparing the following measures:
- If possible, use a PXE infrastructure to boot affected clients into the WinPE environment.
- Customize the WinPE boot image by following steps included in the fix we created for the CrowdStrike incident.
- To enable a faster recovery, manage BitLocker Recovery Keys for secure rapid access using tools like baramundi Defense Control .
Plan, communicate and test incident responses
To strengthen company resilience, we recommend that you prepare an incident response plan with detailed instructions on what to do in the event of a crisis. Test it regularly and revise it as needed to address potential weaknesses and ensure that everyone involved knows what to do.
In addition, educate employees about the importance of IT security and why certain policies and procedures exist. Explain how the right practices can prevent incidents and help each department respond effectively in a crisis. Steps include setting up alternative communication channels, printed emergency contact lists and back-up equipment.
Regular security checks
Carry out regular security checks so you can identify and fix potential IT vulnerabilities before they become a problem. For example, penetration tests or pentests conducted by a white-hat hacker can uncover undetected vulnerabilities. We also recommend carrying out regular vulnerability scans to examine endpoints for known security gaps using tools such as the baramundi Vulnerability Scanner.
Always have a Plan B. Despite all precautionary measures, a failure can occur at any point. In such cases, it is important to be able to react quickly. Make sure you have sufficient resources to respond to an incident. For example, your Plan B could include setting up a dedicated security incident response team that is trained to respond to data breaches. It also makes sense to have a well-documented disaster recovery plan that outlines steps to restore data and services after an incident. Remember, prevention is key to minimizing IT risks and ensuring smooth operations. The baramundi Management Suite integrates tools for efficiently configuring, testing and automating distribution of Windows, Microsoft and 3rd-party application updates.
Finally, even if you already have those and other prevention and recovery measures in place, the CrowdStrike outage should prompt you to revisit them. Seriously consider "what-if?" IT scenarios, responses and tools that can reinforce or expand on existing measures. For example, vulnerability assessments should also examine potential single-points-of-failure, less-secure connections with suppliers, and other factors. That will give you the opportunity to address them proactively and stay ahead of new and potentially more stringent cybersecurity and business continuity requirements from regulators and insurance underwriters.
Secure and reliable update management
Regular software updates help organizations of all sizes reduce security risks. However, updates can also cause problems ranging from unexpected incompatibilities to severe crashes like the one triggered by the CrowdStrike update. With a combination of preparation, planning and the right tools, you can minimize the impact of faulty updates and recover more quickly when problems occur.