Understanding the Crowdstrike Outage of July 19th, 2024, and How AI Improves IT Resiliency

By
Vickie J. Lin
July 26, 2024

Overview of the Incident

On July 19, 2024, a global IT outage was triggered by a faulty update from CrowdStrike, a leading cybersecurity firm. The update affected Windows systems, leading to widespread disruptions across multiple sectors, including airlines, healthcare, banking, and public services. The update caused system crashes, commonly referred to as the "Blue Screen of Death" (BSOD), due to a logic error in the update’s configuration file for the Falcon sensor version 7.11 and above (SC Media) (Blackpoint Cyber) (KVIA).

Technical Cause

The root cause of the outage was a configuration file update designed to target malicious activities. This update inadvertently triggered a logic error that resulted in system instability. The affected systems were those running the Falcon sensor for Windows, which downloaded the updated configuration during a specific timeframe. The logic error led to crashes, severely impacting operations globally (SC Media) (Blackpoint Cyber).

Incident Remediation - Steps Taken by CrowdStrike

CrowdStrike's response to the incident involved several key steps:

  1. Identification and Communication: CrowdStrike quickly identified the faulty update and communicated the issue to its customers, clarifying that it was not a cybersecurity attack but a technical error.
  2. Deployment of Fixes: Engineers deployed a fix, which required systems to download a reverted channel file. This fix, however, required manual intervention in many cases.
  3. Manual Remediation Steps: For systems that continued to crash, CrowdStrike provided detailed remediation steps, including booting into Safe Mode or the Windows Recovery Environment, navigating to the specified directory, and deleting the problematic file (Blackpoint Cyber) (Illini Tech Services).
  4. Support and Coordination: CrowdStrike worked closely with affected organizations to ensure the remediation steps were followed correctly and systems were restored to operational status as quickly as possible (KVIA) (Illini Tech Services).

How Building Your Organization’s Overall IT Operation (ITOps) Maturity Could Prevent Unplanned Outages

ITOps, DevOps, and SRE teams often function in silos - utilizing different tools, having varied levels of expertise, and maintaining separate incident-response workflows. The constant I evolution of IT environments has exponentially increased complexity. This, coupled with the siloed nature of these teams, results in information gaps, unrealistic expectations, and significant stress for operators, ultimately harming business outcomes for many organizations. Consequently, companies across all industries increasingly depend on technology to stay competitive, but this dependence also brings additional challenges, complexity, and risks of incidents and service disruptions.

Enterprise IT professionals globally report that incidents and disruptions, like the Crowdstrike incident, increase every year due to a lack of a unified view on the mapping of various service elements with their infrastructure, as well as a lack of business context to understand the full impact of incidents.

An ecosystem of AI-powered ITOps tools can help reduce risks of unplanned outages and can potentially reduce the impact of cascading network failures. Below, we highlight some examples of key areas where AIOps tools can mitigate some of these challenges:

Proactive IT Change Management
  • Risk Assessments for IT Change Tickets: Before deploying updates, ITOps tools like Accrete’s Nebula ITSM (IT Service Management) platform perform detailed risk assessments and analysis to understand potential risks.
  • Advanced Knowledge Engines: Knowledge Engines discern patterns from both an organization’s historical data and CMDB, as well as from continuous data intake over time. Predictive models and simulations could identify and detect the likelihood of issues arising in the change management process, with Nebula ITSM’s models showing likelihood of failure far earlier in the process than other competing ITOps or observability tools in the market.
  • Approval Workflows: Implementing robust approval workflows ensures that any critical update undergoes thorough scrutiny and testing in a controlled environment before being pushed out for a full-scale deployment.This can drastically reduce the likelihood of an unplanned outage.
  • Automated Change Impact Analysis: The chat feature integrates with Nebula ITSM’s AI capabilities to provide automated change impact analysis. IT teams can query the chat for insights on how proposed changes might affect the system, based on historical data and predictive analytics. For example - before deploying a change, a team member can ask the chat, "What are the potential impacts of implementing this update on dependent configuration items?" Nebula ITSM’s chat feature responds with a detailed analysis, highlighting possible risks and suggesting mitigation strategies.

Nebula ITSM platform's AI chat agent feature in action.
Incident Management and Remediation
  • Root Cause Analysis (RCA): Advanced analytics tools and correlation models within ITSM can quickly perform AI-powered root cause analysis, identifying the problematic update that caused the issue and any additional relevant change tickets, thus minimizing unplanned downtime and lost business productivity.
Knowledge Management
  • Centralized Knowledge Base: A well-maintained knowledge base can provide IT teams with instant access to previous incident resolutions and best practices while breaking down data silos. This resource is invaluable for quickly addressing recurring issues. However, utilizing different knowledge bases effectively is also key to maintaining the health of an organization’s ITSM processes. The Nebula ITSM platform provides a knowledge graph that can provide additional visibility into relationships within an organization’s CMDB, ITSM knowledge bases, and IT ticketing system.  Powered by Accrete’s proprietary knowledge engines, the knowledge graph is a visual representation of what all the interconnected change management data in your organization looks like. Leveraging our knowledge agents and the organization’s  tacit knowledge, Accrete’s ITSM knowledge graph connects like-data together to show upstream and downstream relationships between Configuration Items, which allows an IT professional to understand the business impacts of implementing a change ticket and taking a Configuration Item, or an IT asset, offline.
Knowledge graph view in Nebula ITSM showing the relationships between a ConfigurationI Item (CI) and other CIs, Locations, and Knowledge Articles.
  • AI-Driven Insights: Accrete developed the Nebula ITSM tool with a vision to enable IT organizations to capture, retain, scale, and optimize their tacit knowledge. Tacit knowledge, as opposed to formalized, codified or explicit knowledge—is knowledge that is difficult to express or extract; therefore it is more difficult to transfer to others by means of writing it down or verbalizing it. However, this form of experiential knowledge is ubiquitous in the IT world, as managers with years of experience in maintaining their organization’s IT infrastructure understand the idiosyncrasies and nuances of their configuration items. Unifying an organization’s tacit knowledge across different teams, business units, or silos allows businesses to better understand the overall impact of IT change tickets and software updates.
  • Data Enrichment and Validation: The advanced AI algorithms on an ITOps platform, like the ITSM platform, work to enrich the data they receive. This means that even if relevant knowledge isn’t explicitly designated in the underlying document corpus, the system can still make predictions and identify trends that would otherwise be difficult to spot. Cassini's AI capabilities can enhance knowledge management by continuously learning from incidents and updating with new insights and remediation strategies through real time data intake and data enrichment.

By leveraging ITSM principles and utilizing AIOps tools like Accrete’s Nebula ITSM for IT operations, organizations can work towards creating a more resilient IT environment. Implementing such strategies and tools can prevent such disruptions from escalating and ensure business continuity even in the face of unexpected technical issues.

Let’s start your journey toward AI transformation.

Get in Touch