Paying attention to the lessons of Crowdstrike.

A computer bursting into flames.
A small glitch can create big problems. Image from Midjourney V6

It has been two weeks since the software outage caused by Crowdstrike. I wanted to provide a hot take on the subject but decided to gather more information and develop a more informed opinion. The entire affair has come and gone from our cultural memory, like the internet's annual fascination with Alabama Rush Week. It is foolish because the outage underscores some crucial realities that the business and technology communities must acknowledge.

We have completed a quarter of the twenty-first century, and it is clear that BBC journalist James Burke was correct when he said we live in an interdependent and complicated world where technology and change dominate our lives. Technology requires a level of specialization and training that many people treat the postmodern world around them like magic. It creates a false sense of security where people who do not understand technology depend on it for everyday decisions. Arrogance soon replaces the false sense of security, leaving our lives vulnerable to the slightest technological glitch.

The Crowdstrike outage is a perfect illustration of the nature of our interconnectedness. A change in one file in a piece of security software created an error in Windows Software. The software used in numerous systems like airlines, medical systems, and offices had a critical fault and froze into what people in the tech industry call "the blue screen of death." The only way to bring the systems back up was to restart these computers, remove the file, and perform another restart. It was a tedious and labor-intensive approach that took up the time of countless I.T. professionals worldwide. It grounded airlines and exposed everyone to how tenuous our lives can be thanks to our dependence on technology.

In the aftermath of the outage, people blamed anything they could for the disruption. Some claimed that technology companies' diversity, equity, and inclusion efforts allowed unqualified engineers to create the software that caused the outage. Scott Hanselman from Microsoft pushed back firmly on this take, saying it was reductive and racist to make such accusations. In truth, efforts to include more people in the technology field have improved the quality and safety of many software systems. Many systems are so finely calibrated that a tiny error, magnified across tens of thousands of systems, can have significant consequences.

People also blamed the move fast and break things ethos of software development and Agile. My favorite was from Tiktok personality Zigard Mednicks, who talked about how a more deliberate methodology might have prevented the cascading system from failing. I have plenty of respect for Ziguard, so I pay attention when he says something.

The reality is that we had what we call software development a hand-off issue. CrowdStrike rigorously tested the software within its systems and passed all quality control measures. The checks did not consider each possible situation, so something was missing. The people at Crowstrike had no way of knowing what minor errors in their systems would do to subsequent systems that used their software. Everyone just trusted that a later system would elegantly handle a failure or glitch. That is not what happened. So now Crowdstrike has created a process to test this error and ensure it does not get into a production environment.

It means that messy humans and engineered systems have failed for decades. The level of detail and the tolerance for mistakes is so tiny that a missing comma or a rounding error of a thousandth of a point can have unknown consequences. It is why technology is such an unforgiving field and why those of us who work in it must better understand systems, quality, and the implications of our work. To many people who depend on the service we provide, they trust us to build things that work and when they do not cause minimal disruption.

The Crowdstrike outage shows that we lost some of that trust.

Until next time.


 

Edward J Wisniowski

Edward J Wisniowski

Ed Wisniowski is a software development veteran. He specializes in improving organization product ownership, helping developers become better artisans, and attempting to scale agile in organizations.
Sugar Grove, IL