Is Catastrophic Software Failure a Black Swan?

Thursday, 20 March 2025
By Ed Steinmueller and Gill Ringland

dorothe-wouters-5Tk07JxMAyA-unsplash

Black Swans refer to rare, unforeseen and hence unpredictable events, typically with extreme consequences. For instance, "financial industry models that simplify a complex reality are vulnerable to Black Swans that literally break the bank".

The potential risks of software failure are variously described as a “fly in the soup” or an “elephant in the room”. These metaphors convey the idea of unexpected risk events (flies) or neglected, but obvious and visible (elephant) risks. Nassim Taleb who popularised the use of Black Swans added the important element of magnitude to unexpected and neglected risks. Black Swan events constitute a much more major, often catastrophic, risk than either flies or elephants.

Taleb observes that people and their organisations are often complacent about the existence of Black Swans. Even though, in retrospect, Black Swan events appear inevitable, the fact that they haven’t yet happened encourages a belief that they will not happen and this causes shock when they do occur. To indicate the added risks of complacency, Taleb uses the contrast between the view of a well-cared for turkey that is surprised by Thanksgiving (the sucker) and the butcher (the knower) who plans for the turkey’s demise.

Despite recent apparent increases in the number and scope of software system failures leading to service outages and data breaches, those who haven’t experienced them expect they will not happen to them. In Taleb’s analogy, they are the suckers.

Most organisations deliver their critical services directly or indirectly through digital systems. These services are seen as a utility – essential to the operation of the economy and society, and to the quality of life. But these digital systems contain software components over which the organisations have no control. The failure of one or more of these software components is inevitable and unpredictable The question is about risk magnitude, whether it will cause a Black Swan event for a particular organisation or merely an anticipated and rehearsed for emergency.

DALL·E 2024-08-12 12.05.21 - A detailed image of a futuristic robot tax collector. The robot has a sleek, metallic design with a humanoid shape, featuring digital displays on its

Given the difficulties of predicting the nature and size of service outages or data breaches, and expectations of society that a utility will operate 24/7, the best means of avoiding Black Swan events is for organisations to focus on resilience. Resilience can be used to describe many aspects of organisations. Here we focus on the resilience of services to users – reducing the fallout to users from digital systems failure.

The first steps in building a more resilient organisation is to increase the visibility of systemic risk. Some very basic managerial tools, such as RACI (The RACI framework is based on assigning Responsibility and Accountability with Consultation and Informing of stakeholders), provide the means for ‘getting started’ in establishing responsibility for resilience and increased availability. Building responsibility and eventually consensus requires a common language about risk and resilience within organisations. This enables technical and non-technical people to jointly consider actions to mitigate risk and improve resilience. Encouraging ‘translators’, people who are able to bridge differences in outlook and language, is helpful for building mutual understanding of what is at stake.

Metrics are invaluable in moving from talk to action. A particularly useful metric is based on the concept of ‘important business services’, services that are vital to the financial and reputational survival of the organization and whose interruption would cause intolerable harm. In most cases, a small interruption or one that affects a limited number of customers is tolerable. This allows the identification of a metric -- impact tolerance – the threshold between tolerable and intolerable interruption (https://www.bankofengland.co.uk/prudential-regulation/publication/2021/march/operational-resilience-sop).

In setting impact tolerances, a plot of the cost of reducing risks against the time before failure can be expected is useful in crystallising the idea of impact tolerance. The plotting of the costs of risk reduction is only one of several cost-benefit analyses that can help with common language and enlist support for necessary investments to stay within agreed tolerances (Cost benefit analysis of risk reduction is extensively examined in Chauncey Starr et al (1976)).

chuttersnap-cGXdjyP6-NU-unsplash

Another useful metric is the NIS framework which measures four aspects of the cost of failures in terms of impact to users. These are cost of “lost user hours”, cost of data breaches, cost of damage to life or health, and cost of significant financial impact to users. The NIS framework is used by the ICO to regulate RDSPs (Relevant Data Service Providers), but is not yet widely used for costing the economic impact of service failures. For very large outages, the organisation needs to have a disaster response plan that incorporates alternative service provision (a ‘Plan B’) or well-structured rollback and restart procedures to restore system functionality.

Resilience is more than a technical issue. It requires specific management actions and skills. A recent RoundTable found that these skills are often broadly dispersed within organisations and that gaps in knowledge and practice are only recognised after an outage. These gaps can be bridged by developing internal capability or by procuring external capabilities, but there is no magic bullet that will assure that the necessary skills are available.

However, it can help to build an organisational culture that provides ‘safe spaces’ for people to discuss service resilience and its value to the organisation. This results in more open discussion of failures and outages, and the recognition of early signs of instability or risk.

The practice, common in health and safety, of making it a positive obligation to call out issues has much value. Organisations need to establish a ‘What If?’ approach to planning for future potential scenarios to ensure they have adequate protections in place. One method is to conduct ‘pre-mortem’ examination of a large-scale failure to address managing organisational risks effectively.

alicia-mary-smith-GHoxPI8qvfs-unsplash

Information Technology is a critical utility: this increases the need for resilience everywhere. Improving resilience begins with awareness and visibility of risks, including risks of Black Swan catastrophic failures. Building an ‘all hands’ effort to communicate priorities regarding important business services and the organisation’s tolerance for outages is a first step. This initial planning can be turned to action using metrics to gauge ordinary risk and mitigation measures such as alternative service provision (a Plan B) or well-defined recovery procedures to deal with larger scale failures. Building from visibility to action requires organisational learning and skills-building and a common commitment to reducing the scale of impacts of failures when they occur.

We explore these issues and methods in more depth in Resilience of Services: reducing the impact of IT failures: The London Publishing Partnership.

Ed Steinmueller and Gill Ringland

March 2025

svg.lf_footer_svg{ height: 30px; width: 30px; }