Once again, a software upgrade gone wrong has brought core communications capacities to their knees, this time with fatal consequences. On September 18, a network software update at Australian telecommunications provider Optus brought the triple-zero emergency services calling system in Western Australia, South Australia, and the Northern Territory to a halt for around 13 hours. At least three deaths, including an eight-week-old baby, have been attributed to the outage. As over 600 calls could not be connected, inevitably other less costly but numerous harms have ensued. Not least is the cost of lost confidence in the emergency calling system and Optus in particular.
The triple-zero catastrophe comes nearly two years since another spectacular Optus upgrade failure. A Border Gateway Protocol routing issue was triggered and resulted in the shutdown of the entire Optus network when the default tolerance thresholds set by Cisco Systems for the system were violated. A failed network upgrade similarly crippled Canada’s Rogers network in July 2022. The 2023 Optus outage cost the company over $1.2 million, likely only a fraction of the costs to the economy from loss of service. The share price fell by around 5 percent and the CEO was forced to resign.

The exact cause of the most recent Optus outage has yet to be confirmed. Due to the severity of the consequences, the Australian Communications and Media Authority has instigated a formal investigation, but informal sources have indicated a departure from established processes.
Such outages are not confined to telecommunications systems. The July 2024 CrowdStrike outage—taking down Microsoft’s cloud servers worldwide, crippling air and ground transportation, as well as finance, healthcare, media, and retail industries—also emanated from a software upgrade gone wrong. Nearly every day, a software upgrade goes wrong—frustrating the core functionality of essential and non-essential operations alike. This occurs in hospitals, schools, universities, government departments, stores, factories, and warehouses. Each of these outages may seem minor on its own, but collectively, the lost profits, productivity, and other intangible consequences, add up to a very large aggregate total.
The growing cost of upgrade outages derives from three interwoven sources. First, increased digitization of activities means that more of our daily activities are replaced by applications entirely reliant on computational capacity. Second, as centrally-managed cloud-based data storage and application hosting replace local storage and processing (on phones, local servers, and computers), functions once susceptible to failures of a small number of locally-managed steps are now subject to multiple different and diverse links covering both the movement of data and operational processing. The more links the data has to pass through to get from origin to destination and processing completion, the greater number of software pieces are involved in handling it and, therefore, the greater number of potential vulnerabilities to which it is exposed. Third, the complexities of the software processing the data are also increasing, as more and more intricate and complicated systems interact with each other to manage and control the relevant operations. The seamless coordination of these various software-managed processes is essential to ensure ongoing operational performance. An upgrade to any one of these pieces of software has the potential to uncover an unexpected incompatibility not identified and break the chain—simply because there are so many possible permutations and combinations of functions that neither human nor system is able to reliably track.
From a supply chain risk management perspective, these three forces mean that slowly and inexorably, risks to the resilience of operational delivery of all kinds—not just telecommunications services—have increased with the evolution of cloud computing. And arguably, these chains are at their most vulnerable when updates are made to software at any point along the chain. As it is infeasible for these upgrades to be tested—as there isn’t a test system mirroring the full scope of operations for these complex services to provide reassurance that nothing will go wrong—then it is inevitable that service outages from this source will both increase and impose their full costs in real time in the real world and not in harmless test environments where they once manifested themselves.
While the University of Adelaide’s Mark Stewart has suggested that “there is a long standing worldwide trend for companies to inadequately resource the testing and disaster recovery associated with network planning associated with network upgrades,” it is already too late for rigorous testing in a network-based cloud world. But there is still time to consider our approaches to disaster recovery for when network upgrades go wrong. The responsibility for this lies with every operator over every link along the chain, but who, in this context, will look out for the chain as a whole?
The post Software Upgrade Interruptions: The New Challenge for Resilience appeared first on American Enterprise Institute – AEI.