Resilience Of Services – Reducing IT Failures – Lessons From Grenfell Towers And Coal Mine Deaths In The US.

Saturday, 26 October 2024
By Gill Ringland & Patricia Lustig

Grenfell-Tower

Our previous blogs on Digital Systems, have looked at the threat to national resilience from software failures, the impact on productivity, and discussed why many existing digital systems are not fit for purpose in terms service delivery.

In this blog we ask – are there lessons we could learn from the Grenfell Towers disaster and from coal mine disasters in the US that would help improve the resilience of services based on IT?

Grenfell Towers

The Grenfell Tower disaster, in June 2017, led to the deaths of 72 people. A fire started in a fridge in a fourth floor flat. It should have been contained in that flat for a reasonable time if the building had conformed with Building Regulations. Regulations covered both design of the building such that any fires would be localised, and the materials which should have been fire-retardant.

The Financial Times front page headline of 5th September 2024 was “Official failings and industry deceits led to Grenfell tragedy”. The article goes onto say “The report found that successive governments failed to properly regulate the construction industry”. The inside pages expand on this with “complacent or incompetent ministers and local government officials, and “unscrupulous” suppliers of unsafe cladding materials.”

Were regulations in place which, if followed in the Grenfell Tower building, could have reduced the severity of the fire? Yes. Where was the regulatory structure to enforce these regulations? It was missing. The official report identifies that some regulatory bodies were subject to commercial pressures, others lacked powers to enforce, and the relationship between regulatory bodies was confused.

Where was the responsibility and accountability for fire safety at Grenfell Towers? This was clear – the responsibility was with the local authority. In the Grenfell Tower case, the report notes that “… the building control department failed to perform its statutory function of ensuring that the design of the refurbishment complied with the Building Regulations". It is worth mentioning that local government bore the bulk of austerity related spending cuts in England in the 2010’s, with spending on buildings falling by 40–70% . The surveyor responsible for the refurbishment was overworked, inadequately trained and had a very limited understanding of the risks associated with the use of the materials used - ACM panels. The council of the local authority is responsible for strategy and overall budget, the officers for delivery. Where should the accountability lie?

Lessons for IT failures:

  • Regulations and standards were in place which, if followed, would have reduced the severity of the fire and the loss of life. Regulation and standards are essential but not sufficient. They need to be backed up by an effective, and adequately resourced, inspection and enforcement regime.
  • Building Regulations defined design standards that would have contained the fire. In IT systems, an architecture which localises failures (compartmentalisation) is a back stop to limit the extent of cascading failures.
  • In the local authority, staff were not adequately trained, nor at staffing levels which could implement the authority’s statutory duty. In IT systems, failures of staff at the front end are often due to trying to keep the show on the road with inadequate organisational support.

Coal Mine Deaths In The USA

coal-1626368_960_720

Michael Lewis has published the first of a fascinating set of articles on “heroes of the public sector”. The article tracks the career of Christopher Mark, a mining engineer.

By the time Chris turned his attention to longwall coal mine roofs, the industry was out of pocket $200 for every minute its mines were shut down by roof collapse — and a single roof fall could shut a mine for days. “The same roof fall that can kill miners can also cost a lot of money,” Chris said. And yet even though the coal mine industry had a huge financial incentive to figure out how to solve the problem, it hadn’t solved it.

By 1994, Chris had figured out how to rate the safety of any coal mine roof, on a scale of 1 to 100. He spent years sharing the methodology with engineers. By 2016, for the first time in history, no American miner was killed by falling roofs.

Chris’s paper “The Road to Zero: The Fifty-Year Effort to Eliminate Roof Fall Fatalities from U.S. Underground Coal Mines.” shows in detail not just what happened, but why. About half of the deaths that were averted could be attributed to better technology and new knowledge — that is, by the kind of work he had done. The other half was due to changes in the culture of coal mining. And the greatest spur to that change had been the federal regulations that gave mine inspectors the power to enforce rules.

Mark’s work has meant a change in coal mining culture," said George Gardner, Pittsburgh Safety and Health Technology Center chief at MSHA . “The attitude used to be that mining was just hazardous and not much to do about it. Chris had to convince people that no — this is an engineering problem, and we can solve it.”

Lessons for IT failures

  • The coal mining sector in the US had the advantage of an expert who accumulated data over decades and as a result developed a theory on causes – and hence avoidance – of roof falls. There is no similar data collection or understanding of failure modes for IT systems based.
  • Roof falls were costing the coal industry significant sums, but had not solved it. The current attitude to IT failures is that there is “not much to do about it”.
  • In the coal industry, once an engineering approach was proven, enforcement of regulations was still essential: change happened once federal regulations gave mine inspectors the power to enforce rules. In the UK, the Information Commissioner’s Office has the power to fine Registered Data Service Providers for data breaches, and to publicise the fines: it is unusual for regulators to tackle IT systems failures.

Conclusions

There are useful parallels between IT systems and other systems with a longer history. In looking to reduce IT failures and their effect, key points would seem to be:

  • Principles of design and implementation can only follow accumulation of data and development of an understanding of the underlying principles. Currently there is no systematic data collection on IT failures, their causes and impact. The BCS has published a Policy Brief suggesting that governments should take a lead in publishing data on IT failures in their own systems, to set a framework for wider data sharing.
  • Regulation and standards need to be backed up by effective and visible enforcement. The regulation of Registered Data Service Providers, unusually for IT-related failures, includes publication of fines levied. Other sectors have a range of approaches and enforcement: agreement on some over-arching principles would allow sharing of data and could reduce IT failures.
  • IT systems have become so essential to service delivery that clear lines of accountability and responsibility are called for, even in sectors where the approach to IT failures has been there is “not much to do about it”.

Patricia Lustig & Gill Ringland

26th October 2024

svg.lf_footer_svg{ height: 30px; width: 30px; }