David Norton, Executive Director, CISQ
Note: This blog first appeared on Dave's LinkedIn on October 3, 2019
We think of life as being measured in years not inches, but on the night of 15th December 1967 for Charlene Wood life, or death, would be a matter of inches.
Charlene was looking forward to getting home after 8 hours on her feet at the local hair salon, a long time given Charlene was pregnant with her first child. As Charlene approached the Sliver Bridge over the Ohio River that cold Friday evening, the last thing on her mind was a design decision made 39 years ago, a decade before she was even born.
Just after 5pm Charlene drove on to the bridge, following the steady stream of rush hour traffic, as she had done hundreds of time before - this time Charlene would not reach the other side.
Silver Bridge, “the Gateway to the South,” was a suspension bridge built in 1928. Unlike many suspension bridges that use cable to support the road deck, Silver Bridge used steel eyebar chain links, giant bicycle chain-like links. The use of eyebar was not uncommon, the Clifton Suspension Bridge in the UK was built using the technique as far back as 1864 – it was a safe design and cheaper than the cable method.
To keep the cost down, the design had gone for a low redundancy high strength approach using just two eyebar chains per side instead of the 4 or 6 used in other bridges. Using fewer eyebar chains was not thought to be a problem as the steel they were fabricated from was twice the strength need to actual carry the expected load, in 1928!
Charlene’s mind wandered with thoughts of the weekend and Christmas but her attention was suddenly snapped back when she heard what sounded like a gun shot, then the car in her front rear lights started to move to the left as Charlene felt her car move to the right. Within moments the bridge started to sway from side to side and the road deck started to shake violently – the sound of breaking and contorting steel its grim soundtrack – the bridge was collapsing.
High above Charlene and the other drivers’ heads, one of the giant bicycle chain like eyebars had failed – the ‘gun’ shot. Eyebar 330 forged nearly 40 years ago had succumb to years of poor maintenance, overloading, and a fundamental flaw in its fabrication. Residual stress in the crystalline structure of eyebar 330 left it brittle and prone to cracking – from the day of its fabrication it had been trying to pull itself apart – that Friday it succeeded.
The car lurched backwards as Charlene slammed it into reverse. She had moved less that 4 feet when the lights of the cars in front suddenly dropped away towards the river - like dominoes falling one by one in to the night. Charlene braced herself, the sound of collapsing steel becoming deafening, then silence.
Charlene Wood and her unborn baby survived that night, 46 other people did not. Inches and quick thinking separated her car from the cold Ohio River 50 feet below. The incident was 40 years in the making and less than a minute in the doing.
Professor Henry Petroski, a civil engineer and expert in failure analysis, would later write “If ever a design was to blame for a failure, this was it.”
The bridge design made it nearly impossible to inspect allowing the fatigue cracks to go unnoticed. And the low redundancy design meant it was not possible to replace any of the load bearing eyebars without a considerable amount of temporary support work that could take the same loads – basically another bridge.
After the collapse, the US government rushed through new legislation, with President Lyndon B. Johnson becoming personally involved. The first federal bridge inspection program in the form of the Federal-Aid Highway bill was passed. Crucially, bridge designers became more aware of the field of metallurgy, and the risk of residual fabrication stress and fatigue failure.
Arguably, the software industry has had its Sliver Bridge moment, several in fact. The Therac-25 radiation therapy incident in 1985/87 that lead to patients receiving 100 times the intended dose, and at least 3 deaths. The 1994 Chinook helicopter crashed that killed all 29 passengers due to problems with the FADEC (Full Authority Digital Engine Control) system. There is also a growing list of near misses (no proven fatalities) - the Toyota Sudden unintended acceleration (SUA) problems, or LA Air Traffic Control communication failure.
It’s just not safety-critical systems. Some years ago, a CIO of an agriculture agency commented to me that late payments of agriculture grants to farmers, due to system performance issues, had been linked to depression and even suicides. And outages in finance, retail, and travel have become a daily occurrence and a huge impact on the economy.
Sticking with the bridge theme, we are struggling to keep on top of the “residual stress” in the systems, i.e., low structural integrity, poor maintainability, and general tech debt. And just like a bridge, these systems are under increasing load from the business and customers they support.
So, should government step in as it did after the Sliver Bridge disaster, or should they let the industry put its own house in order? Governments around the world have been working with the IT industry to improve cyber security and reduce the risk to critical infrastructure. In the defence sector there are numerous initiatives related to cyber and software quality. Is this enough? I would argue no.
As it stands today a bank can commission the development of a new payment system that could impact millions of people financially. That system could be delivered with poor quality and high levels of technical debt – “residual stress.” And if or when it fails, there may be a large fine, and the CEO may even be forced to resign, but will the rest of the industry change their behaviour - no?
It’s human nature, when we see something bad happen we always think it won’t happen to us. When was the last time you broke the speed limit? But we all have friends who have been stopped for it – it will be them, not us.
The bottom line, self-regulation, does not work when it comes to enterprise systems. If you think I am pessimistic, why is the frequency and severity of software quality-related incidents increasing?
Governments need to do more with the audit community and the relevant regulatory organizations, to make sure systems are developed without high levels of technical debt. And not just key pieces infrastructure or safety-critical systems – all key enterprise systems that impact citizens, shareholders, and the economy.
We have standards that relate to software quality that can reduce the ‘residual stress’ of the enterprise systems. ISO 25010 SQuaRE (System and Software Quality Requirements and Evaluation), much of which can be automated via CISQ Quality Standards. However, as they are not mandated, very few organizations use them.
It is time regulators and certification bodies in financial, utilities, and healthcare become more proactive, and stop the next “Silver Bridge” disaster in their sector.
And if you have read this far can I ask you to take our survey below, help us stop the next virtual Silver Bridge.