Here’s a question. When a bridge or building falls down whose “fault” is it. Is it the architect who designed the bridge or building in the first place, is it the the builders and construction workers who did not build it to spec, the testers for not testing the worst-case scenario or the people that maintain or operate the building or bridge? How might we use disasters from the world of civil engineering to learn about better ways of architecting software systems?Here are four well known examples of architectural disasters (resulting in increasing loss of life) from the world of civil engineering:
- The Millenium Bridge in London, a steel suspension bridge for pedestrians across the Thames. Construction of the bridge began in 1998, with the opening on 10 June 2000. Two days after the bridge opened the participants in a charity walk felt an unexpected swaying motion. The bridge was closed for almost two years while modifications were made to eliminate the “wobble” which was caused by a positive feedback phenomenon, known as Synchronous Lateral Excitation. The natural sway motion of people walking caused small sideways oscillations which in turn caused people on the bridge to sway in step increasing the amplitude of the bridge oscillations and continually reinforcing the effect. Engineers added dampers to the bridge to prevent the horizontal and vertical movement. No people or animals were injured in this incident.
- The Tacoma Narrows Bridge was opened to traffic on July 1st 1940 and collapsed four months later. At the time of its construction the bridge was the third longest suspension bridge in the world. Even from the time the deck was built, it began to move vertically in windy conditions, which led to it being given the nickname Galloping Gertie. Several measures aimed at stopping the motion were ineffective, and the bridge’s main span finally collapsed under 40 mph wind conditions on November 7, 1940. No people were injured in this incident but a dog was killed.
- On 28 January 2006, the roof of one of the buildings at the Katowice International Fair collapsed in Katowice, Poland. There were 700 people in the hall at the time. The collapse was due to the weight of the snow on the roof. A later inquiry found numerous design and construction flaws that contributed to the speed of collapse. 65 people were killed when the roof collapsed.
- The twin towers of the World Trade Center (WTC) in downtown Manhattan collapsed on September 11 2001when al-Qaeda terrorists hijacked two commercial passenger jets and flew them into the skyscrapers. A government report that looked at the collapse declared that the WTC design had been sound and attributed the collapses to “extraordinary factors beyond the control of the builders”. 2,752 people died, including all 157 passengers and crew aboard the two airplanes.
In at least one of these cases (the Katowice International Fair building) various people (including the designers) have been indicted for “directly endangering lives of other people” and face up to 12 years of prison. They are also charging the buildings operator for “gross negligence” in not removing snow quickly enough.
So what can we learn from these natural and man made disasters and apply to the world of software architecture? In each of these cases the constructions were based on well known “patterns” (suspension bridges, trade halls and skyscrapers have all successfully been built before and have not collapsed). What was different in each of these cases was that the non-functional characteristics were not taken into account. In the case of the bridges, oscillations caused by external factors (people and winds) were not adequately catered for. In the case of the trade hall in Katowice the building’s roof was not engineered to handle the additional weight caused by snow. Finally, in the case of the WTC, the impact of a modern passenger jet, fully laden with fuel, crashing into the building was simply not conceived of (although interestingly an “aircraft-impact analysis”, involving the impact of a Boeing 707 at 600 mph was actually done which concluded that although there would “a horrendous fire” and “a lot of people would be killed,” the building itself would not collapse. Here are some lessons I would draw from these incidents and how we might relate them to the field of software architecture:
- Architects need to take into account all non-functional requirements. Obviously this is easier said than done. Who would have thought of such an unexpected event as a passenger jet crashing into a skyscraper? Actually, to their credit, the buildings architects did but what they lacked was the ability to properly model the effect of such impacts on the structures, especially the effects of the fires.
- For complex systems, architects should build models to model all aspects of the architecture. Tools appropriate to the task should be deployed and the right “level” of modelling needs to be done. Prototyping as a means of testing new or interesting technical challenges should also be adopted.
- Designers should faithfully implement the architectural blueprints and the architect should remain on the project during the design and implementation phases to check their blueprints are implemented as expected.
- Testing should be taken into account early and thought given to how the non-functional characteristics can be tested. Real limits should be applied taking into account the worst case (but realistic) scenario.
- Operations and maintenance should be involved from an early stage to make sure they are aware of the impact of unexpected events (for example a complete loss of all systems because of an aircraft crashing on the data centre) and have operational procedures in place to address such events.
As a final, and sobering, footnote to the above here’s a quote from a report produced by the British Computer Society and Royal Academy of Engineers called The Challenges of Complex IT Projects.
Compared with other branches of engineering in their formative years, not enough people (are known to) die from software errors. It is much easier to hide a failed software project than a collapsing bridge or an exploding chemical plant.
The implications of this statement would seem to be that it’s only when software has major, and very public, failures that people will really take note and maybe start to address problems before they occur. There are plenty of learning points (anti-patterns) in other industries that we can learn from and should probably do so before we start having major software errors that cause loss of life.
You may be interested in the follow up to the above report which describes some success stories and why they worked (just in case you thought it was all bad).