Complexity and Resilience
September 21, 2007
When I first founded Enterra Solutions, one of the emerging characteristics of the information age that intrigued me was its increasing complexity. It was clear to me that there was a growing complexity gap (i.e., the difference between complexity and the lack of tools organizations had to confront it). I became interested in rule set automation because I knew that it could help deal with complexity in ways that human intervention couldn’t. The complexity of networks has been brought to light on numerous occasions and it is almost always the unintended consequences of failure that make those networks vulnerable. That vulnerability is the focus of an article by John Schwartz [“Who Needs Hackers,” New York Times, 12 September 2007]. His article begins with the customs crisis at the Los Angeles airport in August 2007.
“Nothing was moving. International travelers flying into Los Angeles International Airport — more than 17,000 of them — were stuck on planes for hours one day in mid-August after computers for the United States Customs and Border Protection agency went down and stayed down for nine hours. Hackers? Nope. Though it was the kind of chaos that malevolent computer intruders always seem to be creating in the movies, the problem was traced to a malfunctioning network card on a desktop computer. The flawed card slowed the network and set off a domino effect as failures rippled through the customs network at the airport, officials said.”
Yossi Sheffi in his seminal book Resilient Enterprises talked about the need for understanding and planning for disruptions in supply chains. Organizations whose primary commodity is information need to look at the supply chain for information in the same way that manufacturers look at their supply chain for product components. As Schwartz points out, this means more than simply protecting networks against hackers.
“Everybody knows hackers are the biggest threat to computer networks, except that it ain’t necessarily so. Yes, hackers are still out there, and not just teenagers: malicious insiders, political activists, mobsters and even government agents all routinely test public and private computer networks and occasionally disrupt services. But experts say that some of the most serious, even potentially devastating, problems with networks arise from sources with no malevolent component. Whether it’s the Los Angeles customs fiasco or the unpredictable network cascade that brought the global Skype telephone service down for two days in August, problems arising from flawed systems, increasingly complex networks and even technology headaches from corporate mergers can make computer systems less reliable. Meanwhile, society as a whole is growing ever more dependent on computers and computer networks, as automated controls become the norm for air traffic, pipelines, dams, the electrical grid and more.”
One of the reasons that I started looking at critical infrastructure for implementation of Enterprise Resilience Management™ was that I realized that their networks were complex, vulnerable, and critical for the U.S. economy. Anyone who flies in a modern airliner is relying on automated computer programs to keep them safe. The same kind of care that goes into programming fly-by-wire systems needs to be taken when programming any critical infrastructure system. The challenge is understanding the underlying complexity that is generated by massive networks.
“‘We don’t need hackers to break the systems because they’re falling apart by themselves,’ said Peter G. Neumann, an expert in computing risks and principal scientist at SRI International, a research institute in Menlo Park, Calif. Steven M. Bellovin, a professor of computer science at Columbia University, said: ‘Most of the problems we have day to day have nothing to do with malice. Things break. Complex systems break in complex ways.’ When the electrical grid went out in the summer of 2003 throughout the Eastern United States and Canada, ‘it wasn’t any one thing, it was a cascading set of things,’ Mr. Bellovin noted. That is why Andreas M. Antonopoulos, a founding partner at Nemertes Research, a technology research company in Mokena, Ill., says, ‘The threat is complexity itself.’ Change is the fuel of business, but it also introduces complexity, Mr. Antonopoulos said, whether by bringing together incompatible computer networks or simply by growing beyond the network’s ability to keep up. ‘We have gone from fairly simple computing architectures to massively distributed, massively interconnected and interdependent networks,’ he said, adding that as a result, flaws have become increasingly hard to predict or spot. Simpler systems could be understood and their behavior characterized, he said, but greater complexity brings unintended consequences. ‘On the scale we do it, it’s more like forecasting weather,’ he said.”
The key to resilience is not only understanding how systems work but how they break. The cascading effects referred to in the article can produce unexpected results. Surprises are rarely the good kind when it comes to moments of crisis. The Skype case study is particularly intriguing.
“In the case of Skype, the company — which says it has more than 220 million users, with millions online at any time — was deluged on Aug. 16 with login attempts by computers that had restarted after downloading a security update for Microsoft’s Windows operating system. A company employee, Villu Arak, posted a note online that blamed a ‘massive restart of our users’ computers across the globe within a very short time frame’ for the 48-hour failure, saying it had overtaxed the network. Though the company has software to ‘self-heal’ in such situations, ‘this event revealed a previously unseen software bug’ in the program that allocates computing resources.”
When consulting with large corporations, I encourage the use of alternative future discussions that focus on challenges outside the corporation that could impact processes within the organization. Hindsight is easy. Realizing that Microsoft, which still dominates the operating system world, could do something that could have a major negative impact on operations reliant on that operating system should have been a predictable event. Foresight, however, is not as easy as hindsight. Corporations that can find people who can think out of the proverbial box and see potential challenges before they emerge need to hang on to them — they are corporate treasures. People like Peter Schwartz of Global Business Networks has made a remarkable living by being able to imagine the unthinkable. As a result, he has made many a company more resilient and prepared. Back to John Schwartz’ article:
“As computer networks are cobbled together, said Matt Moynahan, the chief executive of Veracode, a security company, ‘the Law of the Weakest Link always seems to prevail.’ Whatever flaw or weakness allows a problem to occur compromises the entire system, just as one weak section of a levee can inundate an entire community, he said. … [T]he precursor to the Internet known as the Arpanet collapsed for four hours in 1980 after years of smooth functioning. According to Dr. Neumann of SRI, the collapse ‘resulted from an unforeseen interaction among three different causes’ that included what he called ‘an overly lazy garbage collection algorithm’ that allowed the errors to accumulate and overwhelm the fledgling network. Where are the weaknesses most likely to have grave consequences? Every expert has a suggestion. … Dr. Bellovin at Columbia said he also worried about what might happen with the massively complex antimissile systems that the government is developing. ‘It’s a system you can’t really test until the real thing happens,’ he said. There are better ways. Making systems strong enough to recover quickly from the inevitable glitches and problems can keep disruption to a minimum. … The best answer, Dr. Neumann says, is to build computers that are secure and stable from the start. A system with fewer flaws also deters hackers, he said. ‘If you design the thing right in the first place, you can make it reliable, secure, fault tolerant and human safe,’ he said. ‘The technology is there to do this right if anybody wanted to take the effort.’ He was part of an effort that began in the 1960s to develop a rock-solid network-operating system known as Multics, but those efforts gave way to more commercially successful systems. Multics’ creators were so farsighted, Dr. Neumann recalled, that its designers even anticipated and prevented the ‘Year 2000’ problem that had to be corrected in other computers. That flaw, known as Y2K, caused some machines to malfunction if they detected dates after Jan. 1, 2000. Billions of dollars were spent to prevent problems. Dr. Neumann, who has been preaching network stability since the 1960s, said, ‘The message never got through.’ Pressures to ship software and hardware quickly and to keep costs at a minimum, he said, have worked against more secure and robust systems. ‘We throw this together, shrink wrap it and throw it out there,’ he said. ‘There’s no incentive to do it right, and that’s pitiful.'”
It seems to be human nature that our perspective shifts only after a crisis. Undoubtedly a network crisis will eventually shift corporate perspectives about network robustness. In other words, the crisis will supply the incentive for which Dr. Neumann is looking. Unfortunately, it will have to be a large crisis and will generate huge disruptions. I can’t predict what that disruption might be — perhaps something to do with the global transfer of money — but a network disruption is likely to occur and let’s hope it doesn’t put anyone’s life at risk. Enterprise Resilience Management was developed with the intention of helping identify and prevent or mitigate such crises and I’m sure others will eventually develop similar approaches. Let’s hope they are implemented and the dire predictions of a big crisis are proven false.