Coping with human error in the router world

The November 2004 issue of ACM Queue contains an article entitled Coping with Human Error in IT Systems by Aaron B. Brown of IBM Research. This article got me thinking about how modern routers cope with human errors.

One of the first nuggets of knowledge comes early in the article.

Human error happens for many reasons, but in the end it almost always comes down to a mismatch between a human operator’s mental model of the IT environment and the environment’s actual state.

This statement is as obvious as it is important. From my experiences managing large, complex networks it couldn’t be more true. Thinking about it, I can trace almost all of my mistakes that have caused service interruptions to not properly understanding the network topology or even the hardware and software involved. This idea also emphasizes how important it is that the people working together to manage a network or any other complex system maintain close contact and always communicate changes to ensure each individual understands the current state of the system.

The article discusses four approaches for coping with human error: error prevention, spatial replication, temporal replication and temporal replication with re-execution. Error prevention in this context refers to better training of the error prone humans as well as tools that reduce errors. Spatial replication involves having multiple copies of the data; think RAID here. In temporal replication the system state is replicated in time. For example saving the system state every five minutes would provide temporal replication. Your daily backups (you do daily backups right?) are temporal replication. Temporal replication with re-execution adds the ability to replay the changes that have happened since the last replica was saved to recover from human errors.

While discussing error prevention the author says:

A good example of this error interception can be seen in the way that many e-mail clients can be configured to batch and delay sending outgoing mail for several minutes, providing a recover window during which an erroneously or rashly sent message can be recalled, discarded, or edited. This and similar buffering-based strategies are particularly effective because they leverage the human ability to self-detect errors: psychologists report that 70 to 86 percent of errors can be detected immediately after they are committed, even if they cannot be anticipated.

This paragraph got me thinking about the differences in the command line interfaces (CLIs) used by the various router vendors.

By far the best known and understood router CLI is from Cisco IOS. This interface has so much momentum that many vendors basically copy the IOS CLI and proudly state that their CLI interface is almost identical to IOS. Foundry is one such vendor. The IOS CLI goes completely against the above quoted paragraph. Any issued command is executed immediately. On many occasions I have been bitten by this behavior. There is nothing worse than hitting the enter key on command that changes an interface IP just as you realize that you fat fingered the IP address. More complex changes, such as modifications to network routes, also seem to have a way of becoming obvious problems just as you hit enter. The instant apply CLI also has a nasty way of making two inter-dependent commands very difficult to execute.

Contrast this with the CLI used on the Alteon application switches. Changes made in this CLI do not take effect immediately. At any point the ‘diff’ command allows the operator to see all of the pending configuration changes. The ‘apply’ command makes the pending changes take effect. The article goes on to illuminate the primary problem that I have with delayed apply CLIs like the Alteon’s:

Error interception can also create confusion by breaking the immediate-feedback loop that people expect in interactive scenarios – imagine the havoc that a two-minute command execution delay would cause for an operator working at a command line to troubleshoot a system outage.

On many occasions I have issued commands to the Alteon switch and waited for them to take effect having forgotten to run ‘apply’. Whether this is simply because I too have become accustomed to the IOS way of doing things I do not know. On reflection, the ability to use the ‘diff’ command to preview all pending changes has probably prevented some of my errors from causing operational problems.

Even if we assume that the non-instant apply CLIs do prevent some errors from becoming operational problems there are network changes that no amount of previewing will prevent. These errors are not typos or the occasional brain-dead moment but changes that interact with other systems in unexpected ways. To prevent errors of this type some form of replication is required. In this respect, the router vendors do not seem to be very advanced. If a command is executed that makes the router unreachable from all other nodes on the network there are only two options: connect a console cable or power cycle the device. Connecting a console cable isn’t all that hard if you happen to be at the same physical location or if there is some from of out of band access to the console port. A good example of console port out of band access is the console servers produced by Cyclades. By having one of these units at each location with an attached modem the administrator can dial-in to diagnose and repair the problem remotely. Of course, this requires the existence of a separate network for out of band access. With the convergence of the IP and PSTN networks I wonder where this out of band access will come from in the future.

For now lets assume there is no out of band communication to the device. How can we recover from router configuration mistakes? A common network administration practice is to save the known good configuration to the device’s flash memory and then schedule the device to reboot after some short time interval; usually this time period would be five minutes or less. At this point changes can be made. If the changes are successful the scheduled reboot can be canceled. If the changes were not successful the scheduled reboot will bring the network back to a functional state but will result in a temporary loss of service. This method gives network administrators a crude form of temporal redundancy. It is possible that spatial redundancy (having multiple links and routers serve each customer) will hide the fact that the router was temporarily out of service. However, spatial redundancy can often be prohibitively expensive in the network world. Good routers and bridges are not cheap and neither is burying new fibre.

So what can the router vendors do to cope with human errors?

One possibility is the addition of a low level administrative interface that operates at the link layer between routers. Such an interface allows communication with the effected node from an adjacent node as long as the link layer is still operational. I have seen this feature on some business class DSL shelves and modems. Though useful in many situations, a link layer administrative interface does not allow the administrator to recover from changes that negatively effect link layer connectivity. In reality, this is just another form of out of band access anyway.

Even when out of band access methods exist, human intervention is still necessary for the system to recover. Something more automatic is required.

One possibility is leaving the CLI world and using some form of administrative client program. This would allow for a more human error tolerant communication channel between the administrator and the router. For example the router could respond to every command with a ‘command completed’ message sent to the client. The client would then acknowledge this message. If the router does not receive this acknowledgment in a set amount of time the change is then automatically reversed restoring connectivity between the administrator and the router. I know of no system that implements this idea but I wouldn’t be surprised if it has been implemented somewhere.

Another option is to have the router take a snapshot of current network traffic and other operational statistics immediately before a command is executed. If these operational statistics change negatively after a configuration change has been applied the router could automatically undo the change. The biggest problem with this approach is defining exactly what would constitute a negative effect.

Both of the above solutions have some merit but the solution I am most fond of is adding a feature that simply allows all changes to be reversed after a set amount of time. This is very similar to the scheduled reboot approach that was discussed earlier. The advantage of automatically reversing the change is that it would not have all of the negative effects of a complete system restart. Most modern, high-end network equipment has the ability to function as both a router and a bridge. This allows the same physical interfaces that carry routed layer three packets to also be carrying VLANs (layer two bridging). In many situations, configuration changes can effect the ability of the device to route packets but not effect the forwarding of layer two packets. The most common example of this is making a typo when entering the IP address during a device re-addressing. The scheduled reboot feature will allow for recovery in this situation but it also means that the forwarding of the VLAN traffic stops. If it were possible to schedule a short interval after which the router undoes any changes that have been applied, the interruption of layer two forwarding could be avoided. I expect this feature would be more difficult to implement reliably that it at first appears. Many commands could be issued after the undo has been scheduled. Not all of them can be undone in the opposite order from which they were applied. Some kind of command dependency data may be required to compute a safe set of commands to return the device to the previous state. In situations where no safe undo commands could be calculated the router could simply fall back on rebooting. Perhaps this feature has been implemented in a router model that I have not yet had the opportunity to manage.

Like the rest of the IT world it appears that router vendors have a long way to go before their products can cope with human errors automatically.

Alan Kay quote

If you look at software today, through the lens of the history of engineering, it’s certainly engineering of a sort – but it’s the kind of engineering that people without the concept of the arch did. Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves.

— Alan Kay, ACM Queue, Vol 2. No. 9

x86_64 FC4 and Open Office

While attempting to compile some software on my x86_64 FC4 system I ran into a strange problem. For some reason the compile was trying to link against an i386 library. My first thought was why are there i386 libraries on my x86_64 Linux installation? Well it turns out that OpenOffice is not 64-bit clean. So, in order to have OpenOffice in x86_64 FC4 all libraries on which OpenOffice depends must be present in i386 form. This leads to duplication since the rest of the system wants the x86_64 versions. Of course this wastes a bit of disk space but disks are cheap. What is more unfortunate is that loading the i386 version of OpenOffice requires a whole bunch of i386 libraries to be loaded into memory when x86_64 equivalents are already loaded.

Lately, I have been using Gnumeric and Abiword for my office application needs so I do not require OpenOffice. Thus, removing OpenOffice and all other i386 packages from my system was the simple solution to my library linking problems.

Gnumeric and Abiword are available in the extras repository, just run “yum install gnumeric abiword”.

Terror and Liberalism

I just finished reading Terror and Liberalism by Paul Berman. The primary thesis of this book is that the current wave of Islamist (Berman is careful to distinguish between Islamist and Islamism) extremism and terrorism is a continuation of the anti-liberal movements of last century. Basically, Berman argues that the same instincts that drove Mussolini, Hitler and even Stalin are at the heart of the terrorist Islamist movements. Also interesting is the idea that communism and fascism, opposites on the left-right political map, are actually two tendrils of the same beast. Both are reactions to liberal societies.

Berman also spends considerable time discussing the prominent thinkers of the Islamist movements including Qutb who believed the truly dangerous part of American life was not capitalism but the separation of Church and State. I would love to provide a summary of this section and many others but I doubt I could do the book justice.

I really enjoyed this book and would happily recommend it to anyone interested in the subject.

Below are a few quotes from the book that may be of interest.

Fascism and communism were violent enemies of each other – bitter opposites. But, caught in a certain light, the bitter opposites looked oddly similar. . . . Was it possible that fascism and communism were somehow related? Mightn’t both of those movements have evolved out of some other, deeper, primordial inspiration?”
— Page 22

On the liberal ideals of Europe and North America before the First World War.

It was an insistence on freedom of thought and freedom of action – not on absolute freedom, but on something truer, stronger, and more reliable than absolute freedom, which is relative freedom: a freedom that recognizes the existence of other freedoms, too. Freedom consciously arrived at. Freedom that is chosen, and not just bestowed by God on high.
— Page 38

On the results of World War I.

Every last thing that people in the nineteenth century had believed about human advancement, the conviction that progress was inevitable, the satisfied belief that Western Europe and North America had discovered the royal road to wealth and freedom and that everyone else was bound to follow sooner or later, the grand optimism, the feeling of certainty on behalf of all the world – every brick in that magnificent edifice came tumbling down.
— Page 41

Totalitarian movements always, but always, rise up in rebellion against the liberal values of the West. That is their purpose.
— Page 99

In the totalitarian movements of the twentieth century, everyone has thought about the First World War and its aftermath. For those were the years when the liberal project of the nineteenth century finally went to pieces – the years when the simpleminded principles of rational thought and inevitable progress began to look, in their ingenuousness, grotesque and mendacious. Those were the years, in the immediate aftermath of the world war, when the new mass movements arose for no other purpose than to declare the old liberal project of the nineteenth century a lie – a gigantic deception foisted on mankind in the interest of plunder, devastation, conspiracy, and ruin.
— Page 118

The suicide bombings produced a philosophical crisis among everyone around the world who wanted to believe that a rational logic governs the world – a crisis for everyone whose fundamental beliefs would not be able to acknowledge the existence of pathological mass political movements.
— Page 143

What do the citizens of a proper liberal society feel in their hearts? A passion for solidarity and self-government. What do those citizens do? They devote themselves to those principles, until the last measure, if necessary. Liberalism is a doctrine that, in the name of tolerance, shuns absolutes; but liberalism does not shun every absolute.
— Page 170

The whole purpose of totalitarianism, Schlesinger wrote in 1949, was to combat the “anxiety” that is aroused by the lure of other, better ideas.
— Page 190

FC4 and CD verification

For the last several versions the Fedora Core (and previously RedHat) distribution has had the ability to verify that the downloaded CD images were successfully transfered to the newly burned discs. For people who download the images and create CDs themselves this is a fabulous feature; I am sure it has saved people from broken installations. However, as I discovered it can also lead a bit of pain.

Last week I downloaded all of the FC4 disc images and preceded to burn them to CD. After rebooting to install using the new media I discovered that the CD verification was failing for three of the five discs. So, I burned them again. Same result. Having used the CD verification for many years I had no reason to doubt it. Eventually I gave up and asked Bob to burn me a copy. Strangely, these CDs failed the verification phase as well.

Realizing that something strange was going on I started googling for similar experiences. It turns out that the CD verification can fail on certain hardware. I had simply never ran into this problem before because this was my first Fedora install on my new computer.

The solution is to boot the installation kernel with an option which tells it not to use DMA for IDE devices. At the GRUB prompt type ‘linux ide=nodma”. After doing this all discs passed their tests. There is one catch though, the Fedora installer is quite smart. If you use a kernel option to do the installation the installer decides this option must be required for successful operation. After installation I had to remove “ide=nodma” from /etc/grub.conf.

If the above wasn’t enough of an adventure I also managed to cause myself some extra pain. When I asked for a copy of FC4 to be created for me I never specified which version. My new computer has a x86_64 processor. The FC4 installation discs I borrowed were for the i386 version. After a day or so of use I realized the mistake and reinstalled with the discs that first caused the problems.

Graduation

Last Thursday was convocation (graduation) day at UWO. This year convocation turned out to be quite a media event due to the presence of Dr. Henry Morgentaler.

London Free Press article
Article from UWO’s website
Text of Dr. Morgentaler’s speach

Dr. Morgentaler was speaking for the morning convocation ceremony so I did not get to hear his speech. The speaker for the afternoon ceremony was Dr. Bessie Borwein. The full text of her speech can be found here.

“Bachelor of Science – Honors Computer Science with distinction” is what the paper now hanging on my wall says. I also received the University of Western Ontario Gold medal for Honors Computer Science for having the highest average in the program.

I was not planning on attending convocation but I am now glad I did. Missing the ceremony would have probably been a source of regret later in life. It is strange to me how much I now value that piece of paper hanging on the wall.

I have put some pictures from the day in my photo gallery. No, there are no Morgentaler pictures there.

Todays bash lesson

Today I noticed that a bash script I created a couple of weeks ago to calculate some log file statistics was no longer working properly.

The culprit was the following line:

REGEX=`printf "^%s %i %.2i:%.2i\n" ${MONTH} ${DAY} ${HOUR} ${TMP_MIN}`

This regular expression was designed to match lines that start with a certain date and time. Lines like following:

May 28 12:02

The regular expression was no longer matching the log file entries I wanted it to. The problem was the day value. When I wrote and tested the script the day was double digits. Now, at the beginning of a new month, the days are single digits. The log file I was analyzing pads the day field to always be two digits wide. Thus, the regular expression no longer matched the lines because there was an unexpected space. Figuring out what was broken and fixing the regular expression didn’t take long.

REGEX=`printf "^%s %2i %.2i:%.2i\n" ${MONTH} ${DAY} ${HOUR} ${TMP_MIN}`

The script now worked properly but I noticed something weird. I had added a debugging statement below the REGEX definition.

echo ${REGEX}

This debugging statement was printing the regular expression with one space between the month and day, not the expected two. Yet, the script appeared to work perfectly. What was going on?

In trying to figure this out I went as far as creating a quick C program so that I could see exactly what was being passed in as the arguments.

Of course, it turned out to be something that should have been obvious. The bash echo command prints out each of the passed arguments with a single space between them. The echo command was interpreting each part of the string in the REGEX variable as an individual argument. The ‘fix’ was to enclose the bash variable in quotes so that it would be considered a single argument.

echo "${REGEX}"

Bash programming rule #x:: Always put double quotes around variables.

Vim tips for DOS text files

DOS (Windows) uses CR-LF to mark the end of lines in text files. Unix just uses LF. Wikipedia has a long article on these differences if you are interested.

Viewed in older versions of vim, DOS text files had a ^M at the end of every line. This made identification of text files that had been uploaded via binary mode FTP very easy. It seems recent versions of vim auto-detect the text file type and no longer show the ^M by default.

Vim can be told to not try the DOS text file type with the ‘:set fileformats=unix’ command. If you set this option DOS text files will have the familiar ^M at the end of each line.

The text file type can be changed to Unix for the current buffer (file being edited) by ‘:set fileformat=unix’. Opening a DOS text file, setting the type to be Unix and then saving the file will convert it to a Unix text file.