The November 2004 issue of ACM Queue contains an article entitled Coping with Human Error in IT Systems by Aaron B. Brown of IBM Research. This article got me thinking about how modern routers cope with human errors.
One of the first nuggets of knowledge comes early in the article.
Human error happens for many reasons, but in the end it almost always comes down to a mismatch between a human operator’s mental model of the IT environment and the environment’s actual state.
This statement is as obvious as it is important. From my experiences managing large, complex networks it couldn’t be more true. Thinking about it, I can trace almost all of my mistakes that have caused service interruptions to not properly understanding the network topology or even the hardware and software involved. This idea also emphasizes how important it is that the people working together to manage a network or any other complex system maintain close contact and always communicate changes to ensure each individual understands the current state of the system.
The article discusses four approaches for coping with human error: error prevention, spatial replication, temporal replication and temporal replication with re-execution. Error prevention in this context refers to better training of the error prone humans as well as tools that reduce errors. Spatial replication involves having multiple copies of the data; think RAID here. In temporal replication the system state is replicated in time. For example saving the system state every five minutes would provide temporal replication. Your daily backups (you do daily backups right?) are temporal replication. Temporal replication with re-execution adds the ability to replay the changes that have happened since the last replica was saved to recover from human errors.
While discussing error prevention the author says:
A good example of this error interception can be seen in the way that many e-mail clients can be configured to batch and delay sending outgoing mail for several minutes, providing a recover window during which an erroneously or rashly sent message can be recalled, discarded, or edited. This and similar buffering-based strategies are particularly effective because they leverage the human ability to self-detect errors: psychologists report that 70 to 86 percent of errors can be detected immediately after they are committed, even if they cannot be anticipated.
This paragraph got me thinking about the differences in the command line interfaces (CLIs) used by the various router vendors.
By far the best known and understood router CLI is from Cisco IOS. This interface has so much momentum that many vendors basically copy the IOS CLI and proudly state that their CLI interface is almost identical to IOS. Foundry is one such vendor. The IOS CLI goes completely against the above quoted paragraph. Any issued command is executed immediately. On many occasions I have been bitten by this behavior. There is nothing worse than hitting the enter key on command that changes an interface IP just as you realize that you fat fingered the IP address. More complex changes, such as modifications to network routes, also seem to have a way of becoming obvious problems just as you hit enter. The instant apply CLI also has a nasty way of making two inter-dependent commands very difficult to execute.
Contrast this with the CLI used on the Alteon application switches. Changes made in this CLI do not take effect immediately. At any point the ‘diff’ command allows the operator to see all of the pending configuration changes. The ‘apply’ command makes the pending changes take effect. The article goes on to illuminate the primary problem that I have with delayed apply CLIs like the Alteon’s:
Error interception can also create confusion by breaking the immediate-feedback loop that people expect in interactive scenarios – imagine the havoc that a two-minute command execution delay would cause for an operator working at a command line to troubleshoot a system outage.
On many occasions I have issued commands to the Alteon switch and waited for them to take effect having forgotten to run ‘apply’. Whether this is simply because I too have become accustomed to the IOS way of doing things I do not know. On reflection, the ability to use the ‘diff’ command to preview all pending changes has probably prevented some of my errors from causing operational problems.
Even if we assume that the non-instant apply CLIs do prevent some errors from becoming operational problems there are network changes that no amount of previewing will prevent. These errors are not typos or the occasional brain-dead moment but changes that interact with other systems in unexpected ways. To prevent errors of this type some form of replication is required. In this respect, the router vendors do not seem to be very advanced. If a command is executed that makes the router unreachable from all other nodes on the network there are only two options: connect a console cable or power cycle the device. Connecting a console cable isn’t all that hard if you happen to be at the same physical location or if there is some from of out of band access to the console port. A good example of console port out of band access is the console servers produced by Cyclades. By having one of these units at each location with an attached modem the administrator can dial-in to diagnose and repair the problem remotely. Of course, this requires the existence of a separate network for out of band access. With the convergence of the IP and PSTN networks I wonder where this out of band access will come from in the future.
For now lets assume there is no out of band communication to the device. How can we recover from router configuration mistakes? A common network administration practice is to save the known good configuration to the device’s flash memory and then schedule the device to reboot after some short time interval; usually this time period would be five minutes or less. At this point changes can be made. If the changes are successful the scheduled reboot can be canceled. If the changes were not successful the scheduled reboot will bring the network back to a functional state but will result in a temporary loss of service. This method gives network administrators a crude form of temporal redundancy. It is possible that spatial redundancy (having multiple links and routers serve each customer) will hide the fact that the router was temporarily out of service. However, spatial redundancy can often be prohibitively expensive in the network world. Good routers and bridges are not cheap and neither is burying new fibre.
So what can the router vendors do to cope with human errors?
One possibility is the addition of a low level administrative interface that operates at the link layer between routers. Such an interface allows communication with the effected node from an adjacent node as long as the link layer is still operational. I have seen this feature on some business class DSL shelves and modems. Though useful in many situations, a link layer administrative interface does not allow the administrator to recover from changes that negatively effect link layer connectivity. In reality, this is just another form of out of band access anyway.
Even when out of band access methods exist, human intervention is still necessary for the system to recover. Something more automatic is required.
One possibility is leaving the CLI world and using some form of administrative client program. This would allow for a more human error tolerant communication channel between the administrator and the router. For example the router could respond to every command with a ‘command completed’ message sent to the client. The client would then acknowledge this message. If the router does not receive this acknowledgment in a set amount of time the change is then automatically reversed restoring connectivity between the administrator and the router. I know of no system that implements this idea but I wouldn’t be surprised if it has been implemented somewhere.
Another option is to have the router take a snapshot of current network traffic and other operational statistics immediately before a command is executed. If these operational statistics change negatively after a configuration change has been applied the router could automatically undo the change. The biggest problem with this approach is defining exactly what would constitute a negative effect.
Both of the above solutions have some merit but the solution I am most fond of is adding a feature that simply allows all changes to be reversed after a set amount of time. This is very similar to the scheduled reboot approach that was discussed earlier. The advantage of automatically reversing the change is that it would not have all of the negative effects of a complete system restart. Most modern, high-end network equipment has the ability to function as both a router and a bridge. This allows the same physical interfaces that carry routed layer three packets to also be carrying VLANs (layer two bridging). In many situations, configuration changes can effect the ability of the device to route packets but not effect the forwarding of layer two packets. The most common example of this is making a typo when entering the IP address during a device re-addressing. The scheduled reboot feature will allow for recovery in this situation but it also means that the forwarding of the VLAN traffic stops. If it were possible to schedule a short interval after which the router undoes any changes that have been applied, the interruption of layer two forwarding could be avoided. I expect this feature would be more difficult to implement reliably that it at first appears. Many commands could be issued after the undo has been scheduled. Not all of them can be undone in the opposite order from which they were applied. Some kind of command dependency data may be required to compute a safe set of commands to return the device to the previous state. In situations where no safe undo commands could be calculated the router could simply fall back on rebooting. Perhaps this feature has been implemented in a router model that I have not yet had the opportunity to manage.
Like the rest of the IT world it appears that router vendors have a long way to go before their products can cope with human errors automatically.