As service expectations rise for IP networks, Juniper Networks has focused on developing the continuous systems operators need to keep their networks running 24 hours a day, seven days a week. A continuous systems approach means that the Juniper engineers explore the different causes of service degradation and outages—whether resulting from planned maintenance, unexpected failures, or human factors—and find ways to avert and mitigate them for delivery of high uptime.
Continuous Systems for Delivery of High Network Uptime
As service expectations continue to rise for IP networks, Juniper Networks has gone beyond the view of high availability as simply redundant equipment or links to develop a continuous systems approach. The perspective of continuous systems is a holistic view to high availability for the delivery of applications and services without disruption or degradation.
Developing these continuous systems means more than developing individual features, protocols, or products. It means identifying the many potential sources of service degradation and outages—whether resulting from planned maintenance, unexpected failures, or human factors—and finding ways to avert and mitigate them for delivery of high uptime. During the product development cycle, Juniper engineers consider each device’s redundancy, failover mechanisms, and operations to develop functions that support seamless service continuity.
Some of the resulting mechanisms are extremely familiar, such as Juniper’s separate routing and forwarding planes and the modular architecture of the JUNOS software. Others are more unique to Juniper, such as our disciplined development process, error-resilient configuration, autoscripting abilities, unified in-service software upgrade, deep automation of interaction with customer support systems, among others.
Many different types of events and errors can cause disruption to network availability. This article explores different causes of network downtime and the mechanisms and features of JUNOS software for delivering high network uptime. We start with the planned events that operators set up to maintain and upgrade their equipment, move into unplanned downtime caused by device, link and system failures and then move to what many believe is the most frequent cause of downtime—human factors.
Reducing the Time and Risk of Planned Events
Carrying the essential communications and operations of global businesses running numerous automated applications, the network no longer experiences off-peak traffic periods in which to slot scheduled outages for maintenance or upgrades. Operators must find ways to reduce the time and risk of planned maintenance events by avoiding the need for fixes and streamlining changes and upgrades.
The Juniper approach to helping operators reduce planned maintenance begins with the highly disciplined development process of JUNOS software. New versions follow a single release train in which developers deliver a superset of features. Juniper engineers adhere to high standards of development to maintain this single train. For instance, they only add features to new releases and use extensive regression testing so that they know if new code has unexpectedly and critically affected a previously working feature, and they fix the problem before releasing the new version.
The disciplined process brings inherent stability to the system that is not possible with other software options, reducing the time that operators must schedule for its planned maintenance. Moreover, this methodical approach ensures a well-understood, extensively tested code base on which to build new continuous system mechanisms. Helping to further reduce the time and risk of upgrades, Juniper is now delivering unified in-service software upgrade (unified ISSU). Unified ISSU enables the complete upgrade from one JUNOS software version to another on supported platforms with dual Routing Engines without disruption on the control plane and with minimal disruption of traffic. The advantages of unified ISSU are covered more completely in this month’s Technical Insider column.
Reducing the Number, Duration and Severity of Unplanned Events
Another opportunity area for increasing uptime, well-recognized by the industry, is to find ways to reduce the number, duration and severity of unplanned events that occur due to failures in the devices, links and systems of the network.
Devices running JUNOS software have a well-deserved reputation for continuous performance and operational stability. The inherent stability of the single release train and modular design of the software contributes to its high uptime in the field. From the very beginning, Juniper developers focused on building a modular operating system that provides the inherent fault tolerance to resist internal failures.
In addition to these engineering elements, JUNOS software offers High Availability features to minimize downtime triggered by many different types of unplanned events. These features include automated mechanisms for rapid detection and response to events, fast failover to redundant systems, self-healing of networks, scripted diagnostics and even restoration, among others.
For example, the automation afforded by JUNOS software event policies allow network engineers to increase their level of active monitoring of the network, creating early warning systems that not only detect emerging problems, but with operations scripts can also take immediate steps to diagnose the root cause, and even make changes to avert further issues and outages, along with helping operators to more quickly restore normal operations. With these tools operations teams can capture operational procedures in scripts instead of on paper, widely sharing scarce expertise by automating their troubleshooting knowledge and the recovery plan. Moreover, scripting enables a continuous improvement capability as each network outage is diagnosed, and the most experienced engineers script the proactive avoidance steps that can help to prevent repeat occurrences.
Among recently developed functionality to reduce downtime from unplanned events are automation of responses to OAM events, automated interaction with Juniper’s customer support team, and the transparent switchover of routing and bridging functions (discussed in this month’s technical insider). These automated mechanisms further expand the operator toolset for fast and often proactive response to avert and mitigate downtime caused by unexpected system and network events.
Averting Downtime Caused by Human Factors
Less explored by most networking vendors is what many argue as the greatest source of downtime—human factors. With the complexity of modern networks, it’s all too easy for even an experienced engineer to put a firewall across the wrong interface (like the one they are using to communicate with the router), mistype an IP address on a filter list, enter just one line of a lengthy command set in the wrong order, or misconfigure a service with a syntax error or missing argument. Detailed procedure manuals and double-checking can alleviate some of these issues but at the expense of slowing down response times. During an emergency situation, pressure and frequent interruptions can significantly increase the likelihood of an error.
Networking vendors have historically left human error issues to their customers, offering only basic training and knowledge bases to help them manage when something goes wrong. Juniper Networks has maintained a long standing focus on the human factor aspects of operations in its JUNOS software by simplifying and automating key processes that can be prone to human error.
For example, the JUNOS CLI is easy to learn, with a feel that is similar to other command sets. Prominent improvements over other systems include multiple features for error-resilient configuration that store changes in a candidate file, enable rollback to 50 prior configurations and can trigger automated rollback in remote systems accidently isolated during configuration changes.
The most frustrating of human errors are ones that have happened before, because they are repeating known mistakes that operations teams could ideally prevent. JUNOS software commit scripts directly address this challenge through the customization of the commit verifications that run before a candidate configuration becomes active. Lead engineers of the operations steams can develop a library of scripts to ensure that configurations are compliant with business and network policies. Moreover, these advanced scripting tools include a macro capability that can condense repeated complex configurations to only a few configuration lines and variables, saving teams valuable time in setting up and changing configurations.
The Bottom Line
The engineering foundations of continuous systems are rooted in the long-standing design and development philosophies of JUNOS software; this is not a feature or attribute that can be easily retrofitted. As part of our move to continuous systems, Juniper Networks has conceived, designed, and implemented a suite of automated mechanisms and procedures that speed recognition and response to system problems to deliver high uptime for the applications and services entrusted to your high-performance networks.
For more information about Juniper’s Continuous Systems methodology and technical advances, read our white paper, What's Behind Network Downtime?
|