Skip to content

   

Intense competition and growing demand for ever more complex services have made end-users increasingly intolerant of network outages of any kind. With this in mind, Juniper has a fixed focus on developing continuous systems that do not disrupt or degrade services. This month, we examine two continuous system mechanisms of Juniper: nonstop active routing (NSR), which provides uninterrupted routing during a switchover of the routing engine, and unified in-service software upgrade (unified ISSU), which enables operators to upgrade the entire operating system without interrupting routing.

Going Beyond High Availability to Develop Continuous Systems

As service providers converge high-demand, critical services onto IP infrastructures, even a relatively small disruption can negatively impact an end-users’ perception of those services. And with automated applications (transactions, synchronization and backups), widespread globalization and other drivers of round-the-clock traffic, today’s always-on network no longer experiences off-peak traffic periods in which to slot scheduled outages for maintenance or upgrades. With network outages of any kind unacceptable, whether unplanned or planned, high availability has taken on critical importance for service providers.

With this in mind, the engineers and developers at Juniper Networks have gone beyond the concept of high availability to focus on developing continuous systems that broadly consider how to avert disruption and degradation of services. . The design approach considers the many potential sources of downtime and finds ways to provide fail-safe mechanisms when problems do occur, along with tools that speed and automate identification, isolation and recovery, and can provide proactive response to avert failures before they even happen.

This month, we examine two continuous system mechanisms of Juniper: nonstop active routing (NSR), which leverages redundant Routing Engines (REs) to provide uninterrupted routing, and unified in-service software upgrade (unified ISSU), which enables the complete upgrade from one software release to another with no disruption to routing.

Enhancing Uptime with Nonstop Active Routing

The ability for network devices to transparently switch routing engines without disrupting routing or forwarding is a requirement for continuous systems. Today, many routers employ redundant routing engines to enable a backup processor to take over operations if the primary routing engine fails.

Graceful Routing Engine switchover (GRES) is the essential mechanism that allows the backup Routing Engine to automatically assume mastership of the routing and system control functions, with no disruption of packet forwarding.  Although this effectively eliminates packet loss on the affected router during the switchover, GRES can provide only a partial solution. Neighboring routers will still detect the switchover from the primary to the backup routing engine and react accordingly—all the adjacent devices process that link/node topology change, perform best path selection  and update all of their neighbors, who in turn update their neighbors until the entire network has received the update. And then, when the new master starts routing for the node, the process repeats. The result is a lot of network churn and processing for no effective change.

Graceful restart provides extensions to routing protocols that enable adjoining peers to recognize switchover as a transitional event so that they do not begin the process of reconverging network paths. When graceful restart is negotiated between a routing adjacency the neighbors do NOT update all of their neighbors when a node stops routing. Instead, they enter an active monitoring wait process. During this wait time it is assumed that the node that is not routing is forwarding traffic and preserved state – often called Non-Stop Forwarding. The graceful wait interval is configurable by the user and negotiated between the nodes. It is often several seconds long. During this graceful wait interval the traffic is not supported by active routing so there is the potential that the restarting NSF node is sending traffic to a destination that is no longer valid – often called blackholing traffic.

The graceful restart solution requires that ALL peers run the standardized protocol extensions, and a  change in the network can cause graceful restart to stop. If any connected router does not support graceful restart protocol extensions (or it was mistakenly not configured to support graceful restart protocol extensions) then that node will respond immediately to the absence of routing and propagate the routing change to the network. Even the nodes that are in the graceful wait period will receive the node topology change and since they receive it from a link that is not directly connected to the restarting node, they process the routing change, exit graceful wait and cause the network to converge.

Nonstop active routing (NSR) is an alternative to graceful restart that offers multiple advantages—it’s transparent to network peers, does not require peer participation, does not drop adjacencies or sessions, has a minimal impact on convergence, and allows the switchover to occur at any point, no matter how much routing is in flux.

The big difference from a system architecture point of view is that both routing engines are actively running routing. Both routing engines are running the routing processes and receiving routing messages from the network neighbors.  Selection of master is now a matter of selecting one of two running routing engines and connecting its outbound message queue to the network to communicate with the neighbors.  Nonstop active routing is self-contained and does not rely on helper routers (as in graceful restart) to assist the routing platform in restoring routing protocol information.  Nonstop bridging extends these benefits to the Layer 2 protocols implemented in Ethernet switching.  Together these features enable RE switchover that is transparent to neighbors, maintaining Layer 2 and Layer 3 stability for supported platforms and protocols. 

Reducing the Time and Risk of Planned Events with Unified ISSU

Juniper has long offered service providers software and hardware functionality to minimize scheduled downtime for routine maintenance, including:

  • Single software release train that delivers new versions as a superset of features, each passing extensive regression testing with no critical errors, bringing inherent stability to the system that is not possible with other software options.
  • Hot swappable interfaces, which enable operators to insert or remove hardware components without resetting the entire device

Unified in-service software upgrade (unified ISSU) delivers yet another advancement for reducing the time and risk of planned maintenance events. Unified ISSU enables the complete upgrade from one JUNOS software version to another on supported dual Routing Engine platforms with no disruption on the control plane and with minimal disruption of traffic. For example, customers can use unified ISSU today to migrate their T640 platforms from JUNOS 9.0 to JUNOS 9.1 software.

Additionally, unified ISSU streamlines upgrades with its automated operations functions.  For example, one of the first processes done by ISSU is to verify that all hardware installed in the system is compatible and supported by the current JUNOS release, followed by a check that all hardware and configured features are supported by the new JUNOS release.  These functions provide robust messaging to the operator and enable the operator to correct discrepancies, allowing ISSU to offline unsupported hardware and then to online after the upgrade or to abort ISSU.   This automated checking saves operators significant time and is more accurate than performing these processes manually.

The Juniper approach contrasts to others that deliver quick bug fixes under the name of  “code patches,”  “SMUs,” or other terms. Applying one patch of the most recent version, and then another from a second new version, and yet another for a third new version, and so on, can quickly lead to a patchwork of many different versions. Regression testing of all the different potential combinations quickly comes unwieldy with a rapidly multiplying potential for major defects and caveats.

The Juniper unified ISSU design therefore enables the complete upgrade of the entire operating system from one major release to another. The solution preserves the full integrity of the completeness and quality of the regression testing for each software release. Upgrade paths are available from any supported release to another.  Juniper’s unified ISSU allows operators to upgrade software without disrupting Layer 3 adjacencies or routing, Layer 2 keepalives, or link management. Execution of the upgrade also requires enablement of graceful Routing Engine switchover and nonstop active routing (NSR) on the device. The ability to upgrade a complete operating system without these disruptions is an important step forward for IP networking—and the direct result of Juniper’s focus on developing continuous systems.

The Bottom Line

The consequence of an outage in a modern multiservice network can be extraordinarily high, yet operations teams face many challenges to increasing network availability.

The continuous systems perspective of Juniper Networks provides a multifaceted approach to developing software that proactively considers all of the underlying factors of downtime. Among the most recent advances are nonstop active routing, unified ISSU and other essential mechanisms for service providers (and their customers) to deliver nonstop services, even during routing engine restarts and system upgrades.

For more information about Juniper’s technical advances, see JUNOS Software High Availability Configuration Guide.