0% found this document useful (0 votes)
56 views465 pages

High Availability Techniques for CCNP ENCOR

Uploaded by

Taher Jalloul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views465 pages

High Availability Techniques for CCNP ENCOR

Uploaded by

Taher Jalloul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TECCRS-2001

Enterprise High Availability


Design and Architectures
Samer Theodossy
Dana Daum
Maren Kostede
Junmei Zhang
Dana Daum Maren Kostede
Technical Solutions Architect
Communications Architect

Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer

High Availability World Coverage


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures:
• Enterprise Wired LAN
• Enterprise Wireless LAN
• Enterprise Data Center
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
Agenda Schedule & Logistics
For Your
08:30 - 10:30 Reference

10:30 - 10:45 Break


Key Concept or
Design Point
Samer
10:45 -12:45
12:45- 14:30 Lunch
Dana
14:30 -16:30
16:30 - 16:45 Break

16:45 - 18:45
Maren Hurray We are done!!!
Junmei
We value your feedback:
Don't forget to complete your online session evaluations
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Cisco Webex Teams

Questions?
Use Cisco Webex Teams (formerly Cisco Spark)
to chat with the speaker after the session

How
1 Find this session in the Cisco Events Mobile App
2 Click “Join the Discussion”
3 Install Webex Teams or go directly to the team space
4 Enter messages/questions in the team space

[Link]/ciscolivebot#TECCRS-2001

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Head Quarters
WAAS
Access
Switches

UCS Rack-mount UCS Rack-mount


Servers Server UCS Blade
Storage Chassis

Distribution WAAS
Switches Central Manager

Nexus
WAN Communications
Router Internet Edge Managers
s
Access
Switches Internet Cisco ACE
Wireless LAN Routers Data Center
Regional Site Controller Firewalls

Nexus
Wireless LAN Data
Internet
Controllers
Center
RA-VPN Firewall
Access WAN
Switch Route
r Guest Wireless
DMZ
LAN Controller
Remote Site Switch

Web
Security
Appliance DMZ
Servers

Email
Teleworker/
Mobile Worker Hardware and Security Core
Software VPN Appliance
Switches

WAN
Access Routers
Switch
Stack
MPLS WAN
Router
WANs s Distribution
Switches

User
WAAS Access
Remote
Site Layers
WAAS

WAN Remote Site


Aggregation Wireless LAN
Controller

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures:
• Enterprise Wired LAN
• Enterprise Wireless LAN
• Enterprise Data Center
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Enterprise-Class Availability
Campus Systems Approach to High Availability

• System-level resiliency Ultimate Goal……………..100%

• Network-level redundancy
Next-Generation Apps
Video Conf., Unified Messaging,
• Enhanced management Global Outsourcing,
E-Business, Wireless Ubiquity
• Human ear notices the difference in voice within
150–200 msec
Mission Critical Apps.
• 10 consecutive G711 packet loss Databases, Order-Entry,
CRM, ERP
• Video loss is even more noticeable

• 200-msec end-to-end campus convergence


Desktop Apps
E-mail, File and Print

APPLICATIONS DRIVE REQUIREMENTS FOR


HIGH AVAILABILITY NETWORKING

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Cisco HA Evolution
No Redundancy
Redundancy with RPR Redundancy with RPR+
No Redundant Units
Adding Redundant Units Redundancy with SSO
Failure on Supervisor Adding Redundant Units
Outage:
causes reload Failure on Active Sup Adding Redundant Units
Failure on Active Sup
10’s of causes Switchover
causes Switchover
Line Cards reload Failure on Active Sup
minutes
on failure Standby Unit is in causes Switchover
Outage:state
STANDBY_COLD
Standby Unit is in
STANDBY_WARM state Standby Unit is in
Several
Line Cards reload after STANDBY_HOT state
minutes
switchover
Line Cards reload after
Outage:
switchover Line Cards Stay up after
Startup Configuration Several
Startup Configuration switchover
Synchronized to Peer
Seconds
Synchronized to Peer Outage:
Startup Configuration
Running Configuration Order
Synchronized of
to Peer
Synchronized to Peer and Running Milliseconds
Configuration
applied after switchover Synchronized to Peer and
applied.
and/or its affiliates. All rights reserved. Cisco Public

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
Defining Levels of Availability*

CA = system is designed to operate 7 days a week 24


hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
Continuous events
Availability
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
Continuous mechanisms to handle both unplanned faults and
Operations
planned maintenance events

HA = system is designed to a specified service level


High Availability with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Defining Levels of Availability*

CA = system is designed to operate 7 days a week 24


hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
Continuous events
Availability
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
Continuous mechanisms to handle both unplanned faults and
Operations planned maintenance events

HA = system is designed to a specified service level


High Availability with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Defining Levels of Availability*

CA = system is designed to operate 7 days a week 24


hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
Continuous events
Availability
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
Continuous mechanisms to handle both unplanned faults and
Operations
planned maintenance events

HA = system is designed to a specified service level


High Availability with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
Defining Levels of Availability*

CA = system is designed to operate 7 days a week 24


hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
Continuous events
Availability
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
Continuous mechanisms to handle both unplanned faults and
Operations
planned maintenance events

HA = system is designed to a specified service level


High Availability with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Measure Availability End-End from the User
Perspective

Application
Custom Application Scripts,
HTML, TCL, Python, many others Presentation

Session

Transport
ICMP Ping, IP Traceroute,
Bidirectional Forwarding
Detection, IP SLA Network

UDLD, STP, REP Data-Link


Cable Testers / Physical
Power Meters

* Layer 8 is not an official part of the OSI reference model 


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
Measure and Analyze Every Event
Analyze and Automate
• Measure all previous points –
• Note each in trouble tickets Actual Fault Starts

• Analyze trends Failure Detection Time

Total Service
Notification Time

Downtime
• Automation –
Diagnosis Time
• Trouble ticketing
• Technology/database Dispatch Time

• Electronic bonding Arrival Time

Repair Time
• Redundant network design and resiliency features
Up Time
• Required for very high availability

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
What to Automate?

Device Monitoring
Configuration
Provisioning

Day 0: Day 1: Day 2:


Deployment Open Programmable Telemetry
Automation Interfaces
Source: 2016 Cisco Study

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Main Operational Challenges

95% 70% 75%

Network Changes Policy Violations OpEx spent on


Performed Manually Due to Human Error Network Visibility and
Troubleshooting
Source: 2016 Cisco Study

CANNOT Keep Pace with the Demands of Digital Business


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
High Availability Design Principal
Key Principals
• Enterprise network design architectures continue to evolve to meet
business and technology needs, but the key principals of high
availably network design still apply;
• Add redundancy and resiliency components as needed to meet the
business requirements.
• Simplify network designs and configurations through virtualization
techniques.
• Implement network-monitoring tools with automation where appropriate,
and analyze all aspects of network outages for indications of where
improvement is needed.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Availability Modeling
• Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
• Stackwise480 and Stackwise
• In Service Software Upgrades

• Foundations of the Structured Network Design


• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
Why Use System and Network Availability
Modeling?
• Planning and Engineering
• Architecture validation
• Design tradeoff analysis/decisions
• Request for Proposal (RFP)
• Service Level Agreement (SLA)

Option 1 $ Option 2 $$ Option 3 $$$

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Predicted Availability Ratings Are Not Guarantees
• Predicted Availability ratings are not
guarantees of network availability.
• Ratings are based on Industry standard
methodologies and statistical analysis
• Useful in making design decisions
and comparing different options.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Predicted Availability Rating
Function of Mean Time Between Failure and Mean Time to Repair
Availability Equation
Increase MTBF

Decrease MTTR

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
Predicted Availability Equation (Basic)

Availability Equation

MTBF
Availability
MTBF MTTR

MTBF = Mean Time Between


Failure
MTTR = Mean Time To Repair
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Predicted Availability Equations
MTBF
Availability
MTBF MTTR

74,116 hrs.
0.999676
2hrs 50.3 min. per year 74,116 hrs 24 hrs. (No Spare)

74,116 hrs
0.999946
28 min. per year 74,116 hrs 4 hrs. (Spare Available)

74,116 hrs
0.999999
.526 min. per year 74,116 hrs .00833 (sub-second)
(Redundancy!!!)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
The Redundancy Effect
Single Points of Failure

Availability = 99.998%
Downtime = ~10 min/yr

99.999% 99.999%
~5 min/yr ~5 min/yr

Linecard Supervisor
Unit 1 Unit 2

Blocks in Series

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
The Redundancy Effect
Single Points of Failure Redundant Components
Availability = 99.999999%
Downtime = ~0.0053 min/yr

Availability = 99.998%
Downtime = ~10 min/yr
Unit 1

99.999%
~5 min/yr
99.999% 99.999%
~5 min/yr ~5 min/yr Supervisor

Linecard Supervisor Unit 2

Unit 1 Unit 2 99.999%


~5 min/yr

Supervisor
Blocks in Series
Blocks in Parallel

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
Example of Predicted Availability Rating
(No Redundancy)
For Your
Reference

Part MTBF MTTR Predicted Annual


Catalyst 2960XR-48TS-I (hours) Availability Downtime
Catalyst 438,130 hrs. 4 hrs. 99.99908704% --
2960XR-48TS-I

Power Supply 1,000,000 hrs. 4 hrs. 99.99960000% --

SFP-10GSR 2,294,776 hrs. 4 hrs. 99.99982569% --


Uplink

System MTBF 268,947 hrs. 99.99851274% 7.82 min

All single points of failure combined in a series calculation

Chassis X Power Supply X Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Example of Predicted Availability Rating
(With Redundancy)
For Your
Reference

Part MTBF MTTR Switchover Combined Predicted Annual


Catalyst 2960XR-48TS-I (hours) time sec. MTBF Hrs. Availability Downtime

Catalyst 438,130 4 hrs. -- 438,130 99.99908704% --


2960XR-
48TS-I

Power Supply 1,000,00 4 hrs. 0 125,001,00 100.00000000% --


(Redundant) 0 0,002

SFP-10GSR 2,294,77 4 hrs. .500 658,251,90 100.00000000% --


Uplink 6 6,130
(Redundant)

System MTBF 438,128 99.99908704% 4.80 min.

Redundant components combined in parallel calculation

Chassis X Combined Power Supply X Combined Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
Example of Predicted Availability Rating
(Catalyst 3850 No Redundancy)
For Your
Reference

Part MTBF MTTR Predicted Annual


Catalyst WS-C3850-48F (hours) Availability Downtime
Catalyst C3850- 241,050 4 hrs. 99.99834062% --
48F
Power Supply 392,174 4 hrs. 99.99898005% --
PWR-C1-
1100WAC

C3850-NM-2-10G 4,319,170 4 hrs. 99.99990732% --

SFP-10GSR Uplink 2,294,776 4 hrs. 99.99982569% --

System MTBF 135,761 99.99705371% 15.50 min

All single points of failure combined in a series calculation

Chassis X Power Supply X Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
Example of Predicted Availability Rating
(With Component Redundancy)
For Your
Reference

Part MTBF MTTR Switchover Combined Predicted Annual


Catalyst WS-C3850-48F time sec. MTBF Availability Downtime
Catalyst 241,050 4 hrs. -- 241,050 hrs. 99.9983406 --
C3850-48F 2%

Power Supply 392,174 4 hrs. 0 19,225,447,9 99.99999999 --


PWR-C1- 59 %
1100WAC
SFP-10GSR 2,294,77 4 hrs. .500 658,251,906, 100.0000000 --
Uplink 6 038 0%
C3850-NM- 4,319,17 4 hrs. -- 4,319,170 --
2-10G 0 99.9999073
2%

System MTBF 228,297 99.9982479 9.22 min.


3%

Redundant components combined in parallel calculation

Chassis X Combined Power Supply X Combined Uplink X Uplink Module =


System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Example of Predicted Availability Rating For Your
Reference
(With Stackwise480 Redundancy, Single Attached)
Part MTBF MTT Switcho Combin Combine Annual
• Catalyst WS-C3850-48F (hours) R ver time ed d Downti
MTBF Availabilit me
y
Catalyst 241,050 4 hrs. -- 241,050 99.99834062 --
C3850-48F %

Power Supply 392,174 4 hrs. .001 19,225,447,9 99.99999999% --


PWR-C1- 59
1100WAC
C3850-NM- 4,319,170 4 hrs. .500 2,328,453,94 100.0000000% --
2-10G 6,134
SFP-10GSR 2,294,776 4 hrs. .500 658,251,906, 100.0000000% --
Uplink 038
System MTBF 241,047 99.99834061% 8.73 min.

Redundant components combined in parallel calculation


Combined Chassis X Combined Power Supply X Combined Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Example of Predicted Availability Rating
(Catalyst 4507R+E Non Redundant)
For Your
Reference

Catalyst WS-C4507R+E Part MTBF MTTR Combined Combined Annual


MTBF Availability Downtime
Chassis with Fans 248,630 4 hrs. 248,630 hrs. 99.99839121% --
WS-C4507R+E
Power Supply 341,356 4 hrs. 341,356 hrs. 99.99882822% --
PWR-C45-6000ACV
WS-X45-SUP8-E 451,610 4 hrs. 451,610 hrs. 99.99911429% --

SFP-10GSR Uplink 2,294,77 4 hrs. 658,251,906,03 99.99999956% --


6 8

WS-X4748-RJ45-E 402,386 4 hrs. 402,386 hrs. 99.99900594% --

System MTBF 82,735 hrs. 99.99516543% 25.43 min.

Components combined in series calculation

Chassis X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Example of Predicted Availability Rating
(Catalyst 4507R+E With Redundancy )
For Your
Reference

Part MTBF MTT Switchover Combined Combined Annual


Catalyst WS-C4507R+E R time MTBF Availability Downtime

with Redundancy Chassis with


Fans
248,630 4
hrs.
-- 248,630
hrs.
99.99839121% --

WS-C4507R+E

Power Supply 341,356 0 0 14,565,831, 99.99882822% --


PWR-C45- hrs. 200 hrs.
6000ACV

WS-X45- 451,610 0 .500 25,494,400, 99.99911429% --


SUP8-E hrs. 625 hrs.

SFP-10GSR 2,294,77 0 .500 658,251,90 99.99999956% --


Uplink 6 hrs. 6,038

WS-X4748- 402,386 4 -- 402,386 99.99900594% --


RJ45-E hrs. hrs.

System MTBF 153,673 hrs. 99.99739714% 13.69 min.

Redundant components combined in parallel calculation


Chassis X Combined Power Supply X Line Card X Combined Supervisor Module X Combined SFP Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Example of Predicted Availability Rating
(Catalyst 6800XL Non Redundant)
For Your
Reference

Part MTBF (hours) MTTR Combined Combined Annual


Catalyst 6800XL MTBF Hrs. Availability Downtime

Chassis C6807-XL 638,440 4 hrs. 638,440 --


99.99937348%

C6807-XL-FAN= 3,077,880 4 hrs. 3,077,880 --


99.99987004%

SFP-10GSR 2,294,776 4 hrs. 2,294,776 --


99.99982569%

Supervisor 231,910 4 hrs. 231,910 --


VS-S2T-10G 99.99827522%

WS-X6904-40G- 256,490 4 hrs. 256,490 --


2T 99.99844051%

C6800-XL-3KW- 3,000,000 4 hrs. 3,000,000 --


AC* 99.99986667%

System MTBF 91,987 99.99565168% 22.87 min.

Components combined in series calculation

Chassis X Fan Tray X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Example of Predicted Availability Rating
(Catalyst 6800XL With Redundancy)
For Your
Reference

Part MTBF MTTR Switchover Combined Combined Annual


Catalyst 6800XL with Hrs. Hrs. time
(seconds)
MTBF Hrs. Availability Downtime

Redundancy Chassis C6807- 638,444 4 Hrs. -- 638,440 99.99937348% --


XL

C6807-XL-FAN= 3,077,88 4 Hrs. -- 3,077,880 99.99987004% --


0

SFP-10GSR 451,610 4Hrs. .500 2,633,000,739, 100.00000000 --


868 %

Supervisor 2,294,77 4 Hrs. .500 26,891,355,96 99.99999997% --


VS-S2T-10G 6 1

WS-X6904-40G- 402,386 4 Hrs. .500 32,893,816,54 99.99999998% --


2T 1

C6800-XL-3KW- 3,000,00 4 Hrs. 0 4,500,003,000, 100.00000000 --


AC* 0 001 %

System MTBF 528,687 99.99924347% 3.98min.

Redundant components combined in parallel calculation

Chassis X Combined Power Supply X Combined Line Card X Combined Supervisor Module X Combined SFP Uplink =
System MTBF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Choosing the Right Platform and Network Design
It is More Than Just Predicted Availability Ratings
• Design to business requirements
• Use Predicted Availability ratings as part of your overall design considerations
• Common factors that dictate platform selection:
• Backplane throughput and performance
• Interface types and port densities
• Scalability for future growth/ investment protection
• Software upgrade procedures
• Software feature support
• Simplicity / Ease of Use

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Availability Modeling
• Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
• Stackwise480 and Stackwise
• In Service Software Upgrades

• Foundations of the Structured Network Design


• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP OSPF BGP SNMP

LDP STP CDP

FIB

Data Plane ASICs, High-Speed TCAMs


FIB

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP OSPF BGP SNMP

LDP STP CDP

FIB

Data Plane ASICs, High-Speed TCAMs


FIB

A B

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP OSPF BGP SNMP

LDP STP CDP

FIB

Data Plane ASICs, High-Speed TCAMs


FIB

SRC A DST B
A B

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Control Plane and Data Plane
For Your
Reference

Definitions for our context


• Control Plane – Protocols or signaling traffic associated with routing. Typically this is
traffic sourced from a router or destined to a router. Examples include BGP, OSPF,
EIGRP, ICMP etc.…
• Processed by a CPU
• May also include exception traffic that needs special services applied
• Commonly referred to as the “Slow Path”
• May also include management protocols including SNMP, Telnet, HTTP etc.… AKA
“Management Plane”

• Data Plane - Traffic forwarded through a device.


• Processed by hardware ASICs
• In the context of a switching device, typically this is traffic processed completely by the
device’s hardware ASICs
• Commonly referred to as the “Fast Path”

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Stateful Switchover (SSO)
For Your
Reference

• Stateful Switchover (SSO)– A software facility within Cisco IOS that


synchronizes specific Cisco IOS processes between an Active Supervisor
Engine and a Redundant Standby Supervisor Engine for the purpose of
redundancy.
• Redundancy Facility – synchronizes application states
• Checkpointing Facility – synchronizes data structures

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
Redundant Supervisors – IOS Active Supervisor
Active – Standby Model  Control Plane
• Console access
Control Plane
• Manages Configurations
Data Plane • Manages Chassis
Environmentals
Active Supervisor
• L2 – L3 Protocols

 Data Plane
• Hardware-based switching

CF RF Standby Supervisor

 Not part of the active forwarding


path
 Multiple Redundancy modes
COLD Standby COLD Standby
WARM Standby WARM Standby
HOT Standby HOT Standby
Control Plane  Synchronization
CF – Checkpoint Facility
Data Plane RF – Redundancy Facility

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Stateful Switchover Mode – IOS
SSO-Aware and SSO-Compliant IOS Applications
Cisco IOS
SSO-Compliant Applications SSO-Aware Applications
Redundancy
Facility Forwarding Information Base
Routing Protocols
IEEE 802.1x
NetFlow Checkpointing PAgP / LACP
Cisco Discovery Protocol Facility …and more
…and more

Active Supervisor

Standby Hot Supervisor

SSO-Compliant Applications SSO-Aware Applications

Routing Protocols Checkpointing Forwarding Information Base


Facility IEEE 802.1x
NetFlow
PAgP / LACP
Cisco Discovery Protocol Redundancy …and more
…and more Facility

Cisco IOS
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
SSO Compliant Redundancy Clients
IOS Partial List Example
Router# show redundancy clients
clientID = 0 clientSeq = 0 RF_INTERNAL_MSG
clientID = 1319 clientSeq = 1 Cat6k Platform Swove
clientID = 5030 clientSeq = 2 Redundancy Mode RF

Management & Services L2 Services L3 Services Platform Specific

 EEM Server RF CLIENT  Frame Relay  IPROUTING NSF RF • Cat6k Inline Power

 SNMP HA RF Client  HDLC  ARP • Car6k OIR


 LLDP  L3 Mobility Manager
 Switch SPAN client • Cat6k QoS Manager
 PPP RF  IP multicast RF Client
 MQC QoS
 MPLS VPN HA  Network RF Client • CWAN VLAN RF Client
 Call-Home RF Client
 LDP HA  HSRP • Cat6k Feature Manager
 Port Security Client
 AToM manager  GLBP
 IKE RF Client • Cat6k SPA TSM
 Cat6k PAgP/LACP  BFD RF Client
 IPSEC RF Client • Cat6k Online Diag HA
 Spanning-Tree  DHCP Snooping
 CRYPTO RSA Protocol
 Cat6k MLS Multicast • Cat6k Platform
 LAN-Switch PAgP/LACP
 SLB RF Client • Config Sync RF client
 LAN-Switch Private V
 VLAN Mapping • Cat6k Startup Config

 CTS HA

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
SSO by itself Does Not
Provide Redundancy for
the Routing Protocols
Graceful Restart, Non-Stop Forwarding and
Non-Stop Routing
• Non-Stop Forwarding was developed by Cisco to maintain traffic forwarding by a router
experiencing a control plane switchover event. The router will essentially synchronize its
Forwarding Information Base between an Active and Standby Route Processor as well as signal
to its routing neighbors to continue forwarding traffic while routing topology information is
exchanged
• The IETF developed standards based implementations similar to Cisco NSF
• The IETF implementations use different terminology including the terms “Graceful Restart” to
describe the signaling used between the routers
• Graceful Restart(GR) and Non-Stop Forwarding (NSF) are terms often used interchangeably
• Graceful Restart/Non-Stop Forwarding as well as Non-Stop Routing (NSR) all allow for the
forwarding of data packets to continue along known routes while the routing protocol information
is being restored (in the case of Graceful Restart) or refreshed (in the case of Non Stop Routing)
following a processor switchover.
• Each routing protocol has its own unique implementation and signaling mechanisms

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 1 Standby Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] 192.168.0 [Link] [Link] aabbcc:ddee32 - - - - - -

[Link] [Link] [Link] [Link] [Link] adbb32:d34e43 - - - - - -

[Link] [Link] [Link] [Link] [Link] aa25cc:ddeee8 - - - - - -

FIB Table
SSO FIB Table

Prefix Next HOP Redundancy Facility Prefix Next HOP

[Link] aabbcc:ddee32 [Link] aabbcc:ddee32

[Link] adbb32:d34e43 [Link] adbb32:d34e43

[Link] aa25cc:ddeee8 Checkpoint Facility [Link] aa25cc:ddeee8

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 53
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

- - - - - -

- - - - - -

- - - - - -

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

- - - - - -

- - - - - -

- - - - - -

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

- - - - - -

- - - - - -

- - - - - -

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

GR/NSF Signaling per protocol

Synchronization per protocol

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] - - - -

[Link] [Link] - - - -

[Link] [Link] - - - -

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

GR/NSF Signaling per protocol

Synchronization per protocol

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] 192.168.0 [Link] - -

[Link] [Link] [Link] [Link] - -

[Link] [Link] [Link] [Link] - -

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

GR/NSF Signaling per protocol

Synchronization per protocol

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] 192.168.0 [Link] [Link] aabbcc:ddee32

[Link] [Link] [Link] [Link] [Link] adbb32:d34e43

[Link] [Link] [Link] [Link] [Link] aa25cc:ddeee8

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

GR/NSF Signaling per protocol

Synchronization per protocol

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Non Stop Forwarding Router Roles
• Non-Stop Forwarding, NSF, allows a
router to continue forwarding data along NSF Aware
routes that are already known, while the
routing protocol information is being
restored
• NSF Aware router or
NSF Helper router*
• A router running NSF-compatible
software, capable of assisting a NSF Aware
neighbor router perform an NSF restart NSF Capable
Device with
• NSF Capable router Redundant
Supervisors
• A router configured to perform
an NSF restart, therefore able to rebuild
routing information from neighbor
NSF-aware or NSF capable router
* NSF Helper - This term is used in IETF terminology

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
NSF/SSO Switchover Operation – IOS
1
Active Supervisor Fails Active Supervisor
Newly Active Supervisor
RP
RP CPU

Control Plane
CPU 5
OSPF EIGRP IS-IS BGP
Control
Path 9
Routing Information Base ARP Table

2 10 6
4
Cisco IOS CEF Tables Global Epoch = 1
FIB Table Adjacency Table
Prefix Next Hop Interface Epoch Next Hop MAC Epoch
10.2 [Link] Vlan 10 01 [Link] AA-BB-.. 01

192.1 [Link] Vlan 192 0 [Link] EE-DD.. 10

NSF Aware Router 11

Data Plane
3 12
Hardware 3
FIB Adjacency
Table Table

Forwarding Path
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
Non-Stop Forwarding
OSPF Implementation Example
NSF Capable NSF Aware NSF Capable NSF Aware

Cisco NSF IETF NSF


Restart Event
(GR)
Restart Event

Restart
Graceful-
Announce
Fast Hello LS Update
(2 Sec Interval Fast Hello (Grace LSA) LS ACK

Fast Hello
RS Bit Set) (2 Sec Interval (Grace LSA)
RS Bit Clear)
Fast Hello

Discovery
OSPF
(2 Sec Interval Fast Hello Hello [Link]
RS Bit Set) (2 Sec Interval [Link] Hello
RS Bit Clear)

Database Database
Description Database Description Database
Description

Database Exchange
Description

Out-of-Band Sync
LSA LSA LSA LSA
Requests/ Requests/ Request Requests
Update Update s/Update /Update

Hello Hello
(RS Bit Clear) Hello Hello
(RS Bit Clear)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
NSF Configuration - IOS
Capable Vs Helper Configuration
• Configuration is required to enable “NSF Capable”
• Configuration is NOT required to enable “NSF Helper” with default settings
• Helper supports both types on the device

router eigrp 1
nsf
!
router ospf 1
nsf ietf
!
router isis 1
nsf cisco

core1# show ip ospf nsf


Routing Process "ospf 1"
IETF Non-Stop Forwarding enabled
restart-interval limit: 120 sec
IETF NSF helper support enabled
IETF NSF helper strict-lsa-checking enabled
Cisco NSF helper support enabled
OSPF restart state is NO_RESTART
Handle 2162698, Router ID [Link], checkpoint Router ID [Link]
Config wait timer interval 10, timer not running
Dbase wait timer interval 120, timer not running

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
NSF Interoperability
Interoperability between different Cisco devices
• The Graceful Restart extensions used in NX-OS are based on the IETF
RFCs except for EIGRP, which is Cisco proprietary and can interoperate
with Cisco NSF.
• This implies that for OSPFv2, OSPFv3, and BGP the GR extension are
compatible with versions of IOS that use the RFC based extensions
router ospf 1 router ospf 1
graceful-restart graceful-restart

✔ router ospf 1
nsf ietf
Si Si
router ospf 1
nsf cisco

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
Non-Stop Routing (NSR)

• Cisco IOS Non-Stop Routing preserves the state information


(prefixes and related data) in the Routing Information Base across
Supervisor Engine (Route Processor) switchover events.
• Helpful in environments where peer routers are not managed by the
same entity or are not capable of supporting NSF awareness
• Consider that Non-Stop Routing does consume more control plane
resources, such as memory and CPU compute cycles, compared to NSF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
Routing Protocol Redundancy With NSR
Active Supervisor Engine Slot 1 Standby Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] 192.168.0 [Link] [Link] aabbcc:ddee32 [Link] [Link] 192.168.0 [Link] [Link] aabbcc:ddee32

[Link] [Link] [Link] [Link] [Link] adbb32:d34e43 [Link] [Link] [Link] [Link] [Link] adbb32:d34e43

[Link] [Link] [Link] [Link] [Link] aa25cc:ddeee8 [Link] [Link] [Link] [Link] [Link] aa25cc:ddeee8

FIB Table
SSO FIB Table

Prefix Next HOP Redundancy Facility Prefix Next HOP

[Link] aabbcc:ddee32 [Link] aabbcc:ddee32

[Link] adbb32:d34e43 [Link] adbb32:d34e43

[Link] aa25cc:ddeee8 Checkpoint Facility [Link] aa25cc:ddeee8

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
Routing Protocol Redundancy With NSR
Active Supervisor Engine Slot 2
EIGRP RIB OSPF RIB ARP Table

Prefix Next Hop Prefix Next Hop IP MAC

[Link] [Link] 192.168.0 [Link] [Link] aabbcc:ddee32

[Link] [Link] [Link] [Link] [Link] adbb32:d34e43

[Link] [Link] [Link] [Link] [Link] aa25cc:ddeee8

FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

No additional signaling required to maintain topology

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
NSR Deployment Scenario
Case Study: MPLS VPN Provider Edge

• Provider PE device can use NSR for MPLS VPN CE


peering with the CE devices
CE
• use NSF for peering with the internal
P devices or Route Reflectors
CE

• NSF Aware peers are not needed


for the CE device P

• Control plane resources can be PE


optimized by using NSF and NSR CE

together P

CE

CE

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
NSR Configuration - IOS
• Configuration is required to enable NSR

router eigrp 1
nsr
!
router ospf 1
nsr
!
router isis 1
nsr

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
Comparing NSF and NSR

Metric Non-Stop Forwarding Non


-Stop Routing
Configuration required Yes, per protocol instance on NSF capable Yes per protocol instance.
device, No configuration required for NSF – BGP also requires per peer configurations
aware devices for Interior Gateway Protocols.
BGP requires GR configuration on both NSF-
Capable and NSF Helper device
Which routing protocols are supported EIGRP, OSPFv2, OSPFv3, ISIS, BGP, LDP, ISIS, BGP, OSPFv2, etc.
etc.
Synchronizes routing protocol state and RIB No Yes
information across redundant control planes
Consumes additional CPU and memory Negligible Yes, applicable with the number of routes per
resources protocol
Requires specific feature support on peer Yes No
routers

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
High Availability At
Different layers
Standalone Chassis Redundant Core
Redundant Supervisors Yes or No ? Catalyst 6500
• Redundant topologies with equal cost multi-
paths (ECMP) provide sub-second
convergence
?
Si Si
• NSF/SSO provides superior availability in
environments with non-redundant paths

RP Convergence
Seconds of Lost Voice

Is Dependent Si Si
on IGP and Tuning

Si

Link Node NSF/SSO OSPF


Failure Failure Convergence
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
Redundant Supervisors Yes or No? Catalyst 6500
• HSRP doesn’t flap on Supervisor SSO
switchover
• Reduces the need for sub-second HSRP timers
Si Si

• SSO Aware HSRP


• 6500-E - 12.2(33)SXH
• 4500 - 12.2(31)SG
?
Seconds of Lost Voice

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
Design Considerations for NSF/SSO
Where Does It Make Sense?
• Access switch is the single point of failure
in best practices HA design
• Supervisor failure is most common Si Si
cause of access switch service outages
• Recommended design with NSF/SSO provides for
sub 600 msec recovery of voice and data traffic
Seconds of Lost Voice

Si Si

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Availability Modeling
• Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
• Stackwise480 and Stackwise
• In Service Software Upgrades

• Foundations of the Structured Network Design


• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Catalyst 9300 Series
Cisco Stackwise-480

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
Stacking Cable – Close-up

Stacking
Cable

Cable Lengths
• 0.5m
• 1m
• 3m

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Understanding the Stack Ring
ASIC Stack Interface

• 6 rings in total
• 3 rings go East Is math really an
• 3 rings go West opinion?

• Each ring is 40G


Assuming
• Total Stack BW = 240G 4 x 24-port
• With Spatial Reuse = 480G Cat9K Switches

Stack Interface

Packets are segmented/reassembled in HW (256 byte


segments)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
Understanding Spatial Reuse
Doubling the capacity of my stack

Assuming 4
3
1
2
4 x 24-port
9300 Switches Destination
Stripping
Packet travels
½ the rings.
Taken out of
stack by
destination

3
1
2
4

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stack Ring Healing
Detection is by hardware
Example Software is notified
shows: immediately
4 x 24-port Ring Wrap initiated
Cat9K immediately (1-2ms)
Switches
For Recovery –

X Hardware detects other side


Software validates the link
and so it brings up the
connection gracefully

Unwrap is slower than Wrap

• All rings wrap


•240Gbps when wrapped

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
IOS XE Software Internals Overview
Infra Domain

LC Domain
Service
Location RP Domain
Interface HA
Wireless Controller
Manager Consolidated
Logging

Availability Framework
IOSd RP

Forwarding &
Feature Mgr (FFM)
Stack Manager (3K)
Internal IPC Licensing
Services
Features PD Comet
External
Libraries/
Utilities Services
Platform UADP ASIC Transports
Drivers Drivers
(TCP/SCTP/UDP) Services
Platform
Low Level APIs Manager

System
Forwarding Engine Driver Packet Delivery Service
Manager

Kernel

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
UADP UADP
provides an
Designed for Flexibility unparalleled degree
of Flexibility
in an Access Switch

Excellent for
encapsulations, which
often need recirculation

Parse depth 15 programmable stages


of 256 Bytes Up to 250 frames across
stages at one time…
Ability to handle current and
future protocols – extremely
flexible and capable

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 84
VXLAN as a protocol had not even been invented
when the UADP ASIC was designed …

Yet UADP forwards VXLAN


in hardware, at high performance
in IOS-XE 16.3+ …
Next-Hop MAC Address thanks to the FlexParser
Src VTEP MAC Address
Dest. MAC 48

Source MAC 48

14 Bytes
Outer MAC Header VLAN Type
Underlay

16
0x8100

VLAN ID 16
(4 Bytes Optional)
in
IP Header
72
Misc. Data
Outer IP Header Ether Type
16
0x0800
Protocol 0x11 (UDP) 8

Header
UDP Header Checksum
16 20 Bytes

VXLAN Header Parse depth 16


15 programmable stages
Source IP

Dest. IP
32

32
Src RLOC IP Address

Source Port Dst RLOC IP Address

of 256 Bytes
Inner (Original) MAC Header VXLAN Port 16

16
8 Bytes Up to 250 frames across
Hash of inner L2/L3/L4 headers of original frame.
Enables entropy for ECMP load balancing.
UDP Length
Inner (Original) IP Header
Checksum 0x0000 16
stages at one time…
UDP 4789
VXLAN Flags RRRRIRRR 8
Allows 64K
possible SGTs
Overlay

VXLAN is a complex Segment ID 16


Original Payload 8 Bytes
protocol …
VN ID 24
Allows 16M
8
possible VRFs
Reserved
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stack Discovery

• Switches boot.

• Stack Interfaces brought online LC Infra

• Infra and LC Domains boot in parallel


LC Infra

• Stack Discovery Protocol discovers


 Stack topology – broadcast, LC Infra
followed by neighbor-cast

• Active Election begins after LC Infra

Discovery exits

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
Stack Active Election
Rules of Election

•The stack (or switch) whose member A


has the higher user configurable
priority 1–15

•The switch or stack whose member


has the
lowest MAC address

%IOSXE-1-PLATFORM: process stack-mgr: %STACKMGR-1-ACTIVE_ELECTED: Switch 1 has been elected ACTIVE.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
Define Stack Roles
minimal Downtime

• Power up the first Switch that you want Catalyst9300#switch 1 priority 15


to make it as Active
A
• Configure Priority of the switch (1-15) Catalyst9300#switch 2 priority 14
• 1 by default – the higher the better S
Catalyst9300#switch 3 priority 13
• Power up the second member that you
want to make as Standby
Catalyst9300#switch 4 priority 12
• Configure Priority less than the Active

• Power up the rest of the members

*Priority command is a global command

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
Catalyst 9K Stack similarity to Catalyst 6500

• Active and Standby units • Active and Standby Supervisors


• Run IOSd, WCM, etc.. on Active/Standby • Run IOS on Supervisors
• Synchronize information • Synchronize information
• Active programs Data plane for members • Active programs all DFCs
• DFCs run a subset of IOS for LCs
• Member switches act as Line cards–
connected via the Stack Cable

A A S
S

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
Show switch with SSO
Stack Mac follows
Active initially
Switch# show switch
Switch/Stack Mac Address : 2037.06cf.0e80
H/W Current
Switch# Role Mac Address Priority Version State
------------------------------------------------------------ Active
*1 Active 2037.06cf.0e80 10 V01 Ready
2 Standby 2037.06cf.3380 8 V00 Ready
3 Member 2037.06cf.1400 6 V00 Ready Standby
4 Member 2037.06cf.3000 4 V00 Ready
Member
* Indicates which member is providing the “stack Identity” (aka “stack MAC”)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 90
Show switch detail output
Switch# show switch detail
Switch/Stack Mac Address : 2037.06cf.0e80
H/W Current
Switch# Role Mac Address Priority Version State
------------------------------------------------------------
*1 Active 2037.06cf.0e80 10 V01 Ready
2 Standby 2037.06cf.3380 8 V00 Ready
3 Member 2037.06cf.1400 6 V00 Ready
4 Member 2037.06cf.3000 4 V00 Ready
Stack Port
Stack Port Status Neighbors
Switch# Port 1 Port 2 Port 1 Port 2 Information
--------------------------------------------------------
1 OK OK 2 4
2 OK OK 3 1
3 OK OK 4 2
4 OK OK 1 3

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 91
Catalyst 9000 – HA State Machine
• Active starts RP Domain locally 2min timer

• Programs hardware on all LC Domains LC RP Infra A


• Traffic starts once hardware is programmed
• Starts 2min Timer to elect Standby in parallel RP LC Infra
S
• Active elects Standby
LC Infra
• Standby starts RP Domain locally
• Starts Bulk Sync with Active RP
LC Infra
• Standby reaches “Standby Hot”

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
Show redundancy states
Switch# show redundancy states
my state = 13 –ACTIVE Terminal state for Active Unit.
peer state = 8 -STANDBY HOT
Mode = Duplex
Unit ID = 1
Terminal state for Standby Unit
for SSO.
Redundancy Mode (Operational) = SSO
Redundancy Mode (Configured) = SSO
Redundancy State = SSO
Manual Swact = enabled Slot Number of Active Unit

Communications = Up

client count = 76
client_notification_TMR = 360000 milliseconds Communication Channel
keep_alive TMR = 9000 milliseconds Status between the
keep_alive count = 0 Active/Standby RP units
keep_alive threshold = 9
RF debug mask = 0

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 93
Show Redundancy Command Output…
Switch#sh redundancy
Redundant System Information :
------------------------------
Available system uptime = 29 weeks, 2 days, 11 hours, 47 minutes
Switchovers system experienced = 2
Standby failures = 0
Last switchover reason = user_forced

Hardware Mode = Duplex System uptime


Configured Redundancy Mode = SSO
Operating Redundancy Mode = SSO
Maintenance Mode = Disabled
Communications = Up

Current Processor Information : Image version


------------------------------ of current unit
Active Location = slot 1
Current Software state = ACTIVE
Uptime in current state = 1 week, 4 days, 22 hours, 38 minutes
Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M),
Version 03.03.03E RELEASE SOFTWARE (fc1)

Peer Processor Information :


------------------------------
Standby Location = slot 2
Current Software state = STANDBY HOT
Uptime in current state = 1 week, 4 days, 22 hours, 34 minutes
Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M),
Version 03.03.03E RELEASE SOFTWARE (fc1)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 94
StackWise Virtual Architecture
Extending StackWise Architecture

Dist-1
Does it look familiar?
SW-1 SW-2

VSS
40G/10G
Cat 9k Cat 9k

• Cisco StackWise Virtual extends proven back-panel technology


over front-panel network ports
• Cisco StackWise Virtual simplifies the Distribution-Layer with two
common Cat 9K series chassis into single logical entity

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
StackWise Virtual Architecture
Resilient Software Design

Dist-1

SW-1 SW-2

40G/10G
Cat 9K Cat 9k

• Cisco StackWise Virtual supports 1+1 Inter-Chassis SSO


redundancy providing non-stop communication
• Consistent SSO and NSF capable protocols and features on both
deployment models

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 97
StackWise Virtual Architecture
Simplified. Scalable.
Core Core

Dist-1
SW-1 SW-2

Distribution
Cat 9k Cat 9k 40G/10G Cat (k

Access

• Cisco StackWise Virtual supports Unified control and management plane architecture

• Complex network designs gets simplified with Multi-Chassis EtherChannels (MEC)

• Improved application performance with deterministic network resiliency during


various planned or unplanned failures.
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 98
SW Patching in IOS-XE

Adding a SMU file 9300#install add file flash:cat9k-universalk9.2017-03-


17_21.53_zhangyu.[Link]
install_add: START Sun Mar 26 [Link] UTC 2017
SUCCESS: Finished copying package(s) to the selected switch(es)
SUCCESS: install_add /flash/cat9k-universalk9.2017-03-
17_21.53_zhangyu.[Link] Sun Mar 26 [Link] UTC 2017

Activating SMU Patching Support is only


9300#install activate file flash:cat9k-universalk9.2017-03-
17_21.53_zhangyu.[Link]
for the Cat9K product
install_activate: START Sun Mar 26 [Link] UTC 2017
2 install_activate: Activating SMU...

This operation
Family
requires a reload of the system. Do you want to proceed? [y/n]y
2 install_activate: Reloading the box to complete activation of the SMU...

Committing it 9300#install commit


install_commit: START Sun Mar 26 [Link] UTC 2017
SUCCESS: install_commit Sun Mar 26 [Link] UTC 2017

Any failures/reloads between activate and commit result in a rollback

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
SMU Deployment Experience with Cisco DNA Center
• Download SMU to APIC- Cisco DNA Center App
EM file server
• Analyze SMU impact
• Test SMU on Pilot setup Network ReadMe
Admin
• Schedule SMU SMU
SMU APIC EM SMU
Server
deployment File Server

[Link]

Pilot Site Production Site Production Site

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 101
Stackable Best
Practices
Stacking Convergence Not a recommended design
Multi-Layer Access
vIP: [Link]
vMAC: 0000.0c07.ac00
Summary
Subnets
• Active unit with uplink failure D1 D2
introduces two failures Distribution HSRP
ACTIVE
HSRP
STANDBY
Si Si
• Active control plane
• Uplink interface
L2
• When the Active fails,
the Standby will take over. Active Standby
Access S1 S2 S3
• Upstream, HSRP / GLBP Single Logical Switch
will detect link down, and
D2 will start answering to the
virtual MAC 0000.0c07.ac00
• Downstream traffic is IP:
MAC:
[Link]
[Link].aa01
IP:
MAC:
[Link]
[Link].aa03
re-routed to D2 via L3 link GW: [Link] GW: [Link]
ARP: 0000.0c07.ac00 ARP: 0000.0c07.ac00

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 103
Stacking Convergence vIP: [Link]

Multi-Layer Access
vMAC: 0000.0c07.ac00
Summary
Subnets
D1 D2
• Active unit Failure Distribution HSRP HSRP
(without uplink) ACTIVE
Si Si
STANDBY

• When the Active fails,


the Standby will take over L2
Access
• No HSRP/GLBP failover, Standby Active
while the new Active being elected, S1 S2 S3

MAC address of HSRP/GLPB still used Single Logical Switch


by the rest of the stack for data
forwarding
• No downstream
re-route convergence IP:
MAC:
[Link]
[Link].aa01
IP:
MAC:
[Link]
[Link].aa03
GW: [Link] GW: [Link]
ARP: 0000.0c07.ac00 ARP: 0000.0c07.ac00

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Catalyst 9300 Stack Wise
Routed Access Summary
Subnets

• CLI “stack-mac persistent timer 0” Distribution


enables MAC consistency – Si Si
• This is the default value for 3850/9300
• This is a change from the existing L3
stacking model
• New Active inherits the MAC address Access
Standby Active
of the previous Active S1 S2 S3
• No MAC changes for end hosts Single Logical Switch
and adjacent routers, significantly
improves upstream recovery
NO MAC
• Caution – Changes
• Do not re-introduce the 3x50/9300 IP: [Link]
IP: [Link]
elsewhere in order to avoid MAC: [Link].aa01 MAC: [Link].aa03
duplicate MAC in your network GW: [Link] GW: [Link]
ARP: [Link].7c80 ARP: [Link].7c80

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 105
Changing Stack Mac on Cat9K Switches
• By default the timer value is set to indefinite (0)
• System continues to keep
selected stack mac after
switchover
Catalyst9k#show switch
• Avoids Protocol flapping Switch/Stack Mac Address : 2037.06cf.0e80
Catalyst9k#show switch
Mac persistency wait time: Indefinite
Switch/Stack Mac Address : 2037.06cf.0e80
2037.06cf.3380
Mac persistency wait time: Indefinite H/W Current
• How to change it Switch# Role Mac Address Priority Version State
H/W Current
------------------------------------------------------------
• A new command introduced Switch#
*1 Role
Active Mac Address Priority
2037.06cf.0e80 10 Version
V01 StateReady
switch#stack-mac update force ------------------------------------------------------------
2 Standby 2037.06cf.3380 8 V00 Ready
*1 3 Member
Member 0000.0000.0000
2037.06cf.1400 10 6 V01V00 Removed
Ready
2 4 Active
Member 2037.06cf.3380
2037.06cf.3000 8 4 V00V00 Ready
Ready
3 Member 2037.06cf.1400 6 V00 Ready
4 Member 2037.06cf.3000 4 V00 Ready

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 106
Key Recommendations for Stacking
• Run the stack in full ring mode to get full bandwidth
• Configure the Active switch priority and Standby switch priority
• Predetermine which switch is the Active and Standby which will become the Active
should the Active fail
• Simplifies operations
• Configure Active and Standby unit without uplinks if possible
• If deploying a stack of 4 or more switches keep the Active and Standby switches
without uplinks, this will simplify the convergence and reduce the outage time
• Do Not change the stack-mac timer value
• By default the value is 0 (indefinite)
• Avoids protocol flapping
• There is a command to change the stack-mac when needed
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 107
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Availability Modeling
• Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
• Stackwise480 and Stackwise
• In Service Software Upgrades

• Foundations of the Structured Network Design


• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 108
ISSU Overview
• ISSU provides a mechanism
to perform software upgrades
and downgrades without taking
the switch out of service
• Leverages the capabilities of NSF Active Sup
and SSO to allow the switch to SSO
forward traffic during Supervisor Standby Sup
Line Card
IOS upgrade (or downgrade)
Line Card
• Key technology is the
ISSU Infrastructure
• Allows SSO between different
versions Catalyst 9400

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 109
In Service Software Upgrades
Streamlined Process for Software Upgrades/Downgrades

ISSU ISSU
Loadversion Acceptversion
(Optional)

1 2 3 4

ISSU ISSU
Runversion Commitversion

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 111
Stateful Switchover Mode – IOS
ISSU Client and Versioning Infrastructure
ISSU Versioning
Cisco IOS Version 1 ISSU Clients
HA-Compliant Applications HA-Aware Applications
Routing Protocols Redundancy
Forwarding Information Base
NetFlow Facility
Port Manager
Cisco Discovery Protocol PAgP / LACP
Checkpointing
…and more Facility …and more

Active Supervisor

Standby Hot Supervisor

ISSU Versioning ISSU Clients


HA-Aware Applications
HA-Compliant Applications Forwarding Information Base
Routing Protocols Port Manager
Checkpointing
NetFlow Facility PAgP / LACP
Cisco Discovery Protocol …and more
…and more Redundancy
Facility

Cisco IOS Version 2


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
ISSU Client and Infrastructure Interactions – IOS
Active Supervisor
Register ClientID,, Hot Standby Supervisor
ISSU Endpoint V1 Msg Capabilities,
ISSU Endpoint V3
MSG Versions, Card
Application XYZ Versioning Type… Versioning Application XYZ
ISSU Client V1 Infrastructure Infrastructure ISSU Client V3
Endpoints Agree on
Register Client Store a Common Set of Store Register Client
Info Client Client Info
Info Capabilities Info

Propose Capabilities Capabilities Propose


Capabilities Negotiation
Endpoints Agree on Negotiation Capabilities
a Common Message
Version
Propose Message Version Message Version Propose
Message Version Negotiation Negotiation Message Version

Agree V1
If Compatible, then
Compatible N V1 Message Exchange V1, V2,V3
Compatible N
Y
Y Can Proceed
Message Message Message
Message Transformation
Exchange Transformation Exchange

MSG V1 MSG V3

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 113
ISSU Dual Supervisor –
Catalyst 9400
ISSU Process
Dual Supervisors
Start ISSU • ISSU Process leverages SSO/NSF
Architecture

• Uplinks on both active and standby SUP


Uplinks are forwarding traffic

Active Supervisor
SSO
Standby Supervisor

Line Card

• Convergence is less than 200 msec


Catalyst 9400

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 115
C9K ISSU
Dual Supervisor ISSU

3 Step Process
• Install add file <tftp/ftp/flash/disk:*.bin>
Granular Control on
the upgrade process
• Install activate ISSU
with ability to rollback
• Install commit

1 Step Process
• Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit Single Command
to perform
complete ISSU

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 116
C9K ISSU Workflow
Dual Supervisor ISSU
1. ISSU Started, Image is
expanded on Active and Standby

V1 S1 Active
If S2 fails to become standby it
will revert back to step 1

V1 S2 Standby Abort Timer


Starts

2. Standby Reloads
with the new V2 Image

5. ISSU V2 S1 Standby
V1 S1 Active
Expired Abort timer will revert
Complete to Step 2 and then Step 1
V2 S2 Active V1 V2 S2 Standby

Abort Timer
Expired

Abort Timer
Stopped
V1 V2 S1 Standby
3. Auto-Switchover causes S2 to
4. ‘Commit’ Keyword become new active and S1 reloads
stops the abort timer
V2 S2 Active
with the new V2 image
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stackwise Virtual - ISSU
C9K ISSU
Stackwise Virtual ISSU and Dual Supervisor ISSU

3 Step Process
• Install add file <tftp/ftp/flash/disk:*.bin>
Granular Control on
the upgrade process
• Install activate ISSU
with ability to rollback
• Install commit

1 Step Process
• Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit Single Command
to perform
complete ISSU

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 119
Stackwise Virtual ISSU
ISSU Process

Install ISSU
Dual-Active Detection Link
Catalyst 9500-24Q Catalyst 9500-24Q
Auto-Switchover 1st Sub-second
2nd Sub-second 16.8.1
16.8.2 16.8.1
16.8.2 traffic convergence
traffic convergence

Stackwise-Virtual Link

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 120
Enhanced Fast Software
Upgrade – Catalyst
9000
Achieving High Availability on Catalyst 9300
Enhanced Fast Software Upgrade
• eFSU provides a mechanism to upgrade
and downgrade the software image by
Control-Plane
RIB

segregating the Control plane and Data Prefix Next Hop

Plane update [Link]

[Link]
[Link]

[Link]

• It updates the control plane by leveraging [Link] [Link]

the NSF/GR Architecture with Flush and


Re-Learn mechanism to reduce the Data Plane
impact on the data plane FIB Table

Prefix Next HOP

[Link] aabbcc:ddee32

[Link] adbb32:d34e43

[Link] aa25cc:ddeee8

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 122
Enhanced Fast Software Upgrade
Regular Upgrade Vs Enhanced Fast Software Upgrade Process
16.10.1*

#Install add file image activate commit


Enhanced Fast Software Upgrade

#Install add file image activate reloadfast


enhanced commit

< 30 seconds of
traffic impact

Traffic is impacted throughout the upgrade cycle


* Limited Controlled Availability in 16.10.1

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 123
Enhanced Fast Software Upgrade
CLI commands

• FSU is supported only in install mode

• One step command which activates the fast software upgrade and
commits it
9300# install add file flash:cat9k_iosxe.BLD_V1610 activate
reloadfast enhanced commit

• Fast Reload without Software upgrade


9300# Reload Fast Enhanced

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 124
Enhanced Fast Software
Upgrade – VSS system
VSS Software Upgrade on Catalyst 6500
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Preparation Steps

Switch-1 Switch-2
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot
0x2102

= Old Version
VSS Active
STANDBY COLD
VSS Standby Hot
WS-X6708-10G
R
= New Version
WS-X6708-10G
Si Si
1. ISSU Loadversion VSL
Execute Upgrade

VSS Standby HOT


R = Reload 100%

50%
SW2

SO = Switchover
1 2 3 4
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 126
VSS Software Upgrade on Catalyst 6500
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Preparation Steps

Switch-1 Switch-2
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot SO
0x2102 R
STANDBY COLD
VSS Standby Hot
VSS Active VSSStandby
VSS Active Hot
= Old Version
WS-X6708-10G
= New Version
WS-X6708-10G
Si Si
1. ISSU Loadversion VSL
Execute Upgrade

2. ISSU Runversion
VSS Standby HOT
R = Reload 100%
3. ISSU Acceptversion
(Optional)
50%
SW2 SW1
SO
= Switchover
1 2 3 4
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 127
VSS Software Upgrade on Catalyst 6500
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Preparation Steps

Switch-1 Switch-2
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot
0x2102 R
STANDBY COLD
VSS Standby Hot VSS Active
= Old Version
WS-X6708-10G
= New Version
WS-X6708-10G
Si Si
1. ISSU Loadversion VSL
Execute Upgrade

2. ISSU Runversion
VSS Standby HOT
R = Reload 100%
3. ISSU Acceptversion
(Optional)
50%
4. ISSU Commitversion SW2 SW1 SW1
SO
= Switchover
1 2 3 4
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 128
VSS Quad SUP SSO - Catalyst 6500
• In Chassis Standby SUP in each
Switch
• This will keep the unit up and ICA
SSO Act
ICA
SSO Stby

running when the other chassis is


reloaded ICS ICS

• We take advantage of this for EFSU


• There are 2 Upgrade Modes
• Standard EFSU Switch ID 1 Switch ID 2
• Staggered EFSU

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 129
EFSU Quad Sup
Normal Quad Sup Upgrade Vs Staggered Quad Sup Upgrade

100% 100%

50% 50%

SW 2 SW 1 SW 1 SW 2 SW 1

1 2 3 4 1 2 3 4 5
1. ISSU Loadversion (2 Sup on Standby Chassis - ICS)
nd

1. ISSU Loadversion (Whole Standby Sw2 chassis reload) 2. ISSU Loadversion – Step 2
2. ISSU Runversion (whole active Sw1 chassis reload) (Switchover with the Standby Chassis, LCs reload)

3. ISSU Acceptversion(Optional) 3. ISSU Runversion (Chassis S/O)

4. ISSU Commitversion (whole Standby Sw1 chassis reload) 4. ISSU Commitversion (ICS on new Standby Chassis)
5. ISSU Commitversion – Step 2
(Reload on the new Standby Chassis LC)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 130
Cisco IOS ISSU Summary
• ISSU is a software upgrade /downgrade procedure
• Changes the risk assessment criteria
• Minimizes the impact of upgrades/downgrades
• Allows for a trial period with automated rollback
• Less downtime
• Both software versions must be ISSU compatible
in order to achieve and SSO–based upgrade
• Software version compatibility includes
• 18 month rolling window between software releases of the same train
• Same license level required between versions

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 136
Graceful Insertion and
Removal - GIR
Graceful Insertion and Removal for Catalyst 9000
Isolation of Switch from network

Change window begins.


Start Maintenance

One command!
Pre-change System Snapshot
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 138
Graceful Insertion and Removal for Catalyst 9000
Return Switch into network

Change window begins.

Stop Maintenance

One command!
Pre-change System Snapshot
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 139
Graceful Insertion and Removal
Isolation of Switch from network

• Isolate a switch from the network in


order to perform debugging or an
upgrade.
• Isolate: All protocols are gracefully
brought down but is not shutdown.

• Entering Maintenance Mode:


• EGP -> IGP in Parallel -> L2 (shutdown port)

• Existing Maintenance Mode:


• L2 -> IGP in Parallel -> EGP

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 140
Graceful Insertion and Removal
Default and Customizable Templates

• Default Template 9300L#show system mode maintenance template default


System Mode: Normal
• System Generated Profile based on default maintenance-template details:
the switch configuration
router isis 1
shutdown l2
9300L#show system mode maintenance template test
• Customized Template System Mode: Normal
Maintenance Template test details:
• User Configured Profile based on shutdown l2
specific configuration or use case

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 141
Graceful Insertion and Removal
Snapshots
• Automatic Snapshots
Switch#show system snapshots compare before_maintenance
• Snapshots are automatically after_maintenance
generated when entering and ================================================================================
Feature Tag .before_maintenance .after_maintenance
exiting maintenance mode ================================================================================
[interface]

• Captures operational data --------------------------------------------------------------------------------


[Name:Vlan1]
from the running system like packetsinput 181587 **181589**
[Name:GigabitEthernet1/0/3]
Vlan’s, Routes etc.. packetsinput 101531 **101550**
broadcasts 80893 **80910**
packetsoutput 211568 **211594**
[Name:GigabitEthernet1/0/8]
output [Link], **[Link],**
• User Configured Snapshots packetsinput 6915 **6918**
packetsoutput 57677 **57706**
[Name:GigabitEthernet1/0/17]
• Snapshots can be collected packetsinput
broadcasts
101528
80891
**101550**
**80910**
manually for comparing and packetsoutput 211570 **211600**

troubleshooting

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 142
GIR Summary
• GIR used to isolate a switch
• Maintenance
• HW upgrade
• SW upgrade
• Works well in an L3 end to end network
• Order of Maintenance is
• EGP -> IGP (in parallel) -> L2 shutdown
• HSRP/VRRP can be leveraged without causing issue on switchover

• If Stackwise Virtual is deployed, you don’t need to do GIR to upgrade those


switches
• Leverage the ISSU Stack Virtual technology

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 143
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures:
• Enterprise Wired LAN
• Enterprise Data Center
• Enterprise Wireless LAN

• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 144
Dana Daum Maren Kostede
Communications Architect Technical Solutions Architect

Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer

High Availability World Coverage


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 145
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• Modularity, Hierarchy, and Structure
• Leveraging Hardware-Based Path Restoration
• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 146
Headquarters
WAAS
Access
Switches

UCS Rack-mount UCS Rack-mount


Servers Server UCS Blade
Storage Chassis

Distribution WAAS
Switches Central Manager

Nexus
WAN Communications
Router Internet Edge Managers
s
Access
Switches Internet Cisco ACE
Routers Data Center
Regional Site Wireless Firewalls
LAN
Controller Nexus
Wireless LAN Data
Internet
Controllers
Center
RA-VPN Firewall
Access WAN Access
Switch Route Switch
r Guest Wireless
DMZ
LAN Controller
Remote Site Switch

Web
Security
Appliance DMZ
Servers

Email
Teleworker/
Mobile Worker Hardware and Security Core
Software VPN Appliance
Switches

WAN
Access Routers
Switch
Stack
MPLS WAN
Router
WANs s Distribution
Switches

User
WAAS Access
Remote
Site Layers
WAAS
WAN Remote Site
Aggregation Wireless LAN
Controller

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Hierarchical network design
High availability using modularity, hierarchy, and structure

• Each layer in hierarchy has a


Access specific role
• Modular topology—building blocks
Distribution • Modularity makes it easy to grow,
understand, and troubleshoot
• Structure creates small fault
Core
domains and predictable network
behavior—clear demarcations and
isolation
Distribution
• Promotes load balancing
and resilience
Access
Building Block

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 148
Hierarchical network design
• Core
• Connectivity, availability and scalability
• Distribution
• Aggregation for wiring and traffic flows
• Policy and network control point (FHRP, L3 summarization)

• Access
• Physical – Ethernet wired 10/100/1000(802.3z)/mGig(802.3bz);
802.3af(PoE), 802.3at(PoE+), and Cisco Universal POE (UPOE)
• Policy enforcement – security: 802.1x, port security, DAI, IPSG, DHCP
snooping; identification: CDP/LLDP; QoS: policing, marking, queuing
• Traffic control – IGMP snooping, broadcast control

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 149
Hierarchical network design
Do I need a core layer?
• It is a question of operational complexity and a Do I need a core layer?
question of scale
• n x (n-1) scaling
• Routing peers
• Fiber, line cards and port counts ($,€,£)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 150
Hierarchical network design
Do I need a core layer?
• It is a question of operational complexity and a Do I need a core layer?
question of scale
• n x (n-1) scaling
• Routing peers
• Fiber, line cards and port counts ($,€,£)
• Capacity planning considerations
• Easier to track traffic flows from a block
to the common core than to ‘n’ other blocks
• Geographic factors may also influence the design
• Multi-building interconnections may have fiber
limitations

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 151
Structured campus network design

• Optimize data load-sharing, redundancy design for best application performance


• Diversify uplink network paths with cross-stack and dual-sup access-layer switches
• Build distributed and full-mesh network paths between Distribution and Access-layer switches
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 152
High availability design optimization
of the elements
• Optimize the interaction of the
physical redundancy with the network
protocols
• Provide the necessary amount of
redundancy

• Pick the right protocol for the


requirement

• Optimize the tuning of the protocol

• The network looks like this so that we


can map the protocols onto the
physical topology

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 153
What we are trying to avoid!

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 154
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• Modularity, Hierarchy, and Structure
• Leveraging Hardware-Based Path Restoration
• High Availability Architectures
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 155
Optimizing network convergence
Failure detection and recovery
• Optimal high availability network design attempts to
leverage ‘local’ switch fault detection and recovery
• Design should leverage the hardware capabilities of
the switches to detect and recover traffic flows
based on these ‘local’ events
• Design principle –
Hardware failure detection and recovery is both
faster and more deterministic
• Design principle –
Software failure detection mechanisms provide a
secondary, not primary, fault detection and recovery
mechanism in the optimal design

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 156
Optimizing network convergence
Layer 1 link redundancy and failure detection
• Direct point to point fiber provides for fast failure detection

• Do not disable auto-negotiation on GigE and 10GigE interfaces

• IEEE 802.3z and 802.3ae link negotiation define the use of Remote Fault
Indicator & Link Fault Signaling mechanisms

• IOS debounce –
• GigE and 10GigE fiber ports is 10 msec
• Minimum for copper is 300 msec

• NX-OS debounce – Currently 100 msec by default


• All 1G and 10G SFP / SFP+ based interfaces (MM, SM, CX-1) changing to a default
of 10 msec
• RJ45 based Copper interfaces on NX-OS will remain at 100 msec

• Design principle
Understand how hardware choices and tuning impact fault detection and
response to link failures

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 157
Optimizing network convergence
Layer 2 software fault detection (e.g. UDLD)
• While 802.3z and 802.3ae link negotiation provide for L1 fault detection,
hardware ASIC failures can still occur
• UDLD provides an L2 based keep-alive mechanism that confirms bi-directional
L2 connectivity
Tx Rx
• Each switch port configured for UDLD will send UDLD protocol packets (at L2)
containing the port’s own device / port ID, and the neighbor’s device / port IDs Rx Tx
seen by UDLD on that port
• If the port does not see its own device / port ID echoed in the incoming UDLD
packets, the link is considered unidirectional and is shutdown
• Design principle – UDLD Keepalive
Redundant fault detection mechanisms required (SW as a backup to HW as
possible)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 158
Optimizing network convergence
Layer 2 and 3 – Why use routed interfaces?
L3 routed interfaces allow faster convergence than L2 switchport with an associated L3 SVI

[Link].042 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet3/1, changed state to down
[Link].050 UTC: %LINK-3-UPDOWN: Interface GigabitEthernet3/1, changed state to down
[Link].050 UTC: IP-EIGRP(Default-IP-Routing-Table:100): Callback: route_adjust GigabitEthernet3/1

[Link].813 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/1, changed state to down
[Link].821 UTC: %LINK-3-UPDOWN: Interface GigabitEthernet2/1, changed state to down
[Link].069 UTC: %LINK-3-UPDOWN: Interface Vlan301, changed state to down
[Link].069 UTC: IP-EIGRP(Default-IP-Routing-Table:100): Callback: route, adjust Vlan301

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 159
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• High Availability Architectures:
• Enterprise Wired LAN
• Multilayer Campus Distribution and HA Considerations
• Simplified Distribution and HA Advantages
• Extending HA Advantages by Simplifying Virtualization
• Enterprise Data Center
• Enterprise Wireless LAN
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 160
Optimizing the Layer 2 design – spanning tree

• At least some VLANs span multiple access switches • Each access switch has unique VLANs

• Layer 2 loops • No Layer 2 loops

• Layer 2 and 3 running over link between distribution • Layer 3 link between distribution

• Blocked links • No blocked links

• More typical of a “classic” data center design • More typical of a campus LAN design

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 161
Optimizing the Layer 2 design
Non-STP-blocking topologies converge fastest

• When STP is not blocking uplinks, recovery of


access to distribution link failures is accomplished
based on L2 CAM updates not on the Spanning Tree
protocol recovery
• Time to restore traffic flows is based on:
• Time to detect link failure + Time to purge the HW
CAM table and begin to flood the traffic
• No dependence on external events (no need to wait
for Spanning Tree convergence)
• Behavior is deterministic

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 162
Optimizing the Layer 2 design
PVST+, Rapid PVST+, MST
• PVST+ (pre 802.1D-2004) - traditional spanning
tree
• Rapid-PVST+ (802.1w) greatly improves the
restoration times for any VLAN that requires a
topology convergence due to link UP
• Rapid-PVST+ also greatly improves convergence
time
over BackboneFast for any indirect link failures
• Rapid PVST+
• Scales to large size (up to 16,000 logical ports)
• Easy to implement, proven, scales
• MST (802.1s)
• Permits very large scale STP implementations
(up to 75,000 logical ports)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 163
Optimizing the Layer 2 design
Complex topologies take longer to converge
• Time to converge is dependent on the protocol
implemented – 802.1D, 802.1s, or 802.1w
• It is also dependent on –
• Size and shape of the L2 topology (how deep is the tree)
• Number of VLANs being trunked across each link
• Number of logical ports in the VLAN on each switch
• Non-congruent topologies take longer to converge.
Restricting the topology is necessary to reduce
convergence times
• Prune all unnecessary VLANs from trunk configuration

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 164
Optimizing the Layer 2 design
STP toolkit – PortFast and BPDU guard
• PortFast is configured on edge ports to allow them to quickly
move to forwarding bypassing listening and learning and
avoids TCN (Topology Change Notification) messages
• BPDU guard can prevent loops by moving PortFast
configured interfaces that receive BPDUs to errdisable state
• BPDU guard prevents ports configured with PortFast from
being incorrectly connected to another switch
• When enabled globally, BPDU guard applies to all interfaces
that are in an operational PortFast state
Switch(config-if)#spanning-tree portfast
Switch(config-if)#spanning-tree bpduguard enable
1w2d: %SPANTREE-2-BLOCK_BPDUGUARD: Received BPDU on port FastEthernet3/1 with BPDU Guard enabled. Disabling port.
1w2d: %PM-4-ERR_DISABLE: bpduguard error detected on Fa3/1, putting Fa3/1 in err-disable state

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 165
Optimizing the Layer 2 design
STP best practices for campus

• The root bridge should stay where you put it


• Define the STP primary (and backup) root
• Rootguard
• Loopguard or bridge assurance
• UDLD
• There is a reasonable limit to broadcast and
multicast traffic volumes
• Configure storm control on backup links to
aggressively rate limit broadcast and
multicast

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 166
Layer 2 access with Layer 3 distribution
First hop redundancy protocols (FHRP)
• HSRP, GLBP, and VRRP are used to provide a resilient
default gateway / first hop address to end stations
• A group of routers act as a single logical router providing
first hop router redundancy
• Protect against multiple failures
• Distribution switch failure
• Uplink failure
• Default recovery is ~10 Seconds

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 167
First Hop Redundancy
Sub-Second Timers Improve Convergence
interface Vlan4
ip address [Link] [Link]
standby 1 ip [Link]
standby 1 timers msec 250 msec 750
standby 1 priority 150
standby 1 preempt
standby 1 preempt delay minimum 180

interface Vlan4
ip address [Link] [Link]
glbp 1 ip [Link]
glbp 1 timers msec 250 msec 750
glbp 1 priority 150
glbp 1 preempt
glbp 1 preempt delay minimum 180

interface Vlan4
ip address [Link] [Link]
vrrp 1 description Master VRRP
vrrp 1 ip [Link]
vrrp 1 timers advertise msec 250
vrrp 1 preempt delay minimum 180

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
HSRP preemption—why it is desirable
• Spanning tree root and HSRP
primary aligned
• When spanning tree root is re-
introduced, traffic will take a two-
hop path to HSRP active
• HSRP preemption will allow HSRP
to follow the spanning tree
topology

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 169
FHRP design considerations
Preempt delay needs to be longer than boot time
• HSRP is not always aware of the status of
the entire switch and network
• Ensure that you provide enough time for the
entire system to be up – diagnostics (full or
partial), L1 (line cards), L2 (STP),
L3 (IGP convergence)
• Tune delay and preempt delay conservatively
as the network is already forwarding data
interface Vlan402
. . .
standby delay minimum 60 reload 600
standby 1 ip [Link]
standby 1 timers msec 250 msec 750
standby 1 priority 110 ‘standby delay’ Controls How Long Before the Interface
standby 1 preempt delay minimum 60 reload 600
standby 1 authentication ese Needs to Be Up Before HSRP Starts and ‘preempt delay’
standby 1 name HSRP-Voice
hold-queue 2048 in Controls How Long to Wait After HSRP Establishes a
Neighbour Relationship.
You Should Configure Both.
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 170
Sub-second timer considerations
HSRP, GLBP, OSPF, PIM
• Evaluate your network before implementing any sub-second timers
• Certain events can impact the ability of the switch to process sub-
second timers
• Application of large ACL
• OIR of line cards in Catalyst 6500/6800
• The volume of control plane traffic can also impact the ability to process
• 250 / 750 msec GLBP & HSRP timers are only valid in designs with less
than 150 VLAN instances (Catalyst 6x00 in the distribution)
• Spanning Tree size

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 171
FHRP design considerations—
asymmetric routing (unicast flooding)
• Alternating HSRP Active between distribution
switches can be used for upstream load balancing
• This can cause a problem with unicast flooding
• ARP timer defaults to four hours and CAM timer
defaults to five minutes
• ARP entry is valid, but no matching L2 CAM table
exists
• In many cases when the HSRP standby needs to
forward a frame, it will have to unicast flood the
frame since its CAM table is empty

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 172
FHRP design considerations—
asymmetric routing (unicast flooding) solutions
• Using ‘V’ based design with unique voice and data VLANs
per access switch, this problem has no user impact
• Don’t deploy stacking switches (ie. daisy-chained switches)
that depend on spanning tree for managing interconnects in
the stack
• Tune ARP timer to 270 seconds and leave CAM timer to
default, unless ARP > 10,000, change CAM timers
• Deploy MultiChassis EtherChannel with Virtual Switching
System (VSS or vPC) in the distribution block

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 173
Even with faster convergence from RPVST+
we still have to wait for FHRP convergence
FHRP Active FHRP Standby
• FHRP protocol based forwarding topologies
• Load balancing based on Per-Port or Per-VLAN

• Protocol-based fault detection and recovery –


• Recommended to configure per-VLAN aggressive timers to
protect user experience impact within <1 second boundary

• Limited network scale for system reliability

• Sub-second protocol timers must be avoided on SSO


capable network
1000
900 HSRP Config
800 interface Vlan2
700 ip address [Link] [Link]
600 standby 1 ip [Link]
500 SVI - Aggressive Time standby 1 timers msec 250 msec 750
400 standby 1 priority 150
Convergence (msec) standby 1 preempt
300 standby 1 preempt delay minimum 180
200
100
0
6500-Sup2T 4500-Sup7E TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 174
Multilayer campus network design—
It is a good solid design, but…
• Utilizes multiple control protocols
• Spanning tree (802.1w), HSRP / GLBP, EIGRP, OSPF 60

50
• Convergence is dependent on multiple factors – 50

• FHRP – 900msec to 9 seconds 40

• Spanning tree – Up to 50 seconds 30

• Load balancing – 20

• Asymmetric forwarding 10
9.1

• HSRP / VRRP – per subnet 0.91

• GLBP – per host


0
Looped PVST+ (No Non-looped Default Non-looped Sub-
RPVST+) FHRP Second FHRP

• Unicast flooding in looped design

• STP, if it breaks badly, has no inherent


mechanism to stop the loop
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 175
Campus wired LAN design
Option 1: Traditional multilayer campus (BRKCRS-2031)

• Common design since the 1990’s


Logical
topology— • Complex configurations (prone to human error)
L3: related to spanning-tree, load balancing,
core/dist. unicast and multicast routing
L2: • Requires heavy performance tuning resulting
dist./acc.
from reliance on FHRPs (HSRP, VRRP, GLBP)
Survives device and link failures

Easy mitigation of Layer 2 looping concerns

Rapid detection/recovery from failures


Physical
Layer 2 across all access blocks within distribution
topology:
2 core Device-level CLI configuration simplicity
2 dist./acc.
Automated network and policy provisioning included
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 176
Transforming multilayer campus
Before: Layer 3 distribution with Layer 2 access

IGP IGP Layer 3

Layer 2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 177
Simplification with routed access design
After: Layer 3 distribution with Layer 3 access

IGP IGP Layer 3

IGP IGP

Layer 2

• Move the Layer 2 / 3 demarcation to the network edge


• Leverages Layer 2 only on the access ports, but builds a Layer 2 loop-free network
• Design motivations – Simplified control plane, ease of troubleshooting, highest availability

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 178
Routed access advantages
Simplified control plane
 Simplified Control Plane
• No STP feature placement (root bridge,
loopguard, …)
• No default gateway redundancy setup/tuning
(HSRP, VRRP, GLBP ...)
• No matching of STP/HSRP priority
• No asymmetric flooding
• No L2/L3 multicast topology inconsistencies
• No Trunking Configuration Required

 L2 Port Edge features still apply:


• Spanning Tree Portfast
• Spanning Tree BPDU Guard
• Port Security, DHCP Snooping, DAI, IPSG
• Storm Control

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 179
Routed access advantages
Simplified network recovery
• Routed access network recovery is
dependent on L3 re-route
• Time to restore upstream traffic flows
is based on ECMP re-route
• Time to detect link failure
• Process the removal of the lost routes
from the SW RIB
• Update the HW FIB
• Time to restore downstream flows is
based on a routing protocol re-route
• Time to detect link failure
• Time to determine new route
• Process the update for the SW RIB
• Update the HW FIB

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 180
Routed access advantages
Faster convergence times
• RPVST+ convergence times
dependent on FHRP tuning
• Proper design and tuning can
achieve sub-second times
• EIGRP converges <200 msec 2
• OSPF converges <200 msec
1.8
1.6
with LSA and SPF tuning
1.4 Upstream
1.2
1
0.8
0.6
0.4
0.2
0
RPVST+ OSPF EIGRP
FHRP
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 181
Routed access advantages
A single router per subnet: simplified multicast
 Layer 2 access has two multicast routers per access subnet, RPF checks
and split roles between routers
 Routed access has a single multicast router which simplifies multicast
topology and avoids RPF check altogether

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 182
Routed access advantages
Ease of troubleshooting

• Routing troubleshooting tools


• Consistent troubleshooting:
access, dist, core
• show ip route / show ip cef
• Traceroute
• Ping and extended pings
• Extensive protocol debugs
• IP SLA from the Access Layer

• Failure differences
• Routed topologies fail closed—i.e.
neighbor loss
• Layer 2 topologies fail open—i.e. switch#sh ip cef [Link]
broadcast and unknowns flooded [Link]/24
nexthop [Link] TenGigabitEthernet9/4

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 183
Why isn’t routed access deployed everywhere?
Routed access design constraints

• VLANs don’t span across multiple wiring


closet switches/switch stacks

Does this impact your requirements?

• IP addressing changes: more DHCP scopes L3


and subnets of smaller sizes increase
management and operational complexity
L3 L3
• Deployed access platforms must be able to L3 L3
support routing features

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 184
Campus wired LAN design
Option 2: Layer 3 routed access (BRKCRS-3036)

• Complexity reduced for Layer 2


Logical
topology— (STP, trunks, etc.)
L3: • Elimination of FHRP and associated timer
everywhere tuning
L2: • Requires more Layer 3 subnet planning; might
edge only
not support Layer 2 adjacency requirements
Survives device and link failures

Easy mitigation of Layer 2 looping concerns

Rapid detection/recovery from failures


Physical
Layer 2 across all access blocks within distribution
topology:
2 core Device-level CLI configuration simplicity
2 dist./acc.
Automated network and policy provisioning included
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 185
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• High Availability Architectures:
• Enterprise Wired LAN
• Multilayer Campus Distribution and HA Considerations
• Simplified Distribution and HA Advantages
• Extending HA Advantages by Simplifying Virtualization
• Enterprise Data Center
• Enterprise Wireless LAN

• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 186
Traditional multilayer campus design

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 187
Simplified end-to-end VSS design

Data Center

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 188
Comparison – standalone (multilayer) versus VSS

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 189
Unified system architecture

• •



TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 190
Catalyst VSS setup
LAN distribution layer

1) Prepare standalone switches for VSS Switch 1 Switch 2


1) Prepare standalone switches for VSS
Router#conf t Router#conf t
Router(config)# hostname VSS-Sw1 VSL Router#config)# hostname VSS-Sw2
VSS-Sw1(config)#switch virtual domain 100 VSS-Sw2(config)#switch virtual domain 100
VSS-Sw1(config-vs-domain)# switch 1 VSS-Sw2(config-vs-domain)# switch 2

2) Configure Virtual Switch Link 2) Configure Virtual Switch Link


VSS-Sw1(config)#interface port-channel 63 VSS-Sw2(config)#interface port-channel 64
VSS-Sw1(config-if)#switch virtual link 1 VSS-Sw2(config-if)#switch virtual link 2
VSS-Sw1(config)#interface range tengigabit 5/4-5 VSS-Sw2(config)#interface range tengigabit 5/4-5
VSS-Sw1(config-if)#channel-group 63 mode on VSS-Sw2(config-if)#channel-group 64 mode on
VSS-Sw1(config-if)#no shutdown VSS-Sw2(config-if)#no shutdown

3) Validate Virtual Switch Link operation


VSS-Sw1# show etherchannel 63 ports
AND
VSS-Sw2# show etherchannel 64 ports
Ports in the group:
-------------------
Port: Te5/4 Port state = Up Mstr In-Bndl
Port: Te5/5 Port state = Up Mstr In-Bndl

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 191
Catalyst VSS setup
LAN distribution layer

4) Enable virtual mode operation VSS 4) Enable virtual mode operation


VSS-Sw1# switch convert mode virtual Switch 1 Switch 2 VSS-Sw2# switch convert mode virtual
Do you want to proceed? (yes/no) yes VSL Do you want to proceed? (yes/no)yes

• The switch now renumbers from y/z to x/y/z • The switch now renumbers from y/z to x/y/z
• When process is complete, save configuration when • When process is complete, save configuration when
prompted, switch reloads and forms VSS. prompted, switch reloads and forms VSS.

5) Verify operation and rename switch 6) Configure dual-active detection


VSS-Sw1# show switch virtual redundancy
• Connect a Gigabit Link between the VSS switches
VSS(config)# switch virtual domain 100
• Check for both switches visible, Supervisors in SSO mode, VSS(config-vs-domain)# dual-active detection fast-hello
second Supervisor in Standby-hot status VSS(config)# interface range gigabit1/1/24, gigabit2/1/24
VSS(config-if-range)# dual-active fast-hello
VSS-Sw1(config)# hostname VSS VSS(config-if-range)# no shut
VSS(config)#

7) Configure the system virtual MAC address


VSS(config)# switch virtual domain 100
*Feb 25 [Link].294: %VSDA-SW2_SPSTBY-5-LINK_UP: Interface Gi2/1/24 is now dual-active detection capable
VSS(config-vs-domain)# mac-address use-virtual
*Feb 25 [Link].323: %VSDA-SW1_SP-5-LINK_UP: Interface Gi1/1/24 is now dual-active detection capable

Configured Router mac address is different from operational value. Change will take effect
after config is saved and the entire Virtual Switching System (Active and Standby) is reloaded.

BRKCRS-3035: Advanced Enterprise Campus Design: Virtual Switching System (VSS)


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 192
“Is there an easier way to enable VSS?”
Use Easy VSS to configure from a single console port

Prerequisites:
• Switches running same software with feature support (C4K:3.6E, C6K:15.2(1)SY1)
• Links to be used for VSLs up with CDP communication
1) C6K - Enable Easy VSS feature, convert, and reload VSS
VSS-Sw1# switch virtual easy VSS-Sw1 VSS-Sw2
VSS-Sw1# switch convert mode easy links ? VSL

Local Interface Remote Interface Hostname


TenGiigabit3/4 TenGigabit3/4 VSS-Sw2
TenGigabiti4/4 TenGigabit4/4 VSS-Sw2
VSS-Sw1# switch convert mode easy links T3/4 T4/4 domain 100
VSS-Sw1(config)# switch virtual domain 100
VSS-Sw1(config-vs-domain)# mac-address use-virtual
VSS-Sw1# copy running-config startup-config
VSS-Sw1# reload

2) Verify operation and rename switch 3) Configure dual-active detection


VSS-Sw1# show switch virtual redundancy • Connect a Gigabit Link between the VSS switches
• Check for both switches visible, Supervisors in SSO VSS(config)# switch virtual domain 100
mode, second Supervisor in Standby-hot status VSS(config-vs-domain)# dual-active detection fast-hello
VSS(config)# interface range gigabit1/1/24, gigabit2/1/24
VSS-Sw1(config)# hostname VSS VSS(config-if-range)# dual-active fast-hello
VSS(config)# VSS(config-if-range)# no shut

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 193
VSS dual supervisor inter-chassis redundancy
• VSS dual supervisor (single sup per chassis) supports inter-
chassis SSO redundancy.
• Single in-chassis supervisor - SSO Active or Standby role. Reduced
NSF Recovery
Capacity
Reduced

• Stateful SSO synchronization and redundancy between


Capacity

virtual-switches
VSL

• Single supervisor system Design – Active


Standby Standby
Active

• Supervisor switchover requires chassis reset, including all linecard Reduced


and service modules Reduced
Capacity
Capacity

• Network capacity reduced until system returns to operational state

• Consistent redundancy design between modular Catalyst


6500E/6800/4500E and fixed Catalyst 4500X/3850 system

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 194
Catalyst quad-supervisor NSF/SSO redundancy
Inter-Chassis Sup
Redundancy

• Dual in-chassis supervisors, each in different


redundancy modes Intra-Chassis Sup
Redundancy
Intra-Chassis Sup
Redundancy
VSL
• In-chassis Active Supervisor (ICA) – SSO Active ICA – SSO Active
ICS – STANDBY-HOT ( Chassis)
ICA – SSO Standby
ICS – STANDBY-HOT(Chassis)
OR Standby-Hot (switchover target)
• In-chassis Standby Supervisor (ICS) – Standby-
Hot (Chassis)

• VSS Quad-Sup protects network availability SW1 SW2

and capacity with dual redundancy domain – 6500-VS4O#show switch virtual redundancy
Switch|Mode|Current|Fabric
| inc

between chassis and within chassis My Switch Id = 1


Peer Switch Id = 2
Configured Redundancy Mode = sso
• Stateful SSO synchronization between Operating Redundancy Mode = sso

multiple redundancy domains Switch 1 Slot 6 Processor Information :


Current Software state = ACTIVE
Fabric State = ACTIVE
• Complete system configuration and Switch 1 Slot 5 Processor Information :
Current Software state = STANDBY HOT (CHASSIS)
parameters synchronization Fabric State = ACTIVE
Switch 2 Slot 6 Processor Information :
Current Software state = STANDBY HOT (switchover target)
• Catalyst 6x00 with Sup6T or Sup2T pairs, Fabric State = ACTIVE
Catalyst 4500E with Sup8E, 7E, and 7L-E Switch 2 Slot 5 Processor Information :
Current Software state = STANDBY HOT (CHASSIS)
Fabric State = ACTIVE
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 195
Understanding Virtual Switch Link
• Inter-chassis system link Control Link Control Link

• No network protocol operations


VSL
• Invisible in network topology VSH L2 L3 Payload CRC

• Transparent to network level troubleshooting

• VSL control link


4500E-VSS#show switch virtual link

Executing the command on VSS member switch role = VSS Active, id = 1


• Carries all system internal control traffic
VSL Status : UP
• Single member-link; dynamic election during boot VSL Uptime : 1 day, 1 hour, 16 minutes
VSL Control Link : Te1/3/1
• Shared interface for network/data traffic
Executing the command on VSS member switch role = VSS Standby, id = 2
• < 50 msec switchover to pre-determined VSL path
VSL Status : UP

• Payload overhead
VSL Uptime : 1 day, 1 hour, 17 minutes
VSL Control Link : Te2/3/1

• Every single packet encapsulated with Virtual Switch Header (VSH)


• Non-bridgeable and non-routeable.
• VSL must be directly connected between two virtual switch systems

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 196
6500E/6800/4500E VSS dual sup – VSL design
Two Cisco recommended designs
Profile 2 – Diversified VSL between
Profile 1 – Two VSL links on Supervisor Supervisor and VSL capable Linecard

Sup Sup Sup Sup

VSL
VSL

• Cost-effective solution to leverage both uplinks. Continue • Redundant and diversified fibers between supervisor and
to use non-VSL capable linecard for 10G core connection. next-gen VSL capable linecards.

• Redundant fibers connects thru common fabric and ASICs, • Same design as Profile 1 but increases system reliability as
this could result vulnerability in system stability. each VSL port are diversified across different fabric/ASICs.

• Optimal and preset VSL parameters – Load-Balancing, • Optimal and preset VSL parameters – Load-Balancing, QoS,
QoS, HA, Traffic-engg, Dual-Active etc.. HA, Traffic-engg, Dual-Active etc..

• Restricted to bundle 2 x VSL ports or 20G switching • Flexible to scale up to 8 x VSL for high-dense system to
capacity on per virtual-switch node basis. aggregate uplink, service modules, single-home etc..

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 197
6500E/6800 VSS quad-supervisor VSL design
RPR-WARM
Sup2T/6T quad-supervisor NSF/SSO VSL redundancy

Sup-1 Sup-2

Sup-3 Sup-4
Sup-4
Sup-3
VSL

SW1 SW2
• Same design profile – 1 dual sup
• Flexible to increase VSL capacity
• Continue to leverage existing non-VSL
10G linecard for uplink connection
• Retains all original VSL benefits
• Vulnerable design during any
supervisor self-recovery fault incident
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 198
6500E/6800 VSS quad-supervisor VSL design
SSO advantage
Sup2T/6T quad-supervisor NSF/SSO VSL redundancy
Recommended: Full-Mesh VSL on Quad-Sup
Sup-2 Sup-1 Sup-2

Sup-3 Sup-4
Sup-4 Sup-3 Sup-4
Sup-4
Sup-3 Sup-3
VSL VSL

SW1 SW2 SW1 SW2


• Same design profile – 1 dual sup • Highly redundant and cost-effective VSL
• Flexible to increase VSL capacity design.
• Continue to leverage existing non-VSL • Increases overall VSL capacity
10G linecard for uplink connection • Maintains 20G VSL capacity during
• Retains all original VSL benefits supervisor failure.
• Vulnerable design during any • Increases network reliability by
supervisor self-recovery fault incident minimizing the dual-active probability
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 199
4500X VSS – VSL network design
• Fixed switch hardware architecture –
• 24 or 48 10G/1G front panel ports
• 8 port 1G/10G Pluggable Uplink Module
• Any ports can be bundled into VSL EtherChannel.
• Recommended to use front-panel ports to build VSL
connections. Minimizes system instability during accidental
uplink module OIR/reset Front / Uplink
Ports

• Split VSL member-link interfaces to different internal ASICs Ten1/1/1 Ten2/1/1

groups : 4500-X 4500-X


Ten1/1/9 Ten2/1/9
ASIC Group 4500X – 16 Port 4500X – 32 Port VSL

ASIC to Port Mapping ASIC to Port Mapping SW-1 Front Panel SW-2
Ports
Internal Stub ASIC – 1 1–8 1–8

Internal Stub ASIC – 2 9 – 16 9 – 16

Internal Stub ASIC – 3 N/A 17 – 24

Internal Stub ASIC – 4 N/A 25 – 32

• Consistent software design and VSL function as 4500E

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 200
Cisco Catalyst platforms and transitions
Where is VSS?

Cisco Catalyst
Cisco Catalyst
9500 Series
9400 Series
Cisco Catalyst
Cisco® Catalyst® 9300 Series
9200 Series

Cisco Catalyst Cisco Catalyst Cisco Catalyst Cisco Catalyst Cisco Catalyst
2960X/XR Series 3850 copper 4500E Series 3850F/4500-X 6840-X/ 6880-X
Access switching Backbone switching

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 201
“How can I simplify my distribution without VSS?”
StackWise Virtual

• Fixed switch hardware architecture with distributed forwarding architecture


• First available on WS-3850-48XS
• Available on Catalyst 3850-24XS, 3850-12XS, 9500-16X, 9500-40X, 9500-12Q, 9500-24Q,
9500-48Y4C, 9500-24Y4C, 9500-32QC, 9500-32C, 9404R/9407R Sup1/Sup1-XL
(check software release notes for versions and additional hardware)
• StackWise Virtual Link between two nodes (10Gb or 40Gb)

• Both StackWise Virtual members must have consistent Cisco IOS-XE and license
StackWise Virtual Pair
WS-3850-48XS WS-3850-48XS SVL
Fast
Distribution
Hello

Access

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 202
Cisco StackWise Virtual (SWV) setup
LAN distribution layer

1) Prepare standalone switches for SWV3850-D1 3850-D2 1) Prepare standalone switches for SWV
3850-D1
3850-D2#conf t
3850-D1#conf t SVL 3850-D2(config)# stackwise-virtual
3850-D1(config)# stackwise-virtual
3850-D2(config-stackwise-vir)# domain <1-255>
3850-D1(config-stackwise-vir)# domain <1-255>

2) Configure StackWise Virtual links 2) Configure StackWise Virtual links


*Automatically creates EtherChannel (128) *Automatically creates EtherChannel (128)
3850-D1(config)# interface range FortyG x/y/z – x/y/z 3850-D2(config)# interface range FortyG x/y/z – x/y/z
3850-D1(config-if)# stackwise-virtual link 1 3850-D2(config-if)# stackwise-virtual link 1

3) Configure dual-active detection 3) Configure dual-active detection


(fast hello) (fast hello)
3850-D1(config)# interface range TenG x/y/z – x/y/z 3850-D2(config)# interface range TenG x/y/z – x/y/z
3850-D1(config)# stackwise-virtual dual-active-detection 3850-D2(config)# stackwise-virtual dual-active-detection

4) Save and reload to convert 4) Save and reload to convert


3850-D1# copy run start 3850-D2# copy run start
3850-D1# reload 3850-D2# reload

Note: Maximum of 8 SVL member links and 4 dual active detection links
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 203
Virtual Switch Link capacity planning
• Plan VSL capacity to reduce congestion point,
handle failures and specific configurations
VSL
• Supported VSL interfaces types :
• Catalyst 6500E/6800 : 10G and 40G
• Catalyst 4500E/4500X : 1G and 10G
• Catalyst 3850 : 1G, 10G, and 40G

• Four major factors :


VSL Analyzer
• Total uplink bandwidth per chassis. Ability to handle data re-route
during uplink failures without network congestion
• Handling egress data to single-homed devices
(non-recommended design)
• Catalyst 6500E/6800 services module integration may require
centralized forwarding on remote chassis
• Remote network services such as SPAN

• Up to 8 member-links supported in VSL EtherChannel.


(Implement in power of 2 for optimal forwarding decision)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 204
VSS – single-homed connections
• Independent of system modes (VSS or Standalone),
single-home connection is non-recommended
• Cannot leverage any distributed VSS architecture benefits.

• Non-congruent Layer 2 or Layer 3 network design with –


• Centralized network control-plane processing over VSL
VSL
• Asymmetric forwarding plane. Ingress data may traverse
over VSL interface and oversubscribe the ports SW-1 SW-2
(HOT-STANDBY)
(ACTIVE)

• Single-point of failure in various faults –


Link/SFP/module failure, SSO switchover, ISSU etc..
A1 A2

• Cannot be trusted switch for dual active detection purpose

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 205
VSS – multi-homed physical connections
• Redundant network paths per system delivers best architectural approach

• Parallel Layer 2 paths between bridges


builds sub-optimal topology :
• Creates STP loop. Except for root port, all other ports
are in blocking mode
• Slow network convergence

• Parallel Layer 3 doubles control-plane processing load :


VSL
• ACTIVE switch needs to handle control plane load of local
and remote-chassis interfaces SW-1
(ACTIVE)
SW-2
(HOT-STANDBY)

• Multiple unicast and multicast neighbor adjacencies


• Redundant routing and forwarding topologies
A1 A2
STP Loop

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 206
VSS – Multichassis EtherChannel

• MEC enables:
• Simplified STP loop-free network topology
• Consistent L3 control-plane and network design as traditional
Standalone mode system
• Deterministic sub-second network recovery

• MECs can be deployed in two modes – Layer 2 or Layer 3

• MEC scalability support varies on system basis –


VSL
• Catalyst 6500E supports 512 L2/L3 MEC
SW-1 SW-2
• Catalyst 4500E and 4500X supports 256 L2 MEC (ACTIVE) (HOT-STANDBY)

• Catalyst 3850-48XS supports 127 L2/L3 MEC

A1 A2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 207
Simplified STP network topology with VSS
• VSS simplifies STP. VSS does not eliminate STP.
Never disable STP.
• Multiple parallel Layer 2 network path builds STP
loop network
• VSS with MEC builds single loop-free network to
utilize all available links.
• Distributed EtherChannel minimizes STP
complexities compared to standalone distribution
design
• STP toolkit should be deployed to safe-guard
multilayer network
STP BLK Port
Loop-free L2 EtherChannel

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 208
Traditional distribution design
Redundant design with sub-optimal topology and complex operation
Stabilize network topology with several L2 features:
• STP Primary and Backup Root Bridge
• Rootguard
• Loopguard or Bridge Assurance
• STP Edge Protection
Protocol restricted forwarding topology
• STP FWD/ALT/BLK Port
• Single Active FHRP Gateway
• Asymmetric forwarding
• Unicast Flood
Protocol dependent driven network recovery:
• PVST/RPVST+ and FHRP Tuning

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 209
Resiliency versus performance/scale tradeoff:HSRP
FHRP Active FHRP Standby
• Multichassis EtherChannel based forwarding topologies
• Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs

1000 interface Vlan2


ip address [Link] [Link]
900
standby 1 ip [Link]
800
standby 1 timers msec 250 msec 750
700 standby 1 priority 150
600 SVI - Aggressive Time standby 1 preempt
500 standby 1 preempt delay minimum 180
Convergence (msec)
400
300
200
100
0
6500-Sup2T 4500-Sup7E

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 210
Resiliency versus performance/scale tradeoff:VSS
• Multichassis EtherChannel based forwarding topologies
• Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs
VSS-SW1
• Hardware-Based Fault Detection and Recovery
• Deterministic network convergence with simplistic approach

• Increases Network Scale for system reliability

• No reliability compromise to enable path and system-level


Quad-Sup redundancy Multilayer VSS
Network Scale And Convergence
1000
1000 900
900 800
800 700
700 600
600 SVI - Aggressive Time SVI (Validated Limit)
500
500
Convergence (msec) 400 Convergence (msec)
400
300
300
200
200
100
100
0 0
6500-Sup2T 4500-Sup7E 6500-Sup2T 4500-Sup7E

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 211
PIM timers also need tuning
• Multicast recovery depends on PIM DR failure detection PIM DR
in Layer 2 network
• PIM routers exchanges PIM expiration time in query
message
• DR Failure Detection:
~90 seconds (30 sec. hello * 3 multiplier)
• Tune PIM query interval to sub-sec as FHRP for faster
multicast convergence
• Sub-second protocol timer must be avoided on SSO interface Vlan2
capable network ip pim sparse-mode
ip pim query-interval 250 msec

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 212
Simplified and robust multicast network design
using VSS
• Single PIM DR system in Layer 2 network to process IGMP
from host receivers
• Doubles multicast forwarding performance across all VSS-SW1 PIM-DR
Multichassis EtherChannel member links
• Optimize multicast network with PIM stub configuration

• Rapid, deterministic and simple multicast design


• Hardware based sub-second fault detection and recovery.
• Eliminates aggressive timer requirement and improves
system performance and scalability

interface Vlan2
ip pim passive

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 213
Multichassis EtherChannel load sharing
• MEC hash algorithm is computed
independently by each virtual-switch to
perform load share via its local physical ports. SW-1 SW-2

• 8 bits computation on each member link of an


MEC is independently done on per virtual-
switch node basis.
• Total number of member link bundling in
single MEC recommendation remains Per Switch MEC Flow Distribution Matrix

consistent as described in single chassis


Member Port1 Port2 Port3 Port4 Port5 Port6 Port7 Port8
Links Bit Bit Bit Bit Bit Bit Bit Bit

EtherChannel section.
1 8 X X X X X X X
2 4 4 X X X X X X
3 3 3 2 X X X X X

• Recommendation to deploy EtherChannel 4


5
2
2
2
2
2
2
2
1
X
1
X
X
X
X
X
X
in 2n ratio evenly distributed to each 6 2 2 1 1 1 1 X X

virtual-switch for best load-sharing result. 7


8
2
1
1
1 1
1 1
1
1
1
1
1
1
1
X
1

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 214
Optimize EtherChannel load balancing
• Load share egress data traffic based on input
hash Core
Default : src-dst-ip vlan
• Optimal load sharing results with :
Recommended : src-dst-mixed-ip-port
• Bucket-based load-sharing – Bundle member-links
in power-of-2 (2/4/8)
• Multiple variation of input for hash (L2 to L4)

• Recommended algorithm * : Dist


Default : src-dst-ip vlan
• Access – Src/Dst IP Recommended : src-dst-mixed-ip-port vlan
• 6500E/6800 Dist/Core – Src/Dst IP + Src/Dst L4
Ports Default : src-mac
• 4500E / 4500X Dist – Src/Dst IP Recommended : src-dst-ip Access
* May vary based on your network traffic pattern

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 215
Summary: Multichassis EtherChannel performs
better in any network design
• Network recovery mechanic varies in different
1
distribution design –

Convergence (sec)
0.8
• Standalone – protocol and timer dependent
0.6
• VSS – hardware dependent
0.4
• VSS logical distribution system – 0.2
• Single P2P STP Topology 0
• Single Layer 3 gateway L2-FHRP L2-MEC

• Single PIM DR system Upstream Downstream Multicast

• Distributed and synchronized forwarding table –


MAC address, ARP cache, IGMP
• All links are fully utilized based on Ether-channel
load balancing

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 216
VSS-enabled campus core design
• Extend VSS architectural benefits to campus
core layer network
• VSS enabled core increases capacity,
optimizes network topologies and simplifies
system operations
• Key VSS enable core best practices :
• Protect network availability and capacity with
Catalyst 6800 Sup6T Quad-Sup NSF/SSO
• Simplify network topology and routing database
with single MEC
• Leverage self-engineer VSS and MEC capabilities
for deterministic network fault detection and
recovery
Data Center

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 217
VSS core network design alternatives

VSL VSL

SW1 SW2 SW1 SW2

VSL VSL

SW1 SW2 SW1 SW2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 218
Catalyst 6500/6800 VSS-enabled campus
design
ECMP forwarding table construction
• ACTIVE switch responsible for: Unicast Forwarding Path
Multicast Forwarding Path
• Construct two software tables : Routing Information Base (RIB)
and Forwarding Information Base (FIB)
T1/2/1 T1/2/1 T2/2/1 T2/2/2

• Synchronize software FIB tables to local and remote chassis


supervisor and network modules
ECMP forwarding also favors locally attached interfaces Po1 Po2

Hardware FIB inserts entries for ECMP routes using locally attached links
If all local links fail the FIB is programmed to forward across the VSL link as last resort
SW1 (ACTIVE) SW2 (HOT_STANDBY)

Unicast ECMP Software RIB (System-Wide) Unicast ECMP Switch-1 Hardware FIB

Four ECMP
RIB Entries Two SW1 HW
FIB Entries
Unicast ECMP Software FIB (System-Wide) Unicast ECMP Switch-2 Hardware FIB

Four ECMP
FIB Entries Two SW2 HW
FIB Entries
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 219
Summary – optimizing core performance (1/2)
HW Driven Forwarding Topology & High Availability Unicast Forwarding Path
Multicast Forwarding Path
VSS-Core
Standalone-Core

VSS-Dist
Standalone--Dist

• •

• •

• •



TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 220
Summary – optimizing core performance (2/2)
HW Driven Forwarding Topology & High Availability Unicast Forwarding Path
Multicast Forwarding Path
Standalone-Core
Standalone-Core

VSS-Dist
Standalone-Dist

• •

• •

• •




TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 221
Simple core network design delivers
deterministic network recovery
• Routing protocol independent network
convergence in large scale campus core T1/2/1 T1/2/1 T2/2/1 T2/2/2

• ECMP prefix-independent convergence (PIC) for


with 6x00 (VSS/standalone) from 12.2(33)SXI2
Po1 Po2

• Cisco Express Forwarding (CEF) optimization in


SW1 (ACTIVE) SW2 (HOT_STANDBY)

IOS software. 3.5


3
• Default behavior: no additional configuration or

Convergence (sec)
2.5
tuning required 2
1.5
• Hardware-based fault detection and recovery in 1
MEC/EC designs 0.5
0
500 1000 5000 10000 15000 20000 25000
ECMP (W/o PIC) ECMP (With PIC) MEC

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 222
VSS core simplifies multicast operation, improves
performance and redundancy (1/2)
• Standalone core needs anycast MSDP peering
for RP redundancy AnyCast - MSDP
Core
• ECMP builds single multicast forwarding path
PIM RP PIM RP
and protocol-based fault detection and recovery
Single OIL

PIM Join

PIM Router PIM Router


Dist

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 223
VSS core simplifies multicast operation, improves
performance and redundancy (2/2)
Single Logical
• VSS based Catalyst systems enables PIM PIM RP Core

RP Redundancy with resilient technologies Multiple Multicast


Forwarding Paths
• MEC increases multicast forwarding Single Logical
capacity by utilizing all member-links and
Single Logical OIL
PIM Interface

provides hardware-based fault detection


PIM Join

and recovery Single Logical


PIM Router Dist

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 224
Simplified multicast network design delivers
deterministic network recovery
• ECMP multicast recovery is mroute scale dependent could range in
seconds.
• MEC/EC multicast recovery is hardware-based and recovery is scale-
independent in sub-seconds

6
Convergence (sec)

5
4
3 ECMP

2 MEC/EC

1
0
100 500 1000 5000

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 225
Implementing non-stop forwarding

• VSS software design is built on NSF/SSO architecture.

• Catalyst 4500E, 4500X and 6500E/6800 deployed in VSS mode must enabled NSF.
No configuration required on NSF Helper system
• NSF capability must be manually enabled for all Layer 3 routing protocols :
• EIGRP, OSPF, ISIS, BGP, MPLS etc..

• In VRF environment the NSF must be manually enabled on per-VRF IGP instance
Inter-Chassis NSF/SSO Recovery Analysis
• Multicast NSF capability is default ON 16
14

Convergence (sec)
12
10
8
6
4
2
0
Without NSF With NSF
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 226
Sub-second protocol timers and NSF/SSO

Core
• NSF is intended to provide availability through route convergence avoidance

• Fast IGP timers are intended to provide availability through fast route convergence interface Port-Channel 10
ip ospf dead-interval minimal multiplier 4
• In an NSF environment dead timer must be greater than:
• SSO recovery + Routing Protocol restart + time to send first hello

• Recommendation –
• Do not configure aggressive timer Layer 2 protocols, i.e. Fast UDLD
VSL
Dist
• Do not configure aggressive timer Layer 3 protocols, i.e. OSPF Fast Hello, BFD etc.. Keep all
protocol timers at default settings

Link and Switch Failure Analysis – Link Failure Analysis –


Default OSPF Timer Aggressive OSPF Timer
0.3 0.3

0.2 0.2
Access

0.1 0.1
Catalyst 2K/3K/4K
0 0

Upstream Downstream Upstream Downstream

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 227
Campus wired LAN design
Option 3: Layer 2 access with “simplified” distribution (BRKCRS-1500)

Logical
• Leading campus design for easy configuration
topology— and operation when using stacking or similar
L3: technology (VSS, StackWise Virtual)
core/dist. • Flexibility to support Layer 2 services within
L2:
dist./acc.
distribution blocks, without FHRPs.
• Easy to scale and manage
Survives device and link failures

Easy mitigation of Layer 2 looping concerns

Rapid detection/recovery from failures


Physical
Layer 2 across all access blocks within distribution
topology:
2 core Device-level CLI configuration simplicity
2 dist./acc.
Automated network and policy provisioning included
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 228
VSS best practices summary (1/2)
• Design each VSS domain with unique ID

• Configure “mac-address use-virtual” under virtual switch configuration mode

• Select appropriate VSS capable system that fits in network and solution requirements

• Deploy 6500/6800 Quad-sup NSF/SSO for mission-critical networks to protect network


availability and capacity
• Do not compromise network foundation baselines. Deploy full-mesh physical connections for
redundancy and load sharing across the network
• MEC enables network benefits with VSS. Bundle all physical connections into single logical
connection for simplified and resilient network topologies
• Layer 3 MEC is highly recommended for 4500E/X VSS enabled Campus network

• Always use link bundling protocols – Cisco PAgP or IETF LACP

• Configure “no ip routing protocol purge-interface” to optimize ECMP based network


convergence time
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 229
VSS best practices summary (2/2)

• Plan and design VSL with appropriate capacity, diversification and redundancy

• Configure “nsf” under L3 routing protocols

• Keep Layer 2 and Layer 3 protocol timers at factory default. Do not enable protocols with
aggressive timers
• Configure redundant dual active trusted ePAgP neighbors (L2/L3)

• Configure redundant dual active mechanics ePAgP and Fast Hello

• Exclude dual active management interface for connectivity and troubleshooting

• Remember “reload” command on 6500/6800 resets both virtual-switch chassis, whereas


4500E/X resets ACTIVE switch. Issue “redundancy reload shelf” on 4500E/X to reload ACTIVE
and STANDBY system

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 230
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• High Availability Architectures:
• Enterprise Wired LAN
• Multilayer Campus Distribution and HA Considerations
• Simplified Distribution and HA Advantages
• Extending HA Advantages by Simplifying Virtualization
• Enterprise Data Center
• Enterprise Wireless LAN

• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 231


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 232
Hop-by-hop network virtualization
Multi-VRF architecture overview
• Two preset network setup:
• Hop-by-hop network segmentation with logical connection
• Build control and data-plane over each logical connection

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 233
Hop-by-hop network virtualization
Data-plane isolation

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 234
Multi-VRF: Campus network design alternatives
Standalone devices

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 235
Multi-VRF: Campus network design alternatives
Cisco VSS

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 236
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (1 of 4)

Standalone Design
10 VRF Sample Design

Each core : 40 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 237
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (2 of 4)

Standalone Design VSS Design


10 VRF Sample Design 10 VRF Sample Design
Each core : 40 Adj VSS core : 0 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 238
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (3 of 4)

Standalone Design VSS Design


10 VRF Sample Design 10 VRF Sample Design
Each core : 160 Adj VSS core : 240 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 239
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (4 of 4)

Standalone Design VSS Design


10 VRF Sample Design 10 VRF Sample Design
Each core :480 Adj VSS core : 880 Adj
Edge : 80 Adj Edge : 160 Adj

• Standalone uses distributed control-plane. VSS uses a centralized control-plane


• Increases 2X control-plane adjacencies based on network design
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 240
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (1 of 4)

VSS-ECMP Design
10 VRF Sample Design

VSS core : 240 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 241
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (2 of 4)

VSS-ECMP Design VSS-MEC Design


10 VRF Sample Design 10 VRF Sample Design

VSS core : 240 Adj VSS core : 100 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 242
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (3 of 4)

VSS-ECMP Design VSS-MEC Design


10 VRF Sample Design 10 VRF Sample Design

VSS core : 880 Adj

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 243
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (4 of 4)

VSS-ECMP Design VSS-MEC Design


10 VRF Sample Design 10 VRF Sample Design

VSS core : 880 Adj VSS core : 260 Adj


Edge : 80 Adj Edge : 20 Adj

• Simplify virtualized network design with EC and MEC. Reduces up to 4X control-


plane adjacencies depending on network design
• Hardware driven, scale-independent and deterministic network availability
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
MPLS-based campus network architecture
Edge and core network design
LSR/LER LSR/LER

LSP
Core

IP/MPLS

LSP LSP

Distribution
LSP LSP LSP

LER LER LER LER LER LER

IP

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 245
Simplified underlay = simplified overlay (before)
P/PE P/PE

VPN PE Management
MP-iBGP PE Systems

MPLS Label Paths


MPLS LDP Adjacencies

P/PE P/PE P/PE P/PE


VPN Unicast Forwarding Paths P/PE P/PE

(with BGP Multipath)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 246
Simplified underlay = simplified overlay (after)
P/PE

VPN PE Management
MP-iBGP PE Systems

MPLS Label Paths


MPLS LDP Adjacencies

VPN Unicast Forwarding Paths


PE PE PE
(Without BGP Multipath)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 247
MPLS before VSS
IGP Tuning
OSPF LSA/SPF Tuning P/PE P/PE
BGP Tunings

MP-iBGP Multipath

BGP Prefix-Independent Convergence

MPLS LDP Tuning

MPLS LDP Session Protection


BFD

MPLS TE Link Protection

MPLS TE Node Protection


P/PE P/PE P/PE P/PE P/PE P/PE
Network/System Redundancy Tradeoff
Protocol Dependent Recovery

Control/Management/Forwarding Complexity

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 248
MPLS VSS benefits summary
IGP Tuning
OSPF LSA/SPF Tuning
P/PE
BGP Tunings Scale-independent Recovery

MP-iBGP Multipath Network/System Level Redundancy


BGP Prefix-Independent Convergence Hardware Driven Recovery
MPLS LDP Tuning Increase VPN Unicast Capacity
MPLS LDP Session Protection Increase VPN Multicast Capacity
BFD Simplified Virtual Network
MPLS TE Link Protection
Control-plane Simplicity
MPLS TE Node Protection
PE PE PE Operational Simplicity
Network/System Redundancy Tradeoff
L2-L4 Load Sharing
Protocol Dependent Recovery

Control/Management/Forwarding Complexity

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 249
Headquarters
WAAS
Access
Switches

UCS Rack-mount UCS Rack-mount


Servers Server UCS Blade
Storage Chassis

Distribution WAAS
Switches Central Manager

Nexus
WAN Communications
Router Internet Edge Managers
s
Access
Switches Internet Cisco ACE
Routers Data Center
Regional Site Wireless Firewalls
LAN
Controller Nexus
Wireless LAN Data
Internet
Controllers
Center
RA-VPN Firewall
Access WAN Access
Switch Route Switch
r Guest Wireless
DMZ
LAN Controller
Remote Site Switch

Web
Security
Appliance DMZ
Servers

Email
Teleworker/
Mobile Worker Hardware and Security Core
Software VPN Appliance
Switches

WAN
Access Routers
Switch
Stack
MPLS WAN
Router
WANs s Distribution
Switches

User
WAAS Access
Remote
Site Layers
WAAS
WAN Remote Site
Aggregation Wireless LAN
Controller

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
What’s different in your network today versus a
decade ago? How does it affect availability?

Cyber
Mobility IoT Security

Bring Your Own Device Auto-detect Non-User Devices Networking and Security
Devices in the Workspace Devices everywhere Advanced threats

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 251
Key Challenges for Traditional Networks

Difficult to Segment Complex to Manage Slower Issue Resolution

Ever increasing number of Multiple steps, Separate user policies for


users and endpoint types user credentials, complex wired and wireless networks
interactions
Ever increasing number of Unable to find users
VLANs and IP Subnets Multiple touch-points when troubleshooting

Traditional Networks Cannot Keep Up!

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 252
What if you could do this?
Cisco Software-Defined Access
Border Border
Nodes Nodes
• Enables:
• Host mobility
• Network segmentation Edge Edge
Nodes Nodes
• Role-based access
control Logical Layer 2 Overlay Logical Layer 3 Overlay

• It is an overlay network
to the network underlay
• Control plane based on LISP
• Data plane based on VXLAN
Physical Topology
• Policy plane based on TrustSec
Software-Defined Access Design Guide - CVD
[Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 253
SD-Access
Why overlays?

Simple Transport Forwarding Flexible Virtual Services


• •
• •
• •
• •

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 254
SD-Access
Types of overlays

• •
• •
• •
• •

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 255
Campus wired LAN design
Option 4: Cisco Software-Defined Access (BRKCRS-1501, many others)

Logical
• Uses advantages of a routed access physical
topology— design, with Layer 2 capable logical overlay
L2/L3: design
flexible OR • Provisioning and policy automation
overlays • Integrates wireless into the same policy
• Requires automation to simplify configuration
Survives device and link failures

Easy mitigation of Layer 2 looping concerns

Rapid detection/recovery from failures


Physical
Layer 2 across all access blocks within distribution
topology:
2 core Device-level CLI configuration simplicity
2 dist./acc.
Automated network and policy provisioning included
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 256
Cisco DNA Appliance—What about HA?
Outside of control plane, and has 2+1 clustering capabilities

SKU Specs Scale and Performance SDA Design

DN1-HW-APL • Based on UCS M4 5000 Devices Small or


• 44 cores Medium
1000 Switches/Routers/WLC + 4000 APs
• 256 GB RAM
• 12 TB SSD
25,000 Clients

DN2-HW-APL • Based on UCS M5 5000 Devices Small or


• 44 cores Medium
1000 Switches/Routers/WLC + 4000 APs
• 256 GB RAM
• 16 TB SSD
25,000 Clients

DN2-HW-APL-L • Based on UCS M5 8000 Devices Medium or


• 56 cores Large
2000 Switches/Routers/WLC + 6000 Aps
• 384 GB RAM
• 16 TB SSD 40,000 Clients

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 257
Missed One?
Access Cisco Software-Defined Sessions are available online
@[Link]
Cisco Live Barcelona - Session Map
Tuesday (Jan 29) Wednesday (Jan 30) Thursday (Jan 31) Friday (Feb 01)
08:00-11:00 11:00-13:00 13:00-15:00 15:00-18:00 08:00-11:00 11:00-13:00 13:00-15:00 15:00-18:00 08:00-11:00 11:00-13:00 13:00-15:00 15:00-18:00 08:00-11:00 11:00-13:00 13:00-15:00 15:00-18:00

BRKCRS-2821 BRKCRS-2825 BRKCRS-2812


SD-Access Integration SD-Access Scale SD-Access Migration

BRKCLD-2412 BRKCRS-3811
Cross-Domain Policy SD-Access Policy

BRKCRS-2810 BRKCRS-1449 BRKCRS-1501


SD-Access Solution ISE & SD-Access Validated Design

BRKCRS-3810 BRKCRS-2815 BRKCRS-2814 BRKARC-2020


SD-Access Connect SD-Access Troubleshoot
Deep Dive SD-Access Sites Assurance SD-Access

LTRACI-2636 LTRCRS-2810 BRKEWN-2021 BRKEWN-2020


ACI + SD-Access Lab SD-Access Lab SD-Access Demo SD-Access Wireless

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 258
SD-Access resources
Related Sessions Cisco SD-Access - 8H Technical Seminar - TECCRS-3810
Reference
• Monday, Jan 28 8:30 AM - 6:45 PM

Cisco SD-Access Fabric Cisco SD-Access Integration


Cisco SD-Access - A Look Under the Hood - BRKCRS-2810 Cisco SD-Access - Connecting to the DC, Firewall, WAN & More! - BRKCRS-2821
• Tuesday, Jan 29 11:00 AM - 1:00 PM • Wednesday, Jan 30 8:30 AM - 10:30 AM

Cisco SD-Access - Technology Deep Dive - BRKCRS-3810 Cisco SD-Access - Scaling to Hundreds of Sites - BRKCRS-2825
• Tuesday, Jan 29 2:30 PM - 4:00 PM • Wednesday, Jan 30 2:30 PM - 4:00 PM

Cisco SD-Access - Connecting Multiple Sites - BRKCRS-2815 Cisco SD-Access – Integrating Existing Network - BRKCRS-2812
• Wednesday, Jan 30 11:00 AM - 1:00 PM • Friday, Feb 01 11:30 AM - 1:30 PM

Cisco SD-Access – Assurance and Analytics - BRKCRS-2814 Cisco SD-Access Policy


• Wednesday, Jan 30 4:30 PM - 6:00 PM
Simplifying and Securing the Cisco Digital Network Architecture - BRKCRS-1449
Cisco SD-Access - Troubleshooting the Fabric - BRKARC-2020 • Tuesday, Jan 29 5:00 PM - 6:30 PM
• Thursday, Jan 31 2:30 PM - 4:00 PM
Group-Based Policy for On-Prem, Hybrid & Cloud with Cisco DNA - BRKCLD-2412
• Wednesday, Jan 30 2:30 PM - 4:00 PM
Cisco SD-Access Campus Cisco Validated Design - BRKCRS-1501
• Friday, Feb 01 9:00 AM - 11:00 AM Cisco SD-Access - Policy Driven Manageability - BRKCRS-3811
• Thursday, Jan 31 2:30 PM - 4:00 PM

Cisco SD-Access Wireless Cisco SD-Access Labs


How to Setup SD-Access Wireless from Scratch - BRKEWN-2021 Cisco SD-Access & ACI Integration - Hands-on Lab - LTRACI-2636
• Thursday, Jan 31 8:30 AM - 10:30 AM • Tuesday, Jan 29 2:15 PM - 6:15 PM
Cisco SD-Access - Wireless Integration - BRKEWN-2020 Cisco SD-Access - Hands-on Lab - LTRCRS-2810
• Friday, Feb 01 9:00 AM - 11:00 AM • Wednesday, Jan 30 9:00 AM - 1:00 PM

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 259
Campus wired LAN design options—summary
Traditional Layer 3 L2 Access / SD-Access /
Multilayer Routed Simplified Fabric for
Campus Access Distribution Campus
BRKCRS-2031 BRKCRS-3036 BRKCRS-1500 BRKCRS-1501
(and many others)

Logical
topology OR

Design Protocols / L3 Planning Flexible, Easy, Flexible, Tools to


notes Tuning Limited L2 Scalable Simplify

Physical
topology:
2 core
2 dist./acc.

On-line library at [Link] TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 260
How do I get there?
Successful deployments… …start with a plan.

Photos showing Basílica i Temple Expiatori de la Sagrada Família


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 261
High availability wired campus design
Key principals
• Choices when interconnecting devices can affect network
availability
• Choose hardware based detection and recovery mechanisms over
software for faster convergence–
• EtherChannel and Multichassis EtherChannel are powerful tools for
convergence and scale
• Overall design choices (multilayer vs. routed access vs. simplified
distribution) require the introduction of supporting protocols that
affect network availability
• Simplifying the network and improving network availability improves
other services overlaid on that network
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 262
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures:
• Enterprise Wired LAN
• Enterprise Wireless LAN
• Enterprise Data Center
• High Availability System Recovery

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 263
Dana Daum Maren Kostede
Technical Solutions Architect
Communications Architect

Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer

High Availability World Coverage


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Who connected to a wired network today?

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 265
… a typical day of a connected life…

Wi-Fi LTE Wi-Fi LTE LTE Wi-Fi

Home Driving Office Walk to lunch Restaurant Shopping,


Hotspots

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 266
No Wireless == No Network Access

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Section Objective

What is the acceptable


network downtime?
Minutes
<< 11 second
are ok
minute

admin

The goal of this section is to show you how to design and deploy a Highly
Available wireless network to reduce the network downtime

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 268
Wireless High Availability concepts
• Good news: all the High Availability concepts and best practices we have seen for wired are
applicable to wireless access as well
• Bad news: wireless is not wired
Ch 1 Ch 6 Ch 11

Thin air…..

Shielded, isolated access No electromagnetic protection

We use the air to transmit packets, it’s a shared media, it’s unlicensed….enough?

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 269
Agenda
• High Availability (HA), the theory of operations:
• What to do at the Radio Frequency layer?
• Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
• HA Design and Deployment Practices
• Wireless Assurance: proactively monitor your network!
• Key takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 270
RF HA – how to build redundancy at the RF layer?

Access Points Access Switches Aggregation Switches Wireless Controller

• Creating a stable, predictable RF environment (Proper Design, Site Survey)


• Dealing with RF that is continuously changing (RRM and RF Management)
• Coping with coverage holes from an AP going down (RRM and RF Management)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 271
Radio Frequency (RF) High Availability

• Site Survey, site survey….and site survey


• Use “Active” survey
• Coverage vs. Capacity
• Consider Client type (ex. Smartphone vs. Laptop)

My
Myantenna
power isgain
halfisof4
my times
brother
smaller
MacBook

I trythen
and to connect
move totoanother
5GHz
and stay ifconnected
BSSID until
it is REALLY
the signal better
is REALLY bad
Adaptive 802.11r, FastLane, iOS Analytics

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 274
Radio Frequency (RF) High Availability

• Site Survey, site survey….and site survey


• Use “Active” survey
• Coverage vs. Capacity
• Consider Client type (ex. Smartphone vs. Laptop)

• AP positioning and antenna choice is Key


• Use common sense
• Light source analogy
• Internal antennas are designed to be mounted on ceiling
• External antennas: use same antennas on all connectors

• Tools
• What you use is less important than how you use it
• Use the same tool to compare results

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 275
RF High Availability: Cisco RRM

• What are Radio Resource Manager (RRM)’s objectives?


• Provide a system wide RF view of the network at the Controller (only Cisco!!)
• Dynamically balance the network and mitigate changes
• Manage Spectrum Efficiency so as to provide the optimal throughput under changing conditions

• What’s RRM
• DCA—Dynamic Channel Assignment
• TPC—Transmit Power Control
• CHDM—Coverage Hole Detection and Mitigation

• RRM best practices


• RRM settings to auto for most deployments (High Density is a special case)
• Design for most radios set at mid power level (lever 3 for example)
• Use RF Profiles to customize RRM settings per Areas/Groups of APs

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 276
RF High Availability: Cisco RRM
• RRM DCA in action

 RRM will determine the optimal


1 6 1
channel plan based on AP layout

 A rogue AP is detected on
channel 11

 RRM will assess the RF and take


a decision in less than 10min

11
 Channel change is triggered to
improve the RF
1 11

 Note how the 3 non overlapping


11
channels are still maintained!
RRM has a system view of RF. AP
view would be limited and could  With a limited AP-based view of
the RF, each AP will avoid
result in sub-optimal RF plan channel 11 reducing overall
network capacity

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 277
RF High Availability: Cisco RRM
RRM Channel Hole Detection Mitigation (CHDM) in action
 RRM will determine the optimal
Power plan based on AP layout

 Each client RSSI is tracked by


AP and reported to WLC

 If an AP fails…

CHDM = Coverage Hole Detection Mitigation


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 278
RF High Availability: Cisco RRM
RRM CHDM in action
 RRM will determine the optimal
Power plan based on AP layout

 Each client RSSI is tracked by


AP and reported to WLC

 If an AP fails…

 CHDM algorithms kicks in and


increases power of neighboring
cells within 90 secs

 Clients roam to new APs

 This happens if the CHDM


conditions are met:
RRM Details and more: • Clients are below the RSSI
threshold
Improve WLAN Spectrum • Min Failed client per AP (#3 default)

Quality with Cisco’s advanced • Coverage Exception Level per AP


(25% by default)
RF (BRKEWN-3010) • Failed packets (number and %)

 These checks are needed to


avoid false positives
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
RF High Availability
Flexible Radio Assignment (FRA)
5GHz. 2.4GHz
5GHz  FRA-auto (default value) or Manual
5GHz
2.4GHz 5GHz. Serving Serving
 Auto 2.4 -> 5GHz or Monitor Mode
Serving
Serving Serving
 Transition to 2.4 GHz if coverage drops

5GHz. 2.4-5GHz
2.4GHz
Serving Monitoring
Serving

FRA: Supported on the Cisco Aironet 2800/3800/4800 Series Access Points

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 280
Summary

 Cisco provides well engineered Access Points,


Antennas, and Radio Resource Management features
in the controllers
 However, you need to understand the general
concepts of radio – otherwise, it is very easy to end up
implementing a network in a sub-optimal way:

“RF Matters”
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 282
… adding a Wireless Controller (functionality)

Private or public
Cloud

Access Points Access Switches Aggregation Switches Wireless Controller

Mobility Express Centralized/


SD-Access
SD-Access/FlexConnect
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
• High Availability (HA), the theory of operations:
• What to do at the Radio Frequency layer?
• Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
• HA Design and Deployment Practices
• Wireless Assurance: proactively monitor your network!
• Key takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 284
Wireless Controller modes fitting different
requirements

Centralized
Configure SDA-Wireless Flex Set
Connect
up Mobility Express
Ease
Fromof Deployment
a web browser or Policy Segmentation and Eliminate the need for a
and Simplified Controller-less
Cisco wireless app,for
management use consistent wired-wireless Controller at every Site for a
largethe
campuses. Cloud
setup wizard to management deployment for distributed
distributed deployment. Cloud
and non-Cloud options.
enable multiple APs deployments and small sites
and non-Cloud options.
simultaneously

LAN
Campus Fabric WAN

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 285
Cisco Wireless Controller Options
Launched Nov 2018
Controller Series
Catalyst 9800

Catalyst 9800-40 Catalyst 9800-80


ENCS 2000 APs 6000 APs

C9800 on Switch Catalyst 9800-Cloud


(SD-Access only) (private and public) Catalyst 9800-Cloud (private)
3000-6000 APs

200 APs 1000 APs 2000 APs 3000 APs 6000 APs

AireOS WLCs
WLC 3504
150 APs
Mobility Express WLC 5520 WLC 8540
50-100 APs 1500 APs 6000 APs
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Cisco Catalyst 9800 Series – Wireless benefits

Powered by IOS XE
Open and Programmable
Trustworthy Solutions
Modular operating system

Deploy Anywhere Always-on Secure

• On-Prem, Private/Public cloud, • Software updates with no • Detect encrypted threats with
Embed wireless on a 9k switch disruption Encrypted Traffic Analytics (ETA)
• AWS GovCloud ready • Rolling AP upgrades • Integration with StealthWatch
• Scale as you grow • Seamlessly add new AP models • Automated macro/micro
segmentation with SDA
• WPA3 Support*
*Future

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 288
High Availability
Reducing downtime for Upgrades and Unplanned Events

N+1 Primary, Per AP Primary,


SSO Active-
Unplanned Events Standby Secondary Secondary,
Device and network interruptions Tertiary

Controller Software Update


Software Maintenance updates ( SMU^ )

Access Point Updates


New AP Model & AP updates*

Software Image Upgrades


Wireless controller image upgrades

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 289
Centralized Mode High Availability: SSO and N+1
Requirements Benefits

Active Client State is


• Catalyst 9800 Series synched
• 5520, 8540, 3504 WLC AP state is synched
Client SSO • L2 connection
No Application downtime
• Same HW+SW Version
Network Uptime

No License needed on
• 1:1 box redundancy
secondary Controller

Available on all controllers


N+1 Redundancy Each Controller has to be
(Deterministic/Stateless HA, a.k.a.: Crosses L3 boundaries
primary/secondary/tertiary)
configured separately
Flexible: 1:1, N:1, N:N

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 290
Wireless Controller HA -
Centralized Mode

N+1 Redundancy
N+1 Redundancy • Administrator statically assigns APs a primary,
WLAN-Controller-A WLAN-Controller-B WLAN-Controller-C
secondary, and/or tertiary controller
• Assigned from controller interface (per AP) or Prime
Infrastructure (template-based)
• You need to specify Name and IP if WLCs are not in the
same Mobility Group
IP Network • Pros:
• Predictability: easier operational management
• Support for L3 network between WLCs
Access Point
• Flexible redundancy design options:1:1, N:1, N:N:1
Primary: WLAN-Controller-1 Primary: WLAN-Controller-2
Primary: WLAN-Controller-3 • WLCs can be of different HW and SW (*)
Secondary: WLAN-Controller-2 Secondary: WLAN-Controller-3
Secondary: WLAN-Controller-2
Tertiary: WLAN-Controller-3 Tertiary: WLAN-Controller-1
Tertiary: WLAN-Controller-1 • “Fallback” option in the case of failover
• Can overload APs on controllers (using AP priority)
 Cons:
• Stateless redundancy. There is a network downtime
when the WLC fails
• More upfront planning and configuration

(*) AP will need to upgrade/downgrade code upon joining

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 292
N+1 Redundancy
Global backup Controllers
Configuration > AP Join >…
Controller Series
Catalyst 9800

• Used if there are no


AireOS

primary/secondary/tertiary WLCs configured


on the AP
• The backup controllers are added to the
primary discovery response message to the
Wireless > High Availability AP

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 293
N+1 Redundancy
AP Failover mechanism
< 30-45 sec (*)

When configured with Primary and backup Controllers:


• AP uses heartbeats to validate current WLC connectivity
• Upon loosing a heartbeat to the Primary, AP sends 5
AP Boots UP
consecutives heartbeats every 3 second (default) WLC failure
detected

• Configurable to minimum of 3 keepalive every 2 sec Reset


• If no reply, AP declares the WLC dead and starts the join Discovery
process to the first backup WLC candidate:
• Backup is the first alive WLC in this order: primary, secondary, Image Data
tertiary, global primary, global secondary. DTLS
Setup
• With N+1 Failover, AP goes back to discovery state just to Run
make sure the backup WLC is UP and then immediately starts
the JOIN process
• With N+1, AP periodically checks for Primary to come back Join Config
online and falls back to it (AP fallback can be disabled)

(*) With Fast Heartbeat and minimum values for keepalive

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 294
N+1 Redundancy
AP Fast Heartbeat
< <30-45
30-45sec
sec(*)

• Fast Heartbeats lower the amount of


time it takes to detect Primary
controller failure
• How Fast Heartbeat works
• AP sends these packets, by default every
1 sec
• When the fast heartbeat timer expires, the
AP sends a 3 fast echo requests to the
WLC for 3 times (configurable)
• If no response primary is considered dead and the AP selects an available controller from its
“backup controller” list in the order of primary, secondary, tertiary, primary backup controller,
and secondary backup controller.

• Fast Heartbeat only supported for Local and Flex mode

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 295
N+1 Redundancy
AP Primary Discovery Request Timer

• The access point periodically sends primary discovery requests to the Primary WLC to
know when it is back online. Default is 120 sec.
• If AP Fallback is enabled (default), the AP automatically joins back the Primary controller

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 296
N+1 Redundancy Failed WLC Backup WLC

AP Failover Priority
Overloaded

Critical AP
fails over
• Assign priorities to APs: Critical, High,
Medium, Low Medium priority
AP dropped

• Critical priority APs get precedence over


AP Priority: Medium
all other APs when joining a controller AP Priority: Critical

• In a failover situation, a higher priority AP


will be allowed to join ahead of all other
APs
• If backup controller doesn’t have enough
licenses (ex. multiple Primary WLCs fail),
existing lower priority APs will be dropped
to accommodate higher priority APs

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 297
N+1 Redundancy
Typical Design
< 30-45 sec Geo separated DC

• Most common Design is N+1 with


WLC-BKP

Redundant WLC in a geographically


separate location across Campus
Primary Locations
IP network
• Can provide 30-45 sec of downtime (Campus)
when use faster heartbeat to detect
failure WLAN-Local

• Use AP priority in case of over


WLAN-Local

subscription of redundant WLC WLAN-Local


APs Configured With:
Primary: WLAN-Local
Secondary: WLC-BKP

For more info:


[Link]
nology/hi_avail/N1_HA_Overview.html
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Wireless Controller HA

Centralized Mode -
Stateful Switch Over
(SSO)
High Availability (Client SSO)
A direct physical connection between Active and Standby Redundant Ports or Layer 2 connectivity is
required to provide stateful redundancy within or across datacenters
Sub-second failover and zero SSID outage
Active Wireless Hot-Standby Wireless
Controller Controller

C9800-40-K9
Redundancy Port Connectivity
RP via L2
Gigabit SFP RP port Gigabit SFP RP port

C9800-80-K9

Active Wireless Hot-Standby Wireless


Controller Redundancy Port Connectivity
Controller
RP Via L2
Example for AireOS Controller:
[Link]

The only supported SFPs on Gigabit ©RP


TECEWN-2005 port
2019 Cisco are
and/or:itsGLC-SX-MMD and
affiliates. All rights reserved. CiscoGLC-LH-SMD
Public 303
C9800 Private Cloud Deployment: ESXi

Client SSO High Availability


C9800-CL-K9
vWLC1-Active vWLC1-Standby
vWLC1-Active vWLC2-Standby vWLC2-Active vWLC1-Standby

C D C D
P P P P
C D C D C D C D
P P P P P P P P

HA interface
HA interface
vswitch
vswitch vswitch
vswitch
vswitch vswitch

switch

Redundancy Port switch


Connectivity
Redundancy Port Connectivity
RP via L2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 304
Stateful Switchover (SSO) < 1 sec

• HA Pairing is possible only between the same type of hardware and software versions
• True Box to Box High Availability i.e. 1:1
• One WLC in Active state and second WLC in Hot Standby state
• Secondary continuously monitors the health of Active WLC via dedicated link

• Configuration on Active is synched to Standby WLC


• This happens at startup and incrementally at each configuration change on the Active

• What else is synched between Active and Standby?


• Licenses, AP CAPWAP state, Clients in “RUN” state

• Downtime during failover reduced is greatly reduced:


• 2 - 100 msec for a box failover (Active WLC crashes, system hangs, manual reset or forced switch-over)
• 350-500 msec in the case of power failure on the Active WLC (no signaling to the peer is possible)
• Few seconds in the case of network failover (gateway not reachable)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 306
Stateful Switchover (SSO)
Failover sequence ACTIVE STANDBY
ACTIVE

1. Redundancy role negotiation and config sync Si Si


GARP
2. APs associates with Active controller
3. Client associates with Active through AP
4. Active failure: notify peer / or missing keep alive
5. Standby WLC sends out GARP
Si
Si Si
6. Standby becomes Active: Si

AP DB and Client DB are already synced to standby controller


AP CAPWAP tunnel session intact
Campus
Client session intact, client does not re-associate* Access

Effective downtime for the client is:


Detection time + Switchover time
Capwap tunnel
Client Session
video: [Link]
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 307
Stateful Switchover (SSO)
Other important things to keep in mind..

• There is no preemption in Controller SSO:


• when the failed Active WLC comes back online it will joining as Hot Standby
• Recommendations:
• In Service Software Upgrade (ISSU): is currently not supported  plan for down
time when upgrading software.
many improvements for Catalyst 9800 Controller  see next section.
• Physical connection between Redundant Ports should be done first before HA
configuration
• Keepalive and Peer Discovery timers should be left at default values for better
performance

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 308
High Availability
Cisco Catalyst 9800
Wireless Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events

N+1 Primary, Per AP Primary,


SSO Active-
Unplanned Events Standby Secondary Secondary,
Device and network interruptions Tertiary

Hot Patch
Controller Software Update (No Wireless Controller Cold Patch
Software Maintenance updates ( SMU^ ) reboot) HA install on SSO Pair
Auto Install on Standby

Flexible
Rolling AP Update
Access Point Updates (No Wireless Controller
AP Device
Pack
Per-Site,
New AP Model & AP updates* New AP Model
Per-Model
Reboot) Updates

Software Image Upgrades N+1 Hitless Rolling AP


Upgrade
Wireless controller image upgrades

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 309
Wireless Controller HA –
Catalyst 9800 only
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events ^ MD Release Only

16.10 Supported Supported after 16.10

N+1 Primary, Per AP Primary,


Unplanned Events SSO Active-
Secondary Secondary,
Standby
Device and network interruptions Tertiary

Hot Patch
Controller Software Update (No Wireless Controller Cold Patch
Software Maintenance updates ( SMU^ ) reboot) HA install on SSO Pair
Auto Install on Standby

Flexible
Access Point Updates Rolling AP Update AP Device
Pack
Per-Site,
(No Wireless Controller
New AP Model & AP updates* Reboot) New AP Model Per-Model
Updates

Software Image Upgrades N+1 Hitless Rolling AP


Upgrade
Wireless controller image upgrades

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 311
Future
SMU on MD
Release only

Controller and AP software upgrades

Controller PSIRTs, fixes New AP Model


Updates on APs Support
Controller update or bug fixes AP update or bug fixes Hot-patchable support for Device Pack

SMU AP Service Pack AP Device Pack

Contain impact within release Faster resolution to critical issues


Fixes for defects and security issues Provide fixes to critical issues found in
without need to requalify a new release network devices that are time-sensitive

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 312
Wireless Controller SMU
Hot Patch Cold Patch
Wireless Controller SMU installation (No Wireless Controller reboot)
Wireless Controller Reboot
Options Auto Install on Standby

 Software Maintenance Update (SMU) is the Hot-Patching Cold Patching


ability to apply patch fixes on a software
release in the customer network Inline replace of functions Install of a SMU will require a
without restarting the process system reload
On SSO Systems, patch will be
 Current mechanism relies on Engineering applied on both active and
Specials On SSO systems, SMU updates
standby without any reload
can be installed on the HA Pair with
• Entire image is rebuilt and delivered to zero downtime (Follows ISSU path
customer and both Standby & Active controller
reloaded but there is no impact to AP
and Client session)

 SMU Infrastructure will be available in 16.10 FCS release

 SMUs for C9800 will be available starting the first MD Release

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 313
Catalyst 9800 SMU Cold Patch + AP Service
Pack
Follows ISSU path and both
Standby & Active controller
reloaded but there is no
Active Standby impact to AP and Client
session.

SMU SMU

Install SMU on Standby Standby Active

SMU SMU

Switchover to Activate SMU


Standby Active
Rolling AP upgrade
if AP image needs update
SMU SMU
(Reset AP in staggered way)
Install SMU on New Standby
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 314
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events ^ MD Release Only

16.10 Supported Supported after 16.10

N+1 Primary, Per AP Primary,


Unplanned Events SSO Active-
Secondary Secondary,
Standby
Device and network interruptions Tertiary

Hot Patch
Controller Software Update (No Wireless Controller Cold Patch
Software Maintenance updates ( SMU^ ) reboot) HA install on SSO Pair
Auto Install on Standby

Flexible
Access Point Updates Rolling AP Update AP Device
Pack
Per-Site,
(No Wireless Controller
New AP Model & AP updates* Reboot) New AP Model Per-Model
Updates

Software Image Upgrades N+1 Hitless Rolling AP


Upgrade
Wireless controller image upgrades

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 315
Rolling AP Upgrade: Choose how aggressive…
N=4 Neighbor APs N=8 Neighbor APs N=24 Neighbor APs

User selects % of APs to upgrade in one go [5, 15, 25]


For 25%, Neighbors marked = 6 [Expected number of iterations ~ 5]
For 15%, Neighbors marked = 12 [Expected number of iterations ~ 12]
For 5%, Neighbors marked = 24 [Expected number of iterations ~ 22] © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Rolling AP Upgrade - Client Steering
• Clients steered from candidate APs
to non-candidate APs
• 802.11v BSS Transition Request
• Dissociation imminent
• If clients do not honor this, they will
be de-authenticated before AP
reload 802.11v

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 317
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events ^ MD Release Only

16.10 Supported Supported after 16.10

N+1 Primary, Per AP Primary,


Unplanned Events SSO Active-
Secondary Secondary,
Standby
Device and network interruptions Tertiary

Hot Patch
Controller Software Update (No Wireless Controller Cold Patch
Software Maintenance updates ( SMU^ ) reboot) HA install on SSO Pair
Auto Install on Standby

Flexible
Access Point Updates Rolling AP Update AP Device
Pack
Per-Site,
(No Wireless Controller
New AP Model & AP updates* Reboot) New AP Model Per-Model
Updates

Software Image Upgrades N+1 Hitless Rolling AP


Upgrade
Wireless controller image upgrades

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
AP
N+1 Rolling AP Upgrade
Wireless Controller image upgrade using N+1 staging Controller
Trigger Rolling
Upgrade

Version : X
X+1 Mobility Group Version: X+1

Primary 1. Device auto selects candidate APs Upgraded N+1


based on selected % and RRM AP
Neighbor Map

2. Upgrade process kicks-in


• Image download to Primary
Wireless Controller
• Image pre-download to APs
• Selective redirect of clients using
11v
• APs moved to N+1 Wireless
Controller in rolling manner
• Primary Wireless Controller Reboot
• APs moved back to Primary
Wireless Controller (optional)

3. Monitor progress on the Device


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 319
Wireless Controller HA

Software Defined-
Access Wireless
Software Defined Access: Bringing Intent Based
Networking to Life
Cisco DNA Center

Automated
Network Fabric
Policy Automation Analytics
Single Fabric for Wired & Wireless
with simple Automation

B B
C
Outside
Identity-Based
Policy & Segmentation
Decouples Security & QoS
from VLAN and IP Address

Insights &
SDA
Extension
Telemetry
User Mobility
Analytics and Insights into
Policy stays with User
User and Application behavior
IoT Network Employee Network © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Catalyst 9800 SD-Access Wireless
Cisco DNA Center

Policy Automation Analytics

SD-Access Wireless Campus SD-Access Wireless Distributed Sites

Controller Appliance or
Private Cloud

SD-WAN
(Viptela)
c c
MPLS | Metro
SD-Access 4G/5G/LTE | Internet

IoT User Mobility


Embedded Wireless
“Cat 9k Switch”
Seamless Mobility
Policy stays with user
Policy stays with
user

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 322
Software Defined-Access: Roles and Terminology
 Cisco DNA Controller – Enterprise SDN
Controller for Automation & Assurance. GUI
Cisco DNA management abstraction via multiple Service
Apps
Identity Controller
ISE / AD  Identity Services – NAC & ID Systems
Services (e.g. ISE) for dynamic Endpoint to Group
mapping and Policy definition
Fabric Mode
WLC  Control-Plane (CP) Node – – Map System that
manages Endpoint to Device relationships
Fabric Border

 Fabric Border Nodes – A Fabric device


B B (e.g. Core) that connects External L3
network(s) to the SDA Fabric
Control-Plane
Intermediate C Nodes  Fabric Edge Nodes – A Fabric device
Nodes (Underlay) (e.g. Access or Distribution) that connects
Wired Endpoints to the SDA Fabric
CAPWAP
(Control)
Fabric Edge  Fabric Wireless Controller – Wireless
Fabric Nodes Controller (WLC) fabric-enabled, participate
Mode APs in LISP control plane
Fabric  Fabric Mode APs – Access Points that are
Mode APs fabric-enabled. Wireless traffic is VXLAN
VXLAN
(Data)
encapsulated at AP
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
SD-Access Wireless: Redundancy Considerations
Active Standby
WLC registers wireless clients in
Host Tracking DB
SSO pair
Client updates
Control Plane (CP) redundancy is
C C
supported in Active / Active
B
configuration

WLC is configured with two CP nodes


with information sync across both

Stateful redundancy with WLC SSO pair.


Active WLC updates Control nodes

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 324
Platforms supporting SD-Access Wireless
Optimized for Distributed Braches Small and Medium Campus Medium and Large Campus

On Switch On Private Cloud On Appliance

• Cisco IOS® XE Software • Cisco IOS® XE Software • Cisco IOS® XE Software


• C9800-40-K9
• Cat 9300, Cat 9500 • C9800-CL • C9800-80-K9
• 200 AP, 4k Clients • 1k AP, 10k Clients • Cisco AireOS Software:
• 3k AP, 32k Clients • WLC 3504 (SW8.8)
• SD-Access wireless with Cat9800 • 6k AP, 64k Clients^ • WLC 5520 (SW8.8)
Software Package • WLC 8540 (SW8.8)

• Indirect AP Support • Scale on demand • Designed for IoT


• Centralize Control Plane • Designed for IoT • Always on Fabric with robust HA
• Always on Fabric with robust HA • Always on Fabric with robust HA

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 326
Wireless Controller HA
FlexConnect Mode
FlexConnect quick recap…
Controller
Cluster
Central Site

• CAPWAP management and data plane are


split:
• Central Switching (SSID data traffic sent to WLC) Central WAN
Switching
• Local Switching (SSID data traffic sent to local VLAN)

• Two modes of operation from AP


perspective:
• Connected (when WLC is reachable)
Local
• Standalone (when WLC is not reachable)
Switching

FlexConnect Branch Office


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 328
FlexConnect HA
Limitations Benefits

L2 roaming Upon WLC failure AP stays up


FlexConnect Local Flex Groups for AAA Local Auth. and clients are not disconnected
Switching Fault Tolerance: identical Equivalent to Client SSO
configuration on N+1 controllers AAA survivability available

FlexConnect Central Same as Centralized mode Same as Centralized mode


Switching

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 329
Clients at locally switched SSIDs stay connected
at Controller/WAN outage

Data Center Local Switching SSIDs  all


connected Clients stay connected!

AAA/ Prime
RADIUS

WAN
WAN
Outage
Wireless Controller Access Point

Branch Office
CAPWAP Control – UDP 5246
CAPWAP Data – UDP 5247
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 330
Impact of WAN Outage or Controller Failure
Controller
Cluster

Central Site 1
1. Controller failure : 2. WAN Failure/ Controller
N+1 HA Design: not reachable:
• No Impact for locally switched • Access Point will continue to
SSIDs transmit/receive Data on
• FlexConnect AP will search for locally switched SSIDs. 2WAN
backup WLC and resume • Connected Clients stay
client sessions with centrally connected
switched SSIDs. • Fast roaming is possible for
1:1 HA Design with Client SSO: Clients with
• No impact for centrally CCKM/OKC/802.11r support
switched SSIDs: Centrally and • New Clients can connect if Local
locally switched SSIDs stay local RADIUS or Authentication Switching
up. provided.
• Lost features: RRM, wIDS,
location, WebAuth, NAC
FlexConnect Branch
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Office
Wireless Controller HA
Mobility Express
Cisco Mobility Express: Controller Function
embedded into the access point

Runs WLAN Controller on


access point

Investment Protection - Add


controller without changing Mobile app/WebUI to configure
Access Point up to 100 access points

Best Practices activated by Simple UI monitors, manages and


default & in built redundancy troubleshoots your network

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 333
Mobility Express Overview
AIR-AP1852I-B-K9

 One AP runs as Mobility Master (think about


it as a local Virtual WLC) AIR-AP2802I-B-K9 AIR-AP1852I-B-K9

 Controller and APs are in the same L2


broadcast domain. AIR-AP1852I-B-K9 AIR-AP3802I-B-K9 AIR-AP1852I-B-K9

MASTER

 Based on FlexConnect architecture AP

 Mobility Express supports client central


authentication and local switching of traffic
AIR-AP13702I-B-K9 AIR-AP3702E-B-K9

AIR-AP2702I-B-K9

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 334
Mobility Express: Master AP Redundancy
• If Master AP fails, another Mobility Express capable AP is elected
automatically.
• Newly elected Master AP has same IP and config as original Master AP.

• Preferred Master can be set (AireOS 8.7)

• Election of a new controller using VRRP


• Heartbeat exchanged every 10s with Master AP
• After 3 missed heartbeat: Master election initiated - all Master capable APs participate
• APs fall into standalone mode during election process (takes about 30 Secs)
• Standalone Access Points join newly elected master and go to connected mode

• Election Priorities
1. Most capable Access Points. 3800 > 2800 > 1800.
2. AP with least client load
3. In case of tie, election based on lowest MAC Address

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 335
Mobility Express Master Election Process
AIR-AP1852I-B-K9

P
AIR-AP2802I-B-K9 AIR-AP1852I-B-K9

Most capable Access


Point – E.g. 2800 vs.
1850 P
AIR-AP1852I-B-K9 AIR-AP3802I-B-K9 AIR-AP1852I-B-K9

MASTER
AP

Least Client Load P

Lowest MAC address AIR-AP3702I-B-K9 AIR-AP3702E-B-K9

AIR-AP2702I-B-K9

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 336
Mobility Express WLAN Deployment Options
Single Office Distributed Office Distributed Enterprise

Mobility Express Controller Based


Mobility Express Mobility Express in Branch in campus

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 337
Agenda
• High Availability (HA), the theory of operations:
• What to do at the Radio Frequency layer?
• Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
• HA Design and Deployment Practices
• Wireless Assurance: proactively monitor your network!
• Key takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 338
HA Best Practices: Connecting an AP to the
wired network

Recommendations:
 Create redundancy throughout the access layer by
connecting APs to different switches/stack
members/linecards
 If the AP is in Local mode, configure the port as
access with SPT PortFast, BPDU guard, etc..
 If the AP is in Flex mode and Local Switching,
configure the port as trunk and allow only the
VLANs you need

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 339
HA Best Practices: Connecting a Single
Controller to the wired network

1) To a single Modular Switch or Stack Modular


Switch/Stack
• Use Trunk EtherChannel(EC)/LAG WLC
• Trunk only the required VLANs to the Controller
• 2/4/8 ports in a bundle to optimize load sharing
• Spread ports across Line Cards/Stack members

2) To Redundant Distribution Switches in a VSS pair VSS pair

• Same as Option 1 WLC


• Spread ports across VSS members

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 340
HA Best Practices: Connecting HA pair to the
wired network

Single Switch or stack


Option 1: to single Modular Switch or Stack
Same configuration
on both Po1 and Po2

• The HA pair of WLCs should be considered as separated WLCs


with the same exact configuration Po 1 Po 2
• Ports on both WLCs are UP but only the ones on the Active WLC
Trunk
are forwarding data traffic
Port-channels
• On WLC side: use same physical ports are connected to the
network, for ex.: port 1-4 on WLC1 and port 1-4 on WLC2
• On switch side the configuration has to be the same. If using LAG, L2
for example, two Port-channel should be used with the same
configuration (same mode, same VLANs, same native, etc..) Standby WLC
Active WLC
• General recommendations for Option 1 WLC also apply

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 341
HA Best Practices: Connecting a Client SSO
Controller Cluster to the wired network (VSS)
Option 2: to VSS pair

• Use EtherChannel from each Wireless Controller to


Distribution VSS
• Spread the links in each EtherChannel among the
two physical switches: this will prevent a Wireless
Controller switchover upon a failure of one of the
VSS switch
• Redundancy Port (RP) connected to the respective
uplink switches.
• The AP/Clients are up after an SSO. It is a seamless
transition and there are no drops on the client. Active WLC Standby WLC

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 342
HA Best Practices: Connecting a Client SSO
Controller Cluster to the wired network (HSRP)
Option 3: to HSRP pair

• Controller devices are connected to 2 HSRP


routers (Active and Standby).
• The uplink is a port-channel. RP connected to
the respective uplink routers.
• Failover of HSRP Active to Standby induces a
switchover of Cisco Catalyst 9800 Wireless
Controller HA pair.
• The AP/Clients are up after an SSO. It is a
seamless transition and there are no drops on
the client.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 343
HA Deployment Best
Practices

Focus on Campus
HA Deployment Best Practices
Campus

• What is the acceptable downtime for your business applications?


• No downtime? Go with Stateful Switchover (Client SSO).
• Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility
• What is the downtime to upgrade a HA pair and how to minimize it?
• Catalyst 9800 Wireless Controller: use built-in Rolling SW Upgrade
• AireOS Controllers (details for reference only):
• Plan for additional backup controller
• Use Prime Infrastructure Rolling SW Updates Feature

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 345
HA Deployment Best Practices for Campus
N+1 SSO SSO + 1 SSO + SSO

L2
Primary Secondary Primary Controller
Primary Controller
Controller Controller
Active WLC Standby WLC

Secondary Controller Secondary Controller

• Approx. 30 Sec failover - Sub-Second Failover - adds adds


time (AP+Client affected) (Client+AP not affected) redundancy and redundancy
• No Config Synch (risk: - Config Synch simplifies and simplfies
Config mismatch) - One active, one standby operation during operation
• AP loadbalancing (no AP loadbalancing) maintenance during
• L2 or L3 - L2 connection needed (e.g. SW maintenance
Updates) (e.g. SW
Updates)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 362
HA Deployment Best Practices
Campus
• What is the acceptable downtime for your business applications?
• No downtime? Go with AireOS Stateful Switchover
• Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility
• What is the downtime to upgrade a HA pair and how to minimize it?
• What is the recommended HA deployment in a multi-site Campus?
1. Use 2-Tier Redundancy (SSO and N+1) HA deployment
• Use SSO in the main site (Primary WLC)
• Use Secondary/Tertiary in redundancy sites
2. For max resiliency use SSO in all sites

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 363
Multi-site Campus: Combine SSO with N+1 DC 1
 SSO pair can act as the Primary Primary 9.6.61.x/ 24 Secondary WLC
Controller and be deployed with [Link]
single Secondary and Tertiary WLC Main Data Centre
Si

PI ISE DC 2
 Network downtime:
Tertiary WLC
• No network downtime for single controller
failure in the Primary DC
.2
Si
IP network [Link]

• On failure of both Active and Standby WLC, .3


APs will fall back to secondary and further to
Si
Si

configured tertiary controller SSO pair


Si Si

 Recommendations: AP Config:
Primary WLC – [Link]
• Make sure that AP Fallback is enabled Secondary WLC – [Link]
• Use AP Failover priority in case of Tertiary WLC – [Link]
oversubscription of the backup WLC Si
Si Si
Si
• Useful to reduce downtime for SSO pair
software upgrade
Campus
Access

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 364
Multi-Site Campus : SSO everywhere! DC 1

Primary 9.6.61.x Secondary SSO


9.6.62.x
Main Data Centre .2
 Each site can be its own IP network
Si
separated SSO architecture Primary SSO pair Si .3

 Full site redundancy by ISE


.2

assigning primary, Si
.3
secondary, tertiary to the PI
Si
DC 2
APs.
Si Si Tertiary SSO
 Max level of High 9.6.63.x

Availability: no network .2
downtime upon controller Si

Si

failure within any site. Si


Si
Si
.3
Si

AP Config:
Campus Primary WLC – [Link]
Secondary WLC – [Link]
Access
Tertiary WLC – [Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 365
HA Deployment Best
Practice

Focus on Branch
HA Deployment Best Practices: Branch Key
Design Questions

Local Controller
FlexConnect
Controller (Appliance/virtual) Mobility Express • Single pane of Mgmt. &
Troubleshooting
• Specific per branch configuration • Specific per branch configuration • Reduced branch footprint
• Independency from WAN quality • Independency from WAN quality • Built-in resiliency
• Reduced configuration on • low hardware footprint (Controller • Perfect fit for centralized IT Team
switches running on Access Point)
• Full feature support
• L3 roaming supported

HA questions:
• Is the branch independent from the Central site from an operation prospective?
• What is the traffic flow of your application? Are the APP servers centrally located?
• Is there a local Internet breakout? How do you authenticate new users if WAN/Controller is
down? Where is the AAA server located?

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 367
FlexConnect Branch Summary
“Central Controller Cluster for thousands of Sites and Access Points”
Key Facts
Data Centre
Campus Services • “Cloud Controller”
When to use:
(private or public)
ISE • Perfect for centralized IT Team
WLC SSO pair • Ease of Operations:
single point of High Availability:
PI configuration for up to • If controller not reachable:
Si
6000 APs
Si
• Local Data path stays UP and Clients stay
connected, you can use AAA survivability
WAN
• SSO at central site provides control plane
survivability

Remote Keep in Mind:


location • Switchport as Trunk if SSID/VLAN separation
needed
• WAN Performance
• Some feature limitations (compared with local
Controller)
FlexConnect APs
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 368
Local Controller Branch Summary
“Do your clients need full Enterprise feature set (even if WAN is down)?”
• Key Facts: When to use:
Data Centre
Campus Services • Position one or two • WAN Bandwidth and latency is a concern
controllers per branch
• Simple configuration on the switch port connected
ISE • Full feature set available to the Access Point desired
Si

Si
• Branch/local IT staff requires configuration outside
PI of corporate standard
WAN
High Availability:
• Full features available if WAN is down
Remote • use N+1 or SSO for site controller redundancy
WLC
Local Services: location
• Local Authentication, DHCP, DNS required for full
AAA, DHCP, DNS
WAN Independency
Si

Keep in Mind:
• Need to manage each site individually
• Prime Infrastructure should be considered for central
manageability
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 369
Mobility Express Branch Summary
“Quick and Easy setup, no additional Hardware, WAN Independency”
When to use:
Data Centre • Key Facts:
Campus Services • WAN independency is required and low hardware
• It’s a Wireless Controller
footprint is desired.
running on an Access
ISE • Ideal for new deployments using 18xx/28xx/38xx
Point!
Si
Series Access Points
Si
PI High Availability:
WAN
• Self-Healing redundancy
• Independent from WAN
Remote • Local AAA, DHCP, DNS for full WAN independency
Local Services: location
AAA, DHCP, DNS Si Keep in Mind:
• Switchport as Trunk if SSID/VLAN separation
needed
• Per branch configuration and management
• consider adding Prime Infrastructure or Cisco DNA
Mobility Express APs Center for central management
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 370
Agenda
• High Availability (HA), the theory of operations:
• What to do at the Radio Frequency layer?
• Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
• HA Design and Deployment Practices
• Wireless Assurance: proactively monitor your network!
• Key takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 371
Cisco Wireless
Assurance:
Proactively Monitor your
Network
Cisco DNA Center can manage all wireless deployment
modes for Automation and Assurance
Cisco DNA Center

Policy Automation Analytics

SDA-Wireless Centralized
Configure Flex Set
Connect
up Mobility Express
From a web browser or Simplified Controller-less
Policy Segmentation and Ease of Deployment Eliminate the need for a
Cisco wireless app, use
andthe
management deployment for distributed
consistent wired-wireless setup wizard for
to Controller at every Site for a
large
enablecampuses
multiple APs distributed deployment deployments and small sites
management
simultaneously

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 373
Continuous Verification
Configs, Changes, Routing, Security
Services, Compliance, Audits
Successful Rollouts, Operational Continuity

Insights & Visibility


Visibility, Context, Historical
Insights, Prediction
Minimize Downtime, User Productivity

Corrective Actions
Guided Remediation, Automated Updates
System Optimization
IT Productivity

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
*Available with 16.10.1s
Purpose-Built for Cisco DNA Assurance and Cisco DNAC 1.2.8 or later

Wireless Streaming Telemetry Architecture


Cisco DNA Center

gRPC/Protobuf https/JWT TLS/TDL AP WSA/JWT

AP2/3/4800K ME, WLC3504/5520/8540 Catalyst 9800 Series Active Sensor AP1800S

• HTTP 2.0/gRPC based • Supported from AireOS 8.5 • KPI Parity with AireOS • HTTPS for Automation and
• Anomaly Event, RF Stat, • Real-Time client event • Immediate Event Update reporting
PCAP, Spectrum • Embedded Wireless in • PnP-based Provisioning
• Scheduled and Automated Cat9300 • Fully Managed by Cisco
DNAC

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 375
Cisco DNA Center: Built ground-up for Assurance

Real-Time and 1800s Sensor to Intelligent Capture


Context based validate user for proactive
Telemetry experience troubleshooting

• Client RF stats, Onboarding • Floor reassignment to make • Live and In-Service capture of
state and location (<5 sec) 1800s sensor mobile Onboarding failures with PCAPs
• Client Onboarding Health with • Speed tests to validate Cloud • Spectrum Analyzer for analyzing
Sankey charts for better app connectivity Interference sources
analysis • IP SLA tests for Real-time • On-Demand AP stats for Wi-Fi
• Near-Real time Client tracking AppX assessment troubleshooting

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 376
Wireless Assurance: Client Onboarding
Client
Onboarding

Actionable Dashboards:
1 Onboarding Sankey charts
for better analysis
Sankey chart

Real-time Correlation:
Correlate Onboarding
2 events with poor RF and
client location for RCA

Intelligent Capture:
3 Onboarding failures with
In-service PCAPs

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 377
Wireless Assurance: Sensors to monitor SLAs
Sensor based
SLA Monitoring

Simulate Client
perspective:
1 1800s Sensor is mobile
with floor re-assignment

Active Testing:
Test the cloud app
2 performance and Real-
time AppX assessment

SLA Dashboard:
3 Onboarding, Network
Services, Cloud App
Performance and IP SLA

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 378
Cisco Sensors: Intelligence of Cisco DNA
Assurance to the edge

Aironet 1800S Active Sensor AP as a Sensor *


(1800/2800/3800/4800)

Purpose-built Hardware for Analytics


Can be configured as dedicate Sensor
when it’s configured AP as a Sensor
Automatically converted to Sensor or AP
• 2x2 with 2 spatial streams by Cisco DNAC
• Multiple powering options
- PoE Power
- USB Type “C” power
- Direct AC Power Plug
• Integrated BLE
• Ultra compact form factor

Onboarding & Configure Tests Global Issue Dynamic Sensor


SLA Dashboard
Services Tests Remotely Creation Test Trigger
Test Your Network Anywhere at Any time at Real-world Client Level

*AP2800/3800/4800 w/ 8.5MR4 or [Link]


TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 379
Wireless Assurance: Client Health and Intelligent
Capture
Client and
Network
Experience
Health Dashboard:
Near-Real time Client
1 tracking (<60 sec) and
Top N AP analytics

Client 360:
Historical Time travel with
2 client RF correlated with
the Onboarding events

Intelligent Capture:
3 On-Demand AP stats for
Wi-Fi troubleshooting

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 380
Know you clients!
Client Insights– Apple iOS Analytics

1 Device Profile
2 Wi-Fi Analytics 3 Assurance
Client shares these details Client shares these details Client shares these details
1. Model e.g. iPhone 7 1. BSSID Error code for why did it
2. OS Details e.g. iOS 11 2. RSSI previously disconnected
3. Channel #

Support per device-group Insights into the clients view Provide clarity into the
Policies and Analytics of the network reliability of connectivity

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 381
Cisco DNA Wireless Assurance
• Be proactive: Use Sensor-based verification for
critical services!
• Know your clients: Cisco/Apple WiFi iOS
Analytics.
• Intelligent Capture: Who’s fault is it? “always
on” packet capture – helping to differentiate
between RF or application/client issue.
• Go back in time: What happened yesterday/last Cisco DNA Center
week?
• Actionable Insights: Provide guidance on how Policy Automation Analytics

to solve the issue.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 383
I would like to leave you
with…
Key Takeaways

• High Availability for Wireless is a multi level approach, starting from


Level 1 (RF)
• You have different solutions to chose based on the downtime that
is acceptable for your business application
• Cisco Controller SSO eliminates network downtime upon controller
failure
• Wireless Assurance is key to assess your network stability and
proactively test

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 385
Selected additional Wireless Sessions…
For Your
Reference

• Cisco DNA Wireless Assurance: Isolate problems for faster troubleshooting - BRKEWN-2034
• Tuesday, Jan 29, 2:30 PM - 4:00 PM | Hall 8.0, Session Room A108

• Advanced Troubleshooting of Cisco Catalyst 9800 Wireless Controller - BRKEWN-3013


• Friday, Feb 01, 11:30 AM - 1:00 PM | Hall 8.0, Session Room C129

• Cisco DNA Center Assurance and Analytics– Reducing Time to resolution using Big Data and
Machine Learning - BRKNMS-2542
• Wednesday, Jan 30, 2:30 PM - 4:00 PM | Hall 8.0, Session Room D134

• Cisco SD-Access Wireless Integration - BRKEWN-2020


• Friday, Feb 01, 9:00 AM - 11:00 AM | Hall 8.0, Session Room C118

• How to setup an SD Access Wireless fabric from scratch - BRKEWN-2021


• Thursday, Jan 31, 8:30 AM - 10:30 AM | Hall 8.0, Session Room C126

• Improve Enterprise WLAN Spectrum Quality with Cisco's advanced RF capacities (RRM, CleanAir,
ClientLink, etc) - BRKEWN-3010
• Wednesday, Jan 30, 8:30 AM - 10:30 AM | Hall 8.0, Session Room A103
• Thursday, Jan 31, 11:00 AM - 1:00 PM | Hall 8.0, Session Room C126

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 386
Agenda
• Designing High Availability Networks for the Enterprise
• System Hardware and Software Resiliency
• Foundations of the Structured Network Design
• High Availability Architectures:
• Enterprise Wired LAN
• Enterprise Wireless LAN
• Enterprise Data Center
• High Availability System Recovery Analysis

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 387
Dana Daum Maren Kostede
Technical Solutions Architect
Communications Architect

Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer

High Availability World Coverage


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
 Legacy DC with vPC
 Programmable Fabric
 Application Centric Infrastructure (ACI)
 Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 389
Data Center HA Section Objective
• Focus is on Enterprise Data Center Network
• High Availability design options and best practices
• High Availability operational best practices
• Same principle: The Enterprise Campus Network High Availability concepts
are applicable to Data Center network
• Same goal: minimize network downtime

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 390
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
 Legacy DC with vPC
 Programmable Fabric
 Application Centric Infrastructure (ACI)
 Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 391
Platform-dependent

NX-OS HA Architecture
hardware-related modules
System-infrastructure
modules
Feature
• Fully distributed modular modules
design
• Control-plane & data-plane
separation
Feature API
• Service restart-ability API
Management
• Non-disruptive SSO* Infrastructure
HA
& ISSU Infrastructure

API
Hardware
Drivers
Netstack

Kernel
*SSO only available on dual-sup Nexus 7x00 and 9500
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 392
NX-OS Service Restart-ability
• Stateful Restart with Persistent Storage Service (PSS)
• Checkpoints states to PSS
• Recover states from PSS upon restart

• Stateful Restart with Graceful Restart


• Recover states based on information from other services and/or network
• Mainly routing protocols
• Stateless Restart
• Fresh start, no trace of former instantiation

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 393
NX-OS NSF with Stateful Fault Recovery

Restart process!

Software RIB

TCP/UDP

HSRP
OSPF

LACP
BGP

IPv6
STP
PIM

etc.
Graceful restart Graceful restart
HA Manager
Linux Kernel

Table Routing updates Nexus Data Plane


Routing updates
Update

If a fault occurs in a process…


• HA manager determines best recovery action (restart process,
switchover to redundant supervisor)
• Process restarts with no impact on data plane
Hardware FIB
• State checkpointing (PSS) allows instant, stateful process recovery
• Software utilizes Graceful Restart© 2019
where appropriate
Cisco and/or its affiliates. All rights reserved. Cisco Public
NX-OS NSF with Stateful Supervisor Switchover
• Supervisor switchover triggers:
• HA policy initiated
• Process restart have failed
• When the kernel fails (or panics) Active Sup
• When the Supervisor experiences a hardware failure
Standby Sup
• User initiated – system switchover

LC - NSF

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 395
NX-OS NSF Configuration
• The Nexus products are “NSF Capable” by default
for all the routing protocols in all NX-OS software releases.
• No additional configuration is required unless
you need to modify the default NSF timers.

Nexus# show running-config ospf all

!Command: show running-config ospf all


!Time: Tue May 19 [Link] 2009

version 4.2(1)
feature ospf

<snip>

router ospf 1
graceful-restart
graceful-restart grace-period 60
area [Link] authentication message-digest

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 396
NX-OS Software Maintenance Upgrade

Direction
• Non-Disruptive Bug Fix • Limited number of Patches
for re-startable/ stateful processes supported
• Works with or without ISSU • Not every bug will have a patch
• For Operationally Impacting • May be disruptive
Bugs with no workaround
• Platform and process specific

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 397
NX-OS Stateful Process Restart & Patching
• NX-OS services Checkpoint their runtime state to the Control-Plane
Persistent Storage Service O SPF 1 EIGRP B GP B GP
Restart
process!
BGP
Management H S RP 1 O TV vPC H S RP 2
Infrastructure

When a process is patched… HA Infrastructure

UDLD SSH IGM P S TP

• Install process applies new patch Hardware


Netstack
Drivers

• HA manager restarts process Kernel

• Process restarts with patched code and no impact on


data plane
• State is recovered, operation resumes
• Total Recovery Time ~10s ms

Data-Plane

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 398
Software Patching: CLI procedure
SMU SMU

Show Install Active SMU


Repository
Show Install Committed N7K> Install Remove N7K> Install Add
Show Install Inactive

Show Install Packages


SMU Committed Copy to Device
Memory: Process: Memory: Process:

N7K> Install Commit N7K> Install Activate


.
.
SMU Removed
Memory: Process: SMU Applied
Memory: Process:

N7K> Install Deactivate N7K> Install Commit


SMU Committed
Memory: Process:

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 399
Dual Supervisor Standard ISSU
N7K# install all kickstart bootdisk:7.2-kickstart system bootdisk:7.3-system

Release 1 Upgrade standby supervisor


Sup 1 Sup 2
7.3 2 Reload standby supervisor
Standby
Active
Standby Active
Standby
Perform SSO 3
7.2
7.3 7.3
7.3
7.2 Release
Upgrade standby supervisor 4

Reload standby supervisor 5


7.3

6 Upgrade LCs & FEX in series *

Release 7.3

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 400
Fixed Switch (ToR) Standard ISSU
• Control plane is inactive during  Reload supervisor
reload while data plane is Control-Plane

forwarding  Load new version


Supervisor

• ISSU is disruptive for L3 services


• ISSU is non-disruptive for L2  Restore control plane
and configuration
services Version
7.3
7.2
• STP enabled switches cannot be  Reconcile with Data
present downstream Plane

Data-Plane

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 401
Fixed Switch (ToR) Enhanced ISSU (N3K/N9K)

Container – B
spawned to bootup Container- B
#install all nxos [Link] with NX-OS (V2) becomes Active

NX-OS upgraded to V2
Container - A Container - B with ~3-5 seconds impact
Container- A to Control plane traffic
destroyed
NX-OS (V1) NX-OS (V2)

Host OS (Linux)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 402
In-Service Software Upgrade
For Your
Reference

ISSU NX-OS Switch Traffic Loss

Standard ISSU Dual supervisor modular switch: N9500, Control plane: <3-5 sec
N7700, N7000 Data plane: 0/no service
disruption
Fixed switch: N9300, N3000, N5500, Control plane: < 120 sec
N5600, N6000 Data plane: 0/no service
disruption
Enhanced ISSU Fixed switch: N9300, N3000 Control plane: <3-5 sec
Data plane: 0/no service
disruption

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 403
NX-OS ISSU Best Practices
• For Layer 2 and Layer 3 protocols with sensitive timers, the timeout value should be
increased. Otherwise, the upgrade will be disruptive
• Best practices vPC ToR
 Make sure that both vPC peers are in the same mode (traditional ISSU mode or enhanced
ISSU mode)
 Connect host using port-channel to a pair of vPC ToR
 If ToR vPC is STP root bridge: Enable peer-switch to avoid STP root change during ISSU
 If ToR vPC is not STP root bridge: enable all ports as edge/edge trunk ports

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 404
Graceful Insertion and Removal for NXOS
Isolation of Switch from network

Change window begins.

vPC vPC

system mode maintenance

One command!
Pre-change System Snapshot
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 405
Graceful Insertion and Removal for NXOS
Return of Switch into network

Change window complete.

vPC vPC

no system mode maintenance

One command!
Post-change System Snapshot
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 406
Configuration Profiles
• Maintenance-mode profile is applied when entering GIR mode,
• Normal-mode profile is applied when GIR mode is exited.

Automatic Profiles Custom Profiles


• Generated by default • User created profile for maintenance-
• Parses configuration to determine mode and normal-mode
changes going into and out of GIR • Flexible selection of protocols for
• Changes based on base protocol isolation
configuration settings.
• Use: maintenance windows and
• Use: Maintenance Windows isolation during troubleshooting

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 407
Graceful Insertion and Removal Feature
Graceful Removal with Isolate command
• New CLI 'isolate' in all Unicast Protocols //Sends route withdraws
router bgp 33
• Make Nexus undesirable for all transit traffic
isolate
• Maintain Protocol Adjacencies //Poisons the routes by sending highest metric
• Send route withdrawals/worse metrics router eigrp 1
isolate
• Local route states are maintained.
//Advertises max-metric router-lsa
• Multicast follows Unicast for RPF router ospf 1
isolate
• Feature available: N5K/6K:7.3(0)N1(1); N7K:
7.3(0)D1(1); N9K/N3K: 7.0(3)I2(1) //Refreshes LSPs with overload-bit on
router isis 1
isolate

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 408
GIR – Platform Specifics
For Your
Reference

Nexus Switch Shutdown Isolate


Nexus 5K/6K Only support shutdown mode from Default mode is isolate from
7.1(0)N1(1) 7.3(0)N1(1), shutdown is optional mode
Supported features: BGP/BGPv6, EIGRP/EIGRPv6, Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+, Interfaces switch), vPC/vPC+(shutdown only), Interfaces
(shutdown only)

Nexus 7K Only support shutdown mode from Default mode is isolate from
7.2(0)D1(1) 7.3(0)D1(1), shutdown is optional mode
Supported features: BGP/BGPv6, EIGRP/EIGRPv6, Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+, Interfaces switch), vPC/vPC+(shutdown only), Interfaces
(shutdown only)

Nexus 9K/3K Default mode is isolate from 7.0(3)I2(1), Default mode is isolate from 7.0(3)I2(1),
shutdown is optional mode shutdown is optional mode
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, PIM(on vPC), RIP,
vPC(shutdown only), Interfaces (shutdown only)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 409
Putting it all Together
What to use? GIR Mode? Patching? ISSU? All of them?

Situation Critical Bug Hardware New


Option
Fix & PSIRT Upgrade Features

ISSU ✓ X ✓
GIR + Cold Boot ✓ X ✓
GIR + Disruptive
✓ X ✓
Installer
SMU Restart ✓ X X
GIR + SMU Reload ✓ X X
GIR X ✓ X

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 410
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
 Legacy DC with vPC
 Programmable Fabric
 Application Centric Infrastructure (ACI)
 Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 411
Data Center Fabric Technology Evolution
VXLAN EVPN

VXLAN F&L
FabricPath

vPC

STP

2015-2019

2014

2010
2009
2008 and ACI
before

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 412
Cisco Data Center Network Solutions

Classic Ethernet Programmable Fabric Application Centric Programmable


& VPC Infrastructure Network

DB DB

Web Web App Web App

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 413
High Availability Design Principle
Structure, Modularity, and Hierarchy

• Structured design
• Allows you to manage and understand traffic flows, and network failure behavior
• Modular design
• Allows for easier evolution and change to the network
• Hierarchical design
• Provides for improved scalability
• Separates network services into manageable building blocks

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 414
High Availability Design Principle
Structure, Modularity, and Hierarchy
• Optimize the interaction of the physical redundancy with the network
protocols
• Provide the necessary amount of redundancy
• Pick the right protocol for the requirement
• Optimize the tuning of the protocol
• Optimize network convergence failure detection and recovery
• Optimal high availability network design attempts to leverage ‘local’ switch fault
detection and recovery
• Design should leverage the hardware capabilities of the switches to detect and
recover traffic flows based on these ‘local’ events

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 415
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
 Legacy DC with vPC
 Programmable Fabric
 Application Centric Infrastructure (ACI)
 Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 416
Cisco Data Center Network Solutions

Classic Ethernet Programmable Fabric Application Centric Programmable


& VPC Infrastructure Network

DB DB

Web Web App Web App

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 417
vPC Feature Overview
vPC Terminology
Layer 3 Cloud
vPC Peer
Keepalive Link
vPC vPC Domain
Peer P S

Peer Link

Orphan Port CFS


S1 S2

vPC

Orphan
Device S3

vPC is supported on Cisco Nexus switches (N5k, N6k, N7k, N9k, N3k)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 418
vPC Failure Scenario
vPC Peer-Keepalive Link up & vPC Peer-Link down
• vPC peer-link failure (link loss):
P vPC Peer-keepalive S
• vPC peer-keepalive up
• Status of other vPC peer known S1 S2

• Secondary vPC peer disables all vPC’s vPC_PLink


Suspend secondary
vPC Member Ports
• Traffic forwarded by vPC primary
vPC1 vPC2

SW3 SW4

Keepalive Heartbeat

P Primary vPC

S Secondary vPC

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 419
Legacy DC HA Design with vPC
Core
Core
• Core Layer
S1 S2
• Layer 3 ECMP for multipath redundancy

Aggregation
S3 S4

Access
Access
S5 S6

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 420
Legacy DC HA Design with vPC
Core
Core
• Aggregation Layer
S1 S2
• HSRP / VRRP/ GLBP with vPC for
active/active gateway
Aggregation
• Use default FHRP timers S3 S4

Access
Access
S5 S6

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 421
Legacy DC HA Design with vPC
Core
Core
• Access Layer
S1 S2
• Connect to a pair of Aggregation switch
via Layer 2 port-channel
Aggregation
• Redundant uplinks S3 S4

Access
Access
S5 S6

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 422
Legacy DC HA Design with vPC
Core
• Access Layer
• Double-sided vPC connecting to Aggregation layer
• Higher resilience
• Different vPC domain ID
vPC Domain 10 Aggregation

vPC Domain 20 Access

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 423
VSS vs vPC
Catalyst Nexus

Non VSS VSS Non vPC vPC


Merge Data Plane only!!
Merge Data and Control Plane
Control Plane still separate

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 424
VSS Design vs vPC Design
Catalyst VSS Nexus vPC

Don’t design VSS and vPC in same way for Layer 3!

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 425
VSS Design vs vPC Design
Catalyst VSS Nexus vPC

vPC Layer 3 routed uplink with ECMP

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 426
vPC – Layer 2 Data Center Interconnect (DCI)
DC 1 DC 2
N Network port
vPC domain 11 vPC domain 21 E Edge or portfast
Long Distance
Dark Fiber - Normal port type

CORE
CORE

B BPDUguard
E F F E
- - F BPDUfilter
N N R Rootguard

802.1AE (Optional)
N N

- E F F E -
R
R -
- - R R
Layer 2 vPC
AGGR

AGGR
N N N N
Portchannel
- -
- -
R R
R R
vPC domain 10 vPC domain 20
ACCESS

ACCESS
- -

E E
B B

Server Cluster Server Cluster

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 427
For Your

vPC Best Practices


Reference

vPC General Deployment Best Practices


• Unique vPC domain ID’s in the contiguous layer 2 domain

• Enable vPC peer-gateway, to act as the active gateway for packets


addressed to the peer gateway of the router MAC
• Keeps forwarding of traffic local to the vPC node and avoids use of the peer-link
• Enable vPC peer-switch, to create vPC peer switch as single logical entity
• Optimized BPDU processing
• Enable auto-recovery to address two cases of single switch behavior
• Peer-link fails and after a while primary switch fails
• Both VPC peers are reloaded and only one comes back up
• Enable vPC orphan-ports suspend to prevent orphan device traffic
blackhole during peer-link failure
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 428
vPC Configuration Best Practices
Enable Object-tracking
• vPC object tracking, tracks both peer-link and uplinks in a list
S4 S5
of Boolean OR

• Object Tracking triggered when the track object goes down

• Suspends the vPCs on the impaired device.

• Traffic forwarded over the remaining vPC peer

! Track the vpc peer link and uplinks


track 1 interface port-channel11 line-protocol
track 2 interface Ethernet1/1 line-protocol
track 3 interface Ethernet1/2 line-protocol
S1 S2
! Combine all tracked objects into one.
! “OR” means if ALL objects are down, this object will go down
track 10 list boolean OR
object 1
object 2
object 3

! If object 10 goes down on the primary vPC peer,


! system will switch over to other vPC peer and disable all local vPCs
S3
vpc domain 1
track 10

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 436
vPC Hitless Role Change Feature

Without vPC Hitless Role Change With vPC Hitless Role Change
• Traffic interruption. • No traffic interruption.

• Manually flap vPC peer link • cli – “vpc role preempt”

• Not Graceful • Graceful

Note: supported from N7k 7.3(0)(D1(1), N9k/3,: 7.0(3)I7(1)


Not supported on N5k/6k
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 438
vPC Shutdown Feature
• Isolates a switch from the vPC complex
Primary Secondary
• Isolated switch can be debugged, reloaded, vPC
or even removed physically, without affecting
the vPC traffic going through the non-
isolated switch S1 S2

switch# configure terminal


switch(config)# vpc domain 100
switch(config-vpc)# shutdown
S3

Note: supported from N7k: 7.3(0)D1(1), N9k: 7.0(3)I2(2), N5k/6k:


6.0(2)N2(1)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 439
Graceful Insertion and Removal Example
FHRP with vPC Switch Isolation using GIR
• Use automatic profile to go Core Network
into GIR. f

Isolate unicast
routing protocol
//Enter maintenance mode using the system mode
L3
maintenance command:
switch# configure terminal L2 VPC
switch(config)# system mode maintenance
Following configuration will be applied: Shutdown
router ospf 100
isolate
vpc domain 2
shutdown

Do you want to continue (y/n)? [no] y

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 440
Legacy DC HA Design with vPC Key Takeaways
• To minimize Legacy DC down time:
• Follow vPC design best practices
• Follow vPC configuration best practices
• Follow vPC operation best practices

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 441
vPC References
For Your
Reference

• vPC Design and Configuration Best Practices:


[Link]
design/vpc_best_practices_design_guide.pdf

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 442
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
• Legacy DC with vPC
• Programmable Fabric
• Application Centric Infrastructure (ACI)
• Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 443
Cisco Data Center Network Solutions
Classic Ethernet Programmable Fabric Application Centric Programmable
& VPC Infrastructure Network

DB DB

Web Web App Web App

• Standards-based
• VXLAN BGP EVPN
• Forwarding & Multi-Tenancy
• Disaggregated Management
• Open NX-OS

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 444
Programmable Fabric Underlay HA Design
• Structured design with Spine,
Leaf and Border Leaf
• Allows you to manage traffic flow, External Layer-3 Network
network failure VTEP VTEP
Border Leaf
• Layer 3 IP fabric with point-to-
point link: Spine Spine Spine Spine

Spine
• Better stability, faster convergence
• Redundant links with ECMP
• Scale out spine leaf design
VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Leaf

• Better scalability and availability


Pod 1

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 445
Programmable Fabric Underlay HA Design
• Structured design with Border
Spine and Leaf
• Layer 3 IP fabric with point-to-
point link: External Layer-3 Network
• Better stability, faster convergence
• Redundant links with ECMP Spine Spine Spine Spine

Border Spine

• Scale out spine leaf design


• Better scalability and availability VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Leaf

Pod 1

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 446
Programmable Fabric Overlay HA Design
• VXLAN EVPN based overlay
• Same “Anycast” SVI IP/MAC is
External Layer-3 Network
enabled at all VTEPs/ToRs
VTEP VTEP

• Better availability, no IP gateway Border Leaf


relearning
• Optimal traffic forwarding, no hair- Spine Spine

Overlay
Spine Spine

Spine
pinning to GW
• Enable host mobility
VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Leaf

SVI IP Address
MAC: 0000.1111.2222
IP: [Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 447
Programmable Fabric Host HA Design
• Host connects to a pair of vPC
External Layer-3 Network
leaf VTEP directly
(recommended)
VTEP VTEP
Border Leaf

• Host connects to a pair of vPC


leaf VTEP via FEX (not
Spine Spine Spine Spine

Overlay Spine
recommended)
• Redundant host uplinks VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Leaf

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 448
vPC Leaf VTEP Best Practices for HA
• vPC leaf VTEP best practices vpc domain 100
peer-switch
• Enable peer-gateway peer-keepalive destination [Link] source [Link]
delay restore 150
• Enable peer switch peer-gateway
ip arp synchronize
• Enable IP ARP Sync ipv6 nd synchronize

• Use separate loopback address for interface nve1


host-reachability protocol bgp
VTEP source address source-interface loopback1
source-interface hold-down-time 180
 Control plane and data plane separation
interface loopback0
 Loopback0 is for underlay and overlay ip address [Link]/32
routing ip router ospf UNDERLAY area [Link]
ip pim sparse-mode
 Loopback1 with secondary IP is for
VTEP source data plane interface loopback1
ip address [Link]/32
ip address [Link]/32 secondary
ip router ospf UNDERLAY area [Link]
ip pim sparse-mode

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 449
vPC Delay Restore and Source Hold Down Timer
Spine Anycast VTEP
Spine vPC Peer-Link Advertisement
Control plane
connection not
adjacencies not
fully established X recovered yet
X Leaf 1 X Leaf
Leaf 2
Leaf 1 4
Leaf 2 2

X Host-to-Leaf
Recovering
connection not
device
recovered yet

If the advertisement of the Anycast VTEP address


Host connection toward recovering Leaf 2 is brought up before
happens before the vPC peer-link and vPC leg
the ToR can successfully establish routing adjacencies with the
connection to the host are recovered, traffic will be
fabric and the peer vPC leaf node  temporary black-holed
black-holed as well.
Tuning delay-restore is required regardless the
A “source-interface hold-down-time” is natively
SW release brought to keep the VTEP address (Loopback1) down
for 180 sec (default), supported from 7.0(3)I2(2)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 452
vPC Leaf VTEP HA Best Practices
• vPC leaf VTEP best practices
o Enable layer 3 link between the two vPC Underlay Network
VTEPs to connect them in the underlay With IP ECMP Load Sharing
network so that when one VTEP loses all
its uplinks, it can still learn the routes
through its vPC peer, and forward the
traffic via its peer
o Layer 3 link can be dedicated link or via
point-to-point VLAN SVI over vPC peer-
link VTEP vPC- vPC-
....... VTEP-1 VTEP-2

vPC
Port-Channel

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 453
Programmable Fabric External Routing HA Design
The two Border Leaf VTEPs are
independent to each other.
They each individually exchange
Spine RR RR routes with the external routing
devices, and advertise the external
routes into the EVPN fabric

VXLAN Overlay
EVPN MP-BGP Border Leaf

VTEP VTEP
Leaf VTEP VTEP VTEP VTEP

Anycast Gateway Anycast Gateway Anycast Gateway Anycast Gateway

Routing
Protocol
of
Choice
Distributed Anycast Gateway on the internal
VTEPs Leafs
Global Default VRF Instance
BGP multi-pathing needs to be enabled on the IP Routing or User Space VRF Instances
internal VTEP to leverage both border leaf
VTEPs

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 454
Programmable Fabric External Routing HA Design
IP Routing

Distributed Anycast Gateway on the internal


VTEPs Leafs
The two Border Spines are both
BGP multi-pathing needs to be enabled on the Routing
VTEPs.
internal VTEP to leverage both border leaf Protocol They each individually exchange
of
VTEPs Choice routes with the external routing
devices, and advertise the external
VTEP VTEP
routes into the EVPN fabric.
Spine RR RR

VXLAN Overlay
EVPN MP-BGP

VTEP VTEP
Leaf VTEP VTEP VTEP VTEP

Anycast Gateway Anycast Gateway Anycast Gateway Anycast Gateway Anycast Gateway Anycast Gateway

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 455
Programmable Fabric Multi-X Connectivity (DCI)
VXLAN Multi-Site
2017+
Fabric #1Domain 1
EVPN Control-Plane BGP EVPN Fabric #2Domain 2
EVPN Control-Plane

Overlay Overlay
VTEP VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Bare Bare Bare Bare


metal metal metal metal

DCI
Data-Plane Domain 1 Data-Plane Domain 2
Data-Plane

 Multiple Fabrics with Integrated DCI (DCI2)


 Hierarchical design at both overlay and underlay: Better
Scale and Failure Domain Isolation between Fabrics

Recommended DCI Architecture Going Forward!!!

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 456
VXLAN Multi-Site
Main Use Cases
Scale-Up Model to Build a
Large Intra-DC Network

Network Extension across


Multiple Sites

Integration with Legacy Networks


(Coexistence and/or Migration)
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 457
VXLAN Multi-Site Border Gateways HA Design
Anycast Border Gateways
BGW BGW BGW BGW

Possible BGWs deployment models:


VTEP VTEP VTEP VTEP


• Anycast Border Gateways (supported since day 1)
and recommended for interconnecting VXLAN EVPN
fabrics
• VPC Border Gateways (supported since 9.2(1)) Site 1

 Border Gateways used for Layer 2 and Layer 3


Site-to-Site communication(East-West traffic) VPC Border Gateways
BGW BGW

 Border Gateway are often deployed also as


VTEP VTEP

Border Leaf nodes for Site to External Layer 3


communication (North-South traffic)

Site 1

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 458
VXLAN Multi-Site VPC Border Gateways
DCI Use Cases

VTEP VTEP VTEP VTEP

Migration/Coexistence of a BGW BGW BGW BGW

legacy site with Spine Spine Spine Spine

one (or more) new VXLAN


EVPN fabrics VTEP VTEP VTEP VTEP VTEP VTEP VTEP

Greenfield Site Legacy Site

VTEP VTEP VTEP VTEP

BGW BGW BGW BGW

Replacing ‘legacy’ DCI solutions


(vPC, OTV, VPLS, etc..)

Legacy Site 1 Legacy Site 2


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
VXLAN Multi-Site
Failure Detection on Anycast BGWs – Fabric Isolation

 The Site-Internal interfaces on BGW nodes are


constantly tracked to determine their status
Site-External

Multi-Site VIP
(‘evpn multisite fabric-tracking’ command)
[Link]
 If all the Site-Internal interfaces are detected as
BGW BGW BGW BGW down:
The isolated BGW stops advertising PIP/VIP
VTEP VTEP VTEP VTEP
1.
PIP-BGW2 PIP-BGW3 PIP-BGW4
addresses toward the Site-External network
[Link] [Link] [Link]
2. The remaining BGWs perform new DF elections for
Site-Internal

the L2VNIs owned by the isolated BGW


Spine Spine
 As a result, the BGW becomes isolated from
both the Site-Internal and Site-External
Site 1 networks
 Seamless BGW node reinsertion using a “delay-
restore” timer for the VIP address

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 460
VXLAN Multi-Site
Failure Detection on Anycast BGWs – DCI Isolation

DC Core
(Layer-3 Unicast)  The Site-External interfaces on BGW nodes are
also tracked to determine their status (‘evpn
Site-External

multisite dci-tracking’ command)


 If all the Site-External interfaces are detected as
BGW BGW BGW BGW down, the isolated BGW node:
Stops advertising VIP VTEP address toward the
VTEP VTEP VTEP VTEP
1.
PIP-BGW1 PIP-BGW2 PIP-BGW3 PIP-BGW4
Site-Internal network
[Link] [Link] [Link] [Link]
2. Withdraws BGP EVPN Type-4 advertisements
Site-Internal

Multi-Site VIP (triggering a new DF election between other BGWs)


[Link]
3. Starts functioning as a regular VTEP (PIP still up)

 As a result, the BGW continues to operate as a


Site 1 Site-Internal VTEP
 Seamless BGW node reinsertion using a “delay-
restore” timer for the VIP address

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 461
Graceful Insertion and Removal Example
VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR
• Use automatic profile to go into GIR.
S10 S20
//Enter maintenance mode using the system
mode maintenance command:
switch# configure terminal
switch(config)# system mode maintenance
Following configuration will be applied: VXLAN BGP EVPN
ip pim isolate
router bgp 1
isolate
router ospf UNDERLAY
isolate
vpc domain 1000
shutdown

Do you want to continue (yes/no)? [no] y


System mode operation completed successfully Host1 Host2
switch(config)# end

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 464
Graceful Insertion and Removal Example
VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR
• Use automatic profile to come out of GIR.
S10 S20
//Enter maintenance mode using the system
mode maintenance command:
switch# configure terminal
switch(config)# no system mode maintenance
Following configuration will be applied: VXLAN BGP EVPN
vpc domain 1000
no shutdown
router ospf UNDERLAY
no isolate
router bgp 1
no isolate
no ip pim isolate

Do you want to continue (yes/no)? [no] y


System mode operation completed successfully Host1 Host2
switch(config)# end

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 465
Graceful Insertion and Removal Example
VXLAN BGP EVPN Spine RR Isolation with GIR
• Use automatic profile to go into GIR.
S10 S20

//Enter maintenance mode using the system mode


maintenance command:
switch# configure terminal VXLAN BGP EVPN
switch(config)# system mode maintenance
Following configuration will be applied:
router bgp 100
isolate
router ospf 1
isolate

Do you want to continue (yes/no)? [no] y


System mode operation completed successfully
switch(config)# end
Host1 Host2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 466
Programmable Fabric HA Takeaways
• Programmable fabric HA design
• Spine leaf L3 IP fabric with ECMP
• VXLAN EVPN fabric with Anycast GW
• vPC for host
• Multiple DC fabric HA design
• VXLAN Multi-Site
• Follow configuration and operational best practices to minimize down time
for different failure scenarios

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 467
Programmable Fabric Resources
For Your
Reference

• VXLAN Network with MP-BGP EVPN Control Plane Design Guide


• [Link]
switches/[Link]

• VXLAN EVPN Multi-Site Design and Deployment White Paper


• [Link]
switches/[Link]#_Toc498025653

• BRKDCN-3378 Building DataCenter Networks with VXLAN BGP EVPN

• BRKDCN-2035 VXLAN BGP EVPN based Multi-Site

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 468
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
• Legacy DC with vPC
• Programmable Fabric
• Application Centric Infrastructure (ACI)
• Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 469
Cisco Data Center Network Solutions
Classic Ethernet Programmable Fabric Application Centric Programmable
& VPC Infrastructure Network

DB DB

Web Web App Web App

• VXLAN-based
• Forwarding, Multi-Tenancy &
Security
• Integrated Controller with
Enhanced APIs

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 470
ACI Fabric Underlay HA Design
• Zero touch provision Application Policy Infrastructure Controller

• Structured design
• Layer 3 IP fabric with point-to-
point link: ACI
Fabric
• Better stability, faster convergence
• Redundant links with ECMP
• Scale out spine leaf design
• Better scalability and availability SVI IP Address
MAC: 0000.1111.2222
IP: [Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 471
ACI Fabric Overlay HA Design
• eVXLAN EVPN based overlay Application Policy Infrastructure Controller

• Same “Anycast” SVI IP/MAC is


enabled at all VTEPs/ToRs
• Better availability, no IP gateway Overlay
ACI
relearning Fabric
• Optimal traffic forwarding, no hair-
pinning to GW
• Enable host mobility
SVI IP Address
MAC: 0000.1111.2222
IP: [Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 472
ACI Fabric Host vPC HA Design
ACI Spine Nodes

ACI Fabric

ACI Leaf Nodes

 Host vPC to ACI leaf nodes


 Host vPC to FEX Host2

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 473
• Differences between ACI vPC and standard
vPC in ACI Fabric vPC
• No Peer Link is required
• Peer communication/path recovery
ACI Fabric Services (ZMQ) happens via the Fabric
• CFS (Cisco Fabric Services) is replaced by
vPC Anycast vPC Anycast
VTEP
IFS (ACI Fabric Services) which is based
VTEP
on Zero Message Queue (ZMQ)
VTEP VTEP • Forwarding selection (which peer will
forward a frame
• Within the Fabric the vPC interfaces use an
anycast VTEP which is active on both vPC
peers

Host or Switch

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 474
ACI Port Tracking Policy for Uplink Failure
Detection
• The port tracking policy specifies
• Number of uplink connections that trigger the policy
• A delay timer for bringing the leaf switch access ports back up

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 475
ACI Fabric Convergence Improvement
• Convergence improvement from sub-seconds to 200ms for ACI3.1
• With new Cloudscale ASIC N9Ks
• Failure scenarios with convergence improvement:
• Fabric (between leaf and spine): link failure, Spine reload/upgrade, Spine linecard
reload, Leaf reload/upgrade, power failure of Spine
• Access link/node with vPC or portchannel
• External (Border Leaf) connectivity (L3 out): link failure, Border Leaf
reload/upgrade
• Achieved by special ASIC capability and software design

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 476
ACI Fabric Convergence Improvement
• Uncovered failure scenarios:
• Double failure, L2/L3 multicast, copper links, process crashes on
Leaf/Spine/Border Leaf, etc.
• Convergence for traffic from EP to ACI fabric is dependent on how fast the
EP is able to divert traffic to ACI Leaf
• Convergence for traffic from external node to ACI fabric is dependent on
how fast external node is able to divert traffic to ACI

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 477
Fabric Fast Convergence - Enable LBX

It’s a per leaf configuration.


Fabric ERSPAN can’t be enabled per uplink port with this feature enabled.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 478
Access Link with vPC or PortChannel Fast
Convergence - Debounce Policy Configuration
Reduce debounce timer from default 100ms to 10msfor faster convergence under Fabric Access Policy

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 479
ACI Fabric Fast Convergence Best Practices
• Always use vPC
• Distribute scale
• 100 L3out per Leaf
• 50 BD per Leaf
• Use static EPG instead of L2out

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 480
ACI Fabric Maintenance Mode
• New decommission option from ACI3.0(1k)
• To help to isolate the switch in ACI fabric with keeping
management access to the switch
• Prior to ACI3.0(1k): decommission options are Regular or
Remove from controller.
• The switch reboots and is wiped out of all the configuration

• ACI3.0(1k): Maintenance Mode (Debug mode) or


decommission
• With Maintenance mode, the switch is not in Active forwarding
path
• It can be accessed via management port, logs can be collected
for debugging

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 481
Spine Node Maintenance Mode Decommission
• IS-IS on spine advertises routes with max matric
• OSPF, EIGRP and BGP do graceful shutdown on IPN/GOLF link
GOLF

IPN
IPN ports are still up but
OSPF neighbor is down.

IS-IS set max Traffic goes through


matric different paths.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 482
Spine Node Insertion Recommission
• Spine switch reboots and is wiped out of all the configuration
• After the switch comes up and is discovered by APIC, the policy is
programmed on the switch
• After the switch configuration is done, the switch establishes IS-IS, OSPF
and BGP peers. Then the switch will be in active forwarding path. Max
metric will be set 10 mins during startup. Thus, internal traffic will be less
preferred for 10 mins.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 483
Leaf Node Maintenance Mode Decommission
• IS-IS on Leaf node advertises route with max metric
• OSPF, EIGRP and BGP do graceful shutdown
• vPC shuts down Keep-Alive & Peer Link
• Shutdown all front panel ports and directly connected IFC ports (Cuts Laser
on the Port)
Set max metric
Traffic goes through
different paths

Graceful shutdown for L3out

vPC shutdown
Shutdown front panel ports

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 484
Leaf Node Insertion Recommission
• The switch reboots and is wiped out of all the configuration
• After the switch comes up and is discovered by APIC, the policy is
programmed on the switch
• After the switch configuration is done, the switch will establish IS-IS, OSPF
and BGP peers. Then the switch will be in active forwarding path. Max
metric will be set for 10 mins during startup. Thus, internal traffic will be
less preferred for 10 mins.
• There is a 2 min delay before we bring up the vPC ports.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 485
ACI 3.0 Release
ACI Multi-Site
VXLAN
Inter-Site
Network

MP-BGP - EVPN

Multi-Site Orchestrator

Site 1 Site 2
REST
GUI
API Availability Zone ‘B’
Availability Zone ‘A’

• Separate ACI Fabrics with independent APIC clusters • MP-BGP EVPN control plane between sites
• No latency limitation between Fabrics • Data Plane VXLAN encapsulation across
• ACI Multi-Site Orchestrator pushes cross-fabric sites
configuration to multiple APIC clusters providing • End-to-end policy definition and
scoping of all configuration changes enforcement
TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 486
ACI Multi-Site
Main Use Cases

Scale-Up Model to Build a Large Data Center Interconnect (DCI)


Intra-DC Network

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 487
ACI Policy Upgrade
• Ability to upgrade all switches and controllers in the fabric from one place,
with a single click
• Requires the upload of the new controller and switch image
• Then, create a firmware group
• Finally, Create Maintenance groups as needed to define which switches
get upgrade at what time
• Controllers are upgraded through a different “Controller Firmware” Policy
• Controllers are kicked off at the same time (sort of like a single maintenance
group) and upgrade sequentially.

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 488
ACI Maintenance Group Logic

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 489
ACI HA Key Takeaways
• ACI is a turnkey solution for Data Center fabric with built in HA and full
automation
• ACI integrates all the best practices and lessons we learned from previous
technologies

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 490
ACI Fabric Resources
For Your
Reference

• Cisco ACI Multi-Site Architecture White Paper


• [Link]
centric-infrastructure/[Link]

• BRKACI-2125 ACI Multi-Site Architecture and Deployment

• BRKACI-3101 ACI Under the Hood - How Your Configuration is Deployed

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 491
Agenda
• Enterprise Data Center High Availability (DC HA)
• DC Switch NX-OS HA Architecture and HA Features
• DC Network HA Design and Operational Best Practices
• Legacy DC with vPC
• Programmable Fabric
• Application Centric Infrastructure (ACI)
• Programmable Network
• Key Takeaways

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 492
Cisco Data Center Network Solutions
Classic Ethernet Programmable Fabric Application Centric Programmable
& VPC Infrastructure Network

DB DB

Web Web App Web App

• Open NX-OS
• Enhanced APIs and
Automation Ecosystem
(DevOps)

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 493
Nexus Device Programmability
• Power on Auto Provisioning (PoAP)
• On-box Python Scripting
• NX-OS Software Development Kit (SDK)
• Configuration Management Tools

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 494
Cisco Nexus Power on Auto Provisioning (PoAP)

License, Configuration and


Script Server DHCP Server Software Server

2 DHCP Discover phase:


3 Get IP Address, Gateway 4
Script server Script file
Download Script Download Configuration
file onto the switch License Software images
and execute the onto the switch
script

Default
Gateway
Reboot if needed. Switch up
Power up Phase: Start Power
and running with the 1 On Auto-Provisioning Process
downloaded image and
5
config
Nexus Switch

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 495
Deploy and Manage POAP Using DCNM..

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 496
Deploy using POAP Script
• Download POAP script from github:
• https:/​/​[Link]/​datacenter/​nexus9000/​blob/​master/​nx-os/​poap/​[Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 497
Nexus 9000 Programmability
On-box Python
• Python script can be run in interactive or non-interactive mode

• Please store scripts in the bootflash:scripts directory of the switch

Interactive Mode Non Interactive (script) Mode


switch# python Switch # dir bootflash:scripts
Python 2.7.5 (default, Nov 5 2016, [Link]) 946 Oct 30 [Link] 2013 [Link]
>>> cli("conf ; interface loopback 1") 7009 Sep 19 [Link] 2013 [Link]
22760 Oct 31 [Link] 2012 [Link]
''
>>> clip ('where detail') Switch # python bootflash:/scripts/[Link]
mode: Or Switch # source [Link]
username: admin -----------------------------------------
vdc: TSI-N9508-stand-alone Started running CRC checker script
routing-context vrf: default finished running CRC checker script
------------------------------------------

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 498
Python Usecases
“Off-Box” Python “On-Box” Python
Linux Server
Python

SSH/NETCONF

Python
NX-OS Device
NX-OS
NX-OS NX-OS Device

• scripts executed locally on switch:


scripts executed externally from switch:
• provisioning automation
• configuration management automation
• automating Embedded Event Manager
• telemetry / operational data
• application development
• controller use cases including APIC, POAP

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 499
Auto Back-up Use Case
“On-Box” Python and EEM Cisco Nexus 9000 Python SDK User Guide:
[Link]
sdk-user-guide-and-api-reference

Python script creates a back-up file and sends it to a tftp


server

Nexus 93xx

EEM

EEM Triggers on-box Python script

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 500
NX-OS SDK (Software Development Kit)
• NX-OS SDK enables on-box custom applications to access NX-OS native
functionality
Nexus 9K
Custom Applications Existing 3 rd Party
(Python, C++ etc..) Linux Applications
Linux – Native
Shell or Guest Linux
Shell Networking
Stack
NX-OS
CLI

L2 L3 Interfaces Platform Etc

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 501
Nexus Programmability
Configuration Management Tools

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 502
NX-OS Programmability Resources
For Your
Reference

• BRKACI-2025 Maximizing Network Programmability and Automation with Open


NX-OS

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 503
Data Center HA
Key Takeaways
High Availability Enterprise Data Center Design
Key Principles
• Follow HA design and operational best practices to minimize network
downtime

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 505
Maren Kostede
Dana Daum Technical Solutions Architect
Communications Architect

Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer

High Availability World Coverage


© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
• First step on the Moon – July 20, 1969 … how it really happened …
LEM = Lunar
“OK, I’m going to step off the LEM now” Excursion Module =
the Lunar Lander
“That’s one small step for man”

“One giant leap for mankind”

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 507
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
• And how it would have looked with … standard HSRP timers …

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 508
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
• And how it would have looked with … 3-second reconvergence …

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 509
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
• And how it would have looked with … 500-msec re-convergence …

Tuning Your Network Design and


Reconvergence Can Be a “Giant Leap”
for Your Network – and Your
Application – Availability!

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 510
Published design guides
[Link]/go/cvd

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Cisco Webex Teams

Questions?
Use Cisco Webex Teams (formerly Cisco Spark)
to chat with the speaker after the session

How
1 Find this session in the Cisco Events Mobile App
2 Click “Join the Discussion”
3 Install Webex Teams or go directly to the team space
4 Enter messages/questions in the team space

[Link]/ciscolivebot#TECCRS-2001

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 512
Complete your online
session survey
• Please complete your Online Session
Survey after each session
• Complete 4 Session Surveys & the Overall
Conference Survey (available from
Thursday) to receive your Cisco Live T-
shirt
• All surveys can be completed via the Cisco
Events Mobile App or the Communication
Stations

Don’t forget: Cisco Live sessions will be available for viewing


on demand after the event at [Link]

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 513
Continue Your Education

Demos in Meet the Related


Walk-in
the Cisco engineer sessions
self-paced
Showcase labs 1:1
meetings

TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 514
Thank you

You might also like