This document will outline some basic troubleshooting steps and issues that I have found out over the years about 4500 Series Switches, specifically Supervisors 4-5, but it is mainly relevant for all models.
Additional Reading:
BRKCRS-3142 - Troubleshooting Cisco 4500 Series Switches
Scope
The scope of items that are being impacted, such as in an test environment 1 Core Switch, to two Distribution to then two Access switches per Distribution switch always keep in mind of where the issue is occurring and what is being impacted. Some of the situations below I will mention if its common for all devices to be impacted, or only certain ones. Please note that this is not necessarily Cisco's best practice, but it is how I have it setup when testing/experiences was issued.
High CPU Process Troubleshooting
Typical Causes:
a) Multicast Traffic (IGMP)
b) Spanning-Tree Traffic
c) Consecutive Port Changes (Port-Flapping)
Enabling CPU "Output" to a Wireshark Session
1. From CLI issue the following commands, these commands will take the CPU as the Source and output it to an Interface so it can be traced using Wireshark:
monitor session 1 source cpu
monitor session 1 destination interface Gi1/1
2. Once that is completed you then can simply use Wireshark to get the details on what the CPU is processing at that time.
How to Fix "Multicast Traffic"
Using the CPU output/Wireshark from above, as soon as you attach the trace to the switch, you will then be able to see all of the traffic. When Multicast Traffic is the problem, it will typically be very noticeable almost immediately as the way that the traffic path leaves from the device (L2) causing the issue will impact all of the switches from there to the upstream gateway (L3). Which is why you should either design your network as Distribution Layer is L3, or if you require that your Access Layer/Distribution layer is all Layer 2, then you take the above as a risk.
Using the Wireshark, you will see traffic coming from a specific source going to a Multicast destination, and typically if there is an issue with it, you will be able to see that its not a small amount of traffic. As each of these packets pass through the network, your switches require direct CPU processing, and cannot be offloaded to the ASIC for quick processing. Once you find the device causing the issue, simply shut down that user's port and see the network start to recover, and then you can perform an analysis on what may have caused this.
Spanning Tree & Port Changes
In the example above, I ran into an odd issue where STP was causing problems due to constant re-convergence/notifications. This was with Rapid-PVST+ and it being the Root. Now in this situation it made sense, but basically had a problem where one of the EtherChannel ports was a one blade, and then another port was on another blade. The switch had a hardware failure causing the blade to fail outright, shutting down all of the ports that was plugged into it, including the port for the etherchannel, which caused the convergence. The switch then attempted after a period of time to bring the blade back up, causing it to re-convergence again. So best lesson learned here is if you are having STP issues where high CPU is coming into one of your switches check to make sure its not a reconvergence issue due to port flapping or hardware problems. If its also generating from two switches, verify the legs/connectivity between them.
Another common and recommended thing, is check for any associated trunk ports attached to that switch as it could have been a rogue switch plugged in. Common Commands:
show proc cpu hist
show proc cpu sorted
errdisable recovery cause
show cdp neighbors
show interface trunk
More information will be added as things break :)
Additional Reading:
BRKCRS-3142 - Troubleshooting Cisco 4500 Series Switches
Scope
The scope of items that are being impacted, such as in an test environment 1 Core Switch, to two Distribution to then two Access switches per Distribution switch always keep in mind of where the issue is occurring and what is being impacted. Some of the situations below I will mention if its common for all devices to be impacted, or only certain ones. Please note that this is not necessarily Cisco's best practice, but it is how I have it setup when testing/experiences was issued.
High CPU Process Troubleshooting
Typical Causes:
a) Multicast Traffic (IGMP)
b) Spanning-Tree Traffic
c) Consecutive Port Changes (Port-Flapping)
Enabling CPU "Output" to a Wireshark Session
1. From CLI issue the following commands, these commands will take the CPU as the Source and output it to an Interface so it can be traced using Wireshark:
monitor session 1 source cpu
monitor session 1 destination interface Gi1/1
2. Once that is completed you then can simply use Wireshark to get the details on what the CPU is processing at that time.
How to Fix "Multicast Traffic"
Using the CPU output/Wireshark from above, as soon as you attach the trace to the switch, you will then be able to see all of the traffic. When Multicast Traffic is the problem, it will typically be very noticeable almost immediately as the way that the traffic path leaves from the device (L2) causing the issue will impact all of the switches from there to the upstream gateway (L3). Which is why you should either design your network as Distribution Layer is L3, or if you require that your Access Layer/Distribution layer is all Layer 2, then you take the above as a risk.
Using the Wireshark, you will see traffic coming from a specific source going to a Multicast destination, and typically if there is an issue with it, you will be able to see that its not a small amount of traffic. As each of these packets pass through the network, your switches require direct CPU processing, and cannot be offloaded to the ASIC for quick processing. Once you find the device causing the issue, simply shut down that user's port and see the network start to recover, and then you can perform an analysis on what may have caused this.
Spanning Tree & Port Changes
In the example above, I ran into an odd issue where STP was causing problems due to constant re-convergence/notifications. This was with Rapid-PVST+ and it being the Root. Now in this situation it made sense, but basically had a problem where one of the EtherChannel ports was a one blade, and then another port was on another blade. The switch had a hardware failure causing the blade to fail outright, shutting down all of the ports that was plugged into it, including the port for the etherchannel, which caused the convergence. The switch then attempted after a period of time to bring the blade back up, causing it to re-convergence again. So best lesson learned here is if you are having STP issues where high CPU is coming into one of your switches check to make sure its not a reconvergence issue due to port flapping or hardware problems. If its also generating from two switches, verify the legs/connectivity between them.
Another common and recommended thing, is check for any associated trunk ports attached to that switch as it could have been a rogue switch plugged in. Common Commands:
show proc cpu hist
show proc cpu sorted
errdisable recovery cause
show cdp neighbors
show interface trunk
More information will be added as things break :)
Comments
Post a Comment