Same L3 switch, new problem. Some vital stats for reference:
- cisco Catalyst 3750E
- 250+ VLAN interfaces
- 100+ Gig/10GE interfaces
- OSPF
- PBR is used to specify a different next-hop
- The same route-map is applied to each interface so that the route-map only needs one corresponding access-list for the match clause.
We got alerted by our monitoring to some consistently high CPU load. We followed the check list for troubleshooting 3750 CPU load here and didn't see anything that particularly caught our eye. Checking revision control (you have configuration revision control, don't you?) a change commit was found that correlated perfectly with the increase in CPU utilisation. It turns out that cisco's check list was relevant and this particular section explained it:
When configuring match criteria in a route map, follow these guidelines:
- Do not match ACLs with deny ACEs. Packets that match a deny ACE are sent to the CPU, which could cause high CPU utilization.
The ACL in use for the route-map had four deny lines at the start with a combined match count in the hundred of millions.
Lesson learned: when you think you've read all of the best practice guides, there'll be another one you haven't had to read until something goes wrong.