7/18/2019

7508E CPU Util% is high (1)

Saw quite slow response on one of my lab router, a 7508E with latest EOS release. 

mlagA.10:18:17(config)#show ver
Arista DCS-7508
Hardware version:    06.00
....
Software image version: 4.22.0.1F

CPU only has 64% idle cycles. Not low. 

mlagA.10:18:13(config)#show proc top once | more
%Cpu(s): 29.6 us,  3.9 sy,  0.0 ni, 64.0 id,  0.1 wa,  0.5 hi,  2.0 si,  0.0 st
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
   23 root      20   0     0    0    0 R  89.6  0.0  26:04.70 ksoftirqd/2
   17 root      20   0     0    0    0 R  68.5  0.0  24:27.80 ksoftirqd/1

I suspect there may be some unexpected traffic hitting the CPU, so check the output of "show cpu couter queue | nz"

yo412.mlagA.10:22:38(config)#clear counters
!!! even a single command - clear counter, takes almost 10 sec to complete !!!


yo412.mlagA.10:22:47(config)#show cpu counters queue | nz | more
Arad3/0:
CoPP Class                 Queue                    Pkts             Octets           DropPkts         DropOctets
Aggregate
-----------------------------------------------------------------------------------------------------------------
CoppSystemL3LpmOverflow    Et3/6/1                  1753             473344              74945           21049856
CoppSystemL3LpmOverflow    Et3/6/2                  1112             307200              73605           20702976
CoppSystemL3LpmOverflow    Et3/6/3                   610             166912              86302           23954432
CoppSystemL3LpmOverflow    Et3/6/4                  1178             320256              77089           21414656

Looks like there is a lot of packets hitting the cpu, even the CoPP filters out most of them. But this is a full load chassis, the aggregated traffic is still too heavy to a x86 CPU. 

Try to tcpdump the incoming packets from et3/6/1 and punted to cpu. Surprisingly not many... 

mlagA.10:35:23(config)#bash tcpdump -nvvi et3_6_1
tcpdump: listening on et3_6_1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:35:36.066689 00:1c:73:46:0d:b0 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
10:35:39.530341 00:1c:73:3b:e0:22 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
^C

2 packets captured

Try to mirror this port to cpu then tcpdump it. (This feature is only supported on 7500E/R or 7280R devices)

mlagA.10:37:46(config)#monitor session 1 source et3/6/1 rx
mlagA.10:39:12(config)#monitor session 1 destination cpu

mlagA.10:39:15(config)#bash tcpdump -nvi mirror0
tcpdump: listening on mirror0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:40:08.198512 1e:af:14:08:18:02 > 00:aa:aa:aa:bb:cc, ethertype 802.1Q (0x8100), length 252: vlan 1408, p 0, ethertype IPv4, 
    100.14.8.119.30485 > 220.200.16.1.24659: Flags [R.UW], seq 0:194, ack 0, win 61689, urg 0, length 194 
10:40:08.199078 1e:af:14:09:18:01 > 00:aa:aa:aa:bb:cc, ethertype 802.1Q (0x8100), length 252: vlan 1409, p 0, ethertype IPv4, 
    100.14.9.118.30504 > 220.200.17.1.24648: Flags [PUEW], seq 0:194, win 62028, urg 0, length 194

Do we have the route? No....

mlagA.10:40:08(config)#sh ip route 220.200.17.1
VRF: default
....
Gateway of last resort is not set

Create a null route for this prefix, response is better and "show cpu couter queue | nz" is back to normal now, no L3LPMOverflow anymore. 

mlagA.10:54:28(config)#ip route 220.200.0.0/16 null0
mlagA.10:54:59(config)#sh cpu counters queue | nz | more
Arad3/0:
CoPP Class                 Queue                    Pkts             Octets           DropPkts         DropOctets
Aggregate
-----------------------------------------------------------------------------------------------------------------
CoppSystemIgmp             Et3/1/2                   160              10240                  0                  0
CoppSystemIgmp             Et3/1/4                   160              10240                  0                  0


But cpu still high. And the busiest process is changed to SandFap instead of ksoftirqd. Hmmm.... why?

mlagA.10:57:19(config)#sh proc top once | more
%Cpu(s): 30.2 us,  4.1 sy,  0.0 ni, 62.1 id,  0.1 wa,  0.5 hi,  3.1 si,  0.0 st
...
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
12874 root      20   0 1001m 357m 192m R 100.4  2.2 110:39.04 SandFap
13025 root      20   0 1001m 357m 192m S  69.6  2.2 110:54.53 SandFap
16765 root      20   0 1001m 359m 193m S  51.2  2.2  96:56.18 SandFap

No comments:

Post a Comment