4/16/2019

How to debug ISIS/BFD neighbor issue

Topology:

[R1/DCS-7512N]:et11/36/1--{180.13.1.0/24}--et11/1:[R2/DCS-7050QX]

* both routers run 4.20.11M
* Between R1 and R2, ISIS/BFD is enabled shown as below

Router Configuration:

R1 configuration:

router isis isis.def
   net 49.0001.0000.0000.0001.00
   is-type level-1
   log-adjacency-changes
   redistribute connected
   spf-interval 1 50 100
   timers lsp generation 1 10 1000
   !
   address-family ipv4 unicast
      maximum-paths 32
   !
   address-family ipv6 unicast
      maximum-paths 32
!
interface Ethernet11/36/1
   description UpEt_A_3.1
   speed forced 10000full
   no switchport
   ip address 180.13.1.1/24
   ipv6 address 180:13:1::1/64
   isis enable isis.def
   isis bfd
   isis circuit-type level-1
   isis network point-to-point
   isis authentication mode md5 level-1
   isis authentication key 7 QI3r+/7nP3Q= level-1

R2 configuration:
router isis isis.def
   net 49.0001.0000.0000.0103.00
   is-type level-1
   log-adjacency-changes
   spf-interval 1 50 100
   timers lsp generation 1 10 1000
   !
   address-family ipv4 unicast
      maximum-paths 32
   !
   address-family ipv6 unicast
      maximum-paths 32
!
interface Ethernet11/1
   description UpEt_A_3.1
   speed forced 10000full
   no switchport
   ip address 180.13.1.100/24
   ipv6 address 180:13:1::100/64
   isis enable isis.def
   isis bfd
   isis circuit-type level-1
   isis network point-to-point
   isis authentication mode md5 level-1
   isis authentication key 7 QI3r+/7nP3Q= level-1

Symptom:

R1: No ISIS and BFD neighbor


R1.13:53:14(config)#show isis nei | grep uspine3
R1.13:53:16(config)#show bfd nei | grep 180.13
R1.13:53:18(config)#

R2: ISIS neighbor stuck in INIT, and BFD is down

R2#sh isis nei
Instance  VRF      System Id        Type Interface          SNPA              State Hold time   Circuit Id
isis.def  default  R1               L1   Ethernet11/1       P2P               INIT  26          2D

R2#sh bfd nei

VRF name: default
-----------------
DstAddr                    MyDisc        YourDisc             Interface         Type               LastUp             LastDown                LastDiag    State
-------------------- --------------- --------------- --------------------- ------------ -------------------- -------------------- ----------------------- -----
180.13.1.1             2628313401               0      Ethernet11/1(47)       normal       04/15/19 13:25       04/15/19 13:26       Nbr Signaled Down     Down

Step1: R1/R2 - "show lldp nei" good and ping works, so no connectivity issue but high latency


-- show lldp nei --
R2#sh lldp nei
Port       Neighbor Device ID               Neighbor Port ID           TTL
Et11/1     R2                               Ethernet11/36/1            120

-- ping -- R2#ping 180.13.1.1
PING 180.13.1.1 (180.13.1.1) 72(100) bytes of data.
80 bytes from 180.13.1.1: icmp_seq=1 ttl=64 time=390 ms
80 bytes from 180.13.1.1: icmp_seq=2 ttl=64 time=437 ms
....

-- show interface <intf> | grep error --
R1.13:46:01(config)#sh int et11/36/1 | grep error
     0 input errors, 0 CRC, 0 alignment, 0 symbol, 0 input discards

     0 output errors, 0 collisions

R2.13:45:34(config)#sh int et11/1 | grep error
     0 input errors, 0 CRC, 0 alignment, 0 symbol, 0 input discards

     0 output errors, 0 collisions

Step2: R2's Down BFD shows NO incoming bfd pkts

-- show bfd neighbor dest-ip <ip> detail --
R2#sh bfd neighbors dest-ip 180.13.1.1 detail
VRF name: default
-----------------
Peer Addr 180.13.1.1, Intf Ethernet11/1, Type normal, State Down
VRF default, LAddr 180.13.2.100, LD/RD 2628313401/0
Session state is Down and not using echo function
Last Up Apr 15 13:25:27 2019
Last Down Apr 15 13:26:23 2019
Last Diag: Nbr Signaled Down
TxInt: 1000, RxInt: 1000, Multiplier: 3
Received RxInt: 300, Received Multiplier: 3
Rx Count: 0, Rx Interval (ms) min/max/avg: 0/0/0 last: never
Tx Count: 87681, Tx Interval (ms) min/max/avg: 508/996/870 last: 627 ms ago
Detect Time: 3000
Sched Delay: 1*TxInt: 87898, 2*TxInt: 0, 3*TxInt: 0, GT 3*TxInt: 0

R2 sends out BFD but no receiving

-- bash tcpdump -ni <intf> udp --
[admin@R2 ~]$ tcpdump -ni et11_1 udp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:48:22.309536 28:99:3a:3b:15:39 > 44:4c:a8:a5:11:40, ethertype IPv4 (0x0800), length 66: 180.13.1.100.57657 > 180.13.1.1.bfd-control: BFDv1, Control, State Down, Flags: [none], length: 24


11:48:23.301574 28:99:3a:3b:15:39 > 44:4c:a8:a5:11:40, ethertype IPv4 (0x0800), length 66: 180.13.1.100.57657 > 180.13.1.1.bfd-control: BFDv1, Control, State Down, Flags: [none], length: 24

Same observation on R1, only receiving no sending

R1.13:53:18(config)#bash tcpdump -ni et11_36_2 udp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_36_2, link-type EN10MB (Ethernet), capture size 262144 bytes
13:55:10.568533 28:99:3a:3b:15:39 > 44:4c:a8:a5:11:40, ethertype IPv4 (0x0800), length 66: 180.13.2.100.57657 > 180.13.2.1.bfd-control: BFDv1, Control, State Down, Flags: [none], length: 24

13:55:11.445074 28:99:3a:3b:15:39 > 44:4c:a8:a5:11:40, ethertype IPv4 (0x0800), length 66: 180.13.2.100.57657 > 180.13.2.1.bfd-control: BFDv1, Control, State Down, Flags: [none], length: 24

ISIS should notify the BFD to establish session after ISIS neighbor up. So the symptom is narrowed down to ISIS neighbor issue. 

Step 3: Tcpdump ISIS IIH (isis hello) on both sides 

-- show ver | grep MAC to get system MAC --

R1#show ver | grep MAC
System MAC address:  444c.a8a5.1140

R2#show ver | grep MAC
System MAC address:  2899.3a3b.1539

R2: IIH 2-way and looks good, R1/R2 send IIH every 10 seconds

-- bash tcpdump -ni <intf> iih --
[admin@R2 ~]$ tcpdump -ni et11_1 iih
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:18:16.104275 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497
11:18:18.000062 28:99:3a:3b:15:39 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0103, length 1497
11:18:24.723990 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497
11:18:27.597320 28:99:3a:3b:15:39 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0103, length 1497

R1: IIH, only 1 way, from R1 to R2, no R2 to R1

[admin@bn303 ~]$ tcpdump -ni et11_36_2 iih
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_36_2, link-type EN10MB (Ethernet), capture size 262144 bytes
11:23:53.268387 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497

11:24:02.823894 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497

IIH packets are destinated to well-known LAN broadcast address (0900:2b00:0005). These packets can't reach from R2 to R1. 

Step 4: Arping

Arping shows similar, R1 to R2 is good, but seeing broadcast packets (all ff) drop on the direction of R2 to R1 even with arp ping

* Arp ping is to send our brocast L2 arp request to probe hosts

[admin@R1 ~]$ arping -b -I et11_36_1 180.13.1.100
ARPING 180.13.1.100 from 180.13.1.1 et11_36_1
Unicast reply from 180.13.1.100 [28:99:3A:3B:15:39]  419.967ms
Unicast reply from 180.13.1.100 [28:99:3A:3B:15:39]  409.205ms
Unicast reply from 180.13.1.100 [28:99:3A:3B:15:39]  401.616ms
Unicast reply from 180.13.1.100 [28:99:3A:3B:15:39]  410.797ms
^CSent 4 probes (4 broadcast(s))
Received 4 response(s)

[admin@R2 ~]$ arping -b -I et11_1 180.13.1.1
ARPING 180.13.1.1 from 180.13.1.100 et11_1
^CSent 9 probes (9 broadcast(s))

Received 0 response(s)

[admin@R2 ~]$ tcpdump -ni et11_1 arp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:46:04.859574 28:99:3a:3b:15:39 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 180.13.1.1 (Broadcast) tell 180.13.1.100, length 28

11:46:05.859645 28:99:3a:3b:15:39 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 180.13.1.1 (Broadcast) tell 180.13.1.100, length 28

At this point, the issue is narrowed down to, broadcast packets from R2 to R1 are dropped unexpectedly, which causes R1 not seeing R2's IIH packets, then no ISIS/BFD neighbor.  

Step 5: hardware issue, but who drops the packets?

Ok, now we can say, most likely the problem is hardware, but who? The best way to nail down the culprit is to mirror packets to cpu, to see if the broadcast packets leave the R2. 

* The feature of mirroring interface packets to CPU is only supported on Sand (Jericho/Arad) platform. 

R1.14:32:14(config)#monitor session 1 source ethernet 11/36/1 rx
R1.14:32:19(config)#monitor session 1 destination cpu

Seeing IIH from R2 to R1 on the mirror cpu. << This is the packets received on the wire of et11/36/1

R1.14:32:20(config)#bash tcpdump -ni mirror0 iih
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mirror0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:33:04.105249 28:99:3a:3b:15:39 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0103, length 1497
14:33:13.257565 28:99:3a:3b:15:39 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0103, length 1497
14:33:21.554871 28:99:3a:3b:15:39 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0103, length 1497

But not on the physical ethernet port << This is the packets received by the kernel via et11/36/1

R1.14:33:22(config)#bash tcpdump -ni et11_36_1 iih
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on et11_36_1, link-type EN10MB (Ethernet), capture size 262144 bytes
14:33:35.056748 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497

14:33:42.990686 44:4c:a8:a5:11:40 > 09:00:2b:00:00:05, 802.3, length 1500: LLC, dsap OSI (0xfe) Individual, ssap OSI (0xfe) Command, ctrl 0x03: OSI NLPID IS-IS (0x83): p2p IIH, src-id 0000.0000.0001, length 1497

Now we know the broadcast packets are dropped on R1 between eth and kernel. 

Step 6: Counters

And this is approved by the counter output on both R1 and R2

R2/Trident - No loss

R2#clear counters
R2#clear platform trident counters
R2#sh cpu counters queue | nz

--------------------------------------------------------------------------------
                                 Linecard0/0
--------------------------------------------------------------------------------
Queue                                          Counter/pkts*          Drops/pkts
---------------                          ------------------- -------------------
OSPF/ISIS                                                  2                   0
BFD                                                       10                   0
R2#sh platform trident counters interface e11/1 | egrep -i 'card|drop' | nz
R2#

R1/Jericho - Loss seen

R1#clear hardware counter drop
R1#clear counters

R1#sh cpu counters queue | grep 'Et11/36/1' | nz

CoppSystemLldp             Et11/36/1                  4                924                  0                  0

R1#sh hardware counter drop | grep o11/1
A     Jericho11/1   DeqDeletePktCnt                :           39148 : 2019-04-16 11:59:39 : 2019-04-16 11:59:39
P     Jericho11/1   dropVoqInNullRoute             :      3800339040 : 2019-04-16 11:59:39 : 2019-04-16 11:59:39

R2#sh hardware counter drop | grep o11/1
A     Jericho11/1   DeqDeletePktCnt                :           39153 : 2019-04-16 11:59:39 : 2019-04-16 11:59:45
P     Jericho11/1   dropVoqInNullRoute             :      3800605701 : 2019-04-16 11:59:39 : 2019-04-16 11:59:45

So, should goes to DeqDeletePktCnt or dropVoqInNullRoute. What does they mean?

No comments:

Post a Comment