4/09/2018

Arista EOS, MLAG (2) - Dual-Primary Detection

MLAG Dual-Primary Detection

This is a long-waiting MLAG feature and was introduced from EOS 4.20.1F around late 2017. When peer link is down, the secondary takes over primary. But sometime, the problem is only on the peer link and the peer is still alive. Without this feature, there will be dual MLAG primary and caused traffic disruption like bursty traffic loop. 

But if you enable this feature, MLAG will communicate with peer via management interface + peer link. Since out-of-band management interface is considered less chance to be clogged, it will prevent above dual primary scenario. 

How does it work?
  • When peer link is down, the secondary takes over primary immediately. 
  • Meanwhile it starts the dual-primary detection. 
  • If the peer receives heartbeats, it concludes a dual-primary found. As a result, secondary peer will disable ALL interfaces to avoid loop. 
  • When peer link is up, it will start MLAG negotiation and recovers
Here is the configuration:

mlag configuration
  peer-address heartbeat 172.30.134.180 
  ! heartbeat via management ip address
  dual-primary detection delay 10 action errdisable all-interfaces

How to verify:

Arista.EOS#show mlag det
MLAG Configuration:
domain-id              :       pg.mlag.leaf1
local-interface        :            Vlan4094
peer-address           :       192.168.255.0
peer-link              :    Port-Channel2000
hb-peer-address        :      172.30.134.181
peer-config            :          consistent

MLAG Status:
state                  :              Active
negotiation status     :           Connected
peer-link status       :                  Up
local-int status       :                  Up
system-id              :   46:4c:a8:97:83:7d
dual-primary detection :          Configured

What happen if heartbeat connection has issue? If there is misconfiguration like missing vrf in hb-peer-address line, or out-of-band management network has connection issue, the system will report

Arista.EOS#sh mlag det
MLAG Configuration:
domain-id              :       pg.mlag.leaf1
......
MLAG Detailed Status:
....
Heartbeat timeouts since reboot :                   1
UDP heartbeat alive             :               False

Arista.EOS#show logg | grep MLAG-3

Apr 10 00:06:28 Arista.EOS Mlag: %MLAG-3-PEER_HEARTBEAT_TIMEOUT: MLAG stopped receiving UDP heartbeats from the peer 172.30.134.180.

MLAG split-brain

When the mlag split brain happens (the trigger is to disconnect the peer link), both leaf could hash BPDU to 1 peer, and the other peer doesn't receive any BPDU, so all ports are in forwarding and causes a loop. 

The STP may kick in and put ports in designated-dispute mode. But after 2 x fdWhile timer, another round of negotiation starts and form a bursty loop every 2 seconds. 

An interesting RFE 11825. 

No comments:

Post a Comment