5/22/2018

MLAG Fast Convergence - MAC Redirection/Promption

https://eos.arista.com/eos-4-18-0f/mlag-unicast-convergence/

Problem Description:


Consider the following setup 

  • mlagA and mlagB are 2 mlag peers with port-ch 2000;
  • The host MAC - 0000:1111:2222 is learnt on MLAG 10 and A is the owner. 
    • In another way, MAC 0000:1111:2222 is A's local MAC and B's remote
  • A bit background:
    • The MAC address and ARP information are all sync'ed during boot-up;
    • After that, only MAC table is sync'ed, for example, A tells B that MAC a.b.c is from mlag po10, or singly interface, or remote vtep. 
    • So in the MAC table, the MAC has at least 4 states:
      • learnedDynamic (local mlag), 
      • peerDynamic (remote mlag), 
      • learnedRemoteDynamic (vxlan) 
      • peerRemoteDynamic (remote vxlan)
    • Get this information by command - show mac address mlag-peer



Now saying we have 

  • link failure, which has 2 loss: down and up
    • when B's po 10 is down, all MAC are re-programmed from po 10 to po 2000, so be MAC move is done one by one in old releases before 4.18.1F. (#1)
    • when B's po1 10 is back up, the ACL on peerLink to block BUM traffic immediately to break L2 loop while the MAC move needs time to be completed. (#2)
  • node failure, which introduces 3 loss, down, up and delay timeout
    • node down, 100s msec loss depending on scale
    • node up, 100s msec loss (#3)
      • When the peerlink is up, why? At this time, peerB has no uplink or downlink up, which are all in reload-delay. 
      • Remember the MAC sync mentioned above? A needs to sync up with B on the MAC, so on A these MAC learnt from B will be flushed!!
      • A has to relearn these MAC, 50% all of sudden. Still ok for locally switched packets because the hw flooding kicks in. 
      • But bad for Vxlan, which requires software flood for the head-end-replication. 
    • reload-delay timeout, 100s msec, actually 2 times
      • Need to have iBGP or IGP L3 routing between 2 peers.
      • Since peerlink is up fast and much earlier than mlag/non-mlag interface. So when non-mlag or mlag interfaces are up, they can send the traffic to peer link before the optimal path converged. 
Feature and solution

So from 4.18F, a feature called MLAG fast MAC redirection is developed to address above issues. This feature has 2 aspects:

1. MAC redirect, for #1 and #2 loss
  • With this feature, the the interface attribute of impacted MACs still point to MLAG po 10 in host table. So, no move at all. 
  • Strata and Sand implementation are slightly different but same idea. 
  • On Sand, it is to use a recirc channel on each Arad/Jericho chip to recycle the MLAG destined packets over to peer-link. 
  • Requirements and limitation:
    • Peerlink must be a LAG not Ethernet on Strata. 
    • On Sand "platform sand lag hardware-only" must be enabled, I believe only hw LAG can share member port - the recirc channel. 
    • MLAG ASU2 cannot co-exist. 
2. MAC address promotion, targeting #3. 
  • When 1) peer reboot; 2) hitful restart of fwding plane, the remote MACs (learnt via peer switch) will be flushed, which cause
  • 1) before the MAC is re-learnt, packets needs to flooded. Still ok in pure L2/L3 environment because done by hw;
  • 2) software forward of Vxlan packets. That's a big issue because it results in drops by CoPP. 
  • Why the MACs are flushed?
    • when peerB is up, MAC are sync'ed from A to B. These remote MAC are flushed. No MAC, then flooding. 
  • Solution: 
    • when peerB is down, the ownership of MAC are transfered to peerA
    • when peerB recovers, *ALL* MAC are sync'ed from A to B
  • Details:
    • when peerB is down, peerA enters failover state;
    • peerDynamic, peerLearnedRemote to learnedDynamic, learnedRemoteDynamic
    • NOT single-leg host
Misc:
  • From AD1554:
    • If all uplinks are L3 interfaces, then it is preferable to keep non-mlag reload-delay timer < mlag timer, so
      • Upstream/L3 up first before downstream/L2;
      • In this way, S-N traffic should be no loss. 
    • Enabled "reload-delay mode lacp standby", need to have non-mlag timre >= mlag timer,
      • MLAG interfaces with LACP are kept warm for LAG membership table, MAC table programming. 
      • But upstream/L3 must be up after L2/downstream/mlag interfaces, otherwise S/N traffic are blackhole'd. 
  • From AD3152:
    • SandL3Unicast - managing NH and ensuring EEDB no change
    • SandACL - programming DROP ACL on the peerLinkRecircPort avoid pkts from peerLink back to peerLink
    • Assigning LagMemberID is interesting, this peerLinkRecircPort needs a member id. what about overflown?
    • LAG member, C/D bit
      • C = collecting, D = Distributing
      • if static LAG, C/D=True, added to LAG
      • if LACP enabled, only C=True, D=True, member can be added. 
      • peerLinkRecircPort is always C/D=False/True
    • LC removal event
      • If all ports on this LC, 
        • L3 will have some downtime since losing all ARP entries; 
        • L2 should be fine once recirc port is programmed. 
      • If at least 1 member on another LC, L2/L3 should be fine
    • This is quite complicated! 
      • 3/1, 3/36, 4/1 are local member of mlag Po 10
      • all 3 ports down, all 3 members retained with C/D=False
      • PeerLinkRecicFap is added from either 3/0, 3/2 or 4/0 
      • If LC3 is pulled, peerLinkRecirFap 4/0 is added. 
      • And a lot of combination of events, LAG config change, member port down...
    • 4 events:
      • LC removal
      • LAG config change: unconfig and change config
      • Member port down or cable unplug
      • port-ch shutdown
  • CLIs:
    • show plat trident counter int e27/1
    • Drops counts in Vlan boundary = Vlan ID missed. In another word, the VLAN id programming on this interface is not done yet. 
Reference:
  • PeerOne Vxlan + MLAG
  • AD3398, BG141435/96642
  • AD3152
  • AD1554

No comments:

Post a Comment