Arista EOS/Networking Tips by Solomon Yang: MLAG Fast Convergence - MAC Redirection/Promption

5/22/2018

MLAG Fast Convergence - MAC Redirection/Promption

https://eos.arista.com/eos-4-18-0f/mlag-unicast-convergence/

Problem Description:

Consider the following setup

mlagA and mlagB are 2 mlag peers with port-ch 2000;
The host MAC - 0000:1111:2222 is learnt on MLAG 10 and A is the owner.

In another way, MAC 0000:1111:2222 is A's local MAC and B's remote

A bit background:

The MAC address and ARP information are all sync'ed during boot-up;
After that, only MAC table is sync'ed, for example, A tells B that MAC a.b.c is from mlag po10, or singly interface, or remote vtep.
So in the MAC table, the MAC has at least 4 states:

learnedDynamic (local mlag),
peerDynamic (remote mlag),
learnedRemoteDynamic (vxlan)
peerRemoteDynamic (remote vxlan)

Get this information by command - show mac address mlag-peer

Now saying we have

link failure, which has 2 loss: down and up

when B's po 10 is down, all MAC are re-programmed from po 10 to po 2000, so be MAC move is done one by one in old releases before 4.18.1F. (#1)
when B's po1 10 is back up, the ACL on peerLink to block BUM traffic immediately to break L2 loop while the MAC move needs time to be completed. (#2)

node failure, which introduces 3 loss, down, up and delay timeout

node down, 100s msec loss depending on scale
node up, 100s msec loss (#3)

When the peerlink is up, why? At this time, peerB has no uplink or downlink up, which are all in reload-delay.
Remember the MAC sync mentioned above? A needs to sync up with B on the MAC, so on A these MAC learnt from B will be flushed!!
A has to relearn these MAC, 50% all of sudden. Still ok for locally switched packets because the hw flooding kicks in.
But bad for Vxlan, which requires software flood for the head-end-replication.

reload-delay timeout, 100s msec, actually 2 times

Need to have iBGP or IGP L3 routing between 2 peers.
Since peerlink is up fast and much earlier than mlag/non-mlag interface. So when non-mlag or mlag interfaces are up, they can send the traffic to peer link before the optimal path converged.

Feature and solution

So from 4.18F, a feature called MLAG fast MAC redirection is developed to address above issues. This feature has 2 aspects:

1. MAC redirect, for #1 and #2 loss

With this feature, the the interface attribute of impacted MACs still point to MLAG po 10 in host table. So, no move at all.
Strata and Sand implementation are slightly different but same idea.
On Sand, it is to use a recirc channel on each Arad/Jericho chip to recycle the MLAG destined packets over to peer-link.
Requirements and limitation:

Peerlink must be a LAG not Ethernet on Strata.
On Sand "platform sand lag hardware-only" must be enabled, I believe only hw LAG can share member port - the recirc channel.
MLAG ASU2 cannot co-exist.

2. MAC address promotion, targeting #3.

When 1) peer reboot; 2) hitful restart of fwding plane, the remote MACs (learnt via peer switch) will be flushed, which cause
1) before the MAC is re-learnt, packets needs to flooded. Still ok in pure L2/L3 environment because done by hw;
2) software forward of Vxlan packets. That's a big issue because it results in drops by CoPP.
Why the MACs are flushed?

when peerB is up, MAC are sync'ed from A to B. These remote MAC are flushed. No MAC, then flooding.

Solution:

when peerB is down, the ownership of MAC are transfered to peerA
when peerB recovers, *ALL* MAC are sync'ed from A to B

Details:

when peerB is down, peerA enters failover state;
peerDynamic, peerLearnedRemote to learnedDynamic, learnedRemoteDynamic
NOT single-leg host

Misc:

From AD1554:

If all uplinks are L3 interfaces, then it is preferable to keep non-mlag reload-delay timer < mlag timer, so

Upstream/L3 up first before downstream/L2;
In this way, S-N traffic should be no loss.

Enabled "reload-delay mode lacp standby", need to have non-mlag timre >= mlag timer,

MLAG interfaces with LACP are kept warm for LAG membership table, MAC table programming.
But upstream/L3 must be up after L2/downstream/mlag interfaces, otherwise S/N traffic are blackhole'd.

From AD3152:

SandL3Unicast - managing NH and ensuring EEDB no change
SandACL - programming DROP ACL on the peerLinkRecircPort avoid pkts from peerLink back to peerLink
Assigning LagMemberID is interesting, this peerLinkRecircPort needs a member id. what about overflown?
LAG member, C/D bit

C = collecting, D = Distributing
if static LAG, C/D=True, added to LAG
if LACP enabled, only C=True, D=True, member can be added.
peerLinkRecircPort is always C/D=False/True

LC removal event

If all ports on this LC,

L3 will have some downtime since losing all ARP entries;
L2 should be fine once recirc port is programmed.

If at least 1 member on another LC, L2/L3 should be fine

This is quite complicated!

3/1, 3/36, 4/1 are local member of mlag Po 10
all 3 ports down, all 3 members retained with C/D=False
PeerLinkRecicFap is added from either 3/0, 3/2 or 4/0
If LC3 is pulled, peerLinkRecirFap 4/0 is added.
And a lot of combination of events, LAG config change, member port down...

4 events:

LC removal
LAG config change: unconfig and change config
Member port down or cable unplug
port-ch shutdown

CLIs:

show plat trident counter int e27/1
Drops counts in Vlan boundary = Vlan ID missed. In another word, the VLAN id programming on this interface is not done yet.

Reference:

PeerOne Vxlan + MLAG
AD3398, BG141435/96642
AD3152
AD1554

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)