Problem Description:
Consider the following setup
- mlagA and mlagB are 2 mlag peers with port-ch 2000;
- The host MAC - 0000:1111:2222 is learnt on MLAG 10 and A is the owner.
- In another way, MAC 0000:1111:2222 is A's local MAC and B's remote
- A bit background:
- The MAC address and ARP information are all sync'ed during boot-up;
- After that, only MAC table is sync'ed, for example, A tells B that MAC a.b.c is from mlag po10, or singly interface, or remote vtep.
- So in the MAC table, the MAC has at least 4 states:
- learnedDynamic (local mlag),
- peerDynamic (remote mlag),
- learnedRemoteDynamic (vxlan)
- peerRemoteDynamic (remote vxlan)
- Get this information by command - show mac address mlag-peer

Now saying we have
- link failure, which has 2 loss: down and up
- when B's po 10 is down, all MAC are re-programmed from po 10 to po 2000, so be MAC move is done one by one in old releases before 4.18.1F. (#1)
- when B's po1 10 is back up, the ACL on peerLink to block BUM traffic immediately to break L2 loop while the MAC move needs time to be completed. (#2)
- node failure, which introduces 3 loss, down, up and delay timeout
- node down, 100s msec loss depending on scale
- node up, 100s msec loss (#3)
- When the peerlink is up, why? At this time, peerB has no uplink or downlink up, which are all in reload-delay.
- Remember the MAC sync mentioned above? A needs to sync up with B on the MAC, so on A these MAC learnt from B will be flushed!!
- A has to relearn these MAC, 50% all of sudden. Still ok for locally switched packets because the hw flooding kicks in.
- But bad for Vxlan, which requires software flood for the head-end-replication.
- reload-delay timeout, 100s msec, actually 2 times
- Need to have iBGP or IGP L3 routing between 2 peers.
- Since peerlink is up fast and much earlier than mlag/non-mlag interface. So when non-mlag or mlag interfaces are up, they can send the traffic to peer link before the optimal path converged.
Feature and solution
So from 4.18F, a feature called MLAG fast MAC redirection is developed to address above issues. This feature has 2 aspects:
So from 4.18F, a feature called MLAG fast MAC redirection is developed to address above issues. This feature has 2 aspects:
1. MAC redirect, for #1 and #2 loss
- With this feature, the the interface attribute of impacted MACs still point to MLAG po 10 in host table. So, no move at all.
- Strata and Sand implementation are slightly different but same idea.
- On Sand, it is to use a recirc channel on each Arad/Jericho chip to recycle the MLAG destined packets over to peer-link.
- Requirements and limitation:
- Peerlink must be a LAG not Ethernet on Strata.
- On Sand "platform sand lag hardware-only" must be enabled, I believe only hw LAG can share member port - the recirc channel.
- MLAG ASU2 cannot co-exist.
- When 1) peer reboot; 2) hitful restart of fwding plane, the remote MACs (learnt via peer switch) will be flushed, which cause
- 1) before the MAC is re-learnt, packets needs to flooded. Still ok in pure L2/L3 environment because done by hw;
- 2) software forward of Vxlan packets. That's a big issue because it results in drops by CoPP.
- Why the MACs are flushed?
- when peerB is up, MAC are sync'ed from A to B. These remote MAC are flushed. No MAC, then flooding.
- Solution:
- when peerB is down, the ownership of MAC are transfered to peerA
- when peerB recovers, *ALL* MAC are sync'ed from A to B
- Details:
- when peerB is down, peerA enters failover state;
- peerDynamic, peerLearnedRemote to learnedDynamic, learnedRemoteDynamic
- NOT single-leg host
Misc:
- From AD1554:
- If all uplinks are L3 interfaces, then it is preferable to keep non-mlag reload-delay timer < mlag timer, so
- Upstream/L3 up first before downstream/L2;
- In this way, S-N traffic should be no loss.
- Enabled "reload-delay mode lacp standby", need to have non-mlag timre >= mlag timer,
- MLAG interfaces with LACP are kept warm for LAG membership table, MAC table programming.
- But upstream/L3 must be up after L2/downstream/mlag interfaces, otherwise S/N traffic are blackhole'd.
- From AD3152:
- SandL3Unicast - managing NH and ensuring EEDB no change
- SandACL - programming DROP ACL on the peerLinkRecircPort avoid pkts from peerLink back to peerLink
- Assigning LagMemberID is interesting, this peerLinkRecircPort needs a member id. what about overflown?
- LAG member, C/D bit
- C = collecting, D = Distributing
- if static LAG, C/D=True, added to LAG
- if LACP enabled, only C=True, D=True, member can be added.
- peerLinkRecircPort is always C/D=False/True
- LC removal event
- If all ports on this LC,
- L3 will have some downtime since losing all ARP entries;
- L2 should be fine once recirc port is programmed.
- If at least 1 member on another LC, L2/L3 should be fine
- This is quite complicated!
- 3/1, 3/36, 4/1 are local member of mlag Po 10
- all 3 ports down, all 3 members retained with C/D=False
- PeerLinkRecicFap is added from either 3/0, 3/2 or 4/0
- If LC3 is pulled, peerLinkRecirFap 4/0 is added.
- And a lot of combination of events, LAG config change, member port down...
- 4 events:
- LC removal
- LAG config change: unconfig and change config
- Member port down or cable unplug
- port-ch shutdown
- CLIs:
- show plat trident counter int e27/1
- Drops counts in Vlan boundary = Vlan ID missed. In another word, the VLAN id programming on this interface is not done yet.
- PeerOne Vxlan + MLAG
- AD3398, BG141435/96642
- AD3152
- AD1554
No comments:
Post a Comment