Technical Docs
Best Practices for Small-Scale AI Compute Backend Fabric
22 min
preface preface this guide provides a detailed introduction to the standardized networking solution, configuration guidance, and maintenance manual for small scale ai computing backend fabric the solution implements a single tier clos network using asterfusion data center switches, based on rail only architecture target audience target audience intended for solution planners, designers, and on site implementation engineers who are familiar with asterfusion data center switches roce, pfc, ecn, and related technologies overview overview the rail only architecture is the ideal design for small scale ai backend fabric as shown in the figure above, the rail only architecture adopts a single tier network design, physically partitioning the entire cluster network into 8 independent rails communication between gpus of different nodes is intra rail, achieving single hop connectivity compared to the traditional clos architecture, the rail only architecture eliminates the spine layer by reducing network tiers, it saves on the number of switches and optical modules, thereby reducing hardware costs it is a low cost, high performance network architecture specifically tailored for large ai model training in small scale compute clusters typical configuration example typical configuration example network topology network topology this example illustrates an ai cluster consisting of 32 compute nodes (128 gpus total, 4 per server), with 4 cx732q n switches deployed as leaf nodes the key design principles are summarized as follows each gpu connects to a dedicated nic; nics follow the "nic n to leaf n " rule independent subnets per rail single tier clos architecture easy roce enabled on leaf switches the gateway vlan ip address planning is as follows table 1 gateway vlan ip address planning device name device name vlan vlan gateway ip address gateway ip address leaf1 101 10 10 1 1/26 leaf2 102 10 10 1 65/26 leaf3 103 10 10 1 129/26 leaf4 104 10 10 1 193/26 configuration overview configuration overview table 2 configuration overview task task configuration configuration roadmap configure leaf switch (optional) configure nic side interface breakout configure gateway vlan and ip address enable easy roce configuring leaf switches configuring leaf switches (optional) configure nic side interface breakout (optional) configure nic side interface breakout when connecting 400g nics to cx864e n switches, split each 800g port into two 400g interfaces table 3 interface breakout configuration step step leaf1 leaf1 enter global configuration mode configure terminal configure breakout for 800g interfaces interface range ethernet 0/0 0/504 breakout 2x400g\[200g] ! if the current version does not support batch configuration interface ethernet 0/0 breakout 2x400g\[200g] ! after completing the configuration, verify the interface status using the show interface summary command configure gateway vlan and ip address configure gateway vlan and ip address table 4 configuring vlan and interface ip addresses step step leaf1 leaf1 configure hostname hostname leaf1 enter global configuration mode configure terminal create gateway vlan and configure ip vlan 101 ! interface vlan 101 ip address 10 10 1 1/26 exit ! add interfaces to the vlan interface range ethernet 0/0 0/248 switchport access vlan 101 ! if the current version does not support batch configuration interface ethernet 0/0 switchport access vlan 101 ! verify vlan configuration using the show vlan summary command enable easy roce enable easy roce the cx n series switches support queues 0 7 (8 queues in total) queue 3 and queue 4 are lossless (supporting up to two lossless queues), while others are lossy the default template uses system default dscp mapping pfc and ecn are enabled for queue 3 and queue 4, and strict priority (sp) scheduling is set for queues 6 and 7 when creating a template, you can specify three parameters cable length specifies the cable length, affecting pfc and ecn parameter calculations options 5m / 40m / 100m / 300m if the exact length is unavailable, choose the closest value (e g , choose 5m for a 10m cable) incast level specifies the traffic incast model, affecting pfc parameters calculation options low (e g 1 1) / medium (e g 3 1) / high (e g 10 1) low is typically used for gpu backend fabric traffic model specifies the business type throughput sensitive, latency sensitive, or balanced this affects ecn parameters calculations options throughput / latency / balance balance and throughput are typically used for gpu backend fabric if the provided lossless roce configuration does not fully suit your scenario, refer to docid\ xdrtk4hdivmfb0cavs04w for fine tuning table 5 enabling easy roce step step leaf1 leaf1 (optional) modify lossless queues; requires save and reload to take effect no priority flow control enable 3 no priority flow control enable 4 priority flow control enable queue id write reload select easy roce template and apply to all interfaces qos roce lossless cable length 5m incast level low traffic model throughput qos service policy roce lossless 5m low throughput verify roce configuration using the show qos roce command maintenance maintenance roce parameter adjustment/optimization roce parameter adjustment/optimization when default configurations are insufficient, use the following commands to optimize performance modify dscp mapping modify dscp mapping table 6 modifying dscp mapping step step command command check running config for dscp map name show running config enter global configuration mode configure terminal enter dscp map configuration view diffserv map type ip dscp roce lossless diffserv map map specific dscp to cos value ip dscp dscp value cos cos value map all dscp to a default cos default cos value use system default dscp mapping default copy note note the cos value represents the queue id the packet is mapped to modify queue scheduling policy modify queue scheduling policy if the interface has been bound to a lossless roce policy, unbind it before modifying table 7 modifying queue scheduling policy step step command command check running config for policy name show running config enter global configuration mode configure terminal enter lossless roce policy view policy map roce lossless name configure sp mode scheduling queue scheduler priority queue queue id configure dwrr mode scheduling queue scheduler queue limit percent queue weight queue queue id adjust pfc and ecn thresholds adjust pfc and ecn thresholds ecn thresholds are adjusted via min th , max th , and probability min th sets the lower absolute value for ecn marking (bytes) max th sets the upper absolute value for ecn marking (bytes) probability sets the maximum marking probability \[1 100] pfc thresholds are adjusted via the dynamic threshold coefficient dynamic th \text{pfc threshold} = 2^{\text{dynamic\\ th}} \times \text{remaining available buffer} other parameters can remain unchanged during modification recommended values for cx864e n pfc dynamic th 1, 2, 3 wred min (bytes) 1,000,000 / 2,000,000 / 3,000,000 wred max (bytes) 8,000,000 / 10,000,000 / 12,000,000 wred probability (%) 10, 30, 50, 70, 90 recommended values for other models pfc dynamic th 1, 2, 3 wred min (bytes) 1,000,000 / 2,000,000 / 3,000,000 wred max (bytes) 4,000,000 / 5,000,000 / 6,000,000 wred probability (%) 10, 30, 50, 70, 90 note note try ecn adjustment first, then pfc you can follow the principle wred min < wred max < pfc xon < pfc xoff this ensures ecn triggers rate adjustment early during congestion to avoid unnecessary pfc, while still allowing pfc to trigger promptly when necessary to prevent packet loss table 8 adjusting pfc and ecn thresholds step step command command get wred and buffer template names show running config enter global configuration mode configure terminal enter ecn configuration view wred roce lossless ecn adjust ecn thresholds mode ecn gmin min th gmax max th gprobability probability enter pfc configuration view buffer profile roce lossless profile adjust pfc thresholds mode lossless dynamic dynamic th size size xoff xoff xon offset xon offset common o\&m commands common o\&m commands interface status maintenance interface status maintenance table 9 interface status information step step command command view interface status show interface summary view l3 interface ip and status show ip interfaces view vlan configuration show vlan summary view interface counters show counters interface common table entry maintenance common table entry maintenance table 10 common table entries step step command command view lldp neighbors show lldp neighbo r { summary | interface interface name } view local mac address table show mac address view local arp table show arp roce statistics maintenance roce statistics maintenance table 11 roce statistics step step command command view roce configuration show qos roce \[ all | summary | roce profile name ] view interface policy bindings show interface policy map view roce statistics by queue show counters qos roce interface ethernet interface name queue queue id clear all roce counters clear counters qos roce view pfc counters show counters priority flow control view ecn counters show counters ecn appendix configuration files (sample) appendix configuration files (sample) leaf1 ! hostname leaf1 ! interface loopback 0 ip address 10 1 0 111/32 ! interface vlan 101 ip address 10 10 1 1/26 exit ! interface range ethernet 0/0 0/248 switchport access vlan 101 ! qos roce lossless cable length 5m incast level low traffic model throughput qos service policy roce lossless 5m low throughput ! leaf2 ! hostname leaf2 ! interface loopback 0 ip address 10 1 0 112/32 ! interface vlan 102 ip address 10 10 1 65/26 exit ! interface range ethernet 0/0 0/248 switchport access vlan 102 ! qos roce lossless cable length 5m incast level low traffic model throughput qos service policy roce lossless 5m low throughput ! leaf3 ! hostname leaf3 ! interface loopback 0 ip address 10 1 0 113/32 ! interface vlan 103 ip address 10 10 1 129/26 exit ! interface range ethernet 0/0 0/248 switchport access vlan 103 ! qos roce lossless cable length 5m incast level low traffic model throughput qos service policy roce lossless 5m low throughput ! leaf4 ! hostname leaf4 ! interface loopback 0 ip address 10 1 0 114/32 ! interface vlan 104 ip address 10 10 1 193/26 exit ! interface range ethernet 0/0 0/248 switchport access vlan 104 ! qos roce lossless cable length 5m incast level low traffic model throughput qos service policy roce lossless 5m low throughput !
