MPLS A. Mahale Internet-Draft Cerebras Systems Intended status: Informational K. Kompella Expires: 23 April 2026 V. P. Beeram Juniper Networks D. Patel AMD 20 October 2025 MPLS for AIDC Probing draft-amahale-mpls-for-aidc-00 Abstract This document describes a method for using Multi-Protocol Label Switching (MPLS) encapsulation to perform scalable and vendor- agnostic network probing within AI/ML data centers. The goal is to detect and isolate gray failures—non-deterministic hardware and software faults—in large-scale lossless networks. The approach enables targeted probing at per-link and per-node granularity, independent of IP/BGP control plane operation, and is extensible to Various CLOS topologies. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 23 April 2026. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. Mahale, et al. Expires 23 April 2026 [Page 1] Internet-Draft MPLS for AIDC October 2025 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Problem Statement: Gray Failures . . . . . . . . . . . . . . 3 3. Network Probing Overview . . . . . . . . . . . . . . . . . . 3 3.1. End-to-End Probing . . . . . . . . . . . . . . . . . . . 3 3.2. Direct Network probing . . . . . . . . . . . . . . . . . 3 4. MPLS Operations for Probing in AIDC . . . . . . . . . . . . . 4 4.1. MPLS Refresher . . . . . . . . . . . . . . . . . . . . . 4 4.2. MPLS Operations and Label Stack Use . . . . . . . . . . . 4 4.3. MPLS Myths and Clarifications . . . . . . . . . . . . . . 4 5. MPLS for AI Cluster Network Probing . . . . . . . . . . . . . 5 5.1. Topology Considerations . . . . . . . . . . . . . . . . . 5 5.2. Scaling and Label Allocation . . . . . . . . . . . . . . 7 5.3. Failure Correlation and Control Plane Independence . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 7 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction The advent of large-scale AI/ML data centers and the adoption of lossless networking paradigms have increased the operational risk of gray failures—partial, intermittent, or non-deterministic faults in network components. These failures are notoriously difficult to detect and isolate, especially in high-performance environments that rely on congestion control mechanisms such as PFC, ECN, and DCQCN. This document proposes a vendor-agnostic probing mechanism leveraging MPLS encapsulation to detect gray failures. The technique provides deterministic path visibility and decouples the probing infrastructure from the routing control plane. Mahale, et al. Expires 23 April 2026 [Page 2] Internet-Draft MPLS for AIDC October 2025 2. Problem Statement: Gray Failures Gray failures refer to partial or intermittent faults in network devices such as switches, routers, NICs, optics, and cables that do not manifest as complete outages. These failures often evade monitoring systems and may take hours or days to isolate. Traditional white-box monitoring approaches that rely on ASIC error counters or register captures are insufficient because the failure mode is not deterministic. ASICs may not be capable of self- detection when specific functional blocks wedge or stall. Proactive network probing provides a means of external validation of data-plane health by exercising network paths in controlled ways. 3. Network Probing Overview Network probing supplements standard telemetry and monitoring systems by introducing synthetic traffic to verify forwarding correctness. There are two primary categories of network probing mechanisms: * End-to-End Probing * Direct Network Probing Generally prober machines are co-located with other AI compute and attached to the leaf switches. 3.1. End-to-End Probing End-to-end probing systems, such as [Pingmesh], rely on hosts sending probes to all other hosts to verify reachability. While effective at identifying connectivity and control-plane issues, this approach lacks sufficient entropy to exercise all paths in large CLOS networks and does not directly map probe failures to specific components. 3.2. Direct Network probing Direct network probing targets intermediate hops and links directly. Implementations such as [NetNorad] generate probes toward every network node or link from each source. This provides superior fault localization but presents challenges in targeting granularity, particularly in pure IP networks where multiple encapsulations may be required. MPLS provides a clean mechanism to express path and target semantics through label stacking. Mahale, et al. Expires 23 April 2026 [Page 3] Internet-Draft MPLS for AIDC October 2025 4. MPLS Operations for Probing in AIDC 4.1. MPLS Refresher MPLS is a 4-byte shim header inserted between Ethernet and IP headers. It contains: * 20-bit Label field * 3-bit QoS (Traffic Class) * 1-bit Bottom of Stack (BOS) * 8-bit TTL MPLS allows multiple labels to be stacked to represent a sequence of hops or links, enabling source routing and fine-grained path control. The 20-bit label space (1,048,576 possible labels) is sufficient for even the largest data centers. MPLS labels may also be used to exercise lossless and lossy queues by mapping QoS bits to specific hardware queues. 4.2. MPLS Operations and Label Stack Use The POP operation is central to MPLS probing. Each network element receiving a packet with a label stack removes the top label and forwards the remaining packet based on the next label or IP header. This enables hierarchical targeting where each label represents a link or node. A single probe packet can traverse multiple layers of the CLOS fabric. 4.3. MPLS Myths and Clarifications _MPLS needs an additional protocol to function._ MPLS is an encapsulation, not a routing protocol. Labels MAY be distributed using dynamic protocols such as LDP, RSVP, or BGP, but they MAY also be configured statically. Static assignment is sufficient for probing use cases. _MPLS is complex to implement in ASICs._ MPLS lookup is simpler than IPv4/IPv6 LPM lookups, as it is an exact match on a fixed-width 20-bit key. Modern ASICs implement MPLS forwarding efficiently using hash-based tables. _Network ASICs cannot handle multiple labels._ Mahale, et al. Expires 23 April 2026 [Page 4] Internet-Draft MPLS for AIDC October 2025 Most modern data center ASICs support 8 or more MPLS labels in the stack, sufficient for multi-stage Clos topologies. _SRv6 obsoletes MPLS._ While SRv6 provides similar functionality, it requires IPv6 and adds considerable packet overhead and implementation complexity. MPLS offers a lightweight and control-plane-agnostic alternative suitable for IPv4-only environments. 5. MPLS for AI Cluster Network Probing 5.1. Topology Considerations Prober to Leaf1 to SP1 to Leaf1 to Prober +--------+ +--------+ | ||ETH|30|IP| | | | SP1 | | SP2 | | | | | | | | | +---+--+-+ +--+--+--+ 30 | |40 30| |40 | +-------------------------+| | ||ETH|IP| || | | +-------------------------++ | 20 | |25 20| |25 +---+--+-+ +-+---+--+ | | | | | | | | | LEAF1 | | LEAF2 | | | | | +----+---+ +--------+ | ||ETH|20|30|IP| ||ETH|IP | +----+----+ | | | PROBER | +---------+ Prober to Leaf1 to SP2 to Leaf2 to SP2 to Leaf1 to Prober Mahale, et al. Expires 23 April 2026 [Page 5] Internet-Draft MPLS for AIDC October 2025 +--------+ +--------+ | | | | | SP1 | |ETH|40|25|30|IP|| SP2 | |ETH|30|IP| | | | | | | | | +---+--+-+ +--+--+--+ 30 | |40 30| |40 | +-------------------------+| | | || | | +-------------------------++ | 20 | |25 |ETH|IP| 20| |25 +---+--+-+ +-+---+--+ | | | | | | | | | LEAF1 | | LEAF2 ||ETH|25|30|IP| | | | | +----+---+ +--------+ | ||ETH|25|40|25|30|IP| ||ETH|IP| | +----+----+ | | | PROBER | +---------+ In a three-stage CLOS topology, MPLS labels are provisioned statically for each leaf and spine link. A probe server connected to each leaf switch sends labeled packets that describe both the forward and return paths. For example: * Stage-1 probes target directly connected leaf–spine links. * Stage-2 probes target remote leaf–spine links. Each probe’s IP destination is set to the local probe server, enabling the packet to return once all labels are popped. Labels have local significance and may be reused across the topology, simplifying configuration. Example: Label 20: LF-1 to SP-1 link Label 30: SP-1 to LF-1 link The prober sends a two-label packet for Stage-1 probing. LF-1 pops label 20 and forwards label 30 to SP-1. SP-1 pops label 30 and returns the packet toward LF-1, completing the loop. Mahale, et al. Expires 23 April 2026 [Page 6] Internet-Draft MPLS for AIDC October 2025 Stage-2 probing uses four-label stacks to test multi-hop paths. A complete mesh of connectivity can be established across the entire data center fabric. 5.2. Scaling and Label Allocation MPLS provides a 20-bit label space (1 million+ labels). Even large topologies with thousands of leaf-spine interfaces consume only a small fraction of the space. Label provisioning may be static or automated via controller software. A consistent label pattern can be repeated across leaves and spines, simplifying operational overhead. An Example label allocation scheme can use a formula like: 2 * (m + M * n) + 100 Where m is leaf index and n is spine index and M is number of leaf switches. 5.3. Failure Correlation and Control Plane Independence Because MPLS probing operates entirely in the data plane, it continues to function even if the IP or BGP control plane is unavailable. End-to-end probing shares fate with routing protocols, whereas MPLS probing isolates data-plane verification. This separation provides improved resilience and faster failure localization in large AI/ML clusters. 6. Security Considerations MPLS probing MUST be isolated from tenant or production traffic. Probes SHOULD be rate-limited and authenticated where feasible. Incorrect label provisioning MAY cause unintended forwarding loops or leakage into production paths. Also MPLS TTL will prevent forwarding loop packets from looping indefinitely. 7. IANA Considerations To be added. 8. Acknowledgments 9. References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Mahale, et al. Expires 23 April 2026 [Page 7] Internet-Draft MPLS for AIDC October 2025 [NetNorad] Facebook Engineering, “NetNorad: End-to-End and Direct Probing in Production Networks,” 2016. [Pingmesh] Microsoft Research, “Pingmesh: A Large-Scale System for Data Center Network Latency Measurement,” 2015. Authors' Addresses Aditya Mahale Cerebras Systems Email: aditya.ietf@gmail.com Kireeti Kompella Juniper Networks Email: kireeti.ietf@gmail.com Vishnu Pavan Beeram Juniper Networks Email: vbeeram@juniper.net Devang Patel AMD Email: devang.patel@amd.com Mahale, et al. Expires 23 April 2026 [Page 8]