Network Working Group Danny Cohen (Myricom) Internet Draft Craig Lund (Mercury) expires in six months Tony Skjellum (MSU) Thom McMahon (MSU) and Robert George (MSU) February 1997 Proposed Specification for the PacketWay Protocol draft-ietf-pktway-protocol-spec-03.txt expires August 1997 Status of this Memo This document is an independent submission. Comments should be submitted to the PktWay@myri.com mailing list. Distribution of this memo is unlimited. This document is an Internet-Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months, and may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material, or to cite them other than as a "working draft" or "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the internet-drafts Shadow Directories on: ftp.is.co.za (Africa) nic.nordu.net (Europe) ds.internic.net (US East Coast) ftp.isi.edu (US West Coast) munnari.oz.au (Pacific Rim) Abstract PacketWay's goal is to move data from a "Source" (a node on a System Area Network) to a "Destination" (another node, probably on another System Area Network) at the high performance available on these SANs. Sources and Destinations can be physical things (a processor or a smart memory board). They can also be "logical" things, such as a group of cooperating processes. [ B l a n k ] PktWay-WG <01> PktWay-WG D R A F T February 1997 Proposed Specification for the PacketWay Protocol ------------------ Danny Cohen (Myricom) Craig Lund (Mercury Computers), Tony Skjellum (MSU), Thom McMahon (MSU) and Robert George (MSU) PktWay-WG This page....................................1 Cheat-sheet..................................2 Introduction.................................3 Notations....................................4 Part-1: PacketWay EEP Messages.......................5 PktWay Message Structure.....................5 Part-2: PacketWay RRP Messages......................15 The Basic Model.............................16 Node Attributes.............................17 Part-3: PacketWay RRP Message Format................19 RRP Message sub-types.......................19 The Structure of RRP messages...............20 RRP Record Format...........................23 RRP Message Examples........................26 Appendix-A: Enumerations................................31 Appendix-B: Example of the use of RRP for discovery.....35 Appendix-C: Routing Tables..............................43 Appendix-D: Glossary....................................45 Appendix-E: Acronyms and Abbreviations..................47 Please send your comments re this draft to . Cheat-Sheet <02> PktWay-WG PktWay at a Glance +----------------- 2 6 24 16 16 PW-Hdr+-+------+-------+--------+---------+--------+--------+--------+--------+ PH1|V| P | Destination-Address | Type-Extension | Packet-Type | +-+-+---++--------------------------+-+------+--------+-----------------+ PH2| E | PL| Data-Length (8B-words) |h| RZ |0 Source-Address | +---+---+--------+--------+---------+-+------+--------+--------+--------+ 4 3 25 1 7 1 23 2 6 2 6 8 8 8 8 8 8 +--------+--------+--------+--------+--------+--------+--------+--------+ L2RH |vv000000|11LLLLLL| SR01 | SR02 |........|........|........|........| +--------+--------+--------+--------+--------+--------+--------+--------+ Length 2 6 4 6 8 8 8 8 8 8 +--------+--------+--------+--------+--------+--------+--------+--------+ Symbol|vv000000|1011ssss|ssssssss|ssssssss| Length | data |........|........| +--------+--------+--------+--------+--------+--------+--------+--------+ <---- Symbol Type ---> 2 6 8 8 8 8 8 8 8 Opt'l +--------+--------+--------+--------+--------+--------+--------+--------+ hdr |TCtttttt|LLLLLLLL| data |........|........|........|........|........| fields+--------+--------+--------+--------+--------+--------+--------+--------+ T: 0=optional, 1=mandatory; C: 0=more OH-fields follow, 1=last OH-field 8 8 8 8 8 8 8 8 RRP +--------+--------+--------+--------+--------+--------+--------+--------+ Record| RTyp | PL | RL |........|........|........|........| +--------+--------+--------+--------+--------+--------+--------+--------+ RRP-messages: GVL2, L2SR, RDRC, TELL, INFO, HRTO, WRU; RTyp: ADDR, NAME, CAPA, LADR, SRQR, MTUR; PktWay-WG <03> Introduction INTRODUCTION ------------ PacketWay is an open family of specifications for internetworking high-performance System Area Networks (SANs) and high-performance LANs. Even though most modern SANs have much in common (such as high rates, low latency, low BER, being packet networks made of point-to-point links with flow control, and the usage of source routes), each is an island upon itself, incapable of direct inter-communications with other SANs. PacketWay's goal is to "internet" such SANs and high-performance LANs. The core of the PktWay protocol is its End/End Protocol (EEP) and its Router/Router-Protocol (RRP). Above the core several extension are expected to be defined (and implemented), including dynamic resource and routing discovery, secure-PktWay, and multicast-PktWay. This part describes the PacketWay EEP (End/End Protocol). Part-2 describes the PacketWay RRP (Router/Router Protocol). Part-3 defines the format of the RRP packets. Other PacketWay layers, such as the PacketWay dynamic discovery security, multicast, and a PktWay Server Layer, will be described in documents to be provided later. Some basic PacketWay terminology requires explanation. PacketWay interconnects high-performance System Area Networks (SANs). Each SAN contains some "nodes". At least one node in each SAN is also a PacketWay "router", connected to more than one SAN. PacketWay's goal is to move data from a "Source" (e.g., a node on a SAN) to a "Destination" (e.g., a node on another SAN). Sources and Destinations can be physical entities (a processor or a smart memory board). They can also be logical entities (a group of cooperating processes). These nodes include sources, destinations, and routers. Within each instance of PacketWay all nodes have unique 24-bit PacketWay addresses. A system designer can assign these "PacketWay Addresses" manually. Alternatively, the optional PacketWay Server Layer provides a way to assign and discover addresses dynamically. Throughout this document "address" always means the 24-bit PacketWay address. SANs also may have PacketWay addresses, aka SAN-IDs. They are also 24-bit quantities, sharing the address space with the nodes. These addresses, of SANs and nodes, are unique within each instance of PacketWay. To optimize for performance, PacketWay has a data transfer mode that leverages the native message routing schemes used within the SANs. This mode uses a "Planned Transfer" paradigm. During the planning phase, a source collects information on optimal routes to a destination, expressed in the various native formats of the intervening SANs. A source later uses this information for low latency transfers to that destination. In PacketWay, the transfer phase of a Planned Transfer is called "L2-forwarding." Appendix-B shows an example of the planning phase. Introduction <04> PktWay-WG PacketWay also optionally supports a more traditional data transfer mode that requires no planning. Such transfers specify the destinations by their addresses only. PacketWay calls this more traditional approach "L3-forwarding." PacketWay packets travel through SANs encapsulated inside the native packet format of each SAN, by being prefixed with the routing header and followed by the tail as required by that SAN. PacketWay packets get to their destinations by Level-2 (L2) forwarding, Level-3 (L3) forwarding, or a combination thereof. In L3-forwarding (similar to IP forwarding), the L2-routing through each SAN is determined by an inter-SAN router upon entering that SAN. The router prefixes the packet with an L2 routing header (such as a source route) corresponding to the destination address specified in the packet. It is a task for that router to determine the L2-routing-header corresponding to the given PacketWay-address. In L2-forwarding the source prefixes the packet with all the L2-routing headers needed along the path to the destination. Each router has only to get the L2-routing-header from the leading L2RH (L2-Routing-Header record) that was provided by the source. PacketWay does not provide Segmentation and Reassembly (SAR). Therefore, the length of a packet cannot exceed the minimum MTU (Maximum Transmission Unit) along its path. PacketWay does not detect errors. It only gathers error detection information from the SANs and inter-SAN routers that a packet transits. PacketWay is big-Endian 8B-word based. NOTATIONS +-------- 8B means "8-byte" (64 bits). 0x indicates hexadecimal values (e.g., 0x0100 is 2^8=256-decimal). 0b indicates binary values (e.g., 0b0100 is 4-decimal). xxx indicate a field that is discarded without any checking (e.g., padding). [fff] indicates that fff is an optional field. All length fields do not include themselves, and therefore may be zero. PktWay-WG <05> EEP-Msgs Part-1: PacketWay EEP messages ------------------------------- The PacketWay MESSAGE STRUCTURE +------------------------------ PacketWay messages have 5 components, including 3 optional ones: [1]: [Optional Sequence of L2-Routing-Headers Records (L2RHs) and Symbols] [2]: EEP Header (16 bytes) (PH) [3]: [Optional Header fields] (OH) [4]: [Optional, Most likely: Data Block] (DB) [5]: [Optional Trailer fields] (OT) [6]: EEP Trailer (8 bytes) (TAIL) Re [1]: as explained later, if the 9th+10th bits of a messages are 0b11 then the message starts with an L2RH (or a symbol). If the 9th through the 12th bits of a message are 0b1011 then this message starts with a "symbol". The other values of these bits indicate the lack of L2RH and symbols and that the message begins with the EEP-header. Re [3]: if the h-bit in the EEP header [2] is 1 then there are optional header fields. The sequence of these header fields is terminated with a word whose 2 MSBytes are 0xFF00. Re [4]: if DL>0, in the EEP header, zero then a Data Block (DB) is included in this message. Re [5]: the optional header fields, [3], may indicate that some optional trailer fields are present after the DB, [4]. The order and the formats of the trailer fields are defined by the optional header fields. It is expected that most messages will have Data Blocks (DB), and that most messages will not have Optional Header fields (OH), nor trailer fields (OT). [1], the leading L2RHs and symbols are consumed by the SANs before reaching the destination which receives only the other components, [2] through [6]. These parts, [2] to [6], constitute the End/End Protocol of PacketWay. TAIL, the EEP trailer, [6] may be modified along the way to the destination, unlike [2], [3] and [4], which arrive exactly as sent by the source. Each PacketWay packet may be first L2-forwarded (zero or more times) before being L3-forwarded (zero or more times). Although PacketWay headers and trailers are always in Big Endian order, the byte order of the Data Block is not defined by PacketWay. Since all the elements of PacketWay (L2RHs, EEP-headers, optional fields, data, and EEP-trailers) are always multiples of 8B-words, it is recommended that PacketWay headers (and data) be aligned on 8B-boundaries. RRP-Msgs <06> PktWay-WG [1]: Optional Sequence of L2-Routing-Headers Records (L2RHs) and Symbols +----------------------------------------------------------------------- A PacketWay source may specify native routes, by placing the native routes before the PacketWay Header. The native routes (for all SANs and LANs beyond the initial one) must appear within a sequence of PacketWay L2-Routing-Header records (L2RH). The contents of the L2RH are totally SAN dependent, with the exception of the first 2 bytes that distinguish this record from an EEP-header and also provide the Length (L) indicating the number of routing bytes of that L2RH (not including these 2 bytes). L is always between 0 and 63. The total number of bytes in the L2RH is L+2, packed in [(L+9)/8] 8B-words (where the square brackets [] indicate the integer part of the quantity within). It's up to each SAN to provide padding, if needed, to fill the L2RH words. Each L2RH is defined by the entity that will process it. In addition to routing information per se, it may also include demuxing information such as a local message-type. For example, over Myrinet it should end with 0x0300 which is the Myrinet-type assigned to PacketWay. The L2 header must contain enough information to allow a router to quickly create any necessary local routing headers and trailers. PacketWay implementations that support L2-forwarding must document their unique L2 header requirements. When a PacketWay message is encapsulated inside any native SAN message (Paragon or Myrinet, for example), it's up to that SAN to distinguish between it and its own native packets. This is not a PacketWay issue. For example, Myrinet uses its Message-Type to recognize PacketWay messages. PacketWay-Routers on the boundaries between SANs are asked to forward packets with either L2 or L3 routings. The former start with an L2RH, (having both its 9th and its 10th bits set to 1), whereas the latter start with PacketWay-addresses (with other values for these 2 bits). FORMAT: Each L2RH is in the format: +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11LLLLLL| SR01 | SR02 |........|........|........| xxx | L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ The first 2 bits are 0b00 for the working version of the protocol. They may have other values for experimental versions. The next 6 bits should be all zeroes. PktWay-WG <07> EEP-Msgs The next two bits must be 0b11 to indicate that this is an L2RH record. The next 6 bits are the byte count of the routing information that starts in the third byte and is followed by as many padding bytes as needed to fill to the next 8B-boundary. The length of the routing information is expected to be between 1 and 63 bytes. This 0b11 was chosen to be consistent with the 0b11 of PktWay-addresses, as described in [2] below. EXAMPLES: An L2RH with an SR with 5 routing bytes: 0b11 L=5 #1 #2 #3 #4 #5 padding +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11000101| SR01 | SR02 | SR03 | SR04 | SR05 | xxx | L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ ^^ |<---------- routing information ----------->| An L2RH with an SR with 13 routing bytes: 0b11 L=13 #1 #2 #3 #4 #5 #6 +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11001101| SR01 | SR02 | SR03 | SR04 | SR05 | SR06 | L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ | SR07 | SR08 | SR09 | SR10 | SR11 | SR12 | SR13 | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ #7 #8 #9 #10 #11 #12 #13 padding Symbols (to be defined later) may be mixed among the L2RHs, before the EEP-header. Their format is: +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|1011ssss|ssssssss|ssssssss| Length | data |........|........|Symbol +--------+--------+--------+--------+--------+--------+--------+--------+ <---- Symbol Type ---> [2]: EEP Header (16 bytes) (PH) +------------------------------ 2 6 24 16 16 +-+------+-------+--------+---------+--------+--------+--------+--------+ |V| P | Destination-Address | Type-Extension | Packet-Type |PH1 +-+-+---++--------------------------+-+------+--------+--------+--------+ | E | PL| Data-Length>=0 (8B-words) |h| RZ |0 Source-Address |PH2 +---+---+--------+--------+---------+-+------+--------+--------+--------+ 4 3 25 1 7 24 These fields are described below. RRP-Msgs <08> PktWay-WG [3]: Optional Header Fields (OH) +------------------------------- A PacketWay-message has optional header fields (OH) if the Option-Flag (h) is set to 1 in the EEP-header. Each OH is in the format: +--------+--------+--------+--------+--------+--------+--------+--------+ |TCtttttt|LLLLLLLL| data |........|........|........|........|........| OH +--------+--------+--------+--------+--------+--------+--------+--------+ The first byte indicates the optional header field type (OH-TYPE). The first bit, T, of the first byte indicates the processing of this OH-TYPE: T=0: Optional (may drop this field if this OH-TYPE is unknown) T=1: Mandatory (should not process this message if this OH-TYPE is unknown) The second bit, C, of the first byte indicates that there are more more trailer fields (i.e., whether this is the last field of this message). C=0: More Optional header fields follow C=1: End of Optional header fields group The other 6 bits of this byte, tttttt, define application-specific OH-TYPEs. The second byte is the byte-count of the data for this field that starts in the third byte, and is padded with as many padding bytes as needed to fill 8B-words. E.g., L=(0-6) implies one 8B-word. L=(7-14) implies two. L does not include itself, and can range from 0 to 255. Example: An Optional Header Field (OH) with a mandatory OH-TYPE and 4 data bytes: L=4 #1 #2 #3 #4 padding padding +--------+--------+--------+--------+--------+--------+--------+--------+ |1xtttttt|00000100| data01 | data02 | data03 | data04 | xxx | xxx | OH +--------+--------+--------+--------+--------+--------+--------+--------+ |<------------- value ------------->| PktWay-WG <09> EEP-Msgs [4]: Optional Data Block (DB) +---------------------------- The DB is free for applications to use in any way. Routers must not modify this field. The DB has DL 8B-words, including optional padding (at the end) of PL bytes. Hence, the number of data bytes is 8*DL-PL. Both DL and PL are defined in the EEP-header. The maximum length of the DB is 8*(2^25-1)B=256MB. [5]: Optional Trailer Fields (OT) +-------------------------------- A PacketWay-message has optional trailer fields (OT) if so indicated in an Optional Header field, e.g., an OH field may indicate that a CRC64 is in the OT. An OT may have just the data for an OH defined above (in the EEP header), or be a stand alone field in the same format as OH. The OT-fields are in the order defined by the OHs. For example, if an OH-field indicating that a CRC32 is in the OT, is followed by another OH-fields indicating that a CRC64 is in the OT, then the OT with the CRC32 should be followed by the OT with the CRC64. [6]: EEP Trailer (TAIL) +---------------------- The TAIL consists of only the Error Indication (EI) field which is a single 8B-word. Routers may start forwarding packets toward their destinations before detecting transmission errors (wormhole routing). The EI field provides such routers with a means to append an error indication to the end of a packet. An all zero EI value means that no error was indicated. Any non zero EI value indicates one or more errors. The packet source will usually initialize the EI field to all zeros. However, as an alternative example, a memory board may create a packet with a non zero EI field (EI=1) that indicates that a parity error was detected by the memory board. Each router does an arithmetic left shift, on the EI field by one bit unless its MSbit is 1. Routers that detect transmission errors also set the LSbit (after the shift) to 1. This provides the ability to identify which routers have indicated errors (if the route is known). RRP-Msgs <10> PktWay-WG THE DETAILS OF THE EEP-HEADER, [2] +--------------------------------- Bytes.bits Version (V) 0.2 Priority (P) 0.6 Destination Address (DA) 3.0 Packet Type Extension (TE) 2.0 Packet Type (PT) 2.0 Endianness (E) 0.4 Padding Length (PL) 0.3 Data Length (DL) 3.1 Options flag (h) 0.1 Reserved (RZ) 0.7 Source Address (SA) 3.0 Version (V) 2 bits This field is static. Its 2 bits are 0b00 for the working version of the protocol. These bits should have other values for experimental versions. Priority (P) unsigned integer, 6 bits It is anticipated that some SANs, especially those working in real time, will want to implement priorities. This field supports such usage. All ones is the highest priority, and all zeroes the lowest. Ideally, packets with higher priority should gain access to contested resources before packets with lower priority. Implementations may ignore the Priority field. Destination Address (DA) 24 bits This field contains the PacketWay address of the destination. Addresses are unique within each instance of PacketWay. Nodes should have addresses assigned to them. The method of assigning addresses to PacketWay nodes is not specified here. Examples of potentially addressable PacketWay nodes include: groups of cooperating processes, an entire MPP, or each of an MPP's many processors or processes. All half-routers (as defined in Part-2) must have addresses so that they can exchange control and configuration packets with other routers. The 24-bit PacketWay address space is divided into several segments, each identified by the most significant bit(s) of the address. PktWay-WG <11> EEP-Msgs MSbits | Segment | Count | Range ---------+----------+-------+------------------- 0XXX | Physical | 8M | 0x000000-0x7FFFFF 100X | Unused | 2M | 0x800000-0x9FFFFF 1010 | Logical | 1M | 0xA00000-0xAFFFFF 1011 | Symbol | 1M | 0xB00000-0xBFFFFF 11XX | L2RH | 4M | 0xC00000-0xFFFFFF PHYSICAL ADDRESS MAP Logical Segment: Physical Unused ^ Symbol L2RH /---------------^--------------\ /--^--\ / \ / \ /------^------\ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | . : . | | | | . | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Memory: 0 8M 10M 11M 12M 16M The 0b11xx was chosen for L2RH to be consistent with the 0b11 indication of L2RHs, as described in [1] above. LAs, "Logical Addresses", (for broadcast and for multicast groups) are also in this address space. An address is a "Logical Address" if its 4 MSbits are set to 0b1010. Certain RRP messages specify addresses either as a unique address or as a set of addresses, by (min,max) or by (value,mask). A few Physical-addresses are reserved: 0x000000 Undefined address (illegal where an address is expected) 0x7FFFFE ("Hey-You!") This address could be used at power up to address nodes or routers, over point-to-point links. ("If you receive it, it's for you.") 0x7FFFFF (Broadcast) This address is reserved for broadcast operations which may be added in later versions. ("If you receive it, it's for you.") Type Extension (TE) 2 bytes An extension of the following PT field. Logically, the TE should be after the PT. However, the PT is 8B-word aligned, easier to process than the TE which is not 8B-aligned. Since the PT is more frequently used than the TE, it was assigned to the better aligned field. RRP-Msgs <12> PktWay-WG Packet Type (PT) 2 bytes The intent of the PT field is to provide all the information needed for demuxing in support of multiple protocol layers. Whereas traditional protocol layering requires several stages of sequential demuxing, PacketWay is expected to provide enough information to support a single combined demuxing (such as in support of zero copy TCP). PT values to support popular parallel programming APIs such as MPI will be defined. The Enumeration Appendix (A1) defines several values for this PT field. Some PTs use also the preceeding 2 bytes of the Type Extension (TE) field for passing PT-specific parameters. However, layered protocols cannot be ignored. The PT field can also define data blocks as containing IP, SNMP, ATM, Ethernet, and other popular layered protocols. The PT will be then used for that purpose, as done throughout the internet (e.g., "ether-types"). For example, here are PT values a memory board may need: PT Meaning --------- -------------------------------------------------------- MEM-WRITE -- Treat the first 8 bytes of the Data Block as a local memory-address and write the remaining data into memory. MEM-READ -- Treat the data block as 2 8B-memory-addresses and an 8B-byte count. Generate a return WRITE packet containing the first address, followed by the appropriate data, that was read from the second address. The PT field will also indicate the commands used in the PacketWay Router/Router configuration and control Protocol (RRP). We will define a special PT value that specifies that the Data Block contains an embedded PacketWay message, complete with another EEP header, and, potentially, prefixed L2-Routing-Headers. This feature will allow to use L3-routing to an intermediate node, followed by L2-routing from there to the final destination. Special Types RRP - PacketWay's Router/Router protocol (see Part-2). ERR - Error reporting packet, usually sent to the Source Address (SA, see below) in response to a PacketWay message that could not be properly handled, such as "Destination Unknown." The TE indicates the nature of the error (e.g., UNK) as defined in the Enumeration Appendix (A4). PktWay-WG <13> EEP-Msgs Endianness (E), 4 bits The idea is that if the SAN interface of the receiving-node detects Endianness that is different than its own and if the entire Data Block (DB) consists of N-byte fields, then it may kick in byte-swapping hardware for N-byte fields, saving much work for the receiving node. e, the first bit (MSbit) of E, indicates that the DB is in Big-Endian order (e=0) or in Little-Endian order (e=1). The next 3 bits could control hardware byte swapping, if any, which assumes that all the data consists of words of the same length. e000: don't swap, it's 8-bit data e001: swap as if all the data is 16-bit words e010: swap as if all the data is 32-bit words e011: swap as if all the data is 64-bit words e100: swap as if all the data is 128-bit words e101: illegal and reserved for future use e110: illegal and reserved for future use e111: illegal and reserved for future use Pad Length (PL) unsigned integer, 3 bits The number of padding bytes that were added at the end of the DB (i.e., from the end of the data to the end of the DB). PL can be between 0 and 7. Data Length (DL) unsigned integer, 25 bits Length, in 8B-words, of the data block (not including the L2RHs, EEP-header, OH, OT, and TAIL, including any optional padding. Hence, the net length of the Data Block is 8*DL-PL bytes. The minimum is zero, and the maximum length is (2^25-1)*8 bytes ~ 2^28 ~ 256 MBytes. Optional Header-Field Flag (h) 1 bit This bit is set to 1 if there is one (or more) optional header fields following the standard 16-byte EEP-header. Reserved (RZ) 7 bits This field is reserved for future use. Applications should neither use it, nor count on others not to use it. In this version it should be always set to zero (0b0000000). Source Address (SA) 24 bit This field contains the physical address of the packet's original source in the same format as DA. However, unlike DA, the SA must be a physical address. Filling in this field is optional. A value of zero means that the SA is not specified. Routers may use this field to identify the sender to which error messages may be returned. RRP-Msgs <14> PktWay-WG [ B l a n k ] PktWay-WG <15> RRP-msgs Part-2: PacketWay RRP messages ------------------------------ PacketWay is an open family of specifications for internetworking System Area Networks (SANs). This part-2 of the PktWay specification describes the RRP (Router to Router Protocol) part of the PacketWay-protocol. The RRP is built on top of the PacketWay-EEP described in Part-1. Part-3 defines and discusses the format of the RRP packets. We introduce some new terminology within this document. A PacketWay Router always bridges (at least) two SANs. The Router consists of three parts: the "Half Router" (HR) attached to the first SAN, the HR attached to the second SAN, and their interconnection. PacketWay does not define the nature of this interconnection. However, we believe the PCI Local Bus de facto standard will become a very popular link. This document specifies a series of options that allow system designers to deploy PacketWay routers of varying levels of intelligence. Each router is considered as a set of interconnected Half-Routers (HRs), each being a full fledged address-bearing node on some SAN. There are several implementation levels of PktWay, indicated by a letter code, which are specified differently for nodes and routers. The higher the letter code ("A" = lowest), the more interoperability and adaptability result. System designers may choose the level of implementation to best suit their needs. Node implementation levels ("A" being the lowest): Level-A: Built-in L2 source routes Level-B: Built-in L3 addresses (dynamic update of first HR) Level-C: Requesting and receiving dynamic information Level-A nodes send messages by using L2-forwarding, by specifying SRs (in L2RHs) that are hard-coded into them, without the ability to dynamically acquire or modify them. Level-B nodes have, in addition, the ability to send messages by using L3-forwarding, by specifying addresses that are hard-coded into them (without the ability to dynamically acquire them). These nodes can ask HRs for the best first HR for any destination node (specified by its address) and for the SR to destination nodes. In addition they can also handle re-direct messages, telling them which HR to use for given nodes. Level-C nodes can also locate L3-nodes by asking HRs to provide the attributes of nodes specified by addresses, names, and/or capabilities. They also respond to such queries by reporting their own attributes. Router implementation levels ("A" being the lowest): Level-A: Forwarding according to L2 source routes Level-B: Handling L3 addresses, and dynamic first HR (re-direct, etc) Level-C: Supporting node discovery RRP-msgs <16> PktWay-WG HRs can support nodes of the same (or lower) implementation level. Level-A routers support only L2-Forwarding, and do not support the planning phase of Planned Transfers. Therefore, nodes which use Level-A routers must have the necessary native routes hard-coded into them (e.g., burned into a PROM somewhere). Level-B routers also support L3-Forwarding, and advise nodes about the first HR to use for each destination. They add the planning phase of Planned Transfers (by supporting requests for routes, [GVL2] and [L2SR]). Level-C routers help nodes discover (resources by capabilities). PktWay is designed for the highest implementation levels, but will interoperate with instances of PktWay using lower implementation levels. THE BASIC PACKETWAY MODEL The basic model of PktWay is a set of SANs (System Area Networks), each with its own conventions and protocols, using a common protocol (PktWay) for interconnection. The interconnection between SANs is via PacketWay-routers. A router between SAN-A and SAN-B is composed of two interconnected processes, each a node on a SAN, complying with the conventions of their SANs.. These processes are known as HRs ("Half-Routers") or "SAN-interfaces." These HRs may be implemented by two separate "boxes" with an inter-SAN communication link between them, or inside a single "multi-homed" box that has interfaces to both SANs, interconnected via some bus or SAN. RRP defines (via message structure and behavior) the interactions between HRs, and between HRs and computing nodes. RRP does not define the lower level protocols that deliver its messages (over links, or between processes in multi-homed routers). In particular, RRP does not define the inter-SAN interconnection links between the HRs -- these are left for mutual agreements among the implementors. These links are expected to range from serial fibers to PCI buses. An optional PPP-like protocol may be defined later for these links. It is assumed that each HR has a Routing Table (RT) for its own SAN (aka Local Routing Table, LRT), with (at least) the addresses of all the nodes, and the source routes to each of them from the HR. This information could be dynamic or static, even manually configured. The HRs may (or may not) perform dynamic mapping of their SANs. It is also assumed that each node, on each SAN/LAN, knows the SR to at least one HR on its SAN/LAN. PktWay-WG <17> RRP-msgs In L2 operation under levels C , when a source node, SA, needs to send a message to a destination node, DA, it first asks any of the HRs on its [SA's] SAN for a source route (SR) from HR to DA. That HR would (1) provide such an SR, or (2) reply with a "re-direct" message, suggesting to ask another HR which is also on SA's SAN, or (3) report no knowledge of DA (using the UNK error message). SA may ask more than one HR for SRs to the same DA and use any algorithm to choose which of these SRs to use. RRP does not specify whether (and how) to cache SRs. In L3 operation, when a source node, SA, needs to send a message to a destination node, DA, it sends that message to any of the HRs on its SAN, using L2, expecting L3-forwarding to DA, using DA's PacketWay address. That HR would either (1) forward the message toward DA, and possibly return to SA a "re-direct" message, suggesting to use, in the future, another HR on SA's SAN for DA, or (2) report no knowledge of DA (using the UNK error message). Under level C nodes may be located by PacketWay-addresses, names, or capabilities, but only addresses may be used for routing. NODE ATTRIBUTES +-------------- Each node has: Physical Address, Name, Capabilities, and Logical-Addresses Address (Physical): 3 bytes, flat, unique in this PacketWay Name: flat, globally unique (e.g., IP address), arbitrary length Capabilities: regular GP node, router, PacketWay-server, NFS, paging server, M/C server, DSP, printer, .... Some capabilities may need additional parameters (e.g., SAN-ID for routers, and resolution+colors for printers). The capabilities are defined in the Enumeration Appendix (A5). Logical-Addresses: a set of (logical) addresses to which this node requests to listen. Logical addresses designate multicast and broadcast groups. The control of the Logical-Addresses (a la IGMP) is not defined in this document. this will be designed by the applications that use it (e.g., PktWay-multicast). The management of logical addresses (e.g., JOIN and LEAVE) is not defined yet. RRP-Msgs <18> PktWay-WG [ B l a n k ] PktWay-WG <19> RRP-Format Part-3: PacketWay RRP Message Format ------------------------------------- RRP messages are PacketWay messages with PT="RRP" in their EEP-header. The EEP-header is followed by some (zero or more) RRP-records according to their RRP-type, followed (always) by the TAIL which is the EI field. The RRP-records constitute the DB of the PacketWay-message. They must be in Big-Endians order, with e=0 in the EEP-header. The RRP-Type is carried in the TE of the of the EEP-header. Following are the RRP messages, with their RRP-type: RRP MESSAGE SUBTYPES +------------------- RRP- Impl'n Type Levels Description +------- ------ ----------------------------------------------- [GVL2] BC Please give me L2-routes to node (address) The reply to [GVL2] is [L2SR], [RDRC], or [ERR/UNK]. [L2SR] BC Here are L2-routes to node (address) [RDRC] BC Re-direct to node (address) via a neighbor HR(address) [TELL] C Please tell me about node (address, name, capabilities) The reply to [TELL] is [INFO], or [ERR/UNK]. [INFO] C Info about node (address, name, capabilities, LAs) [HRTO] BC Which HR should I use for node (address) The reply to [HRTO] is [RDRC], or [ERR/UNK]. [WRU?] C Who/what-Are-You? The reply to [WRU?] is [INFO]. RRP also uses the following error messages: [ERR/UNK] BC Destination Unknown (address) [ERR/HRDOWN] BC HR Down [ERR/LKDOWN] BC Link Down [ERR/GENERAL]ABC General error message All these messages may be sent from nodes or from HRs, to nodes or to HRs. The format of these messages is defined in this part. The implementation levels are: Level-A: pre-wired (static) native routing, "MAC"-based operation Level-B: L3 forwarding (planner transfers), IP-like operation Level-C: Node discovery (static routing) RRP-Format <20> PktWay-WG The RRP records are: RTyp Description ---- ---------------------------------- ADDR Address NAME Name CAPA Capability LADR Logical Addresses SRQR Source Route and its Quality (SR,Q) MTUR MTU (for the previous SRQR) THE STRUCTURE OF THE RRP MESSAGES +-------------------------------- The RRP-records are made of one or more 8B-words. In the following the RRP-type is in [] and its implementation level in (). Each message ends with an TAIL which is not shown here. * [GVL2] (BC) Please give me L2-routes from you to node (address) PH (with [PT/TE]=[RRP/GVL2]) ADDR (address of the node for which SR is requested) * [L2SR] (BC) Here are L2-routes to node (address) PH (with [PT/TE]=[RRP/L2SR]) ADDR (address of the node for which SR is provided) SRQR (SR with Q) MTUR (MTU for the above SR) This message may have several (SRQR,MTUR)s, one for each SR. * [RDRC] (BC) Re-direct to node (address) via a neighbor HR (address) PH (with [PT/TE]=[RRP/RDRC]) ADDR (address of the destination node for which re-direct is issued) ADDR (address of the HR to be used for that destination node) The above addresses are expected to be physical (but they be otherwise). PktWay-WG <21> RRP-Format * [TELL] (C) Please tell me about node (address | name | capabilities) PH (with [PT/TE]=[RRP/TELL]) ADDR (address of the node for which more information is requested) or PH (with [PT/TE]=[RRP/TELL]) NAME (name of the node for which more information is requested) or PH (with [PT/TE]=[RRP/TELL]) CAPA (capabilities for which nodes are requested) This message may have several CAPA's, one for each capability. [TELL] identifies a node by an address and/or a name and/or capabilities. If more than one attribute is specified (e.g., an address and a name) any nodes that meets any of them should be considered (like an implied OR). * [INFO] (C) Info about node (address, name, capabilities) PH (with [PT/TE]=[RRP/INFO]) ADDR (address of the node for which more information is requested) NAME (name of the node for which more information is requested) CAPA (capabilities for which nodes are requested) LADR (Logical-Addresses for the requested node) This message may have several CAPA's, one for each capability. For nodes without NAME or LADR, these records are omitted. [INFO] provides all the known information about that node, address, name, capabilities, and logical-addresses. * [HRTO] (BC) Which HR should I use for node (address) PH (with [PT/TE]=[RRP/HRTO]) ADDR (address of the node for which initial HR is requested) * [WRU?] (C) Who/what-Are-You? PH (with [PT/TE]=[RRP/WRU?] and [DA]=0x7FFFFE) * [ERR/UNK] (BC) Destination Unknown (address) PH (with [PT/TE]=ERROR/UNK) XXXX (XXXX of the Destination node for which the requested information is not available), where XXXX is the ADDR and/or NAME and/or CAPA of the node(s) about which this message is sent RRP-Format <22> PktWay-WG * [ERR/HRDOWN] (BC) HR Down (or Router-Down) PH (with [PT/TE]=[ERROR/HRDOWN]) ADDR (address of the HR that is down) ADDR (the other address of the router that is down) * [ERR/LINKDOWN] (BC) Link Down PH (with [PT/TE]=[ERROR/LINKDOWN]) ADDR (address of one end of the link that is down) ADDR (address of the other end of the link that is down) * [ERR/GENERAL] (ABC) PH (with [PT/TE]=[ERROR/GENERAL]) XX (The entire message that caused that error: PH+OH+DB+TAIL) PktWay-WG <23> RRP-Format RRP RECORD FORMAT +---------------- Each RRP-record starts with an 8B-word header as shown below. Its first byte identifies the record type (RTyp). The second byte is the Pad-Count byte (PL) indicating the number of padding bytes. The third and the fourth bytes (RL) are the length (in 8B-words) of the record, excluding the record header, hence it may be zero. The rest of the header bytes depend on the record type (RTyp). +--------+--------+--------+--------+--------+--------+--------+--------+ | RTyp | PL | RL | | | | |Record +--------+--------+--------+--------+--------+--------+--------+--------+ Some records that have an arbitrary length are "right justified" and have PL padding bytes before the data. Padding Before Data [PBD]. Some records that have an arbitrary length are "left justified" and have PL bytes after the data. Padding After Data [PAD]. In either case the total number of data bytes is: (8*RL-PL-4). Following are the RRP-records. These records are the building blocks used to construct RRP-messages. In the following xxx indicate bytes that are discarded, such as for padding. It is recommended to set them to all-0. ===> [ADDR] Node-Address Record [PAD] This record specifies either a single address (with AT=1) or a range of addresses (with AT=2 followed by AT=3, or by AT=4 followed by AT=5). AT is the "Address-Type". 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | PktWay-Address | +--------+--------+--------+--------+--------+--------+--------+--------+ or: 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=4 | RL=1 | AT=2 | Min-PktWay-Address | +--------+--------+--------+--------+--------+--------+--------+--------+ | AT=3 | Max-PktWay-Address | xxx | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ RRP-Format <24> PktWay-WG or: 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=4 | RL=1 | AT=4 | PktWay-Address-Value | +--------+--------+--------+--------+--------+--------+--------+--------+ | AT=5 | PktWay-Address-Mask | xxx | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ The address-mask follows the address-value after 4 padding bytes. The above addresses may be physical or logical. The address X is specified by an ADDR record if: if AT=1: X == PktWay-Address if AT=2,3: Min-PktWay-Address <= X <= Max-PktWay-Address if AT=4,5: (PktWay-Address-Mask & X) == PktWay-Address-Value An ADDR-record defines only one PktWay-address (or one range), unlike an LADR record that may specify multiple addresses and multiple address-ranges. If the ADDR record is followed by other records that describe the same node (such as NAME, CAPA, LADR, SRQR, and MTUR) then the RL of the ADDR records also covers all these records. All these records apply to all the addresses specified in this ADDR-record. Needless to say that NAME is not expected to appear within a record that specifies more than one address. Hence, if an ADDR-record with AT=1 has RL>1, or if an ADDR-record with AT>1 has RL>2, then this ADDR-record includes additional records (such as CAPA, LADR, SRQR, and/or MTUR) about the specified address(es). ===> [NAME] Node-Name Record [PAD] (e.g., a name with 9 characters: A1..A9): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "NAME" | PL=3 | RL=1 | A1 | A2 | A3 | A4 |Name +--------+--------+--------+--------+--------+--------+--------+--------+ | A5 | A6 | A7 | A8 | A9 | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ ===> [CAPA] Node-Capability Record [PAD] (e.g., with 9 parameter bytes): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=2 | RL=1 | CC=Cx | P1 | P2 | P3 |cap +--------+--------+--------+--------+--------+--------+--------+--------+ | P4 | P5 | P6 | P7 | P8 | P9 | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ Byte#4 is the Capability Code, CC, followed by as many parameter bytes as needed. PktWay-WG <25> RRP-Format The capability codes are listed in the Enumeration Appendix (A5). The number of bytes used by the parameters is 8*RL-PL-5. ===> [LADR] Logical-Addresses Record [PAD] (e.g., 2 logical addresses and a range of logical addresses): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "LADR" | PL=4 | RL=2 | AT=1 |1010 Logical-Address-#1 |LogAdr +--------+--------+--------+--------+--------+--------+--------+--------+ | AT=2 |1010 Min-Logical-Address | AT=3 |1010 Max-Logical-Address | +--------+--------+--------+--------+--------+--------+--------+--------+ | AT=1 |1010 Logical-Address-#2 | xxx | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ Whereas an ADDR-record defines only one PktWay-address (or one range), an LADR record may specify multiple addresses (each with AT=1) and multiple ranges (each with a pair of AT=2,3 or AT=4,5). ===> [SRQR] Source-Route Record [PBD], with Q for that route. (e.g., a combined SR with 13 bytes and an SR with 4 bytes) This record carries one, or more, L2RHs (2 in the following example). 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "SRQR" | PL=2 | RL=3 | xxx | xxx | Q |SR+Q +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=13B| SR01 | SR02 | SR03 | SR04 | SR05 | SR06 |L2RH#1 +--------+--------+--------+--------+--------+--------+--------+--------+ | SR07 | SR08 | SR09 | SR10 | SR11 | SR12 | SR13 | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=4B | SR01 | SR02 | SR03 | SR04 | xxx | xxx |L2RH#2 +--------+--------+--------+--------+--------+--------+--------+--------+ Q (the Route Quality) is an unsigned 16-bit integer. The units are not defined here. It is assumed that it is monotonic with all-0 being the best and all-1 the worst. If there is an MTUR (MTU-record) for that SR it should follow this SRQR record. However, the RL of this SRQR does not include the RL of the MTUR. ===> [MTUR] MTU record [PBD]: 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | "MTUR" | PL=0 | RL=0 | MTU (in 8B-words) |MTU +--------+--------+--------+--------+--------+--------+--------+--------+ The MTU record provides the MTU for the SR defined before (by an SRQR). The value of 0 means indefinite MTU (i.e., any length is OK). RRP-Format <26> PktWay-WG RRP MESSAGE EXAMPLES +------------------- Node-S asks HR1 to provide an L2RH to node-X: ==> [GVL2] Please give me L2-routes from you to node-X 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | HR1-Address | "GVL2" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | X-Address |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ ==> [L2SR] HR1 replies with two L2-routes to node-X with Qs and MTUs (e.g., an SR of 2 L2RHs (of 5+4 bytes), and an SR an L2RH of 3 bytes) 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | "L2SR" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=8 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=7 | AT=1 | X-Address |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | "SRQR" | PL=2 | RL=2 | xxx | xxx | Q |SR+Q +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=5B | SR01 | SR02 | SR03 | SR04 | SR05 | xxx |L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=4B | SR01 | SR02 | SR03 | SR04 | xxx | xxx |L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ | "MTUR" | PL=0 | RL=0 | MTU (in 8B-words) |MTU +--------+--------+--------+--------+--------+--------+--------+--------+ | "SRQR" | PL=2 | RL=1 | xxx | xxx | Q |SR+Q +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=3B | SR01 | SR02 | SR03 | xxx | xxx | xxx |L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ | "MTUR" | PL=0 | RL=0 | MTU (in 8B-words) |MTU +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ ==> [RDRC] HR1 redirects Node-S to use HR2 for node-X 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | "RDRC" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | X-Address |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | HR2-Address |via-HR +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ PktWay-WG <27> RRP-Format ==> [TELL] Please tell me about Node-X (address | name | capabilities) This message may have any of the following 3 forms: If by PacketWay-address: 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | HR1-Address | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | X-Address |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ If by name (e.g., a name with 9 characters: A1...A9): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | HR1-Address | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "NAME" | PL=3 | RL=1 | A1 | A2 | A3 | A4 |Name +--------+--------+--------+--------+--------+--------+--------+--------+ | A5 | A6 | A7 | A8 | A9 | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ If by capabilities (e.g., 2 capabilities, C1 with 2 parameter bytes, and C2 with no parameter bytes): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | HR1-Address | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "CAPA" | PL=1 | RL=0 | CC=C1 | P1 | P2 | xxx |cap +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=3 | RL=0 | CC=C2 | xxx | xxx | xxx |cap +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ A "TELL" may specify several nodes, by addresses, names, and capabilities. Any node that matches any of the specifications will be included in the reply. RRP-Format <28> PktWay-WG ==> [INFO] Info about Node-X (address, name, capabilities) e.g., a name with 9 characters (A1...A9) and 3 capabilities (Cx, Cy, and Cz): 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | "INFO" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=7 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=6 | AT=1 | X-Address | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | "NAME" | PL=3 | RL=1 | A1 | A2 | A3 | A4 | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | A5 | A6 | A7 | A8 | A9 | xxx | xxx | xxx | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | "CAPA" | PL=1 | RL=0 | CC=Cx | P1 | P2 | xxx | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | "CAPA" | PL=3 | RL=0 | CC=Cy | xxx | xxx | xxx | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | "CAPA" | PL=5 | RL=1 | CC=Cz | P1 | P2 | P3 | * +--------+--------+--------+--------+--------+--------+--------+--------+ * | P4 | P5 | P6 | xxx | xxx | xxx | xxx | xxx | * +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ The INFO records aggregate all the nodes that meet any of the attributed specified in the TELL record. When such aggregation is used, the DL (data length) in the PH is the sum of the RLs in all the ADDR fields. (*) The ADDR, NAME, and CAPA records are repeated for each applicable node. Same also for LADR, SRQR, and MTUR, if any. If several capabilities are specified in [TELL], any node that has any of these capabilities should be reported in [INFO]. ==> [HRTO] Node-S asks HR1 which HR to use for Node-X. 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | HR1-Address | "HRTO" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | X-Address |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ PktWay-WG <29> RRP-Format ==> [WRU?] Who/what-Are-You? 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P |01111111|11111111|11111110| "WRU?" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=0 (8B-words) |0| RZ |0 S-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ This is addressed to 0x7FFFFE, the "Hey-You" address. ==> [ERR/UNK] Destination Unknown (address). HR1 tells Node-S that he does not know about Node-X. 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | UNK | "E R R" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | X-Address |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ ==> [ERR/HRDOWN] HR Down (2 addresses). HR1 tells Node-S that HR-X is down 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | "HRDOWN" | "E R R" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | HRX-Address-1 |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | HRX-Address-2 |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ HR1 knows 2 addresses of the downed router. RRP-Format <30> PktWay-WG ==> [ERR/LINKDOWN] Link Down (2 addresses) 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | "LINKDOWN" | "E R R" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 HR1-Address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | A-Addr | +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | B-Addr | +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ This message reports that the link between A-Addr and B-Addr is down. ==> [ERR/GENERAL] General error: HR1 tells node-S that it (HR1) could not handle the enclosed message) 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | S-Address | GENERAL | "E R R" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=? (8B-words) |0| RZ |0 HR1-address |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | |Data |<------The entire message that could not be handled by the sender----->|Data | |Data +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ This message reports that the enclosed message could not be handled by its receiver (the sender of this error message). PktWay-WG <31> Appendix-A Appendix-A: Enumerations ------------------------ (A1) PacketWay Packet Types +-------------------------- The EEP header reserves 4 bytes for signaling from the source node directly to the destination node. They are the PACKET TYPE (PT), and the TYPE EXTENSION (TE), 2 bytes each. This list defines values for the PACKET-TYPE (PT) 2B-field. Each packet-type has its own interpretation of the TE and the h-fields. 2B-Code Packet Type +---------- ---------------------- 0 Illegal 1 RRP 2 Embedded PacketWay Packet 3 MEM-READ 4 MEM-WRITE Higher level protocols: 21 IP 22 SNMP 23 ATM Link layer Protocols 50 Ethernet (E10) 51 Ethernet (E100) 52 Ethernet (E1000) 53 Myrinet 54 Fibre Channel 55 RACEway 56 SCI 57 VME Application level protocols: 81 MPI 82 PVM Secure Protocols 121 Secure (1) 122 Secure (2) 123 Secure (3) 1,024-2,047 User defined 65,535 ERR (for Error) More values will be assigned. "Ether-types" should be added with a pointer to those used by the Internet. Appendix-A <32> PktWay-WG (A2) RRP Messages (Type Extensions of PT="RRP) +---------------------------------------------- RRP- Type Code Description +------ ---- ---------------------------------------------------- 0 Illegal GVL2 21 Please give me L2-routes from you to node (address) L2SR 22 Here are L2-routes to node (address) RDRC 23 Re-direct to node (address) via a neighbor HR (address) TELL 24 Please tell about node (address | name | capabilities) INFO 25 Info about node (address, name, capabilities) HRTO 26 Which HR should I use for node (address) WRU? 27 Who/what-Are-You? GVRT 28 Please give me your RTs RTBL 29 Here is an RT Throughout this document the RRP messages are indicated by their type (e.g., RDRC for re-direct). In actual messages the code is used (e.g., 2 for RDRC). (A3) RRP records +--------------- RTyp Code Description +------ ---- ---------------------------------------------------- 0 Illegal ADDR 41 Address record for one or many nodes NAME 42 Node Name record CAPA 43 Node Capability record LADR 44 Node Logical Addresses record SRQR 45 Source Route record and its Quality (SR, Q) MTUR 46 MTU record (for the previous SRQR) Throughout this document the RRP records are indicated by their RTyp (e.g., ADDR for address). In actual messages the code is used (e.g.,41 for ADDR). (A4) Error Messages +------------------ Subtype Code Description --------- ---- ---------------------------------------------- 0 Illegal UNK 71 Unknown (address) HRDOWN 72 HR-Down (and the links associated with it) LINKDOWN 73 Link-Down (between two HRs) GENERAL 74 General error message Throughout this document the error messages are indicated by their subtype (e.g., LINKDOWN for Link-Down). In actual messages the code is used (e.g., 3 for LINKDOW). PktWay-WG <33> Appendix-A (A5) PacketWay Node Capabilities +------------------------------- Code Capability Parameters +--- ------------------------ -------------------------------------- 0 Illegal 1 GP Computing Node 2 Router SAN-IDs, 1+3 Bytes each 3 PacketWay Server 4 Network Multicast Server 5 NFS 6 NPS (Paging Server) 7 Floating-point DSP IEEE word-sizes (in bytes), 1B per size 8 Fixed-point DSP word-sizes (in bytes), 1B per size 9 Printer 253 Secure PacketWay HR 254 Multicast agent for its SAN 255 SAN (A6) Optional Header Fields Types (OH) +------------------------------------- The MSbit of the type field (T) is the type-of-type. Its assignment is: 0b0: Optional (may drop this OH if its type, tttttt, is unknown) 0b1: Mandatory (should not process this OH if its type is unknown) The next bit is the "Completion bit" (C). Its assignment is: 0b0: More options follow 0b1: This is the last option field The 6 LSbits, tttttt, are the type field. Their assignment is: 0x00: Illegal 0x01: TBD 0x02: CRC32 here 0x03: CRC32 following in the OT (after the DB) 0x04: CRC64 here 0x05: CRC64 following in the OT (after the DB) 0x06: There is an OT (Optional Trailer) 0x07-0x3D: TBD 0x3E: Cryptographic data Appendix-A <34> PktWay-WG (A7) Byte Order (Endianness) +--------------------------- A 4 bit field (E) is used to indicate Endianness, with e being its first bit (MSbit). e=0: Big-Endian order e=1: Little-Endian order The 3 LSbits indicate the size of data chucks (must be the same for the entire data block) to allow hardware swapping e000: don't swap, it's 8-bit data e001: swap as if all the data is 16-bit words e010: swap as if all the data is 32-bit words e011: swap as if all the data is 64-bit words e100: swap as if all the data is 128-bit words e101: illegal and reserved for future use e110: illegal and reserved for future use e111: illegal and reserved for future use (A8) Symbol Types (ST) +--------------------- The format of Symbols is: +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|1011ssss|ssssssss|ssssssss| Length | data |........|........| +--------+--------+--------+--------+--------+--------+--------+--------+ <---- Symbol Type ---> Code Symbol Type -------- ----------------- 0x00000: Reserved 0x00001: Multicast 0x00002: SCID PktWay-WG <35> Appendix-B Appendix-B: Example of the use of RRP (over Myrinets) ----------------------------------------------------- In this example Node1 on SAN1 (with MTU=16KB) is looking for an automatic spectral analyzer (CC=X). It uses TELL (s1) to ask its default router (RTRA1, the half of RouterA connected to SAN#1) which nodes have this capability. RTRA1 knows about no such node, and replies with ERROR/UNK (s2) telling Node1 that RTRA1 knows about no such node. (The extent by which RTRA1 checks with others before sending this reply is not specified here.) Failing to find such analyzer, Node1 is looking for a DSP that handles IEEE floating-point 64-bit data. Node1 (s3) asks RTRA1 to provide the list of floating-point DSPs that can handle 64bit IEEE data. (s4) RTRA1 provides the addresses of both Node2 and Node3. For its own reasons Node1 decides to use Node2. (s5) Node1 asks RTRA1 which router to use for Node2. (s6) RTRA1 suggests to use RouterB. (s7) Node1 uses L3-forwarding, via Router-B, to verify Node2's capabilities, by asking Node2 for information about itself. (s8) Node2 provides this information which Node1 likes. (s9) Node1 asks RouterB for L2RH(s) to Node2. (s10) RouterB provides the requested L2RH with its MTU of 1,024 8B-words (8KB). Finally, (s11) Node1 starts sending data to Node2 using L2-forwarding. Similarly, Node2 may ask its default router which HR to use for Node1 and for L2RH(s) to Node1. If Node1 had only Level-A implementation then it should have the combined L2RH from itself to RouterB and from there to Node2 pre-wired, saving all this message exchange. +-------+ +--0--+ SAN1 +--0--+ +--0--+ | Node1 +----------3 SW0 1----------3 SW1 1----------3 SW2 1 MTU=16KB +-------+ +--2--+ +--2--+ +--2--+ | | RTRA1 *********** +---+---+ *********** RTRB1 * RouterA * | Node2 | * RouterB * RTRA3 *********** +---+---+ *********** RTRB2 | | | +-------+ SAN3 +--0--+ +--0--+ SAN2 +--0--+ | Node3 +----------3 SW3 1 3 SW4 1----------3 SW5 1 MTU=8KB +-------+ +--2--+ +--2--+ +--2--+ The sequence of messages is shown below. (s1) Node1 sends a [TELL] message asking its default router (RTRA1) to provide a list of nodes with the capability code X (CC=X). Node1 knows that RTRA1 is on its network, with SR={2,PW}={2,3,0}, where PW=0x0300 is the 16-bit Myrinet-type assigned to PacketWay. Myrinet is described here with absolute addresses. Appendix-B <36> PktWay-WG 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterA1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | RTRA1 | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "CAPA" | PL=3 | RL=0 | CC=X | xxx | xxx | xxx |Spect +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ This asks for information about nodes with capability-X. (s2) RTRA1 uses [ERR/UNK] to tell Node1 that no such node is known to RTRA1. 0 1 2 3 4 5 6 7 +--------+--------+--------+--------+--------+--------+--------+--------+ | <---- The L2-header needed to get from RouterA1 to Node1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {3,PW} | +---+----+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node1 | UNK | "E R R" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 RTRA1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "CAPA" | PL=3 | RL=0 | CC=X | xxx | xxx | xxx |Spect +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ (s3) Node1 sends another [TELL] message to RTRA1 asking for a list of floating-point DSPs that handle 64bit IEEE data (CC=7,8). 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterA1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | RTRA1 | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "CAPA" | PL=2 | RL=0 | CC=7 | 8 | xxx | xxx |64-DSP +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ PktWay-WG <37> Appendix-B (s4) RTRA1 uses [INFO] to provide the addresses and capabilities of both Node2 and Node3 (the former only 64 bits, the latter both 32 and 64). 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from RouterA1 to Node1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {3,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node1 | "INFO" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=4 (8B-words) |0| RZ |0 RTRA1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=1 | AT=1 | Node2 |adr2 +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=2 | RL=0 | CC=7 | 8 | xxx | xxx |FP-DSP +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=1 | AT=1 | Node3 |adr3 +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=1 | RL=0 | CC=7 | 4 | 8 | xxx |FP-DSP +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ It is possible that Node2 and Node3 are specified by addresses that are not physical addresses. For its own reasons Node1 decided to use Node2 and sends [HRTO] to ask RTRA1 which HR to use for node2. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterA1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | RTRA1 | "HRTO" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | Node2 |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ Appendix-B <38> PktWay-WG (s6) RTRA1 uses [RDRC] to re-direct to Node2 via RouterB. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from RouterA1 to Node1 ----> | | It may be any number of bytes. In this example it is 3 bytes: {3,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node1 | "RDRC" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=2 (8B-words) |0| RZ |0 RTRA1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | Node2 |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | RTRB1 |via-HR +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ Node1 knows how to get to RouterB over its SAN. (s7) Node1 uses [TELL] (still using L3-forwarding via RouterB) to verify Node2's capabilities, by asking Node2 for information about itself. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterB1 ----> | | It may be any number of bytes. Here it is 5 bytes: {1,1,2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node2 | "TELL" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | Node2 |Addr +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ PktWay-WG <39> Appendix-B (s8) Node2 uses [INFO] (via RouterB2, also using L3-forwarding) to provide more information to Node1 about Node2 than what RTRA1 did. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node2 to RouterB2 ----> | | It may be any number of bytes. Here it is 4 bytes: {1,0,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node1 | "INFO" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=5 (8B-words) |0| RZ |0 Node2 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=4 | AT=1 | Node2 | +--------+--------+--------+--------+--------+--------+--------+--------+ | "NAME" | PL=7 | RL=1 | "S" | "u" | "p" | "e" | +--------+--------+--------+--------+--------+--------+--------+--------+ | "r" | xxx | xxx | xxx | xxx | xxx | xxx | xxx | +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=1 | RL=0 | CC=7 | 4 | 8 | xxx |FP-DSP +--------+--------+--------+--------+--------+--------+--------+--------+ | "CAPA" | PL=3 | RL=0 | CC=5 | xxx | xxx | xxx |NFS +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ Node2 provided more information about itself, than what RTRA1 did, such as its name, "Super", its ability to handle also 32-bit IEEE floating point (in addition to 64 bit), and also being an NFS (CC=5). (s9) Node1 uses [GVL2] to ask RouterB for L2RH(s) from RouterB to Node2. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterB1 ----> | | It may be any number of bytes. Here it is 5 bytes: {1,1,2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | RTRB1 | "GVL2" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=1 (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=0 | AT=1 | Node2 |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ Appendix-B <40> PktWay-WG (s10) RouterB uses [L2SR] to provide Node1 with an L2RH from RTRB2 to Node2, with its Q and MTU. Here it is {3,0,PW} from RouterB to Node2. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from RouterB1 to Node1 ----> | | It may be any number of bytes. Here it is 5 bytes: {3,3,3,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node1 | "L2SR" | "R R P" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=0|PL=0| Data-Length=4 (8B-words) |0| RZ |0 RTRA1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | "ADDR" | PL=0 | RL=3 | AT=1 | Node2 |Dest +--------+--------+--------+--------+--------+--------+--------+--------+ | "SRQR" | PL=2 | RL=1 | xxx | xxx | Q |SR+Q +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=4B | 3 | 0 | 3 | 0 | xxx | xxx |L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ | "MTUR" | PL=1 | RL=0 | MTU=1,024 (in 8B-words) |MTU +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ The MTU in the MTUR above is the lessor of the MTUs of both networks. The RL (record-length) of the last MTUR-record is included both in the RL of the preceding SRQR-record and in the RL of the preceding ADDR-record (since the RL of the SRQR is included in the RL of the ADDR). (s11) Finally, Node1 starts sending data to Node2 using L2-forwarding. 0 1 2 3 4 5 6 7 +-----------------------------------------------------------------------+ | <---- The L2-header needed to get from Node1 to RouterB1 ----> | | It may be any number of bytes. Here it is 5 bytes: {1,1,2,PW} | +--------+--------+--------+--------+--------+--------+--------+--------+ |vv000000|11 L=4B | 3 | 0 | 3 | 0 | xxx | xxx |L2RH +--------+--------+--------+--------+--------+--------+--------+--------+ |00 P | Node2 |Sensor.SubType=? | "Sensor" |PH1 +---+----+--------+--------+--------+-+------+--------+--------+--------+ |E=3|PL=0| Data-Length=? (8B-words) |0| RZ |0 Node1 |PH2 +---+----+--------+--------+--------+-+------+--------+--------+--------+ | |Data | <------------------- The sensor data goes here ---------------------> |.... | |Data +--------+--------+--------+--------+--------+--------+--------+--------+ | 64 zero bits, unless any error was indicated along the path |TAIL +--------+--------+--------+--------+--------+--------+--------+--------+ E=3 (0b0011) indicates that all the data is 64-bit, in Big Endian order. PktWay-WG <41> Appendix-B Again, if Node1 had only Level-A implementation then it would have pre-wired the combined L2RH from itself to RouterB and from there to Node2, saving all this message exchange. All the messages shown in this appendix start with local L2 routing bytes needed to get across either SAN1 or SAN2 (indicated with "The L2-header needed to get from ... to ...") which are not L2RHs. The difference is that these bytes are in front of the packet, exposed to the local switches, whereas the L2RHs are only exposed to PacketWay-entities. These local L2 routing bytes are the actual bytes required by the SANs and likely to be consumed as the messages traverses the SAN, unlike the L2RHs that are intact until converted to actual routing bytes. The L2RHs start with 0b0000000011 followed by the number of routing bytes in that L2RH, and possibly also by several bytes of padding. Appendix-B <42> PktWay-WG [ B l a n k ] PktWay-WG <43> Appendix-C Appendix-C: Routing Tables (RTs) -------------------------------- Using only levels A, B, and C, of PktWay does not require coordination of how routing tables (local and external/remote) are structured. However, it is anticipated that in the future dynamic inter-SAN routing, mapping, and resource discovery will be added to PktWay. This appendix discusses a recommended structure for routing tables. It is desired that implementors that are looking only at the Level A-C document, will not create some arbitrary internal representation for their routing tables, that will hamper future interoperability. Instead, it is expected that keeping the future need for common structures for routing tables will lead to structures that will be easy to interoperable in the future when the PktWay specification is extended to include dynamic inter-SAN routing, mapping, and resource discovery. For that purpose this section includes a discussion of routing tables which is NOT a part of this specification. Routing tables provide the information needed for finding SRs to destinations specified by their addresses. In future levels of the PktWay specification the RTs will provide means to identify nodes also by names and/or capabilities. The RTs are based on "maps" for SANs prepared by "mappers", local nodes on the SANs with maps (i.e., routing tables, LRTs) of their SANs, obtained dynamically or statically. The inter-SAN routing process depends on the exchange of these maps to form local and remote RTs. The attributes of an RT are: SN Serial Number of this RT (by RCVF) SAN-ID ID of the SAN which this RT describes RCVF List of Received-From physical addresses or SAN-IDs (history) CSR+Q Common Source Route for the entire RT and its Q MinMTU Min MTU for this RT (along the above CSR) Local-RT Node-Structures, for nodes on the SAN specified by this RT The Local-RT has one or more Node-Structures for each node on the SAN specified by this RT. These Node-Structures are of the form: Address Physical address on this SAN [Name] Optional [Capabilities] Optional [Logical-Address(es)] Optional: The LA(s) to which it listens SR From the mapper to the node specified by this structure Each SR entry (and the CSR, too) contains Q, the quality of the SR, an unsigned 16-bit integer. The units are not defined yet. It is assumed that Q is monotonic (sort of analogous to latency, hence additive) with all-0 being the best and all-1 the worst. Appendix-C <44> PktWay-WG Until otherwise defined, let Q be an unsigned-integer in microseconds. In updating it, its value should be clipped to the maximum value (~64msec). The CSR has a MinMTU which is the minimal MTU along the entire CSR. The RCVF is the list of the physical addresses along which this RT was forwarded. Its entries are either HR-addresses or SAN-IDs. The purpose of the RCVF is to identify the genealogy of a composite route. It could be used for preventing routing loops. The RCVF could have been derived from the CSR, if only the HRs could parse the CSR and associate HR-addresses with SRs and SAN-IDs with HR-addresses, which should not be assumed. Different RTs for the same SAN may be kept. Each RCVF has its own SN. The Node-Structure (in an RT) has SRs from the mapper (of that RT) to that node. The CSR is an SR to the same mapper. Hence, by catenating the CSR to the beginning of the SR in the Node-Structure, an SR is derived all the way from the local node (where the RT resides) to the remote node. Each SAN has a unique SAN-ID, known to the HRs on it. The SAN-IDs share the PacketWay-address space with the nodes. Hence, a SAN-ID is also a unique 24-bit physical PacketWay-address (starting with a 0 bit). PktWay-WG <45> Glossary Appendix-D: Glossary -------------------- Address: A unique designation of a node (actually an interface to that node) or a SAN. Buddy-HR: HRs are "buddies" if they are on the same SAN. Cut-Thru: See wormhole. Destination: The node to which a packet is intended Dynamic-Routing: Routing according to dynamic information (i.e., acquired at run time, rather than pre-set). Endianness: The property of being Big-Endian or Little-Endian (transmission order, etc.) Ethertype: A 16-bit value designating the type of Level-3 packets carried by a Level-2 communication system. HR: Half-Router, the part of a router that handles one network only. L2-Forwarding: Forwarding based on Level-2 (i.e., data-link layer of the ISORM) information, e.g., the native technique of each SAN or LAN. Also called "source routing." L3-Forwarding: Forwarding based on end-to-end Level-3 (i.e., network layer of the ISORM) addresses. Also called "destination routing." MAC: Message Authentication Code. Map: The topology of a network. Mapper: A node on a SAN/LAN that has the map and an RT for that network. It is expected that the mapper dynamically updates the map and the RT. Multi-homed Node: A node with more than one network interface, where each interface has another address. Node: Whatever can send and receive packets (e.g., a computer, an MPP, a software process, etc.) Node structure: A C-struct (or equivalent) containing values for some attributes of a node. Planned Transfer: Transfer of information, occurs after an initial phase in which the sender decides which Level-2 route to use for that transfer. Glossary <46> PktWay-WG RCVF: The "Received From" set includes all the physical addresses through which an RT was disseminated, starting with that of the mapper that created that RT. Re-direct-message: A message that tells nodes which HR should be used in order to get to a certain remote address (or range of). Router: The inter-SAN communication device SAN: System Area Network. Security Context: A relationship between 2 (or more) nodes that defines how the nodes utilize security services to communicate securely. Source: The node that created a packet. Source-Route: A Level-2 route that is chosen for a packet by its source. Symbol: Data preceeding the EEP header of a PktWay message, interleaving with the L2RHs. Twin-HR: Two HRs are twins if they both are parts of the same inter-SAN router. Wormhole-routing: (aka cut-thru routing) forwarding packets out of switches as soon as possible, without storing that entire packet in the switch (unlike Stop-and-forward). Zero-copy TCP: A TCP system that copies data directly between the user area and the network device, bypassing OS copies. PktWay-WG <47> Acronyms Appendix-E: Acronyms and Abbreviations -------------------------------------- 0bNNNN The binary number NNNN (e.g., 0b0100 is 4-decimal) 0xNNNN The hexadecimal number NNNN (e.g., 0x0100 is 256-decimal) 8B 8 byte (64 bits) entity ADDR The Address-record of RRP API Application/Program Interface AT Address Type ATM Asynchronous Transmission Mode B Byte (e.g., 4B) b bit (e.g., 32b) BC Byte Count (of parameters) BER Bit Error Rate CAPA The capability-record of RRP CC Capability Code CSR Common Source-Route DA Destination Address DB Data Block DL Data Length (in 8B words) DSP Digital Signal Processor e The MSbit of E E The Endianness field (in the EEP header) EEP End/End Protocol EI Error Indication GP General Purpose GVL2 An RRP message, requesting L2 route to a given destination GVRT An RRP message asking an HR to give its routing tables h Optional header fields flag HR Half Router HRTO An RRP message asking which HR to use for a given destination ID Identification IGMP Internet Group Management Protocol INFO An RRP message providing information about nodes IP The Internet protocol ISORM The ISO Reference Model L Length field (exclusive of itself) L2 Level-2 of the ISORM (Link) L2RH Level-2 Routing Header L2SR Source Route L3 Level-3 of the ISORM (Network) LA Logical Address LADR The Logical-addresses-record of RRP LAN Local Area Network LRT Local Routing Table LSbit Least Significant bit LSbyte Least Significant byte MPI Message Passing Interface MPP Massively Parallel Processing system MSbit Most Significant bit MSbyte Most Significant byte MSU Mississippi State University MTU Maximum Transmission Unit MTUR The MTU-record of RRP M/C Multicast Acronym <48> PktWay-WG NAME The name-record of RRP NFS Network File Server OH Optional Header field OH-TYPE The Type of an Optional Header field OT Optional Trailer field P The Priority field PAD Padding After Data PBD Padding Before Data PCI The Peripheral Component Interconnect "standard" PH PacketWay Header PL Padding Length (always in bytes) PPP The Point-to-Point Protocol PROM Programmable ROM (Read-Only-Memory) PT Packet Type (2B) PVM Parallel Virtual Machine PW The Myrinet Packet Type assigned to PktWay (PW=0x0300) Q Quality (of a path) RCVF Received-From list, or the Received-From record of RRP RDRC A re-direct message of RRP RH Routing Header RID Record ID RL Record Length (in 8B-words) RRP Router/Router Protocol RT-hd RT (Routing Table) header RT Routing Table RTBL An RRP message proving a Routing Table RTHD The Routing-Table-Header record of RRP RTyp RRP's Record Type RZ The Reserved field (in the EEP header) SA Source Address SAN System Area Network SAN-ID The 24-bit PktWay-address of a SAN SAR Segmentation and Reassembly SN Serial Number SNID SAN-ID SNMP Simple Network Management Protocol SR Source Route (always at Level-2) SRQR The Source-Route-and-Q-record of RRP ST Symbol Type TAIL PacketWay EEP Trailer TE Type Extension (2B) TELL An RRP message requesting information about nodes partially specified UNK Unknown V Version WRU? An RRP message asking its recipient to identify itself XRT External Routing Table xxx A padding byte draft-ietf-pktway-protocol-spec-03.txt [end]