Internet DRAFT - draft-haagens-ips-iscsireqs

draft-haagens-ips-iscsireqs



INTERNET DRAFT                                          Randy Haagens
                                                  Hewlett-Packard Co.
Expires January 2001                                        July 2000
<draft-haagens-ips-iscsireqs-00.txt>

                   iSCSI (Internet SCSI) Requirements


Status of this Memo


     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC2026.  Internet-Drafts are work-
     ing documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups. Note that other groups may also dis-
     tribute working documents as Internet-Drafts.  Internet-Drafts are
     draft documents valid for a maximum of six months and may be
     updated, replaced, or obsoleted by other documents at any time. It
     is inappropriate to use Internet-Drafts as reference material or to
     cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-
     Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.



Comments

     Comments should be sent to the ips mailing list (ips@ece.cmu.edu)
     or to the author(s).


Abstract

     This document explains the motivation behind an efficient transport
     of SCSI commands on top of TCP/IP and describes scenarios where
     such a transport will be used. The document also enumerates and
     discusses requirements for supporting SCSI on top of IP.


Scope

     We propose to define a mapping of SCSI protocol to TCP/IP so that
     SCSI storage controllers (principally disk and tape arrays and
     libraries) can be attached to IP networks, notably Gigabit Ethernet
     (GbE) and 10 Gigabit Ethernet (10 GbE).



Randy Haagens                                                   [Page 1]

Internet-Draft              TCP RDMA option                 July 7, 2000


Motivation

     We seek timely adoption of a protocol mapping for block storage
     over IP networks.  Accordingly, we have chosen to work with the
     existing SCSI architecture and commands and also the existing
     TCP/IP transport layer.  Both these protocols are widely-deployed
     and well-understood.  Using them means a minimum of new invention,
     the most rapid possible adoption, and the greatest compatibility
     with Internet architecture, protocols, and equipment.

     The iSCSI protocol is a mapping of SCSI to TCP, and constitutes a
     "SCSI transport" as defined by the SCSI SAM-2 document [SAM2, p. 3,
     "Transport Protocols"].


1 Applicability

     Traditionally, volume/block-oriented storage controllers (e.g.,
     disk array controllers, tape library controllers) have supported
     the SCSI-3 protocol, and have been attached to computers through
     the SCSI parallel bus or through Fibre Channel.  File-oriented
     storage controllers have supported the NFS and/or CIFS protocols,
     and have been attached directly to IP networks such as Ethernet.

     The IP/Ethernet infrastructure offers compelling advantages for
     volume/block-oriented storage attachment compared to current
     approaches:

          * Increasing performance and reduced cost driven by Internet
          economics and "IP convergence"

          * Seamless conversion from local to wide area using IP routers

          * Emerging availability of "IP datatone" service from car-
          riers, in preference to ATM or SONET or T-1, T-3 services

          * Protocols and middleware for management, security and QoS

          * Economies arising from the need to install and operate only
          single type of network

     The following applications for iSCSI are contemplated:

          * Local storage access, consolidation, clustering and pooling
          (as in the data center)

          * Remote disk access (as for a storage utility)




Randy Haagens                                                   [Page 2]

Internet-Draft              TCP RDMA option                 July 7, 2000


          * Local and remote synchronous and asynchronous mirroring
          between storage controllers

          * Local and remote backup and restore

          * Evolution with SCSI to support of emerging object-oriented
          storage model

     And the following connection topologies are contemplated:

          * Point-to-point direct connections

          * Dedicated storage LAN, consisting of one or more LAN seg-
          ments

          * Shared LAN, carrying a mix of traditional LAN traffic plus
          storage traffic

          * LAN-to-WAN extension using IP routers or carrier-provided
          "IP Datatone"

          * Private networks and the public Internet

     The iSCSI standard will permit SCSI volume/block-oriented devices
     to be attached directly to IP networks such as Ethernet.  The
     SCSI-3 command sets (defined by the ANSI NCITS T10 committee) will
     be mapped to TCP.  iSCSI is this mapping, and is analogous to (but
     not the same as) SCSI-FCP (aka "FCP"), which is the mapping of SCSI
     to Fibre Channel.

     Local-area storage networks will be built using Ethernet LAN
     switches.  These networks may be dedicated to storage, or shared
     with traditional Ethernet uses, as determined by cost, performance,
     administration, and security considerations.  In the local area,
     TCP's adaptive retransmission timers will provide for automatic and
     rapid error detection and recovery, compared to alternative techno-
     logies.

     IP LAN-WAN routers will be used to extend the IP storage network to
     the wide area, permitting remote disk access (as for a storage
     utility), synchronous and asynchronous remote mirroring, and remote
     backup and restore (as for tape vaulting).  In the WAN, TCP end-
     to-end will avoid the need for specialized equipment for protocol
     conversion, ensure data reliability, cope with network congestion,
     and automatically adapt retransmission strategies to WAN delays.

     The full realization of iSCSI will involve the following elements:
     (1) Completion of  Requirements (this document) and Specification



Randy Haagens                                                   [Page 3]

Internet-Draft              TCP RDMA option                 July 7, 2000


     documents; (2) Development of Ethernet storage NICs and related
     driver and protocol software; (3) Development of compatible storage
     controllers; and (4) The likely development of translating gateways
     to provide connectivity between the Ethernet storage network and
     the Fibre Channel and/or parallel-bus SCSI domains.

     Products will initially be offered for Gigabit Ethernet attachment,
     with rapid migration to 10 GbE.  For performance competitive with
     alternative SCSI transports, it will be necessary to implement the
     performance path of the full protocol stack in hardware.  These new
     storage NICs will perform full-stack processing of a complete SCSI
     task, analogous to today's SCSI and Fibre Channel HBAs.  They typi-
     cally also will support all host protocols that use TCP, including
     NFS, CIFS and HTTP.

     A key goal is not to require modifications to the current IP and
     Ethernet infrastructure to support storage traffic over TCP.
     Nevertheless, the performance and security requirements of storage
     will create opportunities for improvement in security protocols and
     QoS implementations.  The addition of storage traffic to local- and
     wide-area internets (and even to the public Internet) may introduce
     increased requirements for traffic monitoring and engineering in
     those environments.

     It is contemplated that many organizations initially will choose to
     operate storage networks based on iSCSI that are independent of
     (isolated from) their current data networks except for secure rout-
     ing of storage management traffic.  These organizations will bene-
     fit from the high performance/cost of IP equipment and a unified
     management architecture, compared to alternative means of building
     storage networks.  As security and QoS evolve, it will become more
     reasonable to build combined networks with shared infrastructure;
     nevertheless, it is likely that sophisticated users will choose to
     keep their storage subnetworks isolated, for the best control of
     security and QoS.

     The proposed charter of the IETF IP SCSI Working Group (IPSWG)
     describes the broad goal of mapping SCSI to IP.  Within that broad
     charter, many transport alternatives may be considered.  Our ini-
     tial work focuses on TCP, and this Requirements document is res-
     tricted to that domain of interest.  At the current time, we do not
     seek a more generic requirements statement that would justify the
     choice of TCP (or another protocol) as transport, since the merits
     of using TCP are readily evident to the working group participants.







Randy Haagens                                                   [Page 4]

Internet-Draft              TCP RDMA option                 July 7, 2000


2 Definitions

     Certain definitions are offered here, with references to the origi-
     nal document where applicable, in order to clarify the discussion
     of requirements.  Throughout the text, use of defined terms is
     emphasized by producing them in bold face type.  Definitions
     without references are the work of the authors and reviewers of
     this document.

     Logical Unit (LU): A target-resident entity that implements a dev-
     ice model and executes SCSI commands sent by an application client
     [SAM-2, section 3.1.50, p. 7].

     Logical Unit Number (LUN): A 64-bit identifier for a logical unit
     [SAM-2, section 3.1.52, p. 7].

     SCSI Device:  A device that is connected to a service delivery sub-
     system and supports an SCSI application protocol [SAM-2, section
     3.1.78, p. 9].

     Service Delivery Port (SDP): A device-resident interface used by
     the application client, device server, or task manager to enter and
     retrieve requests and responses from the service delivery subsys-
     tem.  Synonymous with port (SAM-2 section 3.1.61) [SAM-2, section
     3.1.89, p. 9].

     Target: An SCSI device that receives SCSI command and directs such
     commands to one or more logical units for execution [SAM-2 section
     3.1.97, p. 10].

     Task: An object within the logical unit representing the work asso-
     ciated with a command or a group of linked commands [SAM-2, section
     3.1.98, p. 10].

     Transaction: A cooperative interaction between two objects, involv-
     ing the exchange of information or the execution of some service by
     one object on behalf of the other [SAM-2, section 3.1.109, p. 10].
     [A transaction seems to be a smaller unit than a task.]


3 Requirements

     In the attached, actual requirements statements are flagged with
     [R].  Related discussion is flagged with [D].

     The requirements are somewhat arbitrarily grouped into categories.
     This is for convenience only.  No semantic meaning is to be implied
     from the category names.



Randy Haagens                                                   [Page 5]

Internet-Draft              TCP RDMA option                 July 7, 2000


3.1 General

     [R] Support block storage IO over IP networks.

          [D] Our initial approach uses SCSI for the block storage pro-
          tocol, and TCP/IP for the network transport.

     [R] Minimize optional features; but when allowed, (1) Allow for
     option negotiation at session establishment (login); (2) Provide
     for signaling an error (reject) when an unsupported feature is
     requested.


3.2 Performance/Cost2

     In general, iSCSI must allow implementations to equal or improve on
     the current state of the art for SCSI interconnects.

     [R] Low delay communication.

          [D] Conventional storage access is of a stop-and-wait or
          remote procedure call type.  Applications typically employ
          very little pipelining of their storage accesses, and so
          storage access delay directly impacts performance.  The delay
          imposed by current storage interconnects, including protocol
          processing, is generally in the range of 100 microseconds.
          The use of caching in storage controllers means that many
          storage accesses complete almost instantly, and so the delay
          of the interconnect can have a high relative impact on overall
          performance.

     [R] High bandwidth, bandwidth aggregation.

          [D] The bandwidth (transfer rate, MB/sec) supported by storage
          controllers is rapidly increasing, due to several factors: (1)
          Increase in disk spindle and controller performance; (2) Use
          of ever-larger caches, and improved caching algorithms; (3)
          Increased scale of storage controllers (number of supported
          spindles, speed of interconnects).  Not only must the iSCSI
          provide for full utilization of available link bandwidth, it
          also must exploit parallelism (multiple connections) at the
          device interfaces and within the interconnect fabric.

     [R] Low CPU utilization, equal to or better than current technol-
     ogy.

          [D] For competitive performance, the iSCSI protocol must allow
          three key implementation choices to be realized: (1) iSCSI



Randy Haagens                                                   [Page 6]

Internet-Draft              TCP RDMA option                 July 7, 2000


          must make it possible to build I/O adapters that handle an
          entire SCSI task, as alternative SCSI transport implementa-
          tions do.  (2) The protocol must permit "zero-copy" memory
          architectures, where the I/O adapter reads or writes host
          memory exactly once per disk transaction. (3) The protocol
          must not impose complex operations on the host software, which
          would increase host instruction path length relative to alter-
          natives.

     [R] Cost competitive with alternative storage network technologies.


3.3 SCSI

     [R] Collaboration with ANSI NCITS T10 (SCSI)

          [D] iSCSI is a new SCSI "transport" [SAM2].  Being the inter-
          section of SCSI and TCP, iSCSI has potential impact on T10 as
          well as on IETF.  However, a stated requirement (below) is
          that iSCSI shall have no impact on T10 architecture or command
          sets.  Collaboration with T10 will be necessary to achieve
          this requirement.

          [D] Collaboration with T10 concerns three phases of T10
          activity: (1) Past.  For T10 work completed in the past, and
          well-document in T10 standards publication, we will seek
          assistance in properly interpreting those standards; (2)
          Present.  For T10 work that is ongoing, or recently completed
          (but not widely published), we will seek review of our work by
          individuals active in T10, and/or the participation of those
          individuals in the IETF process; (3) Future.  For compatibil-
          ity with future T10 work, it is essential that iSCSI be a leg-
          itimate and recognized "SCSI transport", no less so than the
          several other SCSI transports.  SCSI command standards must
          evolve within the context of all existing SCSI transports.

          [D] Storage attachment to IP networks will engender an unpre-
          cedented potential for device sharing.  This alone may impact
          future T10 work.

     [R] Supported SCSI Device types.  iSCSI shall support all SCSI dev-
     ice types.  Our primary focus is on supporting "larger" devices:
     host computers and storage controllers (disk arrays, tape library
     controllers).

          [D] Supported SCSI Devices will typically have adequate memory
          to implement the TCP transport and required iSCSI session
          state, and a cost structure that can support VLSI for full-



Randy Haagens                                                   [Page 7]

Internet-Draft              TCP RDMA option                 July 7, 2000


          stack protocol acceleration. Generally, a controller will be
          interposed between the iSCSI (typically Ethernet) connections
          and the drive interface (typically parallel SCSI or Fibre
          Channel).  In the longer term, it will become feasible, due to
          the march of technology, to support iSCSI economically in disk
          spindle and tape mechanism controllers.

     [R] Support SCSI SAM-2 architecture model.

          [D] It would be helpful to produce a document discussing iSCSI
          with reference to SAM-2.  No promises.

     [R] Reliable Transport.  The iSCSI mapping provides the SCSI-3 com-
     mand layer with a reliable transport, equal to or greater in relia-
     bility than the parallel SCSI bus, and providing in-order delivery,
     as suggested by SAM-2.

          [D] See [SAM-2, p. 17.] "The function of the service delivery
          subsystem is to transport an error-free copy of the request or
          response between the sender and the receiver..." [SAM-2, p.
          22] "The manner in which ordering constraints are established
          is implementation-specific.  An implementation may choose to
          delegate this responsibility...to the service delivery port.
          In some cases, in-order delivery may be an intrinsic property
          of the transport subsystem or a requirement established by the
          SCSI protocol standard.  For convenience, the SCSI architec-
          ture model assumes in-order delivery to be a property of the
          service delivery subsystem.  This assumption is made to sim-
          plify the description of behavior and does not constitute a
          requirement.

     [R] Support for SCSI Task Queuing.

          [D] SAM-2 defines task queuing, and so strictly speaking, we
          don't need to call this out specifically.  However, task queu-
          ing is not widely implemented today; and it will increase in
          importance with WAN IP networks, given speed-of-light delays.
          We are particularly interested in supporting task queuing of
          pipelined remote backup and asynchronous disk mirroring

          [D] Just because iSCSI supports task queuing doesn't mean that
          the end SCSI node is required to do so also.  Task queuing is
          an optional feature of SCSI.

     [R] Supports all SCSI-3 command sets [SPC-2, SBC, etc.].  There
     will be no requirement by T10 to modify the SCSI command documents.
     No modifications are required of the SCSI command layer implementa-
     tion, except possibly to lengthen task timers to accommodate wide-



Randy Haagens                                                   [Page 8]

Internet-Draft              TCP RDMA option                 July 7, 2000


     area delays due to speed-of-light and switching.

          [D] Note the restriction to SCSI-3 command sets.  There are
          potential problems with gateways between iSCSI and SCSI-2
          parallel bus devices.  It may not be feasible to transport
          SCSI-2 commands over iSCSI.  Gateways that wish to support
          older SCSI-2 devices may have to proxy for those devices,
          using SCSI-3 commands.

     [R] Forward compatibility with future revisions of SCSI architec-
     ture and protocol.  Attention to clean layering of protocols.

          [D] This is a difficult requirement to achieve in practice,
          since we cannot predict how SCSI will evolve.  However, care-
          ful attention to protocol layering principles will help ensure
          this result.

     [R] Gateways to parallel SCSI [SPI-X] and to SCSI-FCP[FCP, FCP-2].
     It will be possible to construct "translating" gateways so that
     iSCSI hosts can talk to SCSI-X devices; so that SCSI-X devices can
     talk to each other over a iSCSI network; and so that SCSI-X hosts
     can talk to iSCSI devices (where SCSI-X refers to parallel SCSI,
     SCSI-FCP, or SCSI over any other transport).

          [D] This requirement is implied by support for SAM-2, but is
          worthy of emphasis.

          [D] These are true application protocol gateways, and not just
          bridge/routers.  The different standards have only the SCSI-3
          command set layer in common.  These gateways are not mere
          packet forwarders.  We need to look into their remote proxy
          behavior.

          [D] Adequate liaison must be established with related stan-
          dards bodies, principally ANSI T10 (SCSI).


3.4 iSCSI Session Layer

     [R] SCSI command, data, and response transactions occur in a TCP
     connection that is determined by the initiator, in advance of
     starting the SCSI task.

          [D] This requirement allows the initiator to assign the data
          transfer phase of a task to a given data transfer engine, at
          initiation of the task.

     [R?] TCP connection allegiance.  SCSI commands, data and status



Randy Haagens                                                   [Page 9]

Internet-Draft              TCP RDMA option                 July 7, 2000


     information for a given task shall flow within the same single TCP
     connection.

          [D] This is a stronger statement than the one above, and is
          left here as a potential requirement, mostly so that it will
          be clear that the discussion topics below pertain to the
          notion of channel allegiance.

          [D] SAM-2 seems to require this channel allegiance: "A task
          involving one initiator-target pair shall not specify a third
          SCSI device to participate in transmitting and receiving the
          remote procedure model elements for that task.  Thus, an SMU
          initiator [e.g., a host computer] shall not create a task
          using one service delivery port with the expectation that the
          data transfer or status return for that task would occur via a
          different service delivery port" [SAM-2, section  4.10.7,
          p.33].  Of course, interpretation of this clause depends on
          the definition of service delivery port.  If a service
          delivery port is a TCP connection, then channel allegiance is
          pretty clearly required.  But if a service delivery port is an
          iSCSI session or an abstract target device, then the interpre-
          tation of this clause is less clear.

          [D] We have found a number of other possible virtues in chan-
          nel allegiance: (1) It supports multiple instances of the TCP
          protocol engine being controlled by a single iSCSI session
          layer; (2) Failure of a TCP connection will affect only a sub-
          set of the extant tasks (those that use the failed connec-
          tion); (3) All TCP connections are used in exactly the same
          manner; (4) There is no need to have more than one IP port
          defined for the iSCSI protocol, which is firewall-friendly.

     [R] Command striping (load balancing) across multiple host and dev-
     ice interfaces.  It shall be possible to utilize multiple con-
     current paths between hosts and devices for the purpose of load
     balancing.

          [D] Load balancing refers to concurrent tasks from a single
          initiator.  There is no ordering constraint among these tasks.
          We aim to distribute these tasks (commands and their related
          data and status) across multiple host ports, links, switch
          ports and device ports, in order to achieve aggregate perfor-
          mance equal to a multiple of single link performance.

     [R] Command ordering for tape backup and asynchronous remote mir-
     roring.  It must be possible to pipeline commands to a device, and
     to have them executed in order by that device, as prescribed by
     SAM-2.



Randy Haagens                                                  [Page 10]

Internet-Draft              TCP RDMA option                 July 7, 2000


          [D] Ordering can be maintained by allowing each command to
          complete before issuing the next.  But that means there is no
          pipelining.  For tape backup in the local area, this may be
          adequate, as the tape controller buffer can be made suffi-
          ciently large to cover the lower duty cycle of data transfers,
          and LAN speeds are fast enough to burst-fill the buffer.  But
          in the wide area, a method of pipelining commands and
          responses is needed if the slower WAN link is to be filled
          continuously with data.

          [D] This brings up an issue, if commands are sent in different
          TCP connections.  Although a single TCP connection delivers an
          ordered byte stream, there is no ordering constraint between
          TCP connections.  So command striping across TCP connections
          will result in the commands possibly being executed out of
          order, unless the commands themselves are numbered, and can be
          put back into order.  SCSI does not provide a means for put-
          ting commands back in order, but requires that functionality
          of the "transport".

          [D] We contemplate bonding multiple TCP connections into an
          iSCSI session for the purpose of ordered command striping.  A
          command reference number (CRN) will allow iSCSI to receive
          commands in order from the initiator SCSI command layer, and
          deliver them in order to its peer command layer in the target.
          Note that this mechanism can be employed at all times, because
          delivering commands in order never hurts, even if the SCSI
          layer imposes no ordering constraints among them.  This is the
          safest route, in fact, as it upholds the SAM-2 expectation of
          in-order delivery.  We expect the ability to support a session
          consisting of multiple channels to be optional.

     [R] Recovery at the session layer.  The session layer specification
     shall explicitly address recovery at the session layer (from a
     failed TCP connection, for example).

          [D] TCP will recover from data loss due to bit errors or
          congestion.  But what if a TCP connection fails (hangs)?  The
          specification needs to address this issue.

          [D] Another case that we should consider is loss of session
          state at either the target or the initiator, for example, when
          a target is power cycled.  Should it be possible to restore
          the session in this case, or will we have to report service
          delivery failure to the SCSI layer, for recovery at that
          level?  In the case of a recovered session, we're concerned
          about "ghost IOs" that may inappropriately linger from a pre-
          vious session.



Randy Haagens                                                  [Page 11]

Internet-Draft              TCP RDMA option                 July 7, 2000


3.5 Transport, Network and Link

     [R] Works with existing installed Ethernet and IP WAN infrastruc-
     ture.  iSCSI should not require any modification to Ethernet hubs,
     switches or WAN routers to achieve minimum acceptable performance,
     QoS and security.

          [D] Using existing and off-the-shelf technology will allow
          iSCSI to fully leverage the cost, performance and rapid
          improvement of widely-deployed IP LAN and WAN technologies.
          Therefore, iSCSI cannot require the installation of special,
          non-standard features in the underlying technology.  However,
          it may be desirable to apply certain optimizations that will
          enhance storage protocol performance, or the performance of
          other protocols in the presence of the storage protocol.

     [R] Joint operation (coexistence) with other IP protocols.  iSCSI
     shall not preclude concurrent operation with any of the protocols
     in the IP protocol suite, and shall be a good Internet citizen.

          [D] Many organizations will choose to operate iSCSI storage
          networks as separate networks from their traditional data net-
          works, by a router only for management traffic.  This approach
          delivers the most manageable environment from a performance
          and security perspective, and is analogous to today's separate
          Fibre Channel storage networks, except for the obvious bene-
          fits that derive from using LAN technologies.  On the other
          hand, some organizations will favor using fewer networks, and
          mixing storage with other types of traffic.  This practice
          will be more prevalent in the wide-area, where dedicated
          storage links exact a high price.  For these reasons, graceful
          co-existence is required.  Over time, improved support for the
          QoS and security features inherent in IP and Ethernet proto-
          cols will make it more and more reasonable to combine storage
          with other types of network traffic.

          [D] When storage is transported over the wider Internet, it
          must be done in a way that respects TCP's bandwidth management
          and congestion avoidance algorithms.  This is one of the rea-
          sons for selecting TCP as the transport.  We feel that TCP
          itself is a good Internet citizen, and our best chance for
          compatibility.

     [R] Uses TCP/IP.  iSCSI is a protocol mapping from SCSI to TCP.

          [D] While we don't preclude consideration of alternative tran-
          sports, we have focused our attention on TCP. Given wide-area
          functions in a storage controller, and the resulting need for



Randy Haagens                                                  [Page 12]

Internet-Draft              TCP RDMA option                 July 7, 2000


          TCP support, inclusion of an alternative local-area transport
          may imply an increment of cost, not a cost savings; and it
          certainly represents an increment of complexity.

     [R] Link Independent.  iSCSI is defined for all IP networks, and is
     link-independent.  All IP-compatible LAN and WAN links are sup-
     ported.  Specifically, there are no dependencies on Ethernet.

          [D] We may nevertheless want to benefit from certain link
          capabilities like Ethernet port aggregation and PPP multi-
          link.  But the spec should not depend on these capabilities
          for its viability.

     [R] LAN, MAN and WAN -capable.  SCSI Devices that implement iSCSI
     will be capable of communicating with similarly-equipped devices
     and host computers over any IP network, whether local, metropoli-
     tan, or wide-area in scale.

          [D] iSCSI is used not only for local area disk block access
          and tape operations.  It also is used for remote disk access
          (as for a storage utility), remote disk mirroring, and remote
          backup and restore (as for tape vaulting).  Using TCP in the
          iSCSI end nodes means that the protocol is scalable from the
          local to the wide area.

     [R] Handles high bandwidth x delay fabrics.

          [D] This requirement must be clarified further, as an exten-
          sion of the WAN requirement.  Consider that the TCP pipe at 10
          Gbps x 200 msec holds 250 megabytes.  Will TCP sequence counts
          be up to this, or will they wrap too frequently?

     [R] Recovery of data stream processing immediately after TCP seg-
     ment drop.

          [D] In a conventional TCP implementation, loss of a TCP seg-
          ment means that stream processing must stop until that segment
          is recovered, which takes a network round trip to accomplish.
          Following the example above, we would be obliged to catch 250
          MB of data into an anonymous buffer before we could resume
          stream processing; later, this data would need to be moved to
          its proper location.  We seek some means of putting data
          directly where it belongs, and avoiding extra data movement in
          the case of segment drop.

          [D] Two possibilities are known at this time: (1) A Remote DMA
          feature added to TCP headers (in the options field) would
          allow the data portion of subsequent TCP segments to be placed



Randy Haagens                                                  [Page 13]

Internet-Draft              TCP RDMA option                 July 7, 2000


          directly, even though the iSCSI protocol headers have not been
          parsed; (2) A means of recovering iSCSI framing is the TCP
          stream would allow iSCSI protocol processing to continue, and
          the data to be put in its proper location.

     [R?] Framing.  Some method of framing iSCSI protocol units within
     the TCP stream may be required.

          [D] We are unresolved as to whether this is a requirement.
          The more basic requirement, described above, is to be able to
          recover the processing of the data stream immediately after a
          segment drop.  Framing is one way to recover processing.

          [D] The conventional way to locate higher-level protocol
          headers in the TCP stream is simply by parsing from the begin-
          ning of the stream, and never making a mistake.  Is this suf-
          ficient?  Or, should we use some other means such as byte
          stuffing or use of the push bit?  Related, how do we ensure
          that data actually is transmitted, and doesn't languish in a
          TCP buffer somewhere?

          [D] As an example of the problem: suppose a TCP segment is
          lost due to congestion, and it happens to contain an iSCSI
          header.  At that point, stream synchronization will be lost,
          as we cannot find the next iSCSI header.  Following the exam-
          ple above, we're obliged to catch 250 MB of data before we can
          resume iSCSI operation.  If we could find the next iSCSI
          header, we could implement an optimization (non-traditional
          for TCP implementations) that would require us only to catch a
          single iSCSI message's-worth of data.  Subsequent iSCSI mes-
          sages could be decoded, and the data put where it belongs
          (even though command ordering constraints would preclude act-
          ing upon the data until the missing SCSI command is received
          and inspected for ordering constraints).

          [D] Several methods have been discussed for providing framing
          by TCP: (1) A flag could be added in the TCP options that
          indicates that this segment begins a next-level Protocol Data
          Unit (PDU); (1a) Method 1 could be combined with a remote DMA
          mechanism for TCP; (2) The TCP transmitter function could be
          modified so that it emits a TCP segment for every next-level
          PDU, effectively turning TCP into a reliable, sequenced,
          datagram protocol.  Protocols such as iSCSI would then need to
          limit their PDUs to less than the maximum TCP segment size
          (which is dictated by link considerations), if IP fragmenta-
          tion is to be avoided.

          [D] Other methods could work above TCP.  (1) Byte stuffing is



Randy Haagens                                                  [Page 14]

Internet-Draft              TCP RDMA option                 July 7, 2000


          an old technique for framing within byte streams; its main
          disadvantage is that every byte must be processed by the fram-
          ing mechanism, which would make software implementation
          impractical; (2) A special marker header could be placed
          periodically in the TCP stream.  These headers would be found
          by doing arithmetic on TCP sequence numbers.  They contain
          information about the exact location of iSCSI PDUs.

     [R?] Error detection.  Stronger CRC.

          [D] The TCP checksum is rather weak as error detection goes.
          It is supported by the link layer check codes (CRC-32 for Eth-
          ernet).  Is that sufficient?  We don't have strong protection
          from re-assembly errors.  Routers modify the frame and recom-
          pute the CRC.  Even switches recompute CRCs when adding VLAN
          tags, although good implementations do the CRC recomputation
          incrementally.  The TCP checksum is our only end-to-end pro-
          tection.  If the TCP checksum is not sufficient, do we intro-
          duce some kind of check on the SCSI data buffers by the iSCSI
          layer?  Possibilities: byte count, CRC.  Whatever we do, it
          must be possible to compute these check codes on the fly, as
          data is transferred from NIC to memory, without making a
          second pass over the data once it is in memory.

          [D] We are considering using the IPsec messsage digest func-
          tion for this purpose.  It's already defined, and it could be
          used as a check code (only) using well-known keys; hence,
          without introducing the key distribution problem.  Using IPsec
          in conjunction with TCP would not require a modification to
          TCP.  A concern about using the IPsec message digest function
          is that it may be more difficult to compute at high speed than
          a simpler CRC.

          [D] But is TCP truly an end-to-end protocol?  The notion of an
          end-to-end error check is that it and the data it protects
          pass through the network unchanged, but possibly subject to
          errors while on a link or in a memory.  At the receiving end
          node, checking the CRC verifies the correct receipt of data.
          In some cases, such as the use of a SOCKS proxy server or
          perhaps a NAT, the connection is not end-to-end, but is the
          concatenation of two end-to-end connections.  In these cases,
          the iSCSI PDU (message) may be a better candidate for CRC pro-
          tection.

          [D] When considering a CRC at the iSCSI layer, we will give
          consideration to separate CRCs for iSCSI headers and data, and
          to the need to intersperse CRCs within long data messages.




Randy Haagens                                                  [Page 15]

Internet-Draft              TCP RDMA option                 July 7, 2000


     [R] Selective TCP retransmission.

          [D] Given the long delays in the WAN, using TCP selective
          retransmission must be supported by iSCSI, in order to minim-
          ize the bandwidth impact of retransmission.

     [R] Firewall friendly.  The protocol's use of IP addressing and TCP
     port numbers should be firewall friendly.

          [D] This probably means that all connection requests should be
          addressed a specific, well-known TCP port.  That way,
          firewalls can filter based on source and destination IP
          addresses, and destination (target) port number.  The source
          (initiator) port number also should be well-known for the ini-
          tial TCP connection.  Additional TCP connections would require
          different source port numbers (for uniqueness), but could be
          opened after a security dialogue on the control channel.

     [R] Possible to move data directly from end-to-end, without having
     retransmission buffers in the middle.

          [D] This is an important implementation detail.  In an iSCSI
          system, each of the end nodes (for example host computer and
          storage controller) has ample memory; but the intervening
          nodes (NIC, switches) do not.  We contemplate a WAN-scale
          retransmission requirementof 25 MB (1 Gbps) or 250 MB (10
          Gbps, see earlier footnote).  Therefore, it must not be neces-
          sary for intervening nodes to buffer data.

     [R] Conservative in use of TCP and session-layer connections.  The
     number required should not scale directly with the number of sup-
     ported LUs.

          [D]  TCP connection and iSCSI session state is fairly expen-
          sive in terms of memory consumed both on- and off-chip (we
          contemplate VLSI implementation).  At a minimum, we seek to
          support only the number of connections required to achieve
          required bandwidth and delay characteristics between hosts and
          storage controllers.

     [R] Compatible with both IPv4 and IPv6.

          [D] We need to add a literal format for IPv6 addresses in tar-
          get domain names.







Randy Haagens                                                  [Page 16]

Internet-Draft              TCP RDMA option                 July 7, 2000


3.6 Naming

     [R] Naming.  Whenever possible, iSCSI shall support the naming
     architecture of SAM-2.  Deviations and uncertainties will be made
     explicit, and comment/resolution invited.

          [D] It may be necessary to provide a unique naming scheme for
          SCSI LUs.  Fibre Channel does so using WWNs.  There's some
          indication that the T10 Security work will complicate this
          problem through LUN renumbering.  The manner of determining a
          unique, worldwide, unchanging LU name must be determined.  We
          will attempt to make use of SPC-2 provisions for LU Identif-
          iers (Vital product data page 83h [SPC-2, p. 203] ).

          [D] We need to resolve whether the notion of "target" is
          relevant to iSCSI.  Does an iSCSI session connect to a target?
          Can it subsequently address multiple targets and LUs or just a
          bunch of LUs?

          [D] We need to provide an understanding of just what a Service
          Delivery Port (SDP) is in the iSCSI context.  Is it an IP end-
          point?  A session endpoint?  A virtual device (target) that a
          session can be connected to?  SAM-2 seems to equate an SDP
          with a target address, "...the application clients in each
          initiator have the ability to discover that logical units in
          the SMU target are accessible via multiple Target Identifiers
          (service delivery ports)..." [SAM-2, pp. 12-13]

     [R] URLs.  It shall be possible to name SCSI devices and possibly
     LUs using a URL syntax.  These names shall be global (uniform) and
     suitable for passing as handles between SCSI application clients.

     [R] Domain names.  The Domain Name Service (DNS) shall be used to
     resolve the <hostname> portion of the url to one, or multiple IP
     addresses.  When a hostname resolves to multiple addresses, these
     addresses shall be equivalent for functional (possibly not perfor-
     mance) purposes.

          [D] This means that the addresses can be used interchangeably
          as long as we don't care about performance.  For example, the
          same set of SCSI targets and/or LUs (tbd) must be accessible
          from each of these addresses.

     [R] Deal with the complications of the new SCSI security architec-
     ture [99-245r8].

          [D] Pay attention to the proxy naming architecture defined by
          the new security model.  In this new model, SCSI Logical Unit



Randy Haagens                                                  [Page 17]

Internet-Draft              TCP RDMA option                 July 7, 2000


          Numbers (LUNs) can be mapped in a manner that gives each host
          (more correctly, each AccessID) a unique LU map.  Thus, a
          given LU within a target may be addressed by different LUNs.

     [R] Support SCSI 3rd-party operations.

          [D] The key issue here relates to the naming architecture for
          SCSI LUs.  We need to determine a method of passing a name or
          handle between parties


3.7 Security

     [R] Authentication.  At a minimum, iSCSI parties shall participate
     in a simple principals authentication protocol.  This protocol
     shall involve a minimum of encryption and no special hardware for
     implementation.

     [R] Bootstrapping.  It shall be possible to negotiate higher levels
     of security than the minimum, technique to be defined.

     [R] Data encryption.  Data encryption shall be optional, but when
     implemented, shall be done in a manner prescribed by iSCSI, by
     reference to other standards.

     [R] Compatible with IP protocol suite security protocols for the
     present and future.

          [D] We anticipate incorporating IPsec (host-to-host) and
          SSL/TSL (TCP connection) security into the iSCSI protocol by
          reference, and as options.  Adherence to good layering will
          ensure (as much as possible) that future security developments
          at the IP and TCP layers can be utilized by iSCSI.

     [R] Permits use of firewall for security screening.

          [D] It's important to allow a firewall to be used to offload
          authentication from the end node.  This is a possible means of
          defending against Denial of Service (DoS) assaults, from a
          less-trusted area of the network.  We assume that the
          firewall(s) have much greater processing power for dismissing
          bogus connection requests than do the end nodes.


3.8 Topology Discovery

          [D] OK, we said we'd leave this for later.  But why not open
          the discussion?



Randy Haagens                                                  [Page 18]

Internet-Draft              TCP RDMA option                 July 7, 2000


     [R] iSCSI shall have no impact on the use of conventional IP net-
     work discovery techniques.

          [D] IP discovery techniques are well-evolved.  Various network
          management platforms have ways of discovering IP addresses,
          such a mining router caches.  We assume that these techniques
          will be used, and will find all of the IP end points that con-
          tain iSCSI nodes.

     [R] iSCSI shall provide some means of determining that a discovered
     IP end point in fact is an iSCSI node.

          [D] This requirement is just a placeholder.  Generally in IP
          discovery, there is some way of determining the type of the
          discovered device.  Possibly this is due to the presence of
          the SNMP protocol and specific MIB variables.  In this case,
          SNMP is the bootstrap protocol.  Alternatively, one could
          probe various TCP port numbers to determine if there exists a
          higher-level protocol at each port (the port number would tell
          you which protocol).  To be determined.  But in any case, some
          means is needed to determine that an iSCSI entity is present
          at an IP end point.

     [R] When a device supports multiple IP end points, some means of
     determining the IP connection topology is needed.

          [D] A device may support multiple end points, yet it may not
          be reasonable to bind any combination of the end points
          together into an iSCSI session.  For example, a port con-
          troller (aka channel group) card may have four ports that can
          be bound together.  The storage controller may support four of
          these port controllers, yet not allow the binding together
          into a session of TCP connections made on different port con-
          trollers.

          [D] A really simple solution to this problem would be to
          define a means of describing port topology, and provide for
          reading that description either from a MIB or directly from
          the iSCSI layer (with a command).

     [R] SCSI protocol-dependent techniques shall be use for further
     discovery beyond the iSCSI layer.

          [D] Discovery is a complex process.  But SCSI provides
          specific hooks for doing the work, and all we need to do is
          transport the commands associated with this process.  Gen-
          erally the SCSI discovery process involves using the Report
          LUNs command to determine which LUs are addressable at a given



Randy Haagens                                                  [Page 19]

Internet-Draft              TCP RDMA option                 July 7, 2000


          service delivery port.  Subsequently, the true identity of
          each LU (ie, name) is discovered by reading Vital product data
          page 83h.  By comparing LU IDs, the discovery process can find
          that a given LU is accessible through multiple paths.

          [D] We need only verify that this SCSI mechanism is suffi-
          cient.  Hopefully, we will not need to augment SCSI at the
          iSCSI layer.


3.9 Management

     [R] IP-based management protocols.  It shall be possible (but not
     required) to use IP-based management protocols such as SNMP and RMI
     in conjunction with iSCSI.  However, the present effort will not
     define the management architecture for iSCSI networks.

     [R] SCSI management protocols.  It shall be possible to use SCSI
     commands for management (eg, SCSI Enclosure Services, SES commands)
     to manage iSCSI devices.


3.10 Interoperability

     [R] It must be possible for hosts and devices that implement only
     those features specified in the RFC to interoperate.

     [R] Software implementation is possible using conventional TCP/IP
     protocol stack.

          [D] Although some low-performance products may contemplate an
          all-software implementation, we expect the majority of iSCSI
          products to employ hardware protocol acceleration.  This
          requirement really is here to solve two problems (1) Proof of
          interoperability, by compatibility with extant TCP implementa-
          tions; (2) Prototyping, where the iSCSI protocol is first
          implemented in software using these conventional stacks.
          These prototypes will likely become the early reference imple-
          mentations.


4 References

     [SAM-2] ANSI NCITS.  Weber, Ralph O., editor.  SCSI Architecture
     Model -2 (SAM-2).  T10 Project 1157-D.  rev 13, 22 Mar 2000.

     [SPC-2] ANSI NCITS.  Weber, Ralph O., editor.  SCSI Primary Com-
     mands - 2 (SPC-2).  T10 Project 1236-D.  rev 18, 21 May 2000.



Randy Haagens                                                  [Page 20]

Internet-Draft              TCP RDMA option                 July 7, 2000


     [CAM-3] ANSI NCITS.  Dallas, William D., editor.  Information Tech-
     nology - Common Access Method - 3 (CAM-3)).  X3T10 Project 990D.
     rev 3, 16 Mar 1998.

     [99-245r8] Hafner, Jim.  A Detailed Proposal for Access Controls.
     T10/99-245 revision 8, 26 Apr 2000.

     [SPI-X] ANSI NCITS.  SCSI Parallel Interface - X.

     [FCP] ANSI NCITS.  SCSI-3 Fibre Channel Protocol [ANSI X3.269:1996]

     [FCP-2] ANSI NCITS.  SCSI-3 Fibre Channel Protocol - 2 [T10/1144-D]


5 Author



     Randy Haagens
     Roseville, R5U-P5/R5
     Hewlett-Packard Company
     8000 Foothills Blvd. MS 5668
     Roseville, CA 95747-5668
     USA

     Phone:+1 916 785 4578
     Email: randy_haagens@hp.com


Expires January 2001





















Randy Haagens                                                  [Page 21]