D. Otis
Internet Draft                                                  SANlight
Document: draft-otis-scsi-ip-00.txt                              10/2/00
Category: Informational                                  Expires  4/2/01
 
                              SCSI over IP 
 
 
Status of this Memo 
 
   This document is an Internet-Draft and is in full conformance with 
      all provisions of Section 10 of RFC2026 [1].  
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts.  Internet-Drafts are draft documents valid for a maximum of 
   six months and may be updated, replaced, or obsoleted by other 
   documents at any time.  It is inappropriate to use Internet- Drafts 
   as reference material or to cite them other than as "work in 
   progress." 
     
   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt  
   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html. 
    
1. Abstract 
    
   This is an overview of SCSI over IP considerations leading to the 
   FC-SCTP-IP draft and to suggest possible implementations of this FC-
   SCTP-IP draft.  With two basic architectures covered, it is the 
   intent to illustrate decisions leading to both simple FC 
   encapsulation as well as native IP access.      
    
2. Basic Architectures 
    
   The optical delay between facilities is 8 microseconds per mile or 5 
   microseconds per kilometer.  Limiting Metro Area Networks (MAN) to a 
   distance of 125 fiber miles results in 1 millisecond point-to-point 
   delay.  A Wide Area Network (WAN) of 15K miles results in a point-
   to-point delay of 120 milliseconds and used as example distances. 
    
   Direct access to the storage device provides the best performance 
   for a one-to-one association between a client access point and the 
   target device otherwise benefit of caching is reduced by network 
   delay.  Another consideration is communication buffering which can 
   add a significant delay and needs to be considered part of the 
   architecture. 
    
   Network delays necessitate placing caching for storage adjacent to 
   the client otherwise performance benefits are significantly reduced.  
   Caching placed adjacent to storage may be justified if shared by 
   multiple remote clients as a means to reduce load on the storage 
  
                            SCSI over IP                   [page 2] 

 
 
   device.  Such shared storage becomes problematic unless restricted 
   to read-only, otherwise locking/releasing of associated records 
   before and after access operations adds additional delay and 
   complexity. 
    
   In reviewing the basic architectures, cache adjacent to storage will 
   be considered for read-only volumes if used on a MAN or WAN because 
   of their multi-client use, otherwise abstraction of the basic block 
   device becomes appropriate.  Such abstraction could be a file or 
   database server as example.  These abstractions simplify 
   locking/releasing of disparate blocks into a simpler opening/closing 
   of objects.  This brings architectures down to the following: 
     - One-to-One Client Side Cache (CSC) 
     - Many-to-One Server Side Cache (SSC)(read-only unless on LAN) 
    
3. Why not just use abstraction servers? 
    
   To ensure reliability of abstraction servers, the device level 
   interface Storage Area Network(SAN) must be accessible to more than 
   one such server.  The amount of RAM required by abstraction servers 
   typically run above 1% of volume space to prevent thrashing on 
   references of extent, namespace and permissions as example.  RAM is 
   about 100 times the expense of hard storage, so abstracted data 
   doubles cost.  In addition, abstraction servers are sized for the 
   number of clients and volume space making scaling difficult.  By 
   allowing clients remote access to the SAN in an exclusive fashion 
   using their own abstraction servers, the best performance is 
   achieved and servers are properly scaled.  
    
   Direct SAN access would be through a translation of fibre-channel at 
   a switch aggregating FC nodes into a single SCTP IP connection.  
   Simple encapsulation ensures no state information is remembered at 
   any translation node.  Ethernet provides a great deal of flexibility 
   with respect to possible data rates on the IP connection and paths.  
   1G-bit or encapsulated as a single connection on four 1G-bit ports 
   with load balanced redundant paths as example.  Shortly, 10G-bit 
   ports will be available.  A review is required to determine if FC 
   structures are suitable. 
    
4. Does FC or FCP structures need to change? 
    
   10-kilometer fiber places only 3 frames of data in flight at 1G-bit.  
   At a MAN length, this number is increased to 60-frames and, at WAN, 
   it rises to 7K-frames.  This number rises to 600/70K-frames should 
   the connection run at 10G-bit.  The 512 concurrent exchanges (256 
   each direction) sequence limit of FC appears to be a problem as only 
   1K-sequences/second could be supported at this WAN distance. 
    
   With such a WAN however, only 4 responses per second would be 
   possible which would imply 256 concurrent threads would be required 
   to satiate this concurrent sequence limit.  Over such large 
   distances, only applications demanding large transfers of 1.2M-bytes 
   would a maximum bandwidth of 10G-bit be achieved from a single 
  
                            SCSI over IP                   [page 3] 

 
 
   initiator.  However, the reason for such a SSC being used would be 
   to support multiple clients, and each additional client reduces the 
   requisite transfer for maximum connection utilization.  Should the 
   average transfer be 8K-bytes, then 160 ATM clients would be required 
   if all are at the example WAN distance running 10G-bit from the 
   server. 
    
   For the MAN distance at 10G-bit, FC would be limited to 128K-
   sequences per second with 500 responses per second which again 
   implies 256 concurrent threads would be required to satiate.  Should 
   the average transfer be 8K-bytes then only two clients limited by 
   40% would consume the 10G-bit connection.  Once at a LAN distance, a 
   concurrent sequence limit would never become significant even at 
   10G-bit, as there would be only 30 frames of data in flight at the 
   10-kilometer distance.    
    
   With practical considerations for a SSC, should the concurrent 
   sequence constraint become a limit for a particular remote client, 
   these parallel processes could be identified as either different 
   initiators related to a sub-processor or simply isolated on a 
   different stream should the cache support native SCTP access.  
   Regardless of the concurrent sequence constraint of FC, a storage 
   device is mechanically limited and can only provide about 200 
   accesses per second.  As such, when directly accessing the drive, 
   regardless of the data-rate or distance, the number of sequences is 
   still bound by the drive.  This mechanical limit will not be 
   supplanted for dozens of generations, even with WAN use.   
    
   With most commerce, the cache hit rate is relatively low, so it is 
   doubtful that even without sequence expansion by means of multiple 
   streams, there would be an FC protocol limitation on a single 
   initiator on a single stream.  The prominent use of a SCSI-IP 
   protocol however, will be to communicate directly to the storage 
   device and so keeping state information within the domain of client 
   and device server and not at an intermediary node is an important 
   consideration for reliability.  Re-engineering the FC header 
   structures runs a significant risk by introducing stateful 
   translations.  In the end, it will be hard to justify modification 
   of these structures should reliability suffer as a result. 
    
5. Flow Control at the Sub-Node      
    
   It is important to prevent internal nodes within a SSC configuration 
   from becoming congested or allowed to consume excessive resources.  
   The communication buffer is a separate resource from the FIFO 
   delivering to the sub-nodes that obtain access to the various RAID 
   or JBOD.  To provide a deterministic response from sub-nodes, the 
   depth of the FIFO must be controlled.  As little as 8M-byte of FIFO 
   would introduce 100 milliseconds of delay at typical FC rates.  As 
   such, resolution of these nodes must be accommodated.  By attaching 
   each FIFO to a SCTP stream, this control then resolves to the FIFO.  
   The stream can also have priorities assigned to allow urgent 
  
                            SCSI over IP                   [page 4]

 
 
   commands a means of bypassing lower priority commands or of being 
   processed out of order. 
    
   The flow control works in a similar fashion to that of FC Class-3 
   and only Class-3 is indicated as being supported for FC media.  
   Rather than single credit tokens "R_RDY" being sent, a token count 
   relays the buffer credits being returned to the stream.  Flow 
   control resolves to the stream and not the initiator unless each 
   initiator is mapped to a stream.  This is left to the implementer.  
   The typical case would be to have fewer than 20 drives connected to 
   an FC node that feeds the SCTP stream.  As there are 65,535 streams 
   available, 1.3M-drives could be accommodated by the SCTP protocol 
   with a single connection.  It is doubtful that more than a few dozen 
   FC nodes would be found on a single device however. 
    
6. Why SCTP 
 
   The essential advantages for using SCTP compared to TCP: 
     - Headers contained within one frame. 
     - Objects aligned at 32-bit boundaries. 
     - Out of sequence frame processing. 
     - Standard authentication. 
     - Independent streams under common control. 
     - Session restart. 
     - Improved error detection. 
     - Prevention of blind spoofing and denial of service attacks. 
     - Standard Heartbeat and multi-homing.  (Optional) 
 
   With VI promoted for new file systems, it is clear CPU and memory 
   overhead are two areas receiving attention.  SCSI does not require 
   VI to allow zero copy data placement or out-of-sequence processing, 
   however reliance on the encapsulation structure as well as the 
   ability to identify object boundaries within the stream are highly 
   important.  To date, TCP does not allow for these features and with 
   the sizeable install base, it is doubtful it ever will as SCTP 
   provides requisite features for implementing new versions of VI, 
   RPC, and SCSI allowing out-of-sequence delivery and meeting the 
   congestion control requirement found with TCP.  Those expecting to 
   change TCP as a means to offer a solution for the problem of 
   locating objects within a persistent stream may be surprised by the 
   rapidity SCTP fills that void and preempts these efforts.  The API 
   for TCP simply does not provide the type of interface required nor 
   does it provide for the reliability.  When it comes to reliability 
   of storage, little can be said to be more paramount.  Perhaps only 
   performance would receive as much scrutiny and SCTP wins on both 
   accounts for several reasons. 
    
   Use of SCTP also allows a low overhead for simple encapsulation as 
   well as providing the means for native IP access.  Rather than 
   divergent standards, SCTP offers a unified means of communicating to 
   SAN.  TCP does not provide that ability. 
  
                            SCSI over IP                   [page 5]

 
 
      
7. Acknowledgments 
    
   Randall R. Stewart [randall@stewart.chicago.il.us] 
   For his timely inputs on SCTP making this far more enjoyable. 
    
8. Author's Addresses 
    
   Douglas Otis 
   SANlight Inc. 
   160 Saratoga #46 
   Santa Clara, CA 95051 
    
   Phone: (408) 260-1400 x2001 
   Email: dotis@sanlight.net 
    
    
   Full Copyright Statement 
    
   "Copyright (C) The Internet Society (date).  All Rights Reserved.  
   This document and translations of it may be copied and furnished to 
   others, and derivative works that comment on or otherwise explain it 
   or assist in its implementation may be prepared, copied, published 
   and distributed, in whole or in part, without restriction of any 
   kind, provided that the above copyright notice and this paragraph 
   are included on all such copies and derivative works.  However, this 
   document itself may not be modified in any way, such as by removing 
   the copyright notice or references to the Internet Society or other 
   Internet organizations, except as needed for the purpose of 
   developing Internet standards in which case the procedures for 
   copyrights defined in the Internet Standards process must be 
   followed, or as required to translate it into languages other than 
   English. 
    
   The limited permissions granted above are perpetual and will not be 
   revoked by the Internet Society or its successors or assigns. 
   This document and the information contained herein is provided on 
   an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 
   ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 
   IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.