Internet DRAFT - draft-faibish-nfsv4-scsi-nvme-layout

draft-faibish-nfsv4-scsi-nvme-layout



Network File System Version 4                                 S. Faibish
Internet-Draft January 1, 2020                                  D. Black
Intended status: Informational                                  Dell EMC
Expires: January 6, 2021                                      C. Hellwig
                                                            July 6, 2020
                          

          Using the Parallel NFS (pNFS) SCSI/NVMe Layout 
              draft-faibish-nfsv4-scsi-nvme-layout-00

Abstract

   This document explains how to use the Parallel Network File System 
   (pNFS) SCSI NVMe Layout Types with transports using the 
   NVMe over Fabrics protocols. This draft picks the previous SCSI 
   over NVMe draft of C. Hellwig and extended it to support all the 
   types of transport protocols supported by NVMe transport over fabrics 
   additional to the SCSI transport protocol introduced in pNFS SCSI 
   Layout. The proposed transport protocols include support for 
   several transports and fabrics and support the RDMA transports.
  
Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of Internet-Draft Shadow Directories can be accessed at 
   https://www.ietf.org/standards/ids/internet-draft-mirror-sites/.

   This Internet-Draft will expire on January 6, 2021.

Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.









Faibish                   Expires  January 6, 2021              [Page 1]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020 


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.
 
Table of Contents 

   1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2 
     1.1. Conventions Used in This Document . . . . . . . . . . . . .  2 
     1.2. General Definitions . . . . . . . . . . . . . . . . . . . .  2 
   2. SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . . .  3 
     2.1. Volume Identification . . . . . . . . . . . . . . . . . . .  7 
     2.2. Client Fencing  . . . . . . . . . . . . . . . . . . . . . .  7 
       2.2.1. Reservation Key Generation  . . . . . . . . . . . . . .  8 
       2.2.2. MDS Registration and Reservation  . . . . . . . . . . .  8 
       2.2.3. Client Registration . . . . . . . . . . . . . . . . . .  8 
       2.2.4. Fencing Action  . . . . . . . . . . . . . . . . . . . .  8 
       2.2.5. Client Recovery after a Fence Action  . . . . . . . . .  9 
     2.3. Volatile write caches . . . . . . . . . . . . . . . . . . . 10 
   3. Security Considerations . . . . . . . . . . . . . . . . . . . . 10 
   4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10 
   5. Normative References  . . . . . . . . . . . . . . . . . . . . . 11 
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11
 
1. Introduction 

   The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is 
   layout type that allows NFS clients to directly perform I/O to block 
   storage devices while bypassing the MDS. It is specified by using 
   concepts from the SCSI protocol family for the data path to the 
   storage devices. This documents explains how to access PCI Express, 
   RDMA or Fibre Channel devices using the NVM Express protocol [NVME] 
   using the SCSI layout. This document does not amend the pNFS SCSI 
   layout document in any way, instead of explains how to map the SCSI 
   constructs used in the pNFS SCSI layout document to NVMe concepts 
   using the NVMe SCSI translation reference. 
 

1.1. Conventions Used in This Document 

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in [RFC2119]. 



Faibish                 Expires  January 6, 2021                [Page 2]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020 

1.2. General Definitions 

   The following definitions are provided for the purpose of providing
   an appropriate context for the reader. 

   Client The "client" is the entity that accesses the NFS server's 
   resources. The client may be an application that contains the logic 
   to access the NFS server directly. The client may also be the 
   traditional operating system client that provides remote file 
   system services for a set of applications.

   Server/Controller The "server" is the entity responsible for 
   coordinating client access to a set of file systems and is 
   identified by a server owner. 

2. SCSI Layout mapping to NVMe 

   The SCSI layout definition [RFC8154] only references few SCSI 
   specific concepts directly. 

   NVM Express [NVME] Base Specification revision 1.4 and prior 
   revisions define a register level interface for host software to 
   communicate with a non-volatile memory subsystem over PCI Express
   (NVMe over PCIe). This specification defines extensions to NVMe 
   that enable operation over other interconnects (NVMe over Fabrics). 
   The NVM Express Base Specification revision 1.4 is referred to as 
   the NVMe Base specification.
   
   The goal for this draft is to enable an implementer who is 
   familiar with the pNFS SCSI layout (RFC 8154) and the NVMe 
   standards (both NVMe-oF 1.1 and NVMe 1.4) to implement the 
   pNFS SCSI layout over NVMe-oF. The mapping of extensions defined 
   in this document refers to a specific NVMe Transport defined in 
   an NVMe Transport binding specification. This document refers to 
   NVMe Transport binding specification for FC, RDMA and TCP [RFC7525]. 
   The NVMe Transport binding specification for Fibre 
   Channel is defined in INCITS 540 Fibre Channel - Non-Volatile 
   Memory Express [FC-NVMe]. 

   NVMe over Fabrics has the following differences from the NVMe Base 
   specification used with SCSI:
   - There is a one-to-one mapping between I/O Submission Queues and 
     I/O Completion Queues. NVMe over Fabrics does not support multiple 
     I/O Submission Queues being mapped to a single I/O Completion 
     Queue;







Faibish                 Expires  January 6, 2021                [Page 3]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020  

   - NVMe over Fabrics does not define an interrupt mechanism that 
     allows a controller to generate a host interrupt. It is the 
     responsibility of the host fabric interface (e.g., Host Bus 
     Adapter) to generate host interrupts;

   - NVMe over Fabrics does not use the Create I/O Completion Queue, 
     Create I/O Submission Queue, Delete I/O Completion Queue, and 
     Delete I/O Submission Queue commands. NVMe over Fabrics does 
     not use the Admin Submission Queue Base Address (ASQ), Admin 
     Completion Queue Base Address (ACQ), and Admin Queue Attributes 
     (AQA) properties (i.e., registers in PCI Express). Queues are 
     created using the Connect command;
   - NVMe over Fabrics uses the Disconnect command to delete an I/O 
     Submission Queue and corresponding I/O Completion Queue;
   - Metadata, if supported, shall be transferred as a contiguous part 
     of the logical block. NVMe over Fabrics does not support 
     transferring metadata from a separate buffer;
   - NVMe over Fabrics does not support PRPs but requires use of SGLs 
     for Admin, I/O, and Fabrics commands. This differs from NVMe over 
     PCIe where SGLs are not supported for Admin commands and are 
     optional for I/O commands;
   - NVMe over Fabrics does not support Completion Queue flow control. 
     This requires that the host ensures there are available Completion 
     Queue slots before submitting new commands; and
   - NVMe over Fabrics allows Submission Queue flow control to be 
     disabled if the host and controller agree to disable it. If 
     Submission Queue flow control is disabled, the host is required 
     to ensure that there are available Submission Queue slots before 
     submitting new commands. 

   NVMe over Fabrics requires the underlying NVMe Transport to provide 
   reliable NVMe command and data delivery. An NVMe Transport is an 
   abstract protocol layer independent of any physical interconnect 
   properties. An NVMe Transport may expose a memory model, a message 
   model, or a combination of the two. A memory model is one in which 
   commands, responses and data are transferred between fabric nodes 
   by performing explicit memory read and write operations while a 
   message model is one in which only messages containing command 
   capsules, response capsules, and data are sent between fabric nodes. 

   The only memory model NVMe Transport supported by NVMe [NVME] is 
   PCI Express, as defined in the NVMe Base specification. While 
   differences exist between NVMe over Fabrics and NVMe over PCIe 
   implementations, both implement the same architecture and command 
   sets. But NVMe SCSI Translation reference is only using the NVMe 
   over Fabrics not the memory model. NVMe over Fabrics utilizes the 
   protocol layering shown in Figure 1. The native fabric 
   communication services and the Fabric Protocol and Physical Fabric
   layers in Figure 1 are outside the scope of this specification.
                             


Faibish                 Expires  January 6, 2021                [Page 4]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020 

                         +-------------------+ 
                         |  pNFS host SCSI   | 
                         | layout over NVMe  | 
                         +---------+---------+ 
                                   |           
                                   v           
                         +-------------------+ 
                         | NVMe over Fabrics | 
                         +---------+---------+ 
                                   |           
                                   v           
                         +-------------------+ 
                         | Transport Binding | 
                         +---------+---------+ 
                                   |          
                                   v            
                         +--------------------+ 
                         | NVMe Transport svc | 
                         +---------+----------+ 
                                   |            
                                   v            
                         +-------------------+  
                         |    NVMe Transport |  
                         +---------+---------+  
                                   |            
                                   v            
                        +--------------------+ 
                        | Fabric Protocol    | 
                        +---------+----------+ 
                                  |            
                                  v            
                         +-------------------+ 
                         | Physical Fabric   | 
                         +---------+---------+ 
                                   |           
                                   v           
                      +------------------------+ 
                      | pNFS SCSI layout       | 
                      | server/NVMe controller | 
                      +------------------------+ 
            Figure 1: pNFS SCSI over NVMe over Fabrics Layering

   An NVM subsystem port may support multiple NVMe Transports if more 
   than one NVMe Transport binding specifications exist for the 
   underlying fabric (e.g., an NVM subsystem port identified by a 
   Port ID may support both iWARP and RoCE). This draft is also 
   defining NVMe binding implementation that uses the Transport 
   type of RDMA Transport. The RDMA Transport is RDMA Provider 
   agnostic. The diagram in Figure 2 illustrates the layering of 
   the RDMA Transport and common RDMA providers (iWARP, InfiniBand, 
   and RoCE) within the host and NVM subsystem. 

Faibish                 Expires  January 6, 2021                [Page 5]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020 

                         +-------------------+ 
                         |      NVMe Host    | 
                         +---------+---------+ 
                         |  RDMA Transport   | 
                +--------+---+------------+--+---------+
                |    iWARP   | Infiniband |    RoCE    |
                +------------+-----++-----+------------+
                                   || RDMA Fabric
                                   vv
                +------------+------------+--+---------+
                |    iWARP   | Infiniband |    RoCE    |
                +---------+--+------------+---+--------+
                          |  RDMA Transport   | 
                          +-------------------+
                          |   NVM Subsystem   | 
                          +-------------------+
                Figure 2: RDMA Transport Protocol Layers

   NVMe over Fabrics allows multiple hosts to connect to different 
   controllers in the NVM subsystem through the same port. All other 
   aspects of NVMe over Fabrics multi-path I/O and namespace sharing 
   are equivalent to that defined in the NVMe Base specification.

   An association is established between a host and a controller when 
   the host connects to a controller's Admin Queue using the Fabrics 
   Connect command. Within the Connect command, the host specifies 
   the Host NQN, NVM Subsystem NQN, Host Identifier, and may request a 
   specific Controller ID or may request a connection to any available 
   controller. The host is the pNFS client and the controller is the
   NFSv4 server. The pNFS clients connect to the server using different 
   network protocols and different transports excluding PCIe direct 
   connection. While an association exists between a host and 
   a controller, only that host may establish connections with I/O 
   Queues of that controller. 

   NVMe over Fabrics supports both fabric secure channel and NVMe 
   in-band authentication. An NVM subsystem may require a host to 
   use fabric secure channel, NVMe in-band authentication, or both. 
   The Discovery Service indicates if fabric secure channel shall be 
   used for an NVM subsystem. The Connect response indicates if NVMe 
   in-band authentication shall be used with that controller. For
   SCSI over NVMe over Fabrics only the in-band authentication model 
   will be used as the fabric secure channel is only supported with 
   PCIe transport memory model not supported by SCSI layout protocol.

   The pNFS SCSI layout uses the Device Identification VPD page (page 
   code 0x83) from [SPC4] to identify the devices used by a layout. 
   There are several ways to build SCSI Device 




Faibish                 Expires  January 6, 2021                [Page 6]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020

2.1. Volume Identification 


   Identification descriptors from NVMe Identify data included in the
   Controller Identify Attributes specific to NVMe over Fabrics
   specified in the Identify Controller fields in Section 4.1 of 
   [NVMEoF]. This document uses a subset of this information to 
   identify LUs backing pNFS SCSI layouts.

   To be used as storage devices for the pNFS SCSI layout, NVMe 
   devices MUST support the EUI-64 [RFC8154] value in the 
   Identify Namespace data, as the methods based on the Serial 
   Number for legacy devices might not be suitable for unique 
   addressing needs and thus MUST NOT be used. UUID identification 
   can be added by 
   using a large enough enum value to avoid conflict with whatever 
   T10 might do in a future version of the SCSI [SBC3] standard (the 
   underlying SCSI field in SPC is 4 bits, so an enum value of 32 
   in this draft MUST be used). For NVMe, these identifiers need to 
   be obtained via the Namespace Identification Descriptors in NVMe 
   1.4 (returned by the Identify command with the CNS field set to 
   03h). 

2.2. Client Fencing 

   The SCSI layout uses Persistent Reservations to provide client 
   fencing. For this both the MDS and the Clients have to register a 
   key with the storage device, and the MDS has to create a reservation 
   on the storage device. The pNFS SCSI protocol implements fencing 
   using persistent reservations (PRs), similar to the fencing method 
   used by existing shared disk file systems. To allow fencing 
   individual systems, each system MUST use a unique persistent 
   reservation key. The following is a full mapping of the required 
   PR IN and PR OUT SCSI command to NVMe commands which MUST be used 
   when using NVMe devices as storage devices for the pNFS SCSI layout. 

2.2.1. Reservation Key Generation

   Prior to establishing a reservation on a namespace, a host shall 
   become a registrant of that namespace by registering a reservation 
   key. This reservation key may be used by the host as a means of 
   identifying the registrant (host), authenticating the registrant, 
   and preempting a failed or uncooperative registrant. This document 
   assigns the burden to generate unique keys to the MDS, which MUST 
   generate a key for itself before exporting a volume and a key for 
   each client that accesses SCSI layout volumes.

   One important difference between SCSI Persistent Reservations  
   and NVMe Reservations is that NVMe reservation keys always apply 
   to all controllers used by a host (as indicated by the NVMe Host 
   Identifier) 

Faibish                 Expires  January 6, 2021                [Page 7]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020

   This behavior is somewhat similar to setting the ALL_TG_PT bit when 
   registering a SCSI Reservation key, but actually guaranteed to 
   work reliably.

2.2.2. MDS Registration and Reservation

   Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
   MDS needs to prepare the volume for fencing using NVMeReservations. 
   Registering 
   a reservation key with a namespace creates an association between 
   a host and a namespace. A host that is a registrant of a namespace 
   may use any controller with which that host is associated (i.e.,
   that has the same Host Identifier, refer to [NVME] 
   section-5.21.1.26) to access that namespace as a registrant.  

2.2.3. Client Registration

2.2.3.1 SCSI client 

   Before performing the first I/O to a device returned from a
   GETDEVICEINFO operation, the client will register the 
   reservation key returned by the MDS with the storage device 
   by issuing a "PERSISTENT RESERVE OUT" command with a service 
   action of REGISTER with the "SERVICE ACTION RESERVATION KEY" set 
   to the reservation key.

2.2.3.2 NVMe Client

   A client registers a reservation key by executing a Reservation 
   Register command (refer to [NVME] section 6.11) on the namespace 
   with the Reservation Register Action (RREGA) field cleared to 
   000b (i.e., Register Reservation Key) and supplying a reservation 
   key in the New Reservation Key (NRKEY) field. A client that is a 
   registrant of a namespace may register the same reservation key 
   value multiple times with the namespace on the same or different 
   controllers. There are no restrictions on the reservation key 
   value used by hosts with different Host Identifiers.

2.2.4. Fencing Action

2.2.4.1 SCSI client 

   In case of a non-responding client, the MDS fences the client by
   issuing a "PERSISTENT RESERVE OUT" command with the service action
   set to "PREEMPT" or "PREEMPT AND ABORT", the "RESERVATION KEY" field
   set to the server's reservation key, the service action "RESERVATION
   KEY" field set to the reservation key associated with the non-
   responding client, and the "TYPE" field set to 8h (Exclusive Access
   - Registrants Only). 



Faibish                 Expires  January 6, 2021                [Page 8]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020

2.2.4.2 NVMe Client

   A host that is a registrant may preempt a reservation and/or 
   registration by executing a Reservation Acquire command (refer to 
   section 6.10), setting the Reservation Acquire Action (RACQA) field 
   to 001b (Preempt), and supplying the current reservation key 
   associated with the host in the Current Reservation Key (CRKEY)
   field. The CRKEY value shall match that used by the registrant to 
   register with the namespace. If the CRKEY value does not match, 
   then the command is aborted with status Reservation Conflict. 

   If the PRKEY field value does not match that of the current 
   reservation holder and is equal to 0h, then the command is aborted 
   with status Invalid Field in Command. A reservation preempted 
   notification occurs on all controllers in the NVM subsystem that 
   are associated with hosts that have their registrations removed as 
   a result of actions taken in this section except those associated 
   with the host that issued the Reservation Release command. After 
   the MDS preempts a client, all client I/O to the LU fails. 
   The client SHOULD at this point return any layout that refers to 
   the device ID that points to the LU.

2.2.5. Client Recovery after a Fence Action

   A client that detects a NVMe status codes (I/O
   error) on the storage devices MUST commit all layouts that use the
   storage device through the MDS, return all outstanding layouts for
   the device, forget the device ID, and unregister the reservation 
   key. 

   Future GETDEVICEINFO calls MAY refer to the storage device again, 
   in which case the client will perform a new registration based on 
   the key provided. If a reservation holder attempts to obtain a 
   reservation of a different type on a namespace for which that host 
   already is the reservation holder, then the command is aborted with 
   status Reservation Conflict. It is not an error if a reservation 
   holder attempts to obtain a reservation of the same type on a 
   namespace for which that host already is the reservation holder.

   NVMe over Fabrics [NVMEoF] utilizes the same controller 
   architecture as that defined in the NVMe Base specification [NVME].











Faibish                 Expires  January 6, 2021                [Page 9]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020

 
   This includes using Submission and Completion Queues to execute 
   commands between a host and a controller. Section 8.20 of [NVME] 
   base specifications describes the relationship between a controller 
   (MDS) and a namespace associated to the Clients. In a static 
   controller model used by SCSI layout, controllers that may be 
   allocated to a particular Client may have different state at the 
   time the association is established.     

2.3. Volatile write caches 

   The Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in 
   the Volatile Write Cache Feature (Feature Identifier 06h)
   is the Write Cache Enable field in the NVMe Get Features command, 
   see Section-5.21.1.6 of [NVME]. If a write cache is enable on a 
   NVMe device used as a storage device for the pNFS SCSI layout, the 
   MDS  MUST ensure to use the NVMe FLUSH command to flush the 
   volatile write cache. If there is no volatile write cache on the 
   server, then attempts to access this NVMe Feature cause errors. 
   The Get Features command specifying the Volatile Write Cache feature 
   identifier is expected to fail with Invalid Field in Command status.

3. Security Considerations 

   Since no protocol changes are proposed here, no security 
   considerations apply. But the protocol is assuming that NVMe 
   Authentication commands are implemented in the NVMe
   Security Protocol as the format of the data to be transferred is 
   dependent on the Security Protocol. Authentication Receive/Response 
   commands return the appropriate data corresponding to an 
   Authentication Send command as defined by the rules of the 
   Security Protocol. As the current draft is only supporting
   MVMe over fabric In-band protocol the Authentication requirements 
   for security commands are based on the security protocol indicated 
   by the SECP field in the command and DO NOT require authentication
   when used for NVMe in-band authentication. When used for other 
   purposes, in-band authentication of the commands is required. 
 

4. IANA Considerations 
   The document does not require any actions by IANA. 











Faibish                 Expires  January 6, 2021               [Page 10]

Internet-Draft      pNFS SCSI/NVMe Layout over Fabrics         July 2020

5. Normative References 

   [NVME] NVM Express, Inc., "NVM Express Revision 1.4", June 10, 2019.
 
   [NVMEoF] "NVM Express over Fabrics Revision 1.1", July 26, 2019 

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 
             Requirement Levels", March 1997. 

   [RFC8154] Hellwig, C., "Parallel NFS (pNFS) Small Computer System 
             Interface (SCSI) Layout", May 2017. 

   [SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3", 
          ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014. 

   [SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4", 
          ANSI INCITS 513-2015, 2015. 

   [FC-NVMe] INCITS Technical Committee T10, "Fibre Channel - 
             Non-Volatile Memory Express", ANSI INCITS 540, 2018.

   [RFC7525] Sheffer, Y., "Recommendations for Secure Use of Transport 
             Layer Security (TLS) and Datagram Transport Layer Security 
             (DTLS)" alsa known as BCP 195.

Author's Address

   Sorin Faibish
   Dell EMC 
   228 South Street
   Hopkinton, MA  01774
   United States of America

   Phone: +1 508-249-5745
   Email: faibish.sorin@dell.com

   David Black
   Dell EMC 
   176 South Street
   Hopkinton, MA  01748
   United States of America

   Phone: +1 774-350-9323
   Email: david.black@dell.com

   Christoph Hellwig 
   Email: hch@lst.de 





Faibish                 Expires  January 6, 2021               [Page 11]