INTERNET-DRAFT                                                 M. Wittle
draft-wittle-dafs-00.txt                         Network Appliance, Inc.
Expires March 2002                                        September 2001


                    Direct Access File System (DAFS)


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in
   progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt. The list of
   Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Copyright Notice

   Copyright (C) The Internet Society (2001). All Rights Reserved.

Key Words

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119.

Abstract

   The Direct Access File System (DAFS) is a file access and management
   protocol designed for local file-sharing or clustered environments.
   It addresses two primary goals:

   o  Provide low-latency, high-throughput, and low-overhead data
      movement that takes advantage of modern memory-to-memory
      networking technology.

   o  Define a set of file management and file access operations for
      local file-sharing requirements.


Wittle                                                          [Page 1]

INTERNET-DRAFT         Direct Access File System          September 2001


Table of Contents

Chapter 1. Introduction to the Direct Access File System Protocol .... 7
1.1.    New System Trends ............................................ 7
1.1.1.     Local File-Sharing Architecture ........................... 8
1.2     New Networking Technology Trends ............................. 8
1.2.1.     Direct Access Transport ................................... 9
1.3.    The DAFS Opportunity ........................................ 10

Chapter 2. DAFS Overview ............................................ 11
2.1.    DAFS Goals .................................................. 11
2.2.    Local File-Sharing Requirements ............................. 11
2.3.    Direct Access Transport ..................................... 15
2.3.1.     DAT Glossary ............................................. 15
2.3.2.     DAT Description .......................................... 17
2.3.3.     DAT Requirements ......................................... 19
2.3.4.     Physical Interconnect .................................... 21
2.4.    DAFS Protocol ............................................... 21
2.4.1.     DAFS Deployment Models ................................... 24
2.4.2.     DAFS File Name Space ..................................... 24
2.4.3.     DAFS Terminology ......................................... 25

Chapter 3. Communication Model ...................................... 28
3.1.    Session Management .......................................... 28
3.1.1.     Security Model ........................................... 29
3.1.2.     Session Attributes ....................................... 33
3.1.3.     Session Operations ....................................... 42
3.1.4.     Sharing Sessions ......................................... 44
3.2.    Message Handling ............................................ 44
3.2.1.     DAT Data Transfer Operations ............................. 44
3.2.2.     DAT Error Reporting ...................................... 45
3.2.3.     Mapping DAFS onto Memory-to-Memory Architectures ......... 46
3.2.4.     Separate Communications Channel for RDMA Read Operations . 51
3.2.5.     Checksums ................................................ 53
3.2.6.     Message Flow Control ..................................... 54

Chapter 4. File System Operations ................................... 62
4.1.    Concepts and Structures ..................................... 62
4.1.1.     DAFS and NFS Version 4 ................................... 62
4.1.2.     Typographical Conventions ................................ 62
4.1.3.     Recurring Differences Between DAFS and NFS Version 4 ..... 63
4.1.4.     Objects Naming And Filehandles ........................... 64
4.1.5.     Named Attributes ......................................... 74
4.2.    Data Transfer Operations .................................... 75
4.2.1.     Send-Receive ............................................. 75
4.2.2.     RDMA Transfers ........................................... 77
4.2.3.     Batch I/O Operations ..................................... 79
4.2.4.     Server Caching Hints ..................................... 80


Wittle                                                          [Page 2]

INTERNET-DRAFT         Direct Access File System          September 2001


4.3.    Request Chaining ............................................ 80
4.3.1.     Chaining Restrictions .................................... 81
4.3.2.     Chaining Flags ........................................... 85
4.3.3.     Chaining and Flow Control ................................ 86
4.3.4.     Chaining and Recovery .................................... 86
4.4.    Locking and Access Control .................................. 88
4.4.1.     Locking .................................................. 88
4.4.2.     Shared Key Reservations ................................. 111
4.4.3.     Access Control Lists (ACLs) ............................. 112
4.4.4.     Fencing ................................................. 119
4.5.    NFS-Derived Operations ..................................... 122

Chapter 5. Failure Recovery ........................................ 124
5.1.    Exactly Once Semantics ..................................... 124
5.2.    Server Response Cache ...................................... 124
5.2.1.     Response Cache .......................................... 124
5.2.2.     Response Cache Handling of OPNreq Decrease .............. 126
5.2.3.     Handling Batch I/O Requests ............................. 128
5.2.4.     Server Response Cache in Stable Storage ................. 128
5.2.5.     Use of the Server Response Cache ........................ 128
5.2.6.     Response Cache Operations ............................... 129
5.3.    Server Failover ............................................ 130
5.3.1.     Changing failover_locations ............................. 131

Chapter 6. Message Formats ......................................... 132
6.1.    Message Headers and Common Structures ...................... 132
6.1.1.     Message Format .......................................... 132
6.1.2.     Request Header .......................................... 139
6.1.3.     Response Header ......................................... 141
6.1.4.     Basic Types ............................................. 141
6.1.5.     File Attributes ......................................... 147
6.1.6.     File System Attributes .................................. 155
6.1.7.     Direct Operations ....................................... 165
6.1.8.     Cache Hints ............................................. 167
6.1.9.     Authentication .......................................... 167
6.1.10.    Procedures .............................................. 169
6.2.    Connection and Security Management ......................... 172
6.2.1.     DAFS_PROC_CLIENT_CONNECT ................................ 172
6.2.2.     DAFS_PROC_CLIENT_AUTH ................................... 176
6.2.3.     DAFS_PROC_SERVER_AUTH ................................... 179
6.2.4.     DAFS_PROC_CLIENT_CONNECT_AUTH ........................... 181
6.2.5.     DAFS_PROC_CONNECT_BIND .................................. 183
6.2.6.     DAFS_PROC_DISCONNECT .................................... 186
6.2.7.     DAFS_PROC_SECINFO ....................................... 187
6.2.8.     DAFS_PROC_REGISTER_CRED ................................. 189
6.2.9.     DAFS_PROC_RELEASE_CRED .................................. 192
6.3.    Response Cache ............................................. 193
6.3.1.     DAFS_PROC_CHECK_RESPONSE ................................ 193


Wittle                                                          [Page 3]

INTERNET-DRAFT         Direct Access File System          September 2001


6.3.2.     DAFS_PROC_FETCH_RESPONSE ................................ 195
6.3.3.     DAFS_PROC_DISCARD_RESPONSES ............................. 196
6.4.    Fencing Procedures ......................................... 197
6.4.1.     DAFS_PROC_GET_FENCING_LIST .............................. 197
6.4.2.     DAFS_PROC_SET_FENCING_LIST .............................. 198
6.5.    File System Procedures ..................................... 201
6.5.1.     DAFS_PROC_NULL .......................................... 201
6.5.2.     DAFS_PROC_ACCESS ........................................ 202
6.5.3.     DAFS_PROC_APPEND_INLINE ................................. 206
6.5.4.     DAFS_PROC_APPEND_DIRECT ................................. 209
6.5.5.     DAFS_PROC_BATCH_SUBMIT .................................. 212
6.5.6.     DAFS_PROC_CACHE_HINT .................................... 216
6.5.7.     DAFS_PROC_CLOSE ......................................... 219
6.5.8.     DAFS_PROC_COMMIT ........................................ 221
6.5.9.     DAFS_PROC_CREATE ........................................ 225
6.5.10.    DAFS_PROC_DELEGPURGE .................................... 228
6.5.11.    DAFS_PROC_DELEGRETURN ................................... 230
6.5.12.    DAFS_PROC_GET_ROOT_HANDLE ............................... 231
6.5.13.    DAFS_PROC_GETATTR_INLINE ................................ 232
6.5.14.    DAFS_PROC_GETATTR_DIRECT ................................ 235
6.5.15.    DAFS_PROC_GET_FSATTR .................................... 238
6.5.16.    DAFS_PROC_HURRY_UP ...................................... 241
6.5.17.    DAFS_PROC_LINK .......................................... 243
6.5.18.    DAFS_PROC_LOCK .......................................... 246
6.5.19.    DAFS_PROC_LOCKT ......................................... 250
6.5.20.    DAFS_PROC_LOCKU ......................................... 253
6.5.21.    DAFS_PROC_LOOKUP ........................................ 255
6.5.22.    DAFS_PROC_LOOKUPP ....................................... 258
6.5.23.    DAFS_PROC_NVERIFY ....................................... 260
6.5.24.    DAFS_PROC_OPEN .......................................... 262
6.5.25.    DAFS_PROC_OPENATTR ...................................... 275
6.5.26.    DAFS_PROC_OPEN_DOWNGRADE ................................ 277
6.5.27.    DAFS_PROC_READ_INLINE ................................... 279
6.5.28.    DAFS_PROC_READ_DIRECT ................................... 283
6.5.29.    DAFS_PROC_READDIR_INLINE ................................ 287
6.5.30.    DAFS_PROC_READDIR_DIRECT ................................ 292
6.5.31.    DAFS_PROC_READLINK_INLINE ............................... 299
6.5.32.    DAFS_PROC_READLINK_DIRECT ............................... 299
6.5.33.    DAFS_PROC_REMOVE ........................................ 301
6.5.34.    DAFS_PROC_RENAME ........................................ 305
6.5.35.    DAFS_PROC_SETATTR_INLINE ................................ 309
6.5.36.    DAFS_PROC_SETATTR_DIRECT ................................ 312
6.5.37.    DAFS_PROC_VERIFY ........................................ 316
6.5.38.    DAFS_PROC_WRITE_INLINE .................................. 318
6.5.39.    DAFS_PROC_WRITE_DIRECT .................................. 324
6.6.    Back-Control Directives .................................... 330
6.6.1.     DAFS_PROC_BC_NULL ....................................... 330
6.6.2.     DAFS_PROC_BC_BATCH_COMPLETION ........................... 331


Wittle                                                          [Page 4]

INTERNET-DRAFT         Direct Access File System          September 2001


6.6.3.     DAFS_PROC_BC_GETATTR .................................... 332
6.6.4.     DAFS_PROC_BC_RECALL ..................................... 334

Chapter 7. Error Status Result Codes ............................... 336
Chapter 8. Security and IANA Considerations ........................ 346
8.1.    Security Considerations .................................... 346
8.2.    IANA Considerations ........................................ 346
Chapter 9. Bibliography ............................................ 347
Chapter 10. Author Information and Acknowledgements ................ 349
10.1.   Editor ..................................................... 349
10.2.   Authors .................................................... 349
10.3.   Comments ................................................... 349
10.4.   Acknowledgements ........................................... 349

Appendix A. DAFS Name Service ...................................... 350
A.1.    Introduction ............................................... 350
A.2.    DAFS Name Space ............................................ 350
A.3.    DAFS Name .................................................. 350
A.4.    DAFS Location .............................................. 351
A.4.1.     DAT Location ............................................ 351
A.4.2.     DAFS Directory Path ..................................... 353
A.4.3.     DAFS Version ............................................ 353
A.5.    DAFS Names and Locations ................................... 353
A.6.    Name Space Repository ...................................... 354
A.7.    LDAP Schema ................................................ 355
A.8.    References ................................................. 358

Appendix B. DAT Semantics .......................................... 359
B.1.    DAT Glossary ............................................... 359
B.2.    DAT Model .................................................. 361
B.3.    DAT Provider ............................................... 361
B.4.    Transport Endpoints and Connections ........................ 362
B.5.    DAT Memory Semantics ....................................... 364
B.6.    DAT Data Transfer Operations and Connection Properties ..... 365

Appendix C. DAT Name Service ....................................... 368

Appendix D. DAFS Mapping to VI Architecture ........................ 370
D.1.    Terminology Mapping from DAT to VI ......................... 370
D.2.    Additional VI Terminology .................................. 372
D.3.    DAT Requirements Mapping ................................... 373
D.4.    VI & Connections ........................................... 375
D.4.1.     VI Discriminators ........................................375
D.4.2.     VI Connection Attributes ................................ 375
D.4.3.     VI Endpoint Attributes .................................. 376
D.4.4.     DAFS Flow Control Initialization ........................ 376
D.4.5.     VI Disconnect ........................................... 377
D.5.    VI Architecture Memory Semantics ........................... 377


Wittle                                                          [Page 5]

INTERNET-DRAFT         Direct Access File System          September 2001


D.6.    VI Data Transfer Operations ................................ 377
D.7.    Name Service Mapping for VI Architecture ................... 378
D.8.    DAFS Client Discriminators ................................. 380
D.9.    Design Notes ............................................... 381
D.9.1.     Connection Establishment ................................ 382
D.9.2.     Memory Registration ..................................... 382
D.9.3.     NIC Attributes .......................................... 382
D.10.   References ................................................. 383

Appendix E. DAFS Mapping to InfiniBand Reliable Connection ......... 384
E.1.    Terminology Mapping from DAT to InfiniBand ................. 384
E.2.    Additional InfiniBand Terminology .......................... 386
E.3.    DAT Requirements Mapping ................................... 387
E.4.    IBA Model .................................................. 390
E.5.    InfiniBand Architecture Transport Endpoints and Connections  391
E.5.1.     Proxy Communications Managers ........................... 392
E.5.2.     Partitions .............................................. 392
E.5.3.     DAFS Connection Establishment Requirements .............. 393
E.5.4.     Disconnect .............................................. 396
E.5.5.     Automatic Path Migration ................................ 396
E.6.    IBA Memory Semantics ....................................... 397
E.6.1.     Memory Regions and Memory Windows ....................... 397
E.6.2.     Protection Domains ...................................... 397
E.7.    IBA Data Transfer Operations ............................... 398
E.8.    DAFS Name Service Mapping for InfiniBand Reliable Connection 399
E.9.    DAFS Client Connection Request PrivateData ................. 400
E.10.   References ................................................. 402
        Full Copyright Statements .................................. 403


Wittle                                                          [Page 6]

INTERNET-DRAFT         Direct Access File System          September 2001


1.  Introduction to the Direct Access File System Protocol

   This chapter introduces the Direct Access File System (DAFS) proto-
   col. It describes the technology trends that created the need for
   DAFS and how DAFS fulfills the need.

   The need for DAFS arose out of three trends. The first two trends
   involve the deployment of large systems. The third trend is related
   to new networking technologies.

1.1.  New System Trends

   The first trend is the separation of storage systems from application
   servers. This separation enables storage to be managed and scaled
   independently from the applications, operating systems, and machine
   architectures that are attached to the storage system. Typical appli-
   cations include databases and collaboration software. The storage can
   be accessed either through block access protocols (for example, SCSI)
   or file access protocols (for example, NFS, CIFS). However, many
   choose file access protocols because of the following benefits:

   o  Hide storage details

      File access protocols hide the details of the underlying storage
      system from application server software while enabling the file
      system resident in the file server to take advantage of storage
      system geometry. This keeps applications running smoothly without
      retuning after each change in storage capacity.

   o  Enable controlled data management

      File access protocols provide fine-grained data management. Data
      access permission, storage utilization, backup, and even disaster
      recovery can be controlled at the individual file and user level.
      Data management operations affect the specified application data
      only, rather than all the blocks in the volume.

   o  Off load application servers

      File access protocols can off load application servers from run-
      ning file system software and reduce application server I/O
      requirements by eliminating the need to transfer file system
      meta-data. This is sometimes application-dependent, because Net-
      work Attached Storage (NAS) protocols can add TCP/IP processing
      overhead, though in general the overhead is no worse than native
      file system overhead.

   The second trend is the rapid growth of the Internet. This growth


Wittle                                                          [Page 7]

INTERNET-DRAFT         Direct Access File System          September 2001


   requires service providers to develop architectures that are resi-
   lient to failure and can rapidly scale in both computing power and
   storage capacity. The resulting designs spread the service load
   across a set of application servers. The application servers can be
   large machines or relatively small ones. This architecture is resi-
   lient in that if one application server fails another can take its
   place. The architecture is also scalable, because extra computing
   resources can be added by simply adding more application servers.
   Typical applications include email, news, web servers, geographical
   information systems, and clustered databases.

   Often these scalable designs also separate the storage from the
   application servers. Many service providers choose file access proto-
   cols because of the benefits they provide with storage separation, as
   discussed above, but in addition:

   o  Simplify data sharing

      File access protocols enable data to be easily shared even among
      heterogeneous systems. The file access paradigm is the same
      whether processes are sharing file data on a single machine or on
      a distributed system. This provides all the application servers
      access to a common pool of data for load balancing. It also allows
      another application server to take over data previously accessed
      by a failed application server.

1.1.1.  Local File-Sharing Architecture

   When used with file access protocols the system architectures
   described above are referred to as local file-sharing architectures.
   In both cases, the system is comprised of a limited number of appli-
   cation servers and is typically geographically constrained (within a
   data center) and under the control of a single set of administrators
   who are responsible for configuration and maintenance.

   The application servers are connected to storage over a dedicated,
   high- speed interconnection network. The application servers are
   relatively homogeneous in both hardware and software, and typically
   run a limited set of high-performance applications. In contrast, wide
   file-sharing is comprised of a large number of diverse machines that
   typically provide direct services to end users.

1.2.  New Networking Technology Trends

   The third trend is the advent of standard memory-to-memory intercon-
   nection networks. These networks grew out of the research in tightly
   coupled distributed systems. The tightly coupled interconnection net-
   works (sometimes called Cache Coherent Non-Uniform Memory Access, or


Wittle                                                          [Page 8]

INTERNET-DRAFT         Direct Access File System          September 2001


   CC-NUMA) were highly proprietary and were tightly integrated with
   particular computer architectures. Eventually, more loosely coupled
   versions (not cache coherent, i.e., Non-Uniform Memory Access or
   NUMA) that could use generic I/O interfaces, such as PCI, were
   developed. All these interconnect technologies support some form of
   remote memory access and are designed for low latency and high
   throughput. They are intended for use within data centers and are not
   designed to support high-latency, wide-area data transport. These are
   sometimes called System Area Networks. Examples of these networks
   include Virtual Interface Architecture [VIA], [VIDG], InfiniBand
   Architecture [IB], and the Warp protocol for the Internet [WARP].

1.2.1.  Direct Access Transport

   This section describes the set of transport capabilities that DAFS
   protocol depends on. These capabilities are referred to as the Direct
   Access Transport (DAT) and are defined in Appendix B. "DAT Seman-
   tics". The DAT semantic is the minimal set of transport capabilities
   that DAFS requires to provide high-performance DAFS implementations.
   The DAT semantics can be mapped onto networks that support memory-to-
   memory operations, such as Virtual Interface Architecture, InfiniBand
   Architecture, and WARP. DAT does not define a specific transport
   layer interface, but describes the functionality and concepts neces-
   sary to support the DAFS protocol.

   The DAT-based network provides two fundamental capabilities beyond
   those of traditional networking architecture. The first capability is
   called Remote Direct Memory Access (RDMA), which is the ability to
   move data directly to or from a local memory buffer to or from a
   specified memory buffer on a remote node. The second capability
   allows application software to directly access Channel Adapter
   hardware (Channel Adapter is sometimes called a Network Interface
   Card or Network Adapter), bypassing the operating system. This
   operating system bypass capability enables application programs to
   directly address the hardware and to initiate I/O operations without
   operating system intervention. Channel Adapters that support DAT
   capability can be implemented on a variety on interconnection net-
   works such as Fibre Channel, Ethernet, InfiniBand, and proprietary
   fabrics.

   The advantages of a Channel Adapter with DAT capability are the fol-
   lowing features implemented in on the Channel Adapter:

   o  Packet fragmentation and reassembly

   o  Reliable data delivery

   o  Multiplexing and demultiplexing data from different connections


Wittle                                                          [Page 9]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Checksum computations.

   However, these advantages can be had with traditional networking by
   implementing transport protocols such as TCP in a Channel Adapter.
   The IETF WARP proposal defines a protocol that provides such capabil-
   ities for TCP and SCTP. The key advantages are that RDMA operations
   not only provide applications with a mechanism to separate bulk data
   transport from control information, they also provide a way to
   specify exactly where the data belongs. Traditional message transfer
   operations enable only the destination target of a message to specify
   the specific location on the destination node were the message pay-
   load is deposited. Remote memory writes allow the operation initiator
   to specify the target memory location on the destination node. Remote
   memory read operations allow the operation initiator to specify both
   the remote memory location that is to be the source of a data "fetch"
   operation as well as the local destination where the fetched remote
   memory contents are to be deposited.

   The advantages are significant. Consider a typical network file
   access protocol like NFS. In NFS, each I/O operation or I/O operation
   reply, such as read or write, is embodied in a header. The header is
   inserted into the network byte stream followed by any user data. The
   receiver needs to parse the header and determine the appropriate
   place to put any user data that follows. For example, during a read
   operation the requester needs to parse the reply packet header to
   determine with which of the many possible outstanding read operations
   this data is associated. Then it needs to determine the destination
   data buffer associated with the request and copy the data into the
   buffer or otherwise cause the data to be copied there. RDMA opera-
   tions allow the file server to directly place the data into a desti-
   nation buffer specified in the request without any parsing or copy-
   ing. These advantages combine to provide extremely low-overhead and
   low-latency messaging and bulk data transfer.

1.3.  The DAFS Opportunity

   Local file-sharing architectures and memory-to-memory interconnection
   networks are ideally suited for one another. One of the main reasons
   local file-sharing architectures are deployed is to gain high
   throughput and high performance. A file access protocol using such a
   network enables extremely low-overhead access to shared data. The
   protocol off-loads file system processing and meta-data I/O from the
   application servers and eliminate protocol processing overhead, while
   it preserves the advantages of file access. Using the virtual inter-
   face capability provides local file- sharing applications more con-
   trol over the high-performance data path. Any performance issues can
   be addressed directly in application libraries without requiring OS
   patches.


Wittle                                                         [Page 10]

INTERNET-DRAFT         Direct Access File System          September 2001


   Lastly, note that current file access protocols were designed for a
   wide- area file-access environment. Local file-sharing architectures
   use the current protocols, but the applications sometimes have to
   work around specific deficiencies [Christianson] in the areas of file
   locking and fencing. A file access protocol specifically designed for
   local file sharing environments can provide the necessary semantics
   to improve application performance in these situations and eliminate
   many complexities of current protocols that were designed to deal
   with widely varying latencies and unreliable networks.


Wittle                                                         [Page 11]

INTERNET-DRAFT         Direct Access File System          September 2001


2.  DAFS Overview

   This chapter provides an overview of DAFS and describes the following
   topics:

   o  DAFS goals

   o  Local File Sharing Requirements

   o  The Direct Access Transport

   o  The DAFS protocol

2.1.  DAFS Goals

   The Direct Access File System (DAFS) is a file access and management
   protocol designed for local file-sharing or clustered environments.
   It addresses two primary goals:

   o  Provide low-latency, high-throughput, and low-overhead data move-
      ment that takes advantage of modern memory-to-memory networking
      technologies.

   o  Define a set of file management and file access operations for
      local file-sharing requirements.

   The DAFS protocol takes advantage of system area networks that pro-
   vide Direct Access Transport (DAT) capabilities. The DAFS protocol
   defines file access operations that use remote memory-to-memory copy
   and other high performance primitives provided by DAT.

   The current revision of DAFS borrows heavily from the IETF NFS Ver-
   sion 4 specification [Shepler] to provide a full set of file manage-
   ment operations. Although recent enhancements to NFS are often
   directed toward improvements for "wide-sharing" environments, a large
   number of NFS file operations define basic semantics fully appropri-
   ate for use in a local file-sharing architecture

   In areas where DAFS is not intended to add significant value beyond
   existing systems, it seems best to build on that work, rather than
   duplicate the effort. We'd like to explicitly acknowledge that many
   of the file operations defined by DAFS are either based on or are
   directly a result of work done by authors of the NFS v4 specification
   and contributors to that IETF working group.

2.2.  Local File-Sharing Requirements

   Local file-sharing has a number of unique requirements for file


Wittle                                                         [Page 12]

INTERNET-DRAFT         Direct Access File System          September 2001


   access protocols:

   o  Optimize for high-throughput, low-latency networks

      Local file-sharing architectures use high-throughput low-latency
      networks. Current file access protocols are optimized for general-
      purpose internetworking, in which latencies can vary dramatically
      and packet processing is expensive. Memory-to-memory networks have
      low latencies and very low packet processing overhead.

   o  Optimize for high-throughput, low-overhead client implementations

      In the local file-sharing, environment client machines often run
      one application that is trying to achieve high throughput by hav-
      ing many operations pending on the storage system and by fully
      utilizing client CPU resources. Local file-sharing creates a dif-
      ferent environment than wide sharing. Software running on the
      client machine in the wide- sharing environment is typically only
      trying to service a single user's requests. In a local sharing
      environment, the client software only issues a limited number of
      concurrent operations on the storage system and client CPU
      resources are usually not fully utilized.

   o  Support for different operating system file access semantics

      The local file-sharing environment is typically comprised of simi-
      lar client machines running the same base operating system. How-
      ever, the base operating system can differ between different local
      file- sharing applications. At a minimum, DAFS needs to be able to
      support the file access semantics for UNIX and NT.

   o  High-speed consistent locking

      Many local file-sharing applications share data among clients.
      Typically, this is accomplished by locking and unlocking the
      shared data. Older NFS protocols have loose data consistency and
      require complete lock and unlock messages for each interval of
      shared access, resulting in relatively low performance. DAFS needs
      to enforce data consistency while locked, allow lock caching, and
      provide for fast transfer of cached locks between client machines.

   o  Client failure recovery

      When a client fails, it is unacceptable that other clients are
      locked out for long intervals and unable to access data that was
      in use by the failed client. At the same time, it is likely that
      releasing the locks of a failed client supports orderly recovery
      so as not to compromise the data integrity.


Wittle                                                         [Page 13]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  File server reboot or network interruption recovery

      Applications SHOULD not necessarily fail when a file server
      reboots or the network suffers a temporary interruption.

   o  File server failover

      Applications SHOULD not necessarily fail when a file server fails
      over to an alternate file server that has a consistent copy of the
      data. The failover SHOULD be supported in the DAFS protocol and
      not rely on transport level routing tricks.

   o  Fencing

      Clustered application servers maintain their own notion of nodes
      that are considered a part of the cluster. Nodes ejected from the
      cluster need to be prevented from accessing shared data.

   o  Online migration

      Local file-sharing architectures are intended to be highly avail-
      able systems that go long periods without need for reboot. They
      are also intended to enable storage scalability. For this reason
      it SHOULD be possible to move file systems (or finer units of
      data) among the available file servers without requiring applica-
      tion servers to reboot to connect to the new data location.

   o  Security

      Memory-to-memory networking, by its very nature, requires some
      trust between machines. (For example, exporting memory with only a
      small protection key in hardware.) However, DAFS SHOULD provide
      some level of user authentication to meet the general need for
      trust in this environment.

   o  Flow control

      Local file-sharing client machines are typically high-speed appli-
      cation servers. The file server needs to be able to throttle each
      client to ensure fairness and to avoid file server congestion.

   o  Enhanced locking

      Existing approaches to file and byte-range locking are not ade-
      quate for data sharing between active processes. If the entire
      cluster crashes while a lock is held (for example, due to a power
      failure), the file data may be inconsistent after the cluster
      reboots because the lock was broken.   A better semantic would be


Wittle                                                         [Page 14]

INTERNET-DRAFT         Direct Access File System          September 2001


      to have a "stateful" lock that informed the first lock attempt
      after a lock is broken that a failure occurred. An even better
      semantic would be to roll back file changes to the state of the
      file at the time the broken lock was granted.

2.3.  Direct Access Transport

   General-purpose computer networks are designed for a very open and
   diverse environment. Computers of diverse types in many widely
   dispersed locations running a variety of software from different
   manufacturers and under the control of different administrative
   domains can communicate with each other. The communication can occur
   at any time using many different protocols that pass through several
   different network types and switches that can fragment and reorder
   packets. The underlying network protocols are designed to deal with
   these situations, but with significant host cost in both CPU utiliza-
   tion and host memory requirements. Some of the sources of this over-
   head are:

   o  Network packet fragmentation and reassembly

   o  Multiplexing and demultiplexing data from different connections

   o  Realignment of user data following transmission

   o  Checksum computations

   o  Buffer space allocations sized for largest transfer unit

   o  Operating system buffer copying and various overhead costs.

   The Direct Access Transport semantic provides a set of standard
   facilities that address many of the deficiencies in standard network-
   ing transport protocols. Furthermore DAT mapped onto Memory-to-Memory
   networks like FC-VI [T11-FCVI], VI/TCP [Dicecco], IB, and WARP
   offloads most of the solution into a Channel Adapter.

2.3.1.  DAT Glossary

   The following terms are useful in describing the Direct Access Tran-
   sport.

   Channel Adapter

      Channel Adapter is a host-resident device that transfers messages
      to and from host memory associated with a specific Endpoint and a
      Fabric.


Wittle                                                         [Page 15]

INTERNET-DRAFT         Direct Access File System          September 2001


   Channel Adapter Address

      Channel Adapter Address on the fabric.

   Connection Qualifier

      The Connection Qualifier is a value that the Connection Manager
      uses to associate an incoming Connection request with the entity
      providing the service.

   DAT Consumer

      DAT Consumer is a Upper Layer Protocol or application that
      requires Direct Access Transport services.

   DAT Provider

      DAT Provider is the mechanism that provides the Transport services
      for a Direct Access application.

   Data Transfer Completion (DTC)

      DTC is the status of a completed data transfer operation.

   Data Transfer Operation (DTO)

      DTO is a requested data movement transfer submitted to a DAT Pro-
      vider.

   Endpoint

      Endpoint is the local part of a Connection that supports posting
      data transfer operation requests.

   Fabric

      A Fabric is a network with RDMA capabilities.

   Operation Type

      The Operation types in DAT are Send, Receive, RDMA Read or RDMA
      Write data transfer operations (DTO).

   RDMA

      Remote Direct Memory Access is and operation involving access of
      local memory by the remote Endpoint. There are two RDMA opera-
      tions: RDMA Read and RDMA Write.


Wittle                                                         [Page 16]

INTERNET-DRAFT         Direct Access File System          September 2001


   RDMA Memory Region Context (RMR Context)

      RDMA Memory Region Context (RMR Context) is a representation for
      an arbitrary-sized, registered, contiguous virtual space that
      belongs to a Channel Adapter so it can support Remote DMA opera-
      tions on the Connection whose local Endpoint belongs to the Chan-
      nel Adapter.

   RMR Target Address

      RMR Target Address specifies the memory address within a region of
      memory represented by RDMA Memory Region Context. (The specifica-
      tion can be either by virtual address or offset from the start of
      the memory represented by the RMR Context.)

2.3.2.  DAT Description

   DAT specifies a connection-oriented, peer-to-peer communication
   architecture. A pair of hosts that want to communicate need to first
   establish a connection between them through a pair of connected end-
   points. The mechanisms for the connection establishment and for con-
   nection endpoint creation are specific to individual transports that
   provide DAT semantics. Each node can have many connections to the
   same remote node or to other nodes. DAT is designed so that user
   processes can initiate data transfer with low overhead and without
   operating system intervention. There are three basic data transfer
   operations (DTO):

   o  Send

      The sender's DAT Provider forwards the payload of a send DTO into
      the memory of the receiver specified by a receive DTO on the other
      side of the connection. Upon the completion of the receive DTO
      corresponding to the send DTO, the remote DAT Consumer is noti-
      fied.

   o  Remote DMA (RDMA) write

      The originator copies the payload of an RDMA Write DTO from the
      local memory to a remote memory on the remote node identified by
      an RMR Target Address and an RDMA Memory Region Context (RMR Con-
      text) created by the remote DAT Consumer of the connection.

   o  Remote DMA (RDMA) read

      The originator of an RDMA Read copies the payload of an RDMA Read
      DTO from a remote memory on the remote node identified by an RMR
      Target Address and an RDMA Memory Region Context (RMR Context)


Wittle                                                         [Page 17]

INTERNET-DRAFT         Direct Access File System          September 2001


      created by remote DAT Consumer into local memory.

   RDMA write and RDMA read provide remote memory access without
   receiver software intervention. They provide the basic bulk data
   transfer primitives. The send operation provides fast messaging. Send
   can transfer any amount of data, but is typically used for smaller
   messages that can contain an RMR Context and an RMR Target Address
   for the RDMA operations to use. It is important to understand that
   DAT does not specify a physical implementation nor define an API. It
   merely specifies a common communication style and set of capabili-
   ties. Further information is available at:

   o  Appendix B. "DAT Semantics" defines DAT capabilities that DAFS
      requires.

   o  Appendix D. "DAFS Mapping to VI Architecture" defines a mapping of
      DAT onto Virtual Interface Architecture.

   o  Appendix E. "DAFS Mapping to InfiniBand Reliable Connection" pro-
      vides a mapping of DAT onto Infiniband HCA.

   DAT functionality can eliminate most of the following networking pro-
   tocol inefficiencies:

   o  Fragmentation and reassembly

      Channel Adapters typically perform all fragmentation and reassem-
      bly. In addition, the DAT RDMA operations are self-addressing and
      this enables the Channel Adapter to break up a large transfer
      operation into independent units with a length that is appropriate
      to the underlying packet size.

   o  Multiplexing and demultiplexing data from different connections

      Channel Adapters allow multiple DAT connections to be established.
      Data sent by different DAT connections are multiplexed and demul-
      tiplexed directly by the hardware. An RDMA Memory Region Target
      Address (RMR Target Address) is only valid within its RMDA Memory
      Region Context (RMR Context). Typically RMR Contexts are only
      valid within the connection where they are used. The hardware
      automatically identifies the connection to which the RDMA data
      belongs and translates the RDMA address to the underlying physical
      address in that context.

   o  Realignment of user data following transmission

      A DAT-capable node can place data at specific addresses in a
      remote node. This lets senders separate control information from


Wittle                                                         [Page 18]

INTERNET-DRAFT         Direct Access File System          September 2001


      bulk data, and place bulk data in properly aligned locations.
      Receivers need not extract the actual user data from the stream of
      packets and copy it to properly aligned buffers.

   o  Checksum computation

      Networks that provide DAT semantics typically do not require an
      end- to-end software checksum, because the processing at inter-
      mediate switches is extremely simple and the data is protected by
      the underlying cell checksum that is checked by hardware.

   o  Buffer space allocations sized for largest transfer unit

      Networks that support DAT semantics can provide senders with
      immediate useful knowledge of the state of buffers in the
      receiver. Senders can also place large send DTO messages and small
      send DTO messages into appropriately sized buffers on the
      receiver. The receiver need not have a large number of maximally
      sized packet buffers reserved for the networking hardware just in
      case one or more client machines send to it. In addition, DAT-
      capable networks usually have higher level protocols that separate
      buffers on a per sender basis providing further segregation of
      buffer resource utilization.

   o  Operating system buffer copying and various overhead costs

      DAT allows applications to completely bypass operating systems for
      data transfer. In addition, DAT provides an efficient communica-
      tion model permitting large I/O throughput at little CPU cost.

2.3.3.  DAT Requirements

   DAFS depends on the following network transport capabilities that are
   provided by DAT:

   o  DAT supports a connection that provides send-recv message
      transfers and RDMA Read and Write operations.

   o  DAT supports reliable connections that provides the following
      features:

      o  All data transfer operations submitted to the DAT Provider com-
         plete successfully in the absence of errors, with data
         delivered uncorrupted, in the order defined by the ordering
         rules.

      o  Corruption of the data delivered to the local Consumer is
         detected as an error and reported to the Consumer.


Wittle                                                         [Page 19]

INTERNET-DRAFT         Direct Access File System          September 2001


      o  Data loss (inability to deliver data to the remote endpoint of
         the connection, or to the local endpoint for RDMA Read) is
         detected as an error and reported to the Consumer.

      o  Upon detection of an error, the connection is broken and all
         outstanding and in-progress data transfer operations complete
         with an error.

      o  There is a one-to-one correspondence between send operations on
         one endpoint of the connection and recv operations on the other
         endpoint of the connection.

      o  There is no correspondence between RDMA operations on one end-
         point of the connection and recv or send data transfer opera-
         tion on the other endpoint of the connection.

      o  Data Transfer Operation Completion means that the Consumer can
         reclaim resources associated with the operation including the
         memory that contains the data.

      o  Ordering rules:

         o  The data payload for the send operation matching a receive
            operation is delivered into the receive-indicated memory
            buffer prior to the receive completion.

         o  Receive operations on a connection are completed in the
            order of posting of their corresponding sends.

         o  Each RDMA write operation posted on a connection prior to a
            send operation has its data payload delivered to the target
            memory region prior to the completion of the receive opera-
            tion matching that send.

         o  DAT supports multiple connections between the same or dif-
            ferent pairs of nodes (client server pairs).

   o  An RDMA Memory Region Context (RMR Context) supports RDMA opera-
      tions for the set of DAT connections that are associated with it.
      The association between a connection and an RMR Context is esta-
      blished by the local endpoint of the connection where the Memory
      Region is located.

   o  The same RMR Context can be associated with multiple connections.
      In addition, a connection can have multiple RMR Contexts associ-
      ated with it.

   o  The DAT Provider allows the DAT Consumer to create multiple RDMA


Wittle                                                         [Page 20]

INTERNET-DRAFT         Direct Access File System          September 2001


      Memory Region Contexts from the same memory.

   o  DAT supports connection management including the client-server
      connection establishment and connection termination by either side
      of the connection.

   For more information on DAT, see the following:

   o  DAT defines the transport layer semantics necessary to support the
      DAFS protocol in Appendix B. "DAT Semantics".

   o  The mappings of DAT functionality on VI Architecture and Infini-
      Band memory-to-memory interconnection networks are provided in
      Appendix D. "DAFS Mapping to VI Architecture" and Appendix E.
      "DAFS Mapping to InfiniBand Reliable Connection".

2.3.4.  Physical Interconnect

   The DAFS architecture does not specify or mandate any specific physi-
   cal interconnect technology. However, the media chosen SHOULD exhibit
   the following characteristics:

   o  The interconnect needs to support the transport requirements of
      protocols that feature remote memory-to-memory communications.

   o  The interconnect SHOULD be high speed and low latency.

   o  The interconnect needs to be highly reliable. Media errors and
      connection breaks SHOULD be rare.

   Examples of physical interconnect Channel Adapters that provide the
   above features are FC-VI, VI/TCP, and IB HCA.

2.4.  DAFS Protocol

   The DAFS definition has two principal goals: to provide a high- per-
   formance file access solution by taking advantage of the remote
   memory-to-memory communication model, and to address the data- shar-
   ing needs of distributed local file-sharing applications.  The main
   attributes of DAFS are:

   o  Client-server communication model

      DAFS uses a request-response message paradigm between the client
      and server for communication.  This method is used both for file
      operations initiated by the client and for "back-control" direc-
      tives initiated by the server.


Wittle                                                         [Page 21]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Session-based protocol that leverages underlying DAFS communica-
      tion channels

      DAFS establishes Sessions between the client and server that are
      used to simplify authentication and manage ongoing aspects of the
      communication. The DAFS protocol leverages underlying communica-
      tion channel primitives to allow control of errors on a communica-
      tion channel basis, rather than for each DAFS message.

   o  Security

      The remote memory-to-memory communication architecture, by its
      nature, requires some trust between machines. The DAFS protocol
      authenticates clients to servers and servers to clients. It can
      also authenticate individual users within a client-server Session.

   o  Optimized for high-throughput and low-latency networks

      Local file-sharing client machines can be relatively high-
      bandwidth, multiprocessor systems that can generate many threads'
      worth of load on the file server. DAFS is optimized for high
      throughput and takes advantage of the low latency characteristics
      of the network. The protocol support multiple outstanding opera-
      tions within a single connection.

   o  Chaining

      DAFS allows a series of dependent operations to be submitted con-
      currently without waiting for intermediate results. The dependent
      operations can be pipelined without stalling.

   o  Flow control

      Local file-sharing client machines are typically high-speed
      servers. DAFS provides mechanisms for the file server to throttle
      client's independently to ensure fairness and to avoid file server
      congestion.

   o  Internationalization

      All user-accessible strings use internationalized representations.

   o  Multiple operating system file access semantics

      The local file-sharing environment is typically composed of simi-
      lar client machines running the same base operating system. How-
      ever, the base operating system may differ between different local
      file- sharing applications. DAFS supports the file access


Wittle                                                         [Page 22]

INTERNET-DRAFT         Direct Access File System          September 2001


      semantics for both UNIX and NT.

   o  High speed consistent locking

      Many local file-sharing applications share data.  Typically, lock-
      ing and unlocking the shared data does this.  DAFS enforces data
      consistency while files are locked, and allows lock caching
      through delegation.

   o  Enhanced file locking

      DAFS provides "stateful" locks that inform the first lock attempt
      after a lock is broken that this event occurred. This addresses a
      problem with existing approaches that may leave file data incon-
      sistent follow a power-failure induced crash of an entire cluster.

      DAFS also provides "rollback" locks that roll back file changes to
      the state of the file at the time the broken lock was granted

   o  Atomic write append

      In addition to general support for enclosing file access opera-
      tions within file locking semantics, DAFS provides an optimized
      mechanism for the most common case-atomically appending new data
      to the end of a file.

   o  Client failure recovery

      When a client fails, it is unacceptable that the failed client
      locks out other clients for long intervals from accessing data in
      use. DAFS uses lease-based locks to ensure timely file availabil-
      ity after client failure.

   o  File server reboot or network interruption recovery

      Applications can recover when a file server reboots or the network
      suffers a temporary interruption.

   o  File server failover

      Applications need not fail when a file server fails over to an
      alternate file server that has a consistent copy of the data. DAFS
      supports failover and does not rely on network routing tricks.

   o  Fencing

      Clustered application servers often maintain their own notion of
      nodes that are considered a part of the cluster. DAFS prevents


Wittle                                                         [Page 23]

INTERNET-DRAFT         Direct Access File System          September 2001


      client systems ejected from the cluster from accessing shared
      data.

2.4.1.  DAFS Deployment Models

   DAFS is a client-server distributed file access protocol. It can be
   implemented on any underlying network that supports DAT capabilities.
   The implementation mappings are especially interesting for DAFS
   client implementations. A number of implementations are possible. The
   next few paragraphs describe a few basic types:

   1) Application uses DAFS via a user-level library that implements
      DAFS.

   2) Application uses DAFS via a user-level library that implements
      DAFS, but also implements a transparency layer that hides the
      details of the DAFS implementation.

   3) Application used DAFS via a kernel-level DAFS file system imple-
      mentation.

   Implementation 1 implements client file access using the DAFS proto-
   col in a user library. The DAFS client library exports the file
   access primitives directly as an API, perhaps exposing issues like
   memory registration. The application itself does not change to use
   the new API, but its OS adaptation library does. Many high-
   performance applications have such adaptation libraries to make the
   application code easier to port among different operating systems.

   Implementation 2 uses the same DAFS client library as Implementation
   1, but adds a transparency library so that the user I/O library can
   behave as if it was supported by the underlying OS using the standard
   OS interfaces.

   Implementation 3 implements a DAFS client as a file system installed
   into the underlying kernel. The application accesses DAFS through the
   standard OS interfaces.

2.4.2.  DAFS File Name Space

   To access the DAFS namespace, when a DAFS client first comes up in a
   local sharing network, the client needs to enumerate a list of avail-
   able servers. While it is possible that the client might have a pre-
   configured set of servers, it is desirable for a "clean" client to be
   able to join the network, presuming the servers are willing to pro-
   vide it DAFS service. A number of name service mechanisms can be
   defined to provide this "bootstrap" name service. See Appendix A.
   "DAFS Name Service".


Wittle                                                         [Page 24]

INTERNET-DRAFT         Direct Access File System          September 2001


2.4.3.  DAFS Terminology

   This document uses the following terminology:

   Back-control Channel

      The Back-control channel is a communication channel between a DAFS
      client and server that is an OPTIONAL part of a DAFS Session. The
      primary purpose for the Back-control channel to is allow the
      server to send unsolicited messages to the client.

   Client-id-string

      The Client-id-string is a string that is selected by the client
      and is intended to uniquely identify that client. It is presented
      to the server when a Session is established and is opaque to the
      server. All Sessions created using the same client-id-string can
      be considered as being joined together representing the same
      client. This serves a similar purpose as the client-opaque-string
      in NFS Version 4.

   Client-verifier

      The Client-verifier is a 64-bit quantity, which identifies an
      instance of a client. The client-verifier is presented to the
      server at Session establishment along with the client-id-string.
      The server uses this to determine that all Sessions (which should
      have been disconnected) for the old instance of the client are
      gone and that all locks for that client need to be freed. This
      serves a similar purpose as the client verifier in NFS Version 4.

   Client-id

      Client-id is a 64-bit quantity chosen by the server that is used
      as a shorthand identifier for the client-id-string, client-
      verifier pair supplied in the connection request for a Session. It
      provides a component of subsequent lock owner identifiers for the
      client, and is associated with credentials supplied for use by the
      client.

   Communication channel

      A communication channel is a transport protocol level abstraction
      that provides a communication connection between two endpoints
      across a network. A DAFS communication channel provides the set of
      transport delivery and error handling requirements listed previ-
      ously in 2.3.3., "DAT Requirements". DAFS Protocol Specification
      1.0 makes an assumption that there is one-to-one correspondence


Wittle                                                         [Page 25]

INTERNET-DRAFT         Direct Access File System          September 2001


      between a DAFS communication channel and a DAT connection.

      Note: A many to one mapping of DAFS communication channels onto a
            DAT connection could be considered for addition later if
            this feature becomes a requirement, for example, for scala-
            bility purposes. At a minimum it will require adding
            Session-id to the header of DAFS messages.

   Operation Channel

      The Operation channel is a communication channel between a DAFS
      client and server that is a REQUIRED part of a DAFS Session. The
      primary purpose for the Operation channel to is allow the client
      to send operation request messages to the server.

   RDMA-read Channel

      The RDMA-read channel is a communication channel between a DAFS
      client and server that is an OPTIONAL part of a DAFS Session. The
      primary purpose for the RDMA-read channel to is allow the server
      to originate RDMA read operations targeted to the client.

   Response Cache

      The Response Cache is an OPTIONAL, server-maintained, Session-
      based cache that holds the results of recent state-modifying
      requests.

   Session

      A DAFS Session is an abstraction that allows a DAFS client and
      server to create and manage a collection of communication channels
      for exchanging messages.

   Session-id

      The Session-id is a 64-bit identifier used as a shorthand designa-
      tion for a communication Session between a client and server. The
      server returns it to the client when a Session is successfully
      established. It serves both to identify Sessions for recovery pur-
      poses and as evidence that the client still exists. In this latter
      role, it does not have to be explicitly sent, because it is
      implied by any message sent to the server on that Session. Because
      it is used for recovery, including recovery that could involved
      service failover between multiple DAFS servers, it MUST be unique
      across the any set of DAFS servers that share a failure recovery
      mechanism.


Wittle                                                         [Page 26]

INTERNET-DRAFT         Direct Access File System          September 2001


   State-id

      The State-id is a 64-bit opaque quantity that is assigned by the
      server when a file is opened and serves as a shorthand representa-
      tion of the lockowner that has the file opened. The state-id is
      passed as a compact lockowner representation in file lock and
      close requests and is valid for an Session associated with the
      Client. In some respects, it is similar to the NFS Version 4
      state-id. However, in DAFS the state-id does not have to be unique
      across server reboots, because the DAFS Session detects server
      reboots. It does not have to change on each locking request,
      because there are not going to be delayed transmissions pending in
      routers. Finally, it plays no role in lease renewal, because any
      message for a Session associated with that client suffices to
      renew leases.


Wittle                                                         [Page 27]

INTERNET-DRAFT         Direct Access File System          September 2001


3.  Communication Model

   This chapter provides the basic design of the DAFS communication
   model and describes

   o  Session management

   o  Message handling.

3.1.  Session Management

   DAFS communication is a session-based protocol that utilizes a
   request- response model of message exchange between client and
   server. A DAFS Session provides a common communication environment
   between the client and server. The session design incorporates a
   number of long- lived attributes including authentication and author-
   ization, features related to segments of the file system name space,
   message flow control, and transport-level resource management. A Ses-
   sion MUST be established before DAFS file operations can be per-
   formed.

   Rationale: Session-based message transfer enables DAFS to take full
              advantage of a number of attributes of the local file-
              sharing environment. First, to make highly effective use
              of RDMA operations for the transfer of bulk data between
              client and server, most DAFS implementations might prefer
              to preallocate and advertise large transfer buffers on
              specific communication channels. DAFS makes use of the
              Session mechanism to assist in the management of these
              resources. Second, within the local file-sharing environ-
              ment, the trust relationship between client and server
              permits the use of a set of shorthand credentials to be
              associated with the Session after initial authentication
              has been completed. Third, under a Session-based paradigm,
              message exchanges between client and server can be managed
              to provide recovery semantics following system or subcom-
              ponent failure.

   DAFS Sessions have the following primary functions:

   o  Establishing and negotiating DAFS protocol options to be used dur-
      ing the Session

   o  Authenticating the client and server

   o  Linking lower-level connections (for example, DAFS communication
      channels) into a logical entity


Wittle                                                         [Page 28]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Providing context for credential management, message flow control
      management, DAFS file operations, and recovery operations.

3.1.1.  Security Model

   The DAFS security model is based on a trusted client-server relation-
   ship, as would be expected in a local file-sharing environment. To
   provide basic security and establish the trust relationship between
   client and server, authentication is performed as part of the initial
   communication for setting up a Session. This includes authenticating
   the client to server, and OPTIONALLY authenticating the server to
   client.

   After initial authentication has succeeded, the trusted client can
   specify alternate sets of credentials when performing normal DAFS
   operations. Clients can preregister multiple credentials with the
   server (that is, server- side credentials caching) and obtain in
   return opaque cookies to be used in subsequent DAFS operations to
   identify individual credentials to the server. This avoids repeated
   transmission and analysis of the identical credentials that would
   otherwise appear on numerous requests in the local sharing environ-
   ment.

   In addition, when a client (specified in the connection request by
   the Client-id-string field) authenticates multiple Sessions using the
   same authentication type and principal identifier, credentials
   registered on one Session are available to all of that client's Ses-
   sions, and are associated with the same opaque cookies on each of
   those Sessions.

   Rationale: By leveraging the connection-oriented nature of transport
              protocol so that full credentials do not have to be passed
              and subsequently translated into an internal form by the
              server for each operation. Preauthentication is particu-
              larly valuable, because many local file-sharing applica-
              tions use only one credential for the entire duration of
              the connection.

   A DAFS server MAY also support untrusted clients, that can not alter
   their credentials once the Session is established. For more informa-
   tion, see 3.1.1.1.2., "Untrusted Clients".

3.1.1.1.  Authentication

   Each DAFS client MUST authenticate itself to a DAFS server as part of
   Session initialization. This is distinct from any credential cookie
   used for individual DAFS operations.


Wittle                                                         [Page 29]

INTERNET-DRAFT         Direct Access File System          September 2001


   The DAFS client and server can support one or more authentication
   mechanisms. The client can use a SECINFO operation to query which
   mechanism(s) are supported by the server.

   After client authentication has succeeded, OPTIONAL server authenti-
   cation can also be performed, to authenticate the server to the
   client.

3.1.1.1.1.  GSS API Authentication

   DAFS allows clients to authenticate themselves, and to represent
   other entities, including users and machines.   During the connect
   phase, the client can authenticate its identity to the server, using
   DAFS_PROC_CLIENT_AUTH or DAFS_PROC_CLIENT_CONNECT_AUTH, and can
   request that the server authenticate its identity to the client,
   using DAFS_PROC_SERVER_AUTH. DAFS provides that this can be done with
   a high degree of security, by employing the General Security Service
   Application Program Interface (GSS API) in both the client and
   server. The DAFS protocol provides a GSS API flavor of authentica-
   tion, which provides the mechanism needed to exchange the tokens gen-
   erated by GSS_API between the client and the server.

   The GSS API [Linn] provides a generic wrapper for different security
   mechanisms. The most widely used security mechanism currently sup-
   ported by GSS API is Kerberos Version 5. The DAFS protocol provides
   sufficient support for authentication of clients under GSS_API,
   regardless of the underlying security mechanism.

   DAFS does not provide any facility for integrity or privacy services
   under GSS API at its protocol layer. This could be provided by a
   lower level network protocol used by DAFS. DAFS does not authenticate
   each and every packet exchanged in a DAFS Session. Rather, it authen-
   ticates the Session, and relies on the secure nature of the Session
   to prevent an interloper from interjecting rouge packets into the
   client or server.

   The normal usage of the GSS API by the client and server is as fol-
   lows: The client makes a call to GSS_Init_sec_context, specifying
   mutual authentication, but not requesting delegation, replay detec-
   tion or out-of- sequence detection. This returns to the client a
   gss_token, which is an opaque carrier for authentication information
   that will be sent to the server as part of the client authentication.
   Upon receiving the token in a DAFS_PROC_CLIENT_AUTH or
   DAFS_PROC_CLIENT_CONNECT_AUTH call, the server will make a call to
   GSS_Accept_sec_context. If successful, this will generate another
   gss_token, which needs to be returned to the client in the response
   to the procedure called.


Wittle                                                         [Page 30]

INTERNET-DRAFT         Direct Access File System          September 2001


   The interaction between client and server can require multiple phases
   to complete authentication. If the major status code returned by GSS
   API to the GSS_Accept_sec_context call is GSS_S_CONTINUE_NEEDED, then
   the client needs to make a subsequent GSS_Init_sec_context call, with
   the gss_token returned by the server as input, and then make another
   DAFS_PROC_CLIENT_AUTH request, simply for the purpose of delivering
   the gss_token returned by that call to GSS_Init_sec_context to the
   server.

   Since the client specifies the mutual authentication flag in its
   GSS_Init_sec_context call, the client and server are mutually authen-
   ticated once the gss context is created, and there is no need for the
   client to make a DAFS_PROC_SERVER_AUTH call to authenticate the
   server. For this reason, the server response to DAFS_PROC_SERVER_AUTH
   with the AUTH_GSS security flavor is void.

   The gss context created is associated with a DAFS Session. For that
   reason, there is no exchange of gss context handles between the
   client and server. The client and server MUST both destroy the gss
   context associated with a DAFS Session when tearing down that Ses-
   sion.

   Depending on the type of authentication, re-authentication might be
   necessary periodically in order to renew the authentication. Before
   the authentication expires, the client SHOULD initiate a new sequence
   of DAFS authentication operations using either DAFS_PROC_AUTH (for
   the Operation channel) or DAFS_PROC_BIND (for optional channels).
   This sequence proceeds in the same manner as the initial authentica-
   tion sequence. Each channel MUST renew its authentication indepen-
   dently.

   Rationale: The GSS integrity and privacy features are unnecessary for
              use in a local file-sharing environment, where they add
              processing overhead without extra security.

3.1.1.1.2.  Untrusted Clients

   A DAFS server MAY support untrusted clients. An untrusted client has
   its identification and credentials established at initial Session
   connection time. The DAFS server automatically registers the initial
   set of credentials supplied during Session creation as the creden-
   tials to be used for the life of the Session.

   The following constraints are placed on an untrusted client:

   o  The client SHALL NOT use the PROC_REGISTER_CRED or
      PROC_RELEASE_CRED requests.


Wittle                                                         [Page 31]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  The cred_handle for all requests MUST be zero.

   o  The AUTH_NONE authentication method is disallowed.

   o  The AUTH_DEFAULT, AUTH_GSS and AUTH_NAME authentication methods
      are allowed, if supported by the DAFS server.

   Rationale: AUTH_NONE provides no accountability or identification so
              no credentials are available to use for the registered
              defaults. Thus, it is disallowed.

3.1.1.2.  Credential Registration and Caching

   DAFS clients can pre-register a credential with the DAFS server for
   use during file operations, and in return, obtain an opaque creden-
   tial cookie. The cookies are used by the client in subsequent DAFS
   operations instead of passing full user credentials. The number of
   credentials that can be cached by the server for a Session is speci-
   fied during Session setup.

   After a client determines that a set of credentials is no longer
   needed, the client advises the server that the set of credentials can
   be released.

3.1.1.3.  Client Identifiers

   A DAFS client is identified by the client-id-string and client-
   verifier supplied in the connection request for a Session. The
   client-id-string, client-verifier pair is mapped to a shorthand
   client-id identifier that is subsequently identified with the client.
   In determining the appropriate client-id to use for a given client-
   id-string, client-verifier pair, the server SHALL apply the following
   rules:

   Case A: the server has no record of the client-id-string.

           In this case, the server can treat this case as though the
           client has never connected to the server. Specifically, the
           server SHALL generate a new client-id and save the authenti-
           cation mechanism and principal for checking against future
           uses of the client-id string.

   Case B: the server does have a record of the client-id-string, and
           there is an active client-id for it which was returned for
           the same client-id- string, client-verifier pair.

           This case corresponds to a client instance reconnecting after
           a Session was disconnected due to a transport error, or a


Wittle                                                         [Page 32]

INTERNET-DRAFT         Direct Access File System          September 2001


           client instance connecting additional Sessions.

           In this case so long as the client successfully authenticates
           as the same authentication mechanism and principal that
           caused the server to generate the client-id, then the server
           SHALL return the same client-id as it has for the previous
           use of the (client-id-string, client-verifier) pair. If the
           client attempts to authenticate using either a different
           authentication mechanism or principal, then an appropriate
           error is returned.

   Case C: the server does have a record of the client-id-string, and
           there is an active client-id for it which was returned for
           the same client-id- string, but a different client-verifier.

           This case corresponds to a restart of a client instance
           (e.g., client reboot).

           In this case the client MUST authenticate using the same
           authentication mechanism and principal as was used for the
           previous instance of the client-id-string.

           If the client attempts to authenticate using either a dif-
           ferent authentication mechanism or principal, then an
           appropriate error is returned.

           If the server finds the principal is equal to the previously
           registered client-id-string then all locking state associated
           with the old client- id SHALL be immediately released by the
           server.

   Note: When the client uses DAFS_PROC_CLIENT_CONNECT followed by
         DAFS_PROC_CLIENT_AUTH (rather than
         DAFS_PROC_CLIENT_CONNECT_AUTH) and case C above occurs, there
         is a window in which the server needs to generate a new
         client-id but the server can not clean up the old state that
         goes along with the old client-id. The server MUST wait until
         the client successfully authenticates as the appropriate prin-
         cipal using the appropriate mechanism (as discussed in case C)
         before releasing locking state.

3.1.2.  Session Attributes

   A number of aspects of the client to server communication are nego-
   tiated at the time the DAFS Session is established.


Wittle                                                         [Page 33]

INTERNET-DRAFT         Direct Access File System          September 2001


3.1.2.1.  Session Identifier

   Upon successful completion of a client request to establish a DAFS
   Session, the DAFS server returns a unique Session identifier to the
   client. The Session identifier associates a series of DAFS operation
   messages, independent of the lower-level transport implementation
   used to exchange messages.

3.1.2.2.   Session Options

   A number of Session attributes are established during Session ini-
   tialization. The client specifies requested values for the negotiated
   attributes in the Session request message, and the server returns the
   values for those attributes that will be used for the duration of the
   Session. Session options are:

   Protocol Version

      The client requests a particular protocol version for the Session.
      If the server supports that version, it responds with the same
      version number. Otherwise, the server responds with an error indi-
      cation and can also return a protocol version number that is sup-
      ported.

   Endianness

      The client requests a byte ordering endianness for the Session.
      The endianness of the initial connection request sent by the
      client can be either "little-endian" or "big-endian." The server
      determines the client's byte ordering by examining the protocol-
      defined static value in the message header. The server's response
      message to the connection request, and all subsequent messages
      exchanged on the Session will use the endianness chosen by the
      client.

   Maximum Number of Credentials

      The client requests that the server provides storage for a certain
      number of credentials. The server responds with the number that it
      will store. Later the client can use the DAFS_PROC_REGISTER_CRED
      operation to store credentials with the server and receive a
      "shorthand" credential identifier to be used in subsequent DAFS
      operations.

   Maximum Request Size

      The client requests that the size of the buffers allocated on the
      server to receive send-receive style request messages sent from


Wittle                                                         [Page 34]

INTERNET-DRAFT         Direct Access File System          September 2001


      the client is set to this value. The server responds with the
      buffer size that will be allocated on the server for receiving
      those requests.

   Maximum Response Size

      The client requests that the size of the buffers allocated on the
      client to receive send-receive style response messages sent from
      the server is set to this value. The server responds with the
      buffer size that MUST be allocated on the client for receiving
      those requests.

   Use Back-control Channel

      The client requests that it be allowed to bind an additional DAFS
      communication channel to this Session to support the transmission
      of server-initiated "back-control" directive messages on a
      separate channel. The server responds with whether the Back-
      control Channel will be used.

   Use RMDA Read Channel

      The client requests that it be allowed to bind an additional DAFS
      communication channel to this Session to support RDMA read opera-
      tions. The server responds with whether the RDMA-read Channel will
      be used.

   Use Checksums

      The client requests that all user data transferred in read and
      write operations be checksummed. The server responds with whether
      or not data will be checksummed.

   Inline Write Header Size

      The client requests the offset (in bytes) from the start of a
      DAFS_PROC_WRITE_INLINE message where the user data being
      transferred is located. This value is the sum of the DAFS message
      and operation header sizes and the padding size. This provides
      improved data alignment following write transfers to the server.
      The server responds with the offset where it will expect the data.

3.1.2.3.  Multiple Communication Channels

   A DAFS Session includes at least one communication channel and can
   include up to two additional special-purpose channels.

   Note: All DAFS communication channels defined for a Session share the


Wittle                                                         [Page 35]

INTERNET-DRAFT         Direct Access File System          September 2001


         property that if an RMR Context can be used on one of them it
         can be used on any of them for RDMA operations.

   Rationale: DAFS is designed to support the key advantages of fast,
              low- overhead remote read and write operations with
              minimal overhead. The basic message flow pattern needed to
              support those operations can be accomplished using only
              one transport channel. Adding features that require
              server-initiated messages introduces additional complex-
              ity. However, that complexity is introduced

               o  only for those clients who require the feature

               o  in a way that minimizes complexity and performance
                  overhead for the more common I/O operations.

              Thus, to support these optional features, DAFS optionally
              introduces extra complexity in the area of Session manage-
              ment in the form of establishing an additional communica-
              tion channel.


3.1.2.3.1.  DAFS Operation Channel

   A DAFS Session includes at least one communication channel (for exam-
   ple, DAT connection) for transporting DAFS operation messages between
   the client and server. The message flow consists of a DAFS operation
   request message sent from the client to the server, followed by a
   DAFS operation response message sent from the server to the client.
   The DAFS client initiates all request/response pairs on the DAFS
   Operation Channel.

3.1.2.3.2.  Creation of Special Purpose Channels

   The client creates all communication channels between the client and
   server. Connection and authentication of each channel normally con-
   sists of a connection request being sent from the client to the
   server, followed by a response message being sent from the server to
   the client. For some types of Session authentication, this initial
   paired message exchange MAY be followed by subsequent paired message
   exchanges that continue until the authentication process is complete.
   The entire sequence of exchanges is single-threaded, requiring that
   the client and server each make one receive buffer available for the
   next message throughout the sequence. Once the connection and authen-
   tication sequence is complete, the message flow on the channel is
   dictated by it's intended purpose.

   If an optional, special-purpose channel is to be used, it MUST be


Wittle                                                         [Page 36]

INTERNET-DRAFT         Direct Access File System          September 2001


   bound to the Session after the initial connection and authentication
   message sequence has completed successfully on the Operation Channel,
   but before issuing any request on the Operation Channel that would
   require the use of the special-purpose channel

   Note: If the client intends to issue requests that require the use of
         an optional channel, then the client SHOULD create and bind
         those channels to the Session as soon as possible after com-
         pleting the initial connection and authentication message
         sequence on the Operation Channel. Any delay in establishing
         these optional channels could increase the risk that a resource
         shortage on the server could cause an error in establishing the
         optional channel.

         However, the client is NOT REQUIRED to create these optional
         channels if it will not issue requests that would require their
         use. The specific requests that, depending on the results of
         the Session option negotiation, can introduce a requirement for
         the use of an optional channel are:

         o  Back-control Channel

            o  Any operation that requests a delegation


            o  DAFS_PROC_BATCH_SUBMIT


         o  RDMA-read Channel

            o  Any operation that requests the server to initiate an
               RDMA operation


3.1.2.3.3.  Back-control Channel

   A DAFS Session can include a separate communication channel for tran-
   sporting DAFS back-control operation messages between the server and
   the client. Following the initial connection and authentication
   sequence of messages, this second message flow consists of DAFS
   back-control directive request message sent from the server to the
   client, followed by a DAFS response message sent from the client to
   the server.

   The DAFS client creates this channel by initiating a request/response
   message pair to bind this channel to a previously established Ses-
   sion. The bind request can be followed by a sequence of client ini-
   tiated request/response message pairs needed to complete the channel


Wittle                                                         [Page 37]

INTERNET-DRAFT         Direct Access File System          September 2001


   authentication.

   Once the initial connection and authentication sequence is complete,
   the DAFS server initiates all subsequent request/response pairs on
   the Back- control Channel. This channel is OPTIONAL in that the
   client creates it to support DAFS features that are OPTIONAL and
   require server-initiated request/response messages. For instance, the
   delegation feature requires the server to initiate delegation revoca-
   tion message pairs with the client. This second channel provides for
   that requirement.

   Rationale: Separating the Back-control Channel from the Operation
              Channel also separates the traffic on the two channels
              onto two separate DAT Connections, each with their own DAT
              Endpoint and Connection parameters. This provides indepen-
              dent flow control on each channel. In addition, it allows
              the client implementations to handle back-control requests
              in a separate flow of control, without additional parsing
              on the common data path for command responses.

3.1.2.3.4.  RDMA-read Channel

   A DAFS Session can include a communication channel to be used
   exclusively for RDMA read operations initiated by the DAFS server.

   A client that intends to issue requests that require the DAFS server
   to make use of RDMA read operations MUST specify that intent by indi-
   cating the "use_rdma_read" Session connection option. The server
   accepts or rejects the use of the RDMA-read Channel. If accepted,
   then the server will use this communication channel to issue RDMA
   read operations when performing direct read operations from the
   client's memory.

   If the DAFS server specifies the use of an RDMA-read Channel during
   the connection option negotiation, then the client is REQUIRED to
   create and make use of such a channel to issue any DAFS request that
   would use the RDMA-read Channel. Otherwise, the server MAY return an
   error to any DAFS request on the DAFS Operation Channel that would
   involve an RDMA read operation.

   The DAFS client creates this channel by initiating a request/response
   message pair to bind this channel to a previously established Ses-
   sion. The bind request can be followed by a sequence of client ini-
   tiated request/response message pairs needed to complete the channel
   authentication. Once the initial connection and authentication
   sequence is complete, subsequent use of the channel is limited to
   server-initiated RDMA read operations.


Wittle                                                         [Page 38]

INTERNET-DRAFT         Direct Access File System          September 2001


3.1.2.3.4.1.  Use of the RDMA-read Channel Optional for Server

   The RDMA-read Channel is OPTIONAL for the server. If a client's con-
   nection request specifies the use of an RDMA-read Channel (meaning
   that it intends to issue requests that call for the use of that chan-
   nel), the server MAY respond by accepting the use of an RDMA-read
   Channel, or MAY respond by rejecting the use of an RDMA-read Channel.
   If the server accepts the use of the RDMA-read Channel, then the
   client MUST create it, otherwise the client MUST NOT create it.

   If a server has rejected the use of an RDMA-read Channel, and the
   client attempts to create one anyway, the server MAY return an error
   response to the DAFS_PROC_CONNECT_BIND request.

   If a client's connection request does not specify the use of an
   RDMA-read Channel (meaning that it does not intend to issue requests
   that call for the use of that channel), then the server MAY reject
   any DAFS request on the DAFS Operation Channel that would have
   involved the use of the RDMA- read Channel. This seems easier than
   dropping the Session or requiring that the RDMA-read Channel be set
   up in length of time after the DAFS Operation Channel.

   The RDMA-read Channel is NOT OPTIONAL for the client. Regardless of
   whether the client's connection request specifies the use of an RDMA-
   read Channel or not, if the server's response specifies an RDMA-read
   Channel, then the client MUST create the channel if the client
   intends to issue requests that require its use.

   Should the server specify the use of an RDMA-read Channel, and the
   client does not create it, then the server MAY reject any subsequent
   DAFS request on the DAFS Operation Channel that would have made use
   of the RDMA-read Channel.

3.1.2.3.4.2.   Use of the RDMA-read Channel Not Optional for Client

   To make the RDMA-read Channel effective, all clients that issue DAFS
   requests that use RDMA read operations MUST use the RDMA-read Channel
   if it is specified by the server during Session negotiation.

   Rationale: VI does not currently define "selective signaling" (per-
              request interrupt flag) capability, while InfiniBand
              software transport interface does provide a selective sig-
              naling semantic. Hence, a portable application can not
              rely on the transport layer providing such capability. To
              remedy this VI deficiency, DAFS defines the RDMA-read
              Channel that a DAT Provider MAY be able to provide. This
              removes the need for that transport layer to provide a
              selective signaling capability. Notice that if the DAT


Wittle                                                         [Page 39]

INTERNET-DRAFT         Direct Access File System          September 2001


              optionally provides selective signaling, then a DAT Consu-
              mer can use that capability directly, thus avoiding the
              need for additional DAFS communication channel for RDMA-
              read Channel.

              Consider a simple VI Architecture-based implementation
              where each DAFS communication channel is mapped onto a
              separate VI connection. Consider what happens if the
              server wants to use the RDMA-read Channel, and all
              client's but one establish and make use of an RDMA-read
              Channel. Assume the single client without an RDMA-read
              Channel tries to perform a DAFS_PROC_WRITE_DIRECT or
              DAFS_PROC_BATCH_SUBMIT request.


              The only communication channels available to the server to
              post the RDMA read will be either the DAFS Operation Chan-
              nel for the Session or the back-control directive channel
              for the Session. Both of these channels have their send
              work queues tied to a completion queue on which the DAFS
              server normally does not want to have interrupts enabled.
              Typically a global send completion queue that is shared
              across all DAFS Sessions. So, if the DAFS server is about
              to satisfy a DAFS_PROC_WRITE_DIRECT or
              DAFS_PROC_BATCH_SUBMIT request, the server posts the RDMA
              read to either the DAFS operation or back-control direc-
              tive channel. To get timely notification for the RDMA read
              completion, the DAFS server will need to enable interrupts
              on the global send completion queue.


              Thus, a single client could put the server into the posi-
              tion of needing to take an interrupt for every send or
              RDMA write completion. That would defeat the point of the
              RDMA-read Channel.


              Implementations that support selective signaling need not
              use an RDMA-read Channel.


3.1.2.3.5.  Direct I/O channel negotiation

   Depending on the details of server and client DAT support transport,
   it might be desirable for either the client or server to perform RDMA
   operations on the Operation Channel or on the RDMA-read Channel. This
   functionality is negotiated as part of the Session.


Wittle                                                         [Page 40]

INTERNET-DRAFT         Direct Access File System          September 2001


   1) Server supports DIRECT I/O on both the RDMA-read Channel and the
      Operation Channel.

      o  If the client requests use_rdma_channel == TRUE, then the
         server replies with use_rdma_channel == TRUE. DIRECT I/O
         proceeds on the RDMA-read Channel.

      o  If the client requests use_rdma_channel == FALSE, and later
         tries DIRECT I/O on the Operation Channel, the server performs
         the DIRECT I/O on the Operation Channel.

   2) Server requires use of the RDMA-read Channel for DIRECT I/O.

      o  If the client requests use_rdma_channel == TRUE, the server
         replies with use_rdma_channel == TRUE. DIRECT I/O proceeds on
         the RDMA-read Channel.

      o  If the client requests use_rdma_channel == FALSE, the server
         replies with use_rdma_channel == TRUE. If the client later
         tries DIRECT I/O on the Operation Channel, the server returns
         an DAFSERR_ENOTSUPP error.

   3) Server does not support the RDMA-read Channel, but supports DIRECT
      I/O on the Operation Channel.

      o  Client requests use_rdma_channel == TRUE, the server replies
         with use_rdma_channel == FALSE. Later the client tries DIRECT
         I/O on the Operation Channel, and the server performs the
         DIRECT I/O on the Operation Channel.

      o  If the client requests use_rdma_channel == FALSE, the scenario
         proceeds as in the previous situation.

   4) Server does not support DIRECT I/O at all.

      o  The client requests use_rdma_channel == TRUE, the server
         replies with use_rdma_channel == FALSE. Later client tries
         DIRECT I/O on the Operation Channel, the server returns an
         error.

      o  If the client requests use_rdma_channel == FALSE, the scenario
         proceeds as in the previous situation.

3.1.2.3.6.  Special Channel Setup Handling

   The DAFS server might need to be able to differentiate a client's
   connection requests for various types of Session channels. The imple-
   mentation requirements are specific to the details of connection


Wittle                                                         [Page 41]

INTERNET-DRAFT         Direct Access File System          September 2001


   establishment for each particular transport. For more discussion see

   o  Appendix B. "DAT Semantics" for details of DAAT connection estab-
      lishment

   o  Appendix D. "DAFS Mapping to VI Architecture" for details of the
      DAT mapping to VI

   o  Appendix E. "DAFS Mapping to InfiniBand Reliable Connection" for
      details of the DAT mapping to IB.

3.1.2.4.  Session Response Cache

   The DAFS server and client negotiate the use of a cache of operation
   results of recent state-modifying requests issued on each Session.
   State- modifying requests are those that change the state of a file
   or other file system object or a file lock on the server (see 4.3.,
   "Request Chaining" for further description of state-modifying
   requests). The maximum number of outstanding requests allowed for the
   Session determines the number of entries in the cache. This value is
   negotiated during Session initialization. The Session identifier pro-
   vides an identifier for accessing the Response Cache during recovery
   processing following a system failure.

   Rationale: DAFS uses a session-based protocol with a fixed number of
              outstanding requests for each Session. This provides an
              upper bound on the total number of entries in the Response
              Cache. The Session-id provides a unique identifier for
              accessing the cache following a failure. Using these ele-
              ments, the DAFS server can maintain the state necessary to
              insure at-most-once semantics for state-modifying opera-
              tions following a failure.

3.1.2.5.  Session Persistence

   A DAFS Session persists only as long as the DAFS Operation Channel
   exists. If the DAFS Operation Channel is lost, the client MUST estab-
   lish a new Session and re-authenticate and possibly re-register all
   credentials before continuing with additional DAFS requests. The
   client can use the new Session to transmit queries to the server's
   about the state of DAFS requests that were outstanding at the time of
   disconnection. This information is available in the Response Cache
   associated with the old Session.

3.1.3.  Session Operations

   The following DAFS operations are provided for DAFS Session manage-
   ment.


Wittle                                                         [Page 42]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS_PROC_CLIENT_CONNECT

      Create a new Session using the current transport connection, nego-
      tiating basic protocol configuration for the Session.

   DAFS_PROC_CLIENT_AUTH

      Authenticate the client to the server, establishing trust for the
      Session.

   DAFS_PROC_SERVER_AUTH

      Authenticate the server to the client, establishing trust in the
      reverse direction for the Session.

   DAFS_PROC_CLIENT_CONNECT_AUTH

      Create a new Session and authenticate the client to the user.

   DAFS_PROC_CONNECT_BIND

      Bind a new transport connection to an existing Session.

   DAFS_PROC_CLIENT_ DISCONNECT

      Terminate a Session.

   DAFS_PROC_SECINFO

      Enumerate server-supported security authentication methods.

   DAFS_PROC_REGISTER_CRED

      Register credentials with the server for subsequent usage by the
      client.

   DAFS_PROC_RELEASE_CRED

      Remove previously registered credentials that are no longer needed
      by the client.

   The following DAFS operations provide recovery of Response Cache
   information for a previous Session.

   DAFS_PROC_CHECK_RESPONSE

      Check a disconnected Session's Response Cache for the results of a
      request.


Wittle                                                         [Page 43]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS_PROC_FETCH_RESPONSE

      Fetch information from a disconnected Session's Response Cache.

   DAFS_PROC_DISCARD_RESPONSES

      Discard Response Cache information for a disconnected Session's
      Response Cache.

3.1.4.  Sharing Sessions

   DAFS does not specify how a single DAFS Session is used by applica-
   tions. However, it does provide a mechanism to facilitate the sharing
   of a single Session, as might be the case if a multi threaded appli-
   cation wants to multiplex threads onto a Session.

   Each DAFS message contains a 64-bit field in the message header that
   the client is free to use to identify or tag a request. This tag is
   opaque to the server. The server returns the unmodified 64-bit number
   in the response, that could then be used by the client to efficiently
   match the response with the originator of the request. The DAFS pro-
   tocol does not place any constraints on what this 64-bit tag con-
   tains.

3.2.  Message Handling

   The DAT provides a rich set of data transfer primitives. Efficient
   use of those primitives is affected by number of software interface
   and hardware support attributes. The DAFS protocol defines tradi-
   tional send/reply messages, as well as remote DMA-based operations.
   The request- response model uses data buffers transmitted in-line
   with the messages, whereas the bulk data transfer model uses "direct"
   buffers transmitted via RDMA independent of the messages. The RDMA
   model does not require the use of intermediate buffers within the
   file system or transport. The DAFS protocol defines a message flow
   control mechanism to help manage the various buffer resources.

3.2.1.  DAT Data Transfer Operations

   DAT provides a rich set of features for data transfer operations:
   RDMA writes, RDMA reads, and traditional send/receive. Data buffers
   for all data transfer operations consist of a scatter/gather list of
   one or more memory segments. Furthermore, the targeted applications
   of the local file-sharing environment and the DAFS protocol suggest a
   rich set of application requirements including async I/O and list I/O
   [POSIX].


Wittle                                                         [Page 44]

INTERNET-DRAFT         Direct Access File System          September 2001


3.2.1.1.  RDMA

   The RDMA write and RDMA read facilities enable one end of a communi-
   cation channel to directly write or read into the address space of
   its peer. The advantages of RDMA transfers over the traditional
   send/receive model are that data copies can be avoided and receive
   buffers need not all be allocated to hold the maximum transfer size
   of data.

3.2.1.2.  Scatter/Gather

   The initiating side of an RDMA data transfer can provide a set of
   buffers to use rather than requiring all data in the transfer to be
   contiguous in the virtual address space of the process. This can be
   used, for instance, to retrieve a message header into one buffer and
   a data payload into a data buffer.

3.2.1.3.  RDMA Memory Registration

   All memory that will be accessed by a DAT Channel Adapter in support
   of RDMA transfer operations needs to be registered with the Channel
   Adapter. Memory registration serves a number of purposes. First, it
   allows the operating system to pin the memory so it will be memory-
   resident during an I/O transfer. Secondly, it provides the Channel
   Adapter with the physical address mapping for the memory region.
   Finally, it associates the memory region with an RMR Context and set
   of protection attributes. The RMR Context can be used to ensure that
   the memory region is accessible for Remote DMA over the DAT
   connection(s) that are associated with that RMR Context only. This
   restricts access of a particular memory region to particular hosts or
   applications. The protection attributes indicate whether RDMA read or
   write is allowable on the memory region.

   As a result of memory registration, the DAT consumer is returned an
   RMR Context. This RMR Context MUST be provided to the Channel Adapter
   whenever it is asked to reference the memory region by RDMA data
   transfer operation (DTO).

3.2.2.  DAT Error Reporting

   DAT provides guarantees for data delivery and data transfer operation
   (DTO) completions as stated in 2.3.3., "DAT Requirements". DAT makes
   some guarantees as to the type of errors it detects, but it makes no
   statements as to the timeliness of the reporting of these errors. It
   is up to DAFS client and server implementations to address any defi-
   ciencies in the timely error detection/reporting features of any
   given DAT provider.


Wittle                                                         [Page 45]

INTERNET-DRAFT         Direct Access File System          September 2001


   Some of the DAT and DAFS protocol tools that handle timely detection
   of errors include:

   o  DAFS null operations for the main and Back-control Channels that
      could be used to ping a peer.

   o  Ability of either the client or server to break a connection at
      any time.

   o  Support for multiple connections between client and server.

   The DAFS specification does not mandate any particular strategy for
   timely error detection. In fact, the level of error detection sup-
   ported by the DAT provider will dictate the degree of error detection
   that DAFS implementations will need to perform.

   Note: Once a DAT interaction between the DAT Provider and the DAT
         Consumer is defined, a timeout parameter for some synchronous
         data transfer operations (DTO) can also be used for controlling
         timely detection of errors.

   Client and server implementations are free to implement any mechanism
   and enforce any timeliness constraints they see fit. Typically,
   request initiator clients on the main channel and servers on the back
   channel are responsible for error detection enhancements on their
   channel. Possible solutions include:

   o  Use of a per-operation timeout.

   o  Use of keep-alive messages (pings) using the NULL procedures. The
      sender could then restrict the use of timers to these messages
      only and not to every operation.

3.2.3.  Mapping DAFS onto Memory-to-Memory Architectures

   The key characteristics of the memory-to-memory architecture that
   impact the definition of the file access operations are the Remote
   DMA (RDMA) data transfer facilities. The RDMA write and RDMA read
   facilities enable one end of a communication channel to directly
   write or read into the memory of its peer. The advantage of RDMA
   transfers over the traditional send/receive model is that data copies
   can be avoided.

   DAFS assumes that there is one-to-one correspondence between a DAFS
   communication channel and a DAT connection.

   DAFS operations are defined that take advantage of remote memory
   access. For instance, in addition to read file and write file


Wittle                                                         [Page 46]

INTERNET-DRAFT         Direct Access File System          September 2001


   operations that include the data to be transferred in the response or
   request message, new read and write operations are defined that
   include the memory address of the client's destination/source buffer.

   Given the RDMA and send/receive models supplied by DAT, DAFS requests
   and responses are divided into two categories:

   o  messages that transfer large variable-sized (usually greater than
      1 KB) bulk user data

   o  messages that are bounded in size by the file access protocol.

3.2.3.1.  Small Bounded-Size Transfers

   The DAFS message exchange is mapped onto Direct Access Transport mes-
   sages in the following manner:

   1) The DAFS request message is placed in transport-registered memory.
      A preallocated send data transfer operation (DTO) buffer is ini-
      tialized so that its data segment's buffer virtual address points
      to the DAFS message. More generally, the send DTO buffer can con-
      tain a "gather" list of virtual address pointers, paired with
      corresponding buffer lengths, that describe a (potentially virtu-
      ally non-contiguous) series of memory buffers containing the DAFS
      message. The message is sent using the transport specific send
      interface.

   2) On the receiving end, a server is REQUIRED to have a pre-
      allocated, registered transport-level receive data transfer opera-
      tion (DTO) buffer ready to accept a request message. The DTO
      buffer's data segment(s) form a scatter list of <virtual address,
      length> pairs that describe a preallocated virtual buffer that
      meets the agreed upon maximum message size. It is up to the server
      implementation to detect message reception through polling or
      blocking calls to the transport's receive interface using a
      transport-specific API.

   3) The server builds a response and sends it in a manner similar to
      step 1. Note that as part of the flow control agreement, the
      client MUST post as many receive descriptor buffers as there are
      outstanding requests.

   4) The client receives the response in a manner similar to step 2. As
      in step 2, the DAFS architecture does not mandate a mechanism for
      detecting the arrival of responses. It is therefore possible for a
      single client thread to asynchronously deliver requests to a
      server and collect responses at a later time.


Wittle                                                         [Page 47]

INTERNET-DRAFT         Direct Access File System          September 2001


3.2.3.2.  Bulk Data Transfers

   A small number of operations might require the transfer of large and
   variable-sized user data frames. Typical RPC-based distributed file
   systems encode and transmit bulk data in the same fashion as bounded-
   sized requests. The sequence of operations is the same as described
   earlier in 3.2.3.1., "Small Bounded-Size Transfers" and the data
   packets usually have a fixed size header followed immediately (inline
   to the header) by bulk data.

   The problem with this encoding of messages is that the bulk data
   never lands at the desired memory location on the destination. This
   implies that a data copy needs to be performed to place the bulk data
   at the intended destination. Using RDMA operations, it is possible,
   with modifications to the packet encoding, to place the data directly
   where it is desired. The packet encoding changes so that the bulk
   data does not follow the header, but rather the header contains a
   memory reference to where the bulk data can be found.

   All RDMA operations are initiated by the DAFS server.

3.2.3.2.1.  Client to Server Bulk Data Transfer

   In cases where the bulk data flow is from the client to the server
   (as in DAFS write), a single DAFS operation maps to the following
   messages:

   1) The client sends to the server a message that contains the DAFS
      header plus a list of <RMR Context, RMR Target Address, length>
      triples that describe where the bulk data resides. Sending this
      request does not differ from the first step in the bounded-size
      request processing, in that it is also a single send descriptor
      buffer with the data segment pointing to the header. However, in
      this message the header contains memory address information for
      the bulk data buffer.

   2) The server receives the request as in the bounded-size case. The
      request, however, is not complete, because the bulk data is miss-
      ing. The next step handle the bulk data transfer.

   3) The server decodes the request and posts an RDMA read data
      transfer operation (DTO) using the addressing information con-
      tained in the request. Note that the client is not involved,
      because the RDMA operation does not contain any immediate data and
      does not require the use of any of the client's receive buffers
      nor matches against any client's submitted receive DTOs. Also note
      that now that the server has the file information, it is possible
      for the server to place the contents of the bulk transfer directly


Wittle                                                         [Page 48]

INTERNET-DRAFT         Direct Access File System          September 2001


      into the buffer that the native file system requires. (However,
      this is an implementation issue that is not mandated by the DAFS
      architecture.)

      Finally, note that depending on the server implementation and the
      optional channels negotiated for the Session, the RDMA operation
      MAY be posted on the either the RDMA-read Channel or the Operation
      channel.

   4) The server sends the response (bounded in size) as described in
      the bounded-sized response case.

   The client receives the response in a manner similar to the bounded-
   size case.

3.2.3.2.2.  Server to Client Bulk Data Transfer

   In cases where the bulk data flow is from the server to the client
   (as in a DAFS read operation), a single DAFS operation maps to the
   following transport-level messages:

   1) The client sends the DAFS header, containing the memory informa-
      tion for the bulk data buffer, the same as for step 1 for client
      to server bulk data transfers. The address encoded in the request
      refers to the client location where data to be transferred from
      the server to the client will be placed.

   2) The server receives the DAFS request as in step 2 of client to
      server bulk data transfers.

   3) The server posts an RDMA write data transfer operation (DTO) to
      move the bulk data directly to the address advertised in the
      request header. (As stated in 2.3.3., "DAT Requirements" the
      server's RDMA write DTO is not matched against any receive DTO
      pre-submitted by the client.)

   4) The server sends the response just as in step 3 for small bounded-
      size transfers.

   5) The client receives the response in a manner similar to step 4 of
      the small bounded-size case. However, the client is aware that the
      response does not include the bulk data and that it can be found
      at the location that the client specified in the request.

   The DAFS file access defines bulk transfer operations that parallel
   the traditional RPC model. The bulk transfer operations take advan-
   tage of DAT RDMA capabilities, while the small bounded-size opera-
   tions use the traditional send/receive model. DAFS defines standard


Wittle                                                         [Page 49]

INTERNET-DRAFT         Direct Access File System          September 2001


   APPEND_INLINE, READ_INLINE, WRITE_INLINE, and READDIR_INLINE opera-
   tions as well as APPEND_DIRECT, READ_DIRECT, WRITE_DIRECT, and
   READDIR_DIRECT operations. DAFS defines two versions, inline and
   direct, for any operation where the data size might be large due to
   variable length fields.

   Rationale: To promote pipelining of DAFS messages through send and
              receive data transfer operations (DTO), the client and
              server need to agree to pre-submit multiple receive DTOs.
              However, DAFS messages are variable length, and the order
              of requests is unpredictable. Therefore, maintaining
              variable- length send DTOs would require additional mes-
              sages (or additional synchronized communication channels)
              in order for the receiver to allocate and pre-submit a
              receive DTO for an appropriately sized DAT buffer. A sim-
              ple solution to this problem is to preallocate the
              maximally-sized DAT buffers capable of receiving the larg-
              est DAFS message. This wastes space, because most DAFS
              messages are smaller than 1 KB, and only a few messages
              with variable-sized fields (for example, file attributes
              and long file names) might become much larger than 1 KB.
              Thus, DAFS essentially implements two buffer sizes: small
              and large. Small buffers are used for normal send-receive
              traffic. Anything that will not fit in a small buffer has
              its bulk data portion transferred directly between client
              and server buffers using RDMA.

              The benefits of this approach are primarily:


               o  the performance benefits of simplifying the negotia-
                  tion of buffer sizes through the use of uniformly
                  sized DAT buffers for send/receive data transfer
                  operations (DTO)

               o  the ease of implementation from a managing fixed sized
                  buffers.

              However, during Session establishment, DAFS leaves open
              the option for the client to negotiate the buffer size.
              There might be an implementation or application where
              operation latency is very critical and memory space is
              very cheap. In this case, the additional memory costs of
              large buffers can be traded off against the reduced
              latency of the DAFS Send- Receive model of communication
              versus the DAFS RDMA model.


Wittle                                                         [Page 50]

INTERNET-DRAFT         Direct Access File System          September 2001


3.2.4.  Separate Communications Channel for RDMA Read Operations

   DAFS defines, optionally, a separate communication channel, commonly
   mapped into a separate DAT connection that can be used specifically
   for RDMA read operations issued by the server. This option might pro-
   vide a significant reduction in the latency of DAFS_PROC_WRITE_DIRECT
   operations.

   Rationale: The DAFS server should rarely need to block waiting for
              the send data transfer operation (DTO) of a DAFS response
              message to complete. One implementation would be for the
              server to specify that all completion notifications for
              send DTOs on all DAFS Sessions be associated with a single
              Completion Queue (CQ). The server could then periodically
              poll this CQ to harvest send completions in a group,
              rather than taking an interrupt on each individual send
              completion.

              On the other hand, to provide good response time for a
              DAFS_PROC_WRITE_DIRECT request, the server should be able
              to receive immediate notification (that is, a hardware
              interrupt) when the RDMA read completes. A server imple-
              mentation could use a single CQ that ties together the
              following:


               o  the completion notifications for receive DTOs of the
                  DAT connections for the DAFS Operation Channels

               o  the completion notifications for receive DTOs of the
                  DAT connections for the back-control directives chan-
                  nels

               o  the completion notifications for the receive DTOs of
                  the DAT connections for the RDMA-read Channels.

              The server can then have a single worker that blocks on
              this CQ when idle.


              With this setup, under moderate load the server receives
              timely interrupts for RDMA read completions, but does not
              have to receive interrupts for the (potentially numerous)
              interrupts for ordinary send completions that indicate
              that DAFS response resources can be reclaimed.


   Note: Even with this scheme, per data transfer operation (DTO)


Wittle                                                         [Page 51]

INTERNET-DRAFT         Direct Access File System          September 2001


         interrupt

         control might be highly desirable for the following reasons:

         o  DAT RDMA read does not support remote gather. Thus, if the
            DAFS_PROC_WRITE_DIRECT specifies N+1 virtually noncontiguous
            client buffers, the server will need to post N+1 separate
            RDMA read operations. The server only cares when the final
            RDMA read completes. Unfortunately, without per- data
            transfer operation (DTO) interrupt control, the server can
            be interrupted when each of the N+1 RDMA read operations
            complete, rather than just when the desired final RDMA read
            operation completes.

         o  Adding a third communication channel per DAFS Session is
            potentially expensive, because communication channels are a
            limited resource.

3.2.4.1.  Error Detection on the RDMA-read Channel

   A problem could arise when the RDMA-read Channel is employed. Because
   it is difficult to detect and recover from errors on that channel.
   The reason is that the client needs to create it, but only the server
   knows whether the channel is functioning correctly.

   The client creates the communication channel by establishing a DAT
   connection, creating RMR Contexts for RDMA read, associating RMR Con-
   texts with the DAT connection, connecting it to the server, and bind-
   ing it to the DAFS Session. Following the initial connection and
   authentication message exchanges, the client does not post descriptor
   buffers to this communication channel, since it will be used solely
   by the client's transport to satisfy the server's RDMA read opera-
   tions.

   On the other hand, once the initial connection and authentication
   sequence is complete, the server will not submit any receive data
   transfer operations (DTO) to the DAT connection for this communica-
   tion channel, because the client will not send on the channel. In
   fact, that is the motivation: traffic on the channel is never inter-
   leaved, so that completions can be efficiently handled. The only
   sends to the channel are for server RDMA read operations, and these
   occur for non-inline client transfers only.

   It is only when the server has an opportunity to perform such a
   transfer that the status of the channel will be checked. The client
   software will never see an error from the channel. When a failure
   condition occurs on the RDMA-read Channel, the server can disconnect
   the client's Session connection, forcing the client to recover as if


Wittle                                                         [Page 52]

INTERNET-DRAFT         Direct Access File System          September 2001


   a network partition or server restart was encountered.

3.2.5.  Checksums

   The DAFS protocol defines the OPTIONAL use of checksums on all mes-
   sages exchanged on a Session. During Session creation, the client can
   specify this option. If specified, two separate checksums will be
   computed:

   message_checksum

      A checksum for the message, including headers and any inline data

   direct_checksum

      If there is an RDMA transfer associated with the operation, a
      checksum for the bulk data transferred via direct RDMA.

   The message_checksum is transmitted in the message header for both
   request and response messages. It is computed by the message sender,
   and inserted into the message header. If the request includes an RDMA
   operation, the direct_checksum is computed for the RDMA data buffer
   and inserted into operation header of the request of response accom-
   pany the RDMA operation. The message receiver verifies the checksums
   on receipt of the message. If the server detects a checksum failure,
   a checksum error status is returned. If the client detects a checksum
   failure it SHOULD take appropriate action.

   The use of checksums is negotiated during Session creation. The
   client requests the use of checksums in the connection request mes-
   sage, and the server replies with an acknowledgement that checksums
   will be used in the connection response message. Neither the client's
   connection request nor the server's connection reply message includes
   a checksum. However, all subsequent messages for the Session, includ-
   ing messages transferred on optional channels, will include a check-
   sum.

   The checksum is computed on the message with the DAFS message headers
   in the endian byte ordering specified for the Session (i.e., DAFS
   network byte order). As an input to the checksum algorithm, the value
   of the checksum field itself is "zero".   The checksum value computed
   is based on the ones-complement Fletcher-32 checksum [Fletcher],
   [Sklower] using a checksum computation modulus of 65535.

   o  S1 and S2 are 16-bit quantities. The checksum is computed on the
      data 2 bytes at a time, treating pairs of contiguous data bytes as
      a single 16-bit data word. The resulting values of S1 and S2 are
      placed in the 32-bit DAFS checksum field, each as a 16-bit


Wittle                                                         [Page 53]

INTERNET-DRAFT         Direct Access File System          September 2001


      quantity.

   o  S1 is given the initial value 0x0101. Starting from that initial
      value, S1 is computed to be the ones-complement sum of the data
      taken 2 bytes at a time (as described in the preceding paragraph)
      with a modulo function applied by subtracting 65535 whenever the
      value of S1 becomes larger than 65535. If the length of the data
      is an odd number of bytes, then S1 is computed as if an additional
      byte containing the value "zero" had been appended to the end of
      the data. The zero pad byte is not included in the transmission of
      the data.

   o  S2 is given the initial value of 0x0000. Starting from that ini-
      tial value, S2 is computed to be the 16-bit sum of the data multi-
      plied by the position of the data from the end of the packet. No
      multiplication is actually necessary in the algorithm. The multi-
      plication effect results from the way the sum is accumulated. S2
      accumulates values of S1 after S1 is updated; so a given 16-bit
      data word appears multiple times in S2. The number of times each
      16-bit data word appears in S2 depends on its position from the
      end of the packet. The same 65535 modulo function is applied to S2
      as it is computed.

   Rationale: This optional approach to checksumming provides a "last
              check" on the hardware and software implementations
              involved in providing DAFS, without imposing a performance
              penalty. Better fast than slow, but more importantly,
              better safe than sorry.

              The initial value for S1 is chosen as a non-zero checksum
              seed in order to detect an invalid all-zero block, while
              providing byte order independence to keep the checksum
              algorithm simple.


3.2.6.  Message Flow Control

3.2.6.1.  Requesters and Responders

   The DAFS protocol defines the role of client and server. The client
   is the party that initiates the Session and submits file-level
   requests (DAFS operations) to the server. The server waits for con-
   nection requests from clients and processes the file-level requests.

   All DAFS communication is of a 'request-response' nature. For the
   DAFS operations that form the bulk of the communication, the client
   is the requester submitting the DAFS operation, and the server is the
   responder, processing the DAFS operation and sending the response.


Wittle                                                         [Page 54]

INTERNET-DRAFT         Direct Access File System          September 2001


   But the DAFS protocol also defines a set of "back-control" directives
   (for example, delegation revocation, asynchronous notification of
   operation completion) that the server can send to the client. For
   these directives, the server takes the role of the requester and the
   client is the responder.

3.2.6.2.  Flow Control Requirements

   DAT includes no provision for flow control. Under the DAT (see
   2.3.3., "DAT Requirements") the communicating DAFS parties "A" and
   "B" MUST guarantee, through some mechanism external to DAT itself,
   that:

   o  "A" will not attempt to send data via a send data transfer opera-
      tion (DTO) when "B" does not have a receive DTO pre-submitted and
      waiting to receive the data.

   o  "A" will not attempt to send more data than the size of the buffer
      of the receive data transfer operation (DTO) submitted on "B".

   The flow control mechanism does not require out-of-band communication
   of "send credits," nor are buffers disassociated from application-
   level operations. A requester knows how many requests it is allowed
   to have outstanding at any given time, and it never has to wait for
   send credits separately from waiting for application-level operations
   to complete.

   Finally, the flow control mechanism enables the server to provide
   congestion control at the server through back pressure on requesters
   to reduce the rate of incoming requests.

3.2.6.3.   Overview-Use of Communication Channel Facilities

   As explained earlier in 3.1.2.3., "Multiple Communication Channels",
   a DAFS Session consists of one or more communication channels mapped
   onto one or more DAT connections. The following list describes the
   types of communication channels:

   o  A REQUIRED DAFS Operation Channel over which the client sends DAFS
      operation requests and the server sends responses.

   o  An optional Back-control Channel over which the server sends
      directive requests and the client sends responses. The client can
      decline the use of all DAFS features that necessitate use of this
      channel, in which case the client need not create and manage this
      channel.

   o  An optional RDMA-read Channel over which the server issues RDMA


Wittle                                                         [Page 55]

INTERNET-DRAFT         Direct Access File System          September 2001


      read operations. This communication channel is an adjunct used to
      provide a separate channel for issuing RDMA read operations and is
      not involved in message flow control.

   As a part of channel creation, DAFS establishes a flow control proto-
   col for the channel. However, at least one successful message
   exchange is necessary in order to establish the flow control proto-
   col. The first message exchange between a DAFS client and server is
   governed by these 2 rules:

   1) The DAFS server listens for an incoming connection on the DAT
      transport by posting a buffer where a connection request is to be
      stored. The client's initial connection request operation MAY be:

      o     DAFS_PROC_CLIENT_CONNECT,

      o     DAFS_PROC_CONNECT_AND_AUTH,

      o     or, for an optional channel, DAFS_PROC_CLIENT_BIND.

      The server MUST post a buffer at least 4-KB in size to receive
      this request. This provides space for authentication data for the
      channel that MAY be included with the connection request.

   2) The DAFS client MUST be prepared to receive a reply to the initial
      connection request. Synchronization requirements regarding the
      posting of a buffer to receive the DAFS connection reply are tran-
      sport dependent. However, the client MUST port a buffer of at
      least 4-KB in size to receive this reply. This provides space for
      authentication data for the channel that MAY be included with the
      connection request.

   This initial exchange of a DAFS connection request and response mes-
   sage contains the flow control negotiation parameters that will
   govern the subsequent packet exchange on the channel. The following
   flow control values are negotiated:

   OPNreq

      Number of DAFS operations (or back-control directives) that the
      requestor can submit simultaneously on the channel. Depending on
      which channel, the requester might be the client or the server.
      For the DAFS Operation Channel, OPnreq is a limit on the client;
      for the Back- control Channel, it is a limit on server. The value
      can be dynamically renegotiated throughout the lifetime of the
      DAFS Session. For more information, see 3.2.6.4.2., "Maximum
      Number of Simultaneous Outstanding Requests". OPNreq MUST be >= 1
      at all times.


Wittle                                                         [Page 56]

INTERNET-DRAFT         Direct Access File System          September 2001


   OPSZreq

      The maximum size of a single DAFS operation request or back-
      control directive request.

   OPSZresp

      The maximum size of a single DAFS operation response or back-
      control response.

   The discussion that follows refers to "requester" and "responder."
   For the DAFS Operation Channel, the requester is the client issuing
   DAFS operations and the responder is the server. For the back-control
   directive channel, the requester is the server and the responder is
   the client. Nreq, SZreq, and SZresp refer to the values (negotiated
   or static) appropriate for the direction under consideration.

   On a given channel, the requester and responder use the following
   protocol to satisfy the flow control requirements described earlier
   in 3.2.6.2., "Flow Control Requirements":

   o  The requester submits requests using send descriptors no larger
      than SZreq.

   o  The responder responds using send descriptors no larger than
      SZresp.p. If the response to a submitted request will not fit in a
      buffer of size SZresp, an error indication is returned instead.

   o  When no requests are outstanding, the responder guarantees that at
      least Nreq receive buffers of size >= SZreq are posted.

   o  While processing some number of requests M <= Nreq on a given
      channel, the responder need only be prepared to receive (Nreq - M)
      more requests. So upon receiving a request, the responder need not
      immediately post another receive buffer. But before sending a
      response, the responder MUST post another receive buffer (unless
      Nreq is being reduced). For more information, see 3.2.6.4.2.,
      "Maximum Number of Simultaneous Outstanding Requests". This demon-
      strates the requester's recognition that, upon receipt of the
      response, only (M - 1) requests are simultaneously outstanding.

   o  The requester guarantees that it will never have more than Nreq
      requests outstanding. When exactly Nreq requests are outstanding,
      the requester MUST delay submitting the next request until it
      receives a response to a previously submitted request. When the
      response is received, the requester might or might not be able to
      immediately submit the next request. For more information, see
      3.2.6.4.2., "Maximum Number of Simultaneous Outstanding Requests".


Wittle                                                         [Page 57]

INTERNET-DRAFT         Direct Access File System          September 2001


3.2.6.4.  Flow Control Specifics

   Flow control between a requester and responder requires that the two
   parties agree on two types of information:

   o  Maximum size of request and response. This governs the size of the
      receive buffers posted by the requestor and responder.

   o  Maximum number of requests that are allowed to be simultaneously
      outstanding. This governs the number of receive descriptor buffers
      that the responder maintains on its transport-level receive work
      queue.

3.2.6.4.1.  Maximum Request/Response Sizes

   The DAFS protocol allows the maximum size values OPSZreq and OPSZresp
   be negotiated on a per-Session basis when the Session is created.
   After the DAFS Session has been established, these values remain in
   effect for the lifetime of the Session.

   The protocol allows the values to be negotiated on a per-Session
   basis to permit the client and server control over the following:

   o  maximum amount of data in the WRITE_INLINE request (impacts
      OPSZreq of the DAFS Operation Channel)

   o  maximum amount of data that can be returned in a READ_INLINE
      response (impacts OPSZresp of the DAFS Operation Channel)

   o  maximum number of entries in a NOTIFY_BATCH_COMPLETE back-control
      directive message (impacts OPSZreq of the back- control directive
      channel).

3.2.6.4.2.  Maximum Number of Simultaneous Outstanding Requests

   Unlike the size values, the DAFS protocol considers the maximum
   number of simultaneous outstanding requests for each channel, OPNreq,
   to be dynamic. The protocol provides the following capabilities:

   o  Requester can ask for an increase or decrease in Nreq in any
      request packet. This allows a client to request additional
      server-side resources (for example, additional outstanding receive
      data transfer operations) during periods of heavy DAFS activity.
      This also allows a client to "be a good citizen" and yield
      resources during times of reduced activity.

   o  Responder can increase or decrease Nreq in any response packet.
      This allows the server to reduce the amount of server-side


Wittle                                                         [Page 58]

INTERNET-DRAFT         Direct Access File System          September 2001


      resources (for example, reduce outstanding receive data transfer
      operations) it dedicates to a single client Session. This might be
      necessary to accept additional incoming connections on a given
      NIC, or to throttle back the rate of incoming DAFS operations from
      a single overly active client. It also allows the server to
      restore resources to an active client when they become available.

   Note: This value MUST always be >= 1; if Nreq were 0, there would be
         no mechanism for the client to request or the server to grant
         an increased value.

   The mechanism by which the DAFS protocol supports dynamic negotiation
   of Nreq is described as follows:

   At any given time, the responder has dedicated resources for two
   classes of requests on a given Session:

   o  a set H of requests currently being handled (request received,
      response not yet sent)

   o  a set P of requests, not yet received, for which receive data
      transfer operations (DTO) are currently submitted.

   The union of these two sets constitutes the complete set of requests
   that the responder is currently equipped to handle. Thus:

           Nreq = num_elements(set H) + num_elements(set P)

   The header of every request contains a value Desired_Nreq. This lets
   the requester request an increase or decrease in the value Nreq, the
   header of every request contains a value Desired_Nreq. But the
   responder is under no obligation to honor the requester's desire
   regarding Nreq; the responder is the sole owner of the value of Nreq.

   The requester cannot assume any action by the responder until noti-
   fied via the Target_Nreq field in a received response (see the fol-
   lowing paragraph). Furthermore, the requester MUST be prepared for a
   change in the Nreq value when processing any response, whether a
   change was asked for or not. There is no reason for the requester to
   distinguish between solicited and unsolicited changes in Nreq.

   The responder can change the value of Nreq at any time, either in
   response to a change request or on its own, and notifies the reques-
   ter of the current value of Nreq using the Target_Nreq field in the
   header of every response. But note that the Target_Nreq value commun-
   icated in every response is the Nreq value that the responder is aim-
   ing for, and is NOT necessarily the same as the Nreq value in effect
   at the time of the response.


Wittle                                                         [Page 59]

INTERNET-DRAFT         Direct Access File System          September 2001


   Each request that arrives at the responder consumes one outstanding
   receive data transfer operation (DTO); this reduces the
   num_elements(set P) by 1 and increases num_elements(set H) by 1. The
   responder does not need to submit another receive DTO at this point,
   but instead can delay submitting another receive DTO until it is
   ready to send the response.

   Prior to sending any response, the responder can take one of three
   courses of action, depending on whether it wants to maintain,
   increase, or reduce the value of Nreq:

   o  If the responder wants to maintain the value of Nreq at its
      current value (Target_Nreq = Nreq), it submits one receive data
      transfer operation (DTO) (possibly reusing the same memory buffer
      of the receive DTO used for the request just completed).

   o  If the responder wants to increase the value of Nreq by M
      (Target_Nreq = Nreq + M), it submits M + 1 receive data transfer
      operation (DTO).

   o  If the responder wants to decrease the value of Nreq by M
      (Target_Nreq = Nreq - M), it simply declines to submit any receive
      data transfer operation (DTO) prior to sending the response. When
      the response is sent, num_elements(set H) is reduced by 1 without
      a corresponding increase in num_elements(set P), thus reducing
      Nreq by 1.

   When the requester receives a response, it recalculates the value of
   Nreq using the following formula:

           Nreq = MAX(previous_Nreq - 1, Target_Nreq)

   where Target_Nreq is the value contained in the response header.

   Thus:

   o  Increases in the value of Nreq take place immediately; this is
      possible because the responder can submit receive data transfer
      operation (DTO) at any time.

   o  Decreases in the value of Nreq take place gradually; the value can
      be reduced by only 1 on each processed request. This is necessary
      because the DAT does not provide a way to cancel outstanding and
      in-progress data transfer operations (DTO). The only ways to
      retrieve an outstanding DTO is for it to be consumed by incoming
      requests, or for the DAT connection to be terminated.

   The response contains the responder's target value for Nreq rather


Wittle                                                         [Page 60]

INTERNET-DRAFT         Direct Access File System          September 2001


   than the current value. This allows the responder to explicitly
   notify the requester that the responder needs to reclaim resources
   associated with (Target_Nreq - current Nreq) additional requests. In
   this situation, the requester SHOULD take necessary action to allow
   the responder to reclaim those resources in a timely manner.

   Thus, although Nreq can be decreased by at most one on each response,
   Target_Nreq can be reduced by M. This provides the server a way to
   tell the client that he intends to reduce Nreq by one after each of
   the next M requests, and acts as a hint to the client that if the
   client does not have M requests queued to issue, that the client
   SHOULD issue M null requests to allow the server to reduce Nreq.
   Failure to do so can result in connection failure.

   Note: Consider the case in which a server needs to reclaim resources
         associated with a particular DAFS Session. The server notifies
         the requester of a reduced Target_OPNreq value. The client
         receives and recognizes the change, but does not have another
         request to issue. In this case the server might be forced to
         terminate the Session in order to reclaim (all) resources that
         had been dedicated to that Session. To avoid this situation, a
         client that is notified of a reduced Target_OPNreq SHOULD issue
         NULL DAFS operation requests to consume receive descriptors on
         the server, allowing the server to achieve its resource recla-
         mation.


Wittle                                                         [Page 61]

INTERNET-DRAFT         Direct Access File System          September 2001


4.  File System Operations

   DAFS file system operations can be divided into different areas:

   o  Concepts and structures

      Key objects being managed include file names, filehandles, and
      access credentials. DAFS shares a heritage with NFS, but differs
      in some important ways.

   o  Data transfer

      Key DAFS operations focus on the efficient transfer of data
      between client and server. Typically these operations take advan-
      tage of RDMA capability.

   o  Request chaining

      DAFS chaining is similar in concept to NFS version 4 compound
      operations, but is tailored to the DAFS environment

   o  Locking and access control

      DAFS provides operations to support sharing files with a robust
      failure recovery framework.

   o  Standard file system support

      A number of DAFS operations are intended to be functionally
      equivalent to NFS Version 4 operations.

4.1.  Concepts and Structures

4.1.1.  DAFS and NFS Version 4

   DAFS concepts and procedures are described in the sections that fol-
   low. Some of these are based on the NFS Version 4 protocol. The dis-
   cussion of these DAFS procedures includes quoted remarks from the NFS
   Version 4 specification.

4.1.2.  Typographical Conventions

   Some DAFS procedure descriptions contain references to the NFS Ver-
   sion 4 protocol as described in the Internet Society's RFC 3010 docu-
   ment. These references appear inside quotation marks. At the end of
   each quotation appears a reference to the RFC document. The abbrevi-
   ated references look like this: (RFC 3010, pp. xxx-yyy) where xxx-
   yyy refers to the pages where the quoted text is found.


Wittle                                                         [Page 62]

INTERNET-DRAFT         Direct Access File System          September 2001


   Whenever DAFS differs in terminology from the quoted NFS Version 4
   text, the DAFS equivalent term appears inside square brackets [].
   Such simple substitution includes but is not limited to procedure
   names and error codes.

4.1.3.  Recurring Differences Between DAFS and NFS Version 4

4.1.3.1.  Filehandles in Compound vs. Chaining

   Most file system actions operate on a file object. An NFS Version 4
   or a DAFS procedure requires a filehandle that specifies the file
   object to act upon. In the NFS Version 4 protocol, the filehandle is
   obtained from the COMPOUND operation's current file handle. For DAFS
   operations, the filehandle is obtained from the arguments in the pro-
   cedure supplied by the client, unless the operation is chained and
   the DAFS_CHF_FH flag is set in the DAFS header. In the chained case,
   the filehandle is the one saved by the previous operation in the
   chain.

   When a DAFS operation completes successfully, the filehandle used by
   the operation becomes available for use by the operation that follows
   in the chain. This is not the case, however if the operation gen-
   erates a new filehandle, such as DAFS_PROC_LOOKUP. In this case, the
   new handle generated becomes available for use by the operation that
   follows in the chain.

4.1.3.2.  Credentials

   NFS Version 4 requests are enclosed in an RPC request. The RPC header
   for every operation contains a set of credentials that identifies the
   user requesting file service. The DAFS protocol has a procedure to
   register user credentials. This procedure returns a credentials han-
   dle. Subsequent DAFS request need only include the credentials handle
   obtained via the credentials registration.

4.1.3.3.  Attribute Bitmaps

   In DAFS, the attribute data structure used in procedures such as
   DAFS_PROC_GETATTR, has two bitmaps. The included bitmap determines
   the attribute fields that are present in the attributes packet, that
   is the fields for which memory is allocated. The valid bitmap
   represents the attributes with actual values that the server was able
   to return. The valid bitmap is a subset of the included bitmap.
   Therefore, it is possible to have attribute fields present as indi-
   cated in the included map, but with no valid values.

   In contrast, the NFS Version 4 attributes contain one bitmap only.
   The remaining attribute structure consists of fields with valid


Wittle                                                         [Page 63]

INTERNET-DRAFT         Direct Access File System          September 2001


   attribute values.


4.1.4.  Objects Naming And Filehandles

   The DAFS name space is structured similarly to the NFS Version 4 name
   space. This section includes some text quoted from the NFS Version 4
   specification that describe the name space. It also includes text
   from the same NFS Version 4 specification that describes the proper-
   ties of filehandles.

           "7. NFS Server Name Space

           7.1 Server Exports

           On a UNIX server the name space describes all the  files
           reachable  by    pathnames  under  the root directory or
           "/".  On a Windows NT server  the name space constitutes
           all  the  files  on disks named by mapped  disk letters.
           NFS  server  administrators  rarely  make   the   entire
           server's   file  system  name  space  available  to  NFS
           clients.  More often  portions of  the  name  space  are
           made  available  via an 'export' feature." (RFC 3010, p.
           47)

   Text is omitted regarding use of the mount protocol in previous ver-
   sions of the NFS protocol.

           "7.2 Browsing Exports

           The NFS version 4 protocol provides  a  root  filehandle
           that  clients  can  use to obtain filehandles for  these
           exports via a multi-component  LOOKUP.   A  common  user
           experience    is  to  use  a  graphical  user  interface
           (perhaps a file 'Open' dialog window)  to  find  a  file
           via  progressive browsing through a directory tree.  The
           client must be able to move from one export  to  another
           export  via  single-component, progressive LOOKUP opera-
           tions." (RFC 3010, p. 48)

   In DAFS, the root filehandle is obtained via a
   DAFS_PROC_GETROOTHANDLE procedure call.

   Text about previous versions of the NFS protocol and the use of MOUNT
   capabilities has not been quoted here.

           "7.3 Server Pseudo-Filesystem


Wittle                                                         [Page 64]

INTERNET-DRAFT         Direct Access File System          September 2001


           NFS version 4  servers  avoid  this  name  space  incon-
           sistency by presenting all the exports within the frame-
           work of a single server name space.  An  NFS  version  4
           client  uses  LOOKUP  and  READDIR  operations to browse
           seamlessly from one export to another.  Portions of  the
           server  name space that are not exported are bridged via
           a 'pseudo file system' that provides a view of  exported
           directories  only.   A  pseudo  file system has a unique
           fsid and behaves like a normal, read only file system.

           Based on the construction of the server's name space, it
           is possible that multiple pseudo file systems may exist.
           For example,


                         /a         pseudo file system
                         /a/b       real file system
                         /a/b/c     pseudo file system
                         /a/b/c/d   real file system


           Each  of  the  pseudo  file  systems  are   consider[ed]
           separate  entities  and  therefore  will  have  a unique
           fsid." (RFC 3010, p. 48)

   DAFS file systems do not have an fsid as described for the NFS Ver-
   sion 4 case in the quoted text above. Instead, DAFS file systems have
   a unique FSHandle. This FSHandle is obtained via a
   DAFS_PROC_GETFSATTR procedure call. The FSHandle is also a visible
   part of the DAFS filehandle that DAFS client can consult to determine
   when a new file system has been reached during a pathname traversal.

           "7.4 Multiple Roots

           The DOS and Windows operating environments are sometimes
           described  as having 'multiple roots'.  File systems are
           commonly represented as disk letters.  MacOS  represents
           file  systems as top level names.  NFS version 4 servers
           for these platforms can construct a pseudo  file  system
           above  these  root  names so that disk letters or volume
           names are simply directory  names  in  the  pseudo  root
           tree.

           7.5 Filehandle Volatility

           The nature of the server's pseudo file system is that it
           is  a logical representation of file system(s) available


Wittle                                                         [Page 65]

INTERNET-DRAFT         Direct Access File System          September 2001


           from the server. Therefore, the pseudo  file  system  is
           most  likely  constructed dynamically when the server is
           first instantiated.  It is expected that the pseudo file
           system  may  not  have an on disk counterpart from which
           persistent  filehandles  could  be  constructed.    Even
           though  it  is  preferable  that the server provide per-
           sistent filehandles for the pseudo file system, the  NFS
           client should expect that pseudo file system filehandles
           are volatile.  This can be  confirmed  by  checking  the
           associated 'fh_expire_type' attribute for those filehan-
           dles in question.  If the filehandles are volatile,  the
           NFS  client  must  be  prepared  to recover a filehandle
           value (e.g. with a multi-component LOOKUP) when  receiv-
           ing an error of NFS4ERR_FHEXPIRED.

           7.6 Exported Root

           If the server's root file system is exported, one  might
           conclude  that a pseudo-file system is not needed.  This
           would be wrong. Assume the following file systems  on  a
           server:


                               /       disk1  (exported)
                               /a      disk2  (not exported)
                              /a/b    disk3  (exported)


           Because disk2 is not exported, disk3 cannot  be  reached
           with  simple    LOOKUPs.  The server must bridge the gap
           with a pseudo-file system.

           7.7 Mount Point Crossing

           The server file system environment may be constructed in
           such  a  way  that  one file system contains a directory
           which is 'covered' or mounted upon by a second file sys-
           tem.  For example:


                         /a/b            (file system 1)
                         /a/b/c/d        (file system 2)


           The pseudo file system  for  this  server  may  be  con-
           structed to look like:


Wittle                                                         [Page 66]

INTERNET-DRAFT         Direct Access File System          September 2001


                         /               (place holder/not exported)
                         /a/b            (file system 1)
                         /a/b/c/d        (file system 2)


           It is the server's responsibility to present the  pseudo
           file  system  that  is  complete  to the client.  If the
           client sends a  lookup request for the path  '/a/b/c/d',
           the server's response is the filehandle of the file sys-
           tem '/a/b/c/d'. In previous versions of the  NFS  proto-
           col,   the  server  would  respond  with  the  directory
           '/a/b/c/d'  within the file system '/a/b'.

           The NFS client will be able to determine if it crosses a
           server  mount  point  by  a  change  in the value of the
           'fsid' attribute.

           7.8 Security Policy and Name Space Representation

           The application of the server's security policy needs to
           be  carefully  considered  by  the implementor.  One may
           choose to limit  the  viewability  of  portions  of  the
           pseudo  file  system based on the server's perception of
           the client's ability to  authenticate  itself  properly.
           However,  with  the support of multiple security mechan-
           isms and the ability to negotiate the appropriate use of
           these  mechanisms,   the  server  is  unable to properly
           determine if a  client  will  be  able  to  authenticate
           itself.   If,  based on its policies, the server chooses
           to limit the contents of the  pseudo  file  system,  the
           server  may  effectively hide file systems from a client
           that may otherwise have legitimate access."  (RFC  3010,
           pp. 49-50)

           "4. Filehandles

           The filehandle in the  NFS  protocol  is  a  per  server
           unique  identifier  for  a file system object.  The con-
           tents of  the  filehandle  are  opaque  to  the  client.
           Therefore, the server is responsible for translating the
           filehandle to an internal  representation  of  the  file
           system  object.   Since  the  filehandle is the client's
           reference to an object and the  client  may  cache  this
           reference,  the server SHOULD not reuse a filehandle for
           another file system object. If the server needs to reuse
           a filehandle value, the time elapsed before reuse SHOULD


Wittle                                                         [Page 67]

INTERNET-DRAFT         Direct Access File System          September 2001


           be large enough such that it is unlikely the client  has
           a cached copy of the reused filehandle value.  Note that
           a client may cache a filehandle for a  very  long  time.
           For  example,  a  client  may  cache  NFS  data to local
           storage as a method to expand its effective  cache  size
           and  as  a means to survive client restarts.  Therefore,
           the lifetime of a cached filehandle  may  be  extended."
           (RFC 3010, p. 23)

   DAFS filehandles are mostly opaque to the client. They contain a
   client- visible FSHandle field as well as an opaque fileid field.

           "4.1 Obtaining The First Filehandle

           The operations of the NFS protocol are defined in  terms
           of one or more filehandles.  Therefore, the client needs
           a filehandle to initiate communication with the server."
           (RFC 3010, p. 24)

   References to the mount protocol use in previous version of the NFS
   protocol, have been removed. The DAFS protocol defines a special
   filehandle, called the Root Filehandle, that is used to initiate this
   communication.

           "4.1.1 Root Filehandle

           The  first  of  the  special  filehandles  is  the  ROOT
           filehandle. The ROOT filehandle is the 'conceptual' root
           of the  file system name space at the NFS server."  (RFC
           3010, p. 24)

   The client gets the ROOT filehandle by employing the
   DAFS_PROC_GETROOTHANDLE operation, which returns to the client the
   ROOT filehandle. This root filehandle is used by the DAFS client to
   traverse the file name space provided by the server. See "7. NFS
   Server Name Space" from the NFS Version 4 specification as quoted
   above and the DAFS notes on name space issues also found in this sec-
   tion for a description of the name space presented by a DAFS server.

   The NFS Version 4 specification description of the public filehandle
   is omitted here as this filehandle concept is not part of the DAFS
   protocol.

           "4.2 Filehandle Types

           In the NFS version 2 and 3 protocols, there was one type
           of  filehandle  with  a single set of semantics. The NFS
           version 4 protocol introduces a new type  of  filehandle


Wittle                                                         [Page 68]

INTERNET-DRAFT         Direct Access File System          September 2001


           in  an  attempt  to  accommodate certain server environ-
           ments.  The first type of  filehandle  is  'persistent'.
           The semantics of a persistent filehandle are the same as
           the filehandles of the NFS version 2  and  3  protocols.
           The   second  type  of  filehandle  is  the   'volatile'
           filehandle.

           The volatile filehandle  type  is  being  introduced  to
           address  server  functionality  or implementation issues
           which  make  correct  implementation  of  a   persistent
           filehandle  infeasible.  Some server environments do not
           provide a file system level invariant that can  be  used
           to  construct  a  persistent  filehandle. The underlying
           server file system may not provide the invariant or  the
           server's file system programming interfaces may not pro-
           vide access to the needed invariant.  Volatile  filehan-
           dles may ease the implementation of server functionality
           such as hierarchical storage management or  file  system
           reorganization  or  migration.   However,  the  volatile
           filehandle increases the implementation burden  for  the
           client.  However this increased burden is deemed accept-
           able based on the overall gains achieved by  the  proto-
           col.

           Since the client will  need  to  handle  persistent  and
           volatile  filehandle  differently,  a  file attribute is
           defined which may be used by the client to determine the
           filehandle  types  being  returned  by the server." (RFC
           3010, p. 25)

   Disregard the reference to file system migration in the previous
   paragraph: the DAFS protocol does not support migration.

           "4.2.1 General Properties of a Filehandle

           The filehandle contains all the information  the  server
           needs to distinguish an individual file.  To the client,
           the filehandle is opaque. The client stores  filehandles
           for  use in a later request and can compare two filehan-
           dles from the same server for equality by doing a  byte-
           by-byte comparison.  However, the client MUST NOT other-
           wise interpret  the  contents  of  filehandles.  If  two
           filehandles  from  the  same server are equal, they MUST
           refer to the same file.  If  they  are  not  equal,  the
           client  may  use  information provided by the server, in
           the form of file attributes, to determine  whether  they
           denote  the  same  files or different files.  The client
           would do this as  necessary  for  client  side  caching.


Wittle                                                         [Page 69]

INTERNET-DRAFT         Direct Access File System          September 2001


           Servers  SHOULD try to maintain a one-to- one correspon-
           dence between filehandles and  files  but  this  is  not
           required.   Clients MUST use filehandle comparisons only
           to improve performance, not for correct  behavior.   All
           clients  need  to be prepared for situations in which it
           cannot be determined whether two filehandles denote  the
           same  object  and  in  such  cases, avoid making invalid
           assumptions which might cause incorrect behavior."  (RFC
           3010, pp. 25-26)

   DAFS filehandles are mostly opaque to the client. They contain a
   client- visible FSHandle field as well as an opaque fileid field.
   The opaque fileid field shares the same properties as the NFS file
   handle properties described in the quoted text above.

   Although the FSHandle is a client-visible field within the DAFS
   filehandle, the FSHandle itself is opaque to the client. In other
   words, a DAFS client MUST NOT interpret the contents of the FSHandle
   field, except for testing it for equality to determine if two file
   objects reside within the same server's file system.

           "As an example, in the  case  that  two  different  path
           names when traversed at the server terminate at the same
           file system object, the server SHOULD  return  the  same
           filehandle  for each path. This can occur if a hard link
           is used to create two file names which refer to the same
           underlying  file  object and associated data.  For exam-
           ple, if paths /a/b/c and /a/d/c refer to the same  file,
           the  server  SHOULD  return the same filehandle for both
           path names traversals.

           4.2.2 Persistent Filehandle

           A persistent filehandle is defined  as  having  a  fixed
           value  for  the  lifetime  of  the file system object to
           which it refers.  Once the server creates the filehandle
           for  a  file  system  object, the server MUST accept the
           same filehandle for the object for the lifetime  of  the
           object.   If  the  server  restarts  or  reboots the NFS
           server must honor the same filehandle value as it did in
           the server's previous instantiation."  (RFC 3010, p. 26)

   Reference to file system migration has been removed.

           "The persistent  filehandle  will  be  become  stale  or
           invalid  when  the  file system object is removed.  When
           the server is presented  with  a  persistent  filehandle
           that refers to a deleted object, it MUST return an error


Wittle                                                         [Page 70]

INTERNET-DRAFT         Direct Access File System          September 2001


           of NFS4ERR_STALE.  A filehandle may  become  stale  when
           the  file  system  containing  the  object  is no longer
           available.  The file system may become unavailable if it
           exists  on  removable  media  and the media is no longer
           available at the server or the file system in whole  has
           been  destroyed  or  the  file  system  has  simply been
           removed from the server's name space (i.e. unmounted  in
           a Unix environment).

           4.2.3 Volatile Filehandle

           A volatile filehandle does not share the same  longevity
           characteristics  of a persistent filehandle.  The server
           may determine that a volatile filehandle  is  no  longer
           valid  at  many different points in time.  If the server
           can definitively determine that  a  volatile  filehandle
           refers  to  an  object that has been removed, the server
           should return DAFSERR_STALE to the  client  (as  is  the
           case  for  persistent  filehandles).  In all other cases
           where the server determines that a  volatile  filehandle
           can  no  longer  be  used,  it should return an error of
           NFS4ERR_FHEXPIRED.      The     mandatory      attribute
           'fh_expire_type' is used by the client to determine what
           type of filehandle the server is providing for a partic-
           ular  file  system. This attribute is a bitmask with the
           following values:

           FH4_PERSISTENT

              The value of FH4_PERSISTENT is  used  to  indicate  a
              persistent   filehandle,  which  is  valid  until the
              object is removed from the  file system.  The  server
              will  not  return NFS4ERR_FHEXPIRED for this filehan-
              dle. FH4_PERSISTENT is defined as a  value  in  which
              none of the bits specified below are set.

           FH4_NOEXPIRE_WITH_OPEN

              The filehandle will not expire while client  has  the
              file  open.  If  this  bit  is  set,  then the values
              FH4_VOLATILE_ANY  or  FH4_VOL_RENAME  do  not  impact
              expiration  while the file is open.  Once the file is
              closed or if the FH4_NOEXPIRE_WITH_OPEN bit is false,
              the rest of the volatile related bits apply.

           FH4_VOLATILE_ANY

              The filehandle may expire at any time and will expire


Wittle                                                         [Page 71]

INTERNET-DRAFT         Direct Access File System          September 2001


              during system migration and rename.

           FH4_VOL_RENAME

              The filehandle may expire  due  to  a  rename.   This
              includes  a  rename  by  the  requesting  client or a
              rename  by  another  client.   May  only  be  set  if
              FH4_VOLATILE_ANY is not set.

           Servers which provide volatile filehandles should deny a
           RENAME  or  REMOVE that would affect an OPEN file or any
           of the components leading to the OPEN file. In addition,
           the  server  should  deny all RENAME or REMOVE  requests
           during the grace or lease period upon   server restart.

           The  reader  may  be  wondering  why  there  are   three
           FH4_VOL*  bits  and why FH4_VOLATILE_ANY is exclusive of
           FH4_VOL_MIGRATION  and   FH4_VOL_RENAME.    If   the   a
           filehandle  is  normally  persistent  but cannot persist
           across a file set migration, then the  presence  of  the
           FH4_VOL_MIGRATION  or  FH4_VOL_RENAME  tells  the client
           that it can treat the file handle as persistent for pur-
           poses  of  maintaining a file name to file handle cache,
           except for the specific event   described  by  the  bit.
           However,  FH4_VOLATILE_ANY  tells  the  client  that  it
           should not maintain such a cache for unopened files.   A
           server    MUST   not   present   FH4_VOLATILE_ANY   with
           FH4_VOL_RENAME  as  this   will   lead   to   confusion.
           FH4_VOLATILE_ANY   implies  that  the  file  handle will
           expire upon migration or rename, in  addition  to  other
           events." (RFC 3010, pp. 26-27)

   The description for FH4_VOL_MIGRATION has been removed. For readabil-
   ity purposes, references to this flag were kept in the above para-
   graph. Disregard these references.

           4.2.4 One Method of Constructing a Volatile Filehandle

           As mentioned, in some instances a  filehandle  is  stale
           (no  longer  valid; perhaps because the file was removed
           from the server) or it is expired (the  underlying  file
           is  valid  but  since the filehandle is volatile, it may
           have expired).  Thus the server  needs  to  be  able  to
           returnNFS4ERR_STALE    in    the    former    case   and
           NFS4ERR_FHEXPIRED in the latter case. This can  be  done
           by careful construction of the volatile filehandle.  One
           possible implementation follows.


Wittle                                                         [Page 72]

INTERNET-DRAFT         Direct Access File System          September 2001


           A volatile filehandle, while opaque to the client  could
           contain:

           [volatile bit = 1 | server boot time | slot | generation
           number]

           * slot is an index in  the  server  volatile  filehandle
           table

           * generation number is the  generation  number  for  the
           table entry/slot

           If the server boot time is less than the current  server
           boot  time, return NFS4ERR_FHEXPIRED.  If slot is out of
           range,  return  NFS4ERR_BADHANDLE.   If  the  generation
           number does not match, return  NFS4ERR_FHEXPIRED.

           When the server reboots, the table is gone (it is  vola-
           tile).

           If volatile bit is 0, then it is a persistent filehandle
           with a different structure following it.

           4.3 Client Recovery From Filehandle Expiration

           If possible, the client SHOULD recover from the  receipt
           of  an NFS4ERR_FHEXPIRED error.  The client must take on
           additional responsibility so that it may prepare  itself
           to recover from the expiration of a volatile filehandle.
           If the server returns persistent filehandles, the client
           does not need these additional steps.

           For volatile filehandles, most commonly the client  will
           need  to  store  the  component  names leading up to and
           including the file  system  object  in  question.   With
           these  names,  the  client  should be able to recover by
           finding a filehandle in the name  space  that  is  still
           available  or  by  starting  at the root of the server's
           file system name space.

           If the expired filehandle refers to an object  that  has
           been  removed from the file system, obviously the client
           will not be able to recover from the expired filehandle.

           It is also possible that the expired  filehandle  refers
           to  a  file  that  has  been  renamed.   If the file was
           renamed by another client, again it is possible that the
           original client will not be able to recover. However, in


Wittle                                                         [Page 73]

INTERNET-DRAFT         Direct Access File System          September 2001


           the case that the client itself is renaming the file and
           the  file is open, it is possible that the client may be
           able to recover.  The client can determine the new  path
           name based on the processing of the rename request.  The
           client can then regenerate the new filehandle  based  on
           the  new  path  name. The client could also use the com-
           pound operation mechanism to construct a set  of  opera-
           tions  like:


                   RENAME A B
                   LOOKUP B . . ." (RFC 3010, pp. 28-29)


   The DAFS protocol does not support the COMPOUND procedure. Instead, a
   client can issue the rename and lookup within a DAFS chain. The quote
   also removed the GETFH call after lookup in the example above since
   DAFS does not support or require this call. The DAFS_PROC_LOOKUP
   returns the filehandle for B.

4.1.5.  Named Attributes

   The DAFS protocol supports three classes of file object attributes:
   mandatory, recommended, and named. Mandatory and recommended attri-
   butes are discussed in 6.1.5., "File Attributes" and 6.1.6., "File
   System Attributes". The named attributes model is borrowed from the
   NFS Version 4 specification and its description follows in the quote
   below:

           "These attributes are not supported by  direct  encoding
           in the NFS Version 4 protocol but are accessed by string
           names rather than numbers and correspond to an  uninter-
           preted  stream  of  bytes which are stored with the file
           system object.  The name space for these attributes  may
           be  accessed  by  using  the  OPENATTR  operation.   The
           OPENATTR operation returns a filehandle  for  a  virtual
           "attribute  directory"  and  further perusal of the name
           space may be done using READDIR and LOOKUP operations on
           this  filehandle.  Named attributes may then be examined
           or changed by normal READ and WRITE  and  CREATE  opera-
           tions  on  the  filehandles  returned  from  READDIR and
           LOOKUP.  Named attributes may have attributes.

           It is recommended that servers support  arbitrary  named
           attributes.   A  client should not depend on the ability
           to store any named attributes in the server's file  sys-
           tem.   If  a  server  does  support  named attributes, a
           client which is also able to handle them should be  able


Wittle                                                         [Page 74]

INTERNET-DRAFT         Direct Access File System          September 2001


           to  copy a file's data and meta-data with complete tran-
           sparency from one location to another; this would  imply
           that  names  allowed  for  regular directory entries are
           valid for named attribute names as well.

           Names of attributes will not be controlled by this docu-
           ment  or  other IETF standards track documents.  See the
           section 'IANA Considerations' for  further  discussion."
           (RFC 3010, p. 31)

   The reference to the "IANA Considerations" section as it pertains to
   the named attributes follows:

           "The NFS version 4 protocol provides for the association
           of  named  attributes to files.  The name space identif-
           iers for these attributes are defined as  string  names.
           The  protocol does not define the specific assignment of
           the name space for these file attributes;  the  applica-
           tion developer or system vendor is allowed to define the
           attribute, its semantics, and the associated name.  Even
           though  this  name  space  will not be specifically con-
           trolled  to   prevent   collisions,    the   application
           developer  or  system  vendor  is strongly encouraged to
           provide the name assignment and associated semantics for
           attributes  via an Informational RFC.  This will provide
           for interoperability   where  common  interests  exist."
           (RFC 3010, p. 174)

4.2.  Data Transfer Operations

4.2.1.  Send-Receive

4.2.1.1.   Inline Bulk Data Transfer

   A traditional send-receive model of is provided for small transfers
   and for environments in which remote DMA operations are not desired.
   This is termed INLINE data transfer because the data is sent inline
   with the write request or read response.

   Typical I/O operations specified using DAFS INLINE operations are:

   o  DAFS_PROC_READ_INLINE

   o  DAFS_PROC_WRITE_INLINE

   o  DAFS_PROC_APPEND_INLINE

   Transport-level scatter/gather facilities can be used by applications


Wittle                                                         [Page 75]

INTERNET-DRAFT         Direct Access File System          September 2001


   to help avoid data copies. A negotiated padding of write request
   headers enables the server to use scatter/gather to receive
   WRITE_INLINE data directly into its buffers. DAFS provides an
   OPTIONAL Session attribute to govern the use of padding between the
   end of a DAFS message header and the data being transferred by the
   DAFS_PROC_WRITE_INLINE operation. The OPTIONAL Session attribute
   "inline_write_header_size" specifies the number of bytes used to pad
   inline write headers to a more convenient offset from the beginning
   of the message. The number of bytes specified by
   inline_write_header_size is the length of the DAFS message header,
   the write operation message header, and all padding bytes up to the
   start of the inline data.

   The DAFS client can write inline data with two different alignments.
   If the client is writing a data buffer that begins following the
   negotiated padding length, then the client sets the "padded_write"
   flag for that DAFS_PROC_WRITE_INLINE operation and include the pad-
   ding bytes. Otherwise, the client clears the "padded_write" flag and
   sends the data buffer with no padding so that it immediately follows
   the operation header.

   Rationale: While some applications want to use the RDMA operations,
              other applications might not. Aligning the data portion of
              a data transport message provides a non-RDMA mechanism
              that can be used to effect zero-copy file read and write
              operations.

   Note: The mechanism is the scatter-gather capability used in conjunc-
         tion with the padding of headers. A server DAT connection for
         the DAFS operational channel that is to be used to receive file
         write operations would submit receive data transport operations
         describing a buffer with two (or more) memory chunks: one
         pointing to a header buffer of length
         "inline_write_header_size", and one (or more) pointing to a
         data buffer. When a client wants to perform inline writes of
         large buffers or set of buffers, it can negotiate the
         inline_write_header_size option, pad the header to this size,
         and set the padded_inline_write flag for such
         DAFS_PROC_WRITE_INLINE requests. This causes the data payload
         to land in the server's data buffer.

         The reason for inserting padding into the
         DAFS_PROC_WRITE_INLINE operation is to help the server create
         well-aligned data buffers. However, since these buffers are
         used to receive all requests, segmented buffers might introduce
         inconvenience for other unpadded requests. For this reason, the
         value chosen for inline_write_header_size might need to be at
         least as large as the value chosen for max_request_size so that


Wittle                                                         [Page 76]

INTERNET-DRAFT         Direct Access File System          September 2001


         unpadded requests fit within the first segment of the server's
         receive buffer.

         File read operations can use this technique if they are for
         synchronous reads for a single client. Because the
         DAFS_PROC_READ_INLINE has a fixed-length response header, the
         client can post a single receive for the request with one seg-
         ment identifying the response header and the next segment(s)
         identifying the user data buffer(s).

4.2.1.2.  Inline Append

   In addition inline bulk data movement, the DAFS_PROC_APPEND_INLINE
   operation provides features specific to appending data to the end of
   an existing file. The DAFS append operations ensure the atomicity of
   the determination of the current file size and writing the data into
   the file. This prevents concurrent append access by multiple clients
   from overwriting each others' data.

4.2.1.3.  Inline Meta-Data Transfer

   Most DAFS operations that do not include bulk data require only a
   small send and receive buffer size. However, there are a few opera-
   tions that include variable-sized fields that benefit from RDMA when
   the amount of data being transferred is large. By providing two vari-
   ants of these operations, DAFS reduces the buffer space requirement
   of the protocol by allowing the standard inline buffers to be smaller
   (large transfers can use the RDMA-based operation variant). These
   operations are called INLINE operations because the data is sent
   inline with the DAFS message header. A number of traditional func-
   tions have been implemented using DAFS INLINE operations:

   o  DAFS_PROC_GETATTR_INLINE

   o  DAFS_PROC_READDIR_INLINE

   o  DAFS_PROC_READLINK_INLINE

   o  DAFS_PROC_SETATTR_INLINE

4.2.2.  RDMA Transfers

4.2.2.1.  Memory Registration

   The DAT requires that host memory that the transport Channel Adapter
   will access for RDMA operations needs to be registered with the Chan-
   nel Adapter before it is used. DAFS does not specify how that memory
   registration is done.


Wittle                                                         [Page 77]

INTERNET-DRAFT         Direct Access File System          September 2001


4.2.2.2.  Direct Bulk Data Transfer

   Transport level RDMA features are exported to DAFS users via DIRECT
   versions of read and write. These operations pass RMR Contexts and
   RMR Target Addresses in their request messages rather than INLINE
   data. It is the responsibility of the DAFS user to manage the memory
   registrations appropriately.

   DAFS defines the following RDMA-based data transfer operations:

   o  DAFS_PROC_READ_DIRECT

   o  DAFS_PROC_WRITE_DIRECT

   o  DAFS_PROC_APPEND_DIRECT

   The underlying transport-level RDMA I/O operations to support these
   DAFS requests are issued by the server. An application file read
   translates into a send of a DAFS_PROC_READ_DIRECT request to the
   server. The server performs an RDMA write of the data requested
   directly to the client's indicated buffer, and then follows it with a
   send of a response message.

   A file write operation translates into a send of a
   DAFS_PROC_WRITE_DIRECT request to the server containing the DAT RMR
   Context and RMR Target Addresses of the client's buffer. The server
   then performs an RDMA read to read that buffer's contents into its
   chosen destination buffer. The server ends the operation with a send
   of a response message to the client.

4.2.2.3.  Direct Meta-Data Transfer

   For operations that transfer variable-length data fields that can be
   significantly larger than the base message size, DAFS includes a
   DIRECT variant of the operation. The data is sent via an RDMA opera-
   tion, separate from the DAFS message header. These DAFS DIRECT opera-
   tion variants are:

   o  DAFS_PROC_GETATTR_DIRECT

   o  DAFS_PROC_READDIR_DIRECT

   o  DAFS_PROC_READLINK_DIRECT

   o  DAFS_PROC_SETATTR_DIRECT


Wittle                                                         [Page 78]

INTERNET-DRAFT         Direct Access File System          September 2001


4.2.3.  Batch I/O Operations

   Most DAFS requests are received by the server and acted upon immedi-
   ately. When the request is complete, the server returns the results
   to the client. Batch I/O operations introduce a new model of interac-
   tion between the client and server. In this model, the client
   requests one or more I/O operations and informs the server that the
   data being read or written can be performed and then notification of
   the completion can be done asynchronously to the request. The server
   can take advantage of this asynchronous batch processing to optimize
   both the completion of the RDMA operations and the commitment of user
   data to stable storage.

   The results of the I/O operations are sent to the client as a
   request- response callback on the Back-control Channel of the DAFS
   Session. The batch I/O completion message can contain the results of
   one or more previously issued batch I/O requests.

   Note: The client is responsible for the synchronization of batch I/O
         operations with other operations on the file. Batch I/O's
         interact with DAFS locks in the same manner that synchronous
         I/Os do.

   The batch time window argument provides a hint to the server about
   the client's throughput requirements so that the server can optimize
   I/O gathering mechanisms to better support the throughput require-
   ment. Although the batch I/O operation does not provide a guarantee
   that the write operation will be completed within the batch window,
   the server is expected to give batch requests that have reached the
   window the same priority as a normal synchronous I/O operation.

   Batch I/Os give a client the ability to go beyond the synchronous
   request- response model provided by the normal DAFS flow control
   mechanism. It is possible for a client to overwhelm a server with
   asynchronous batch I/O requests. Therefore, servers can either: per-
   form an asynchronous batch I/O in the standard synchronous fashion or
   return an error (such as EWHOACOWBOY) to notify a client that the
   server is congested and that the client needs to slow down its gen-
   eration of batch requests. The EWHOACOWBOY error instructs the client
   to restrain from posting additional batch I/O requests until it has
   received a batch write completion message.

   Rationale: The batch write operation provides a high bandwidth
              mechanism for transferring a large, batch of I/O requests
              where the application's latency requirement is based on
              completion of the entire batch rather than completion of
              each individual request. By batching request completions
              within a completion notification messages, the mechanism


Wittle                                                         [Page 79]

INTERNET-DRAFT         Direct Access File System          September 2001


              supports high bandwidth streams of I/O requests.

4.2.4.  Server Caching Hints

   DAFS cache hints provide a way to supply information to the server
   regarding which file data the client would like the server to cache
   on writeback, and which file data the client would like the server to
   prefetch into the server's cache. The hints include two types of
   information.

   First the client can supply information about the client's predicted
   access pattern for the file if it is known. These hints can provide a
   general hint to inform the server's caching policies for the file.
   Second, specific byte range cache "weighting" hints are provided,
   indicating predictions about the client's intentions regarding future
   read and write file access to the byte range.

   DAFS allows the client to sent cache hints to the server as a part of
   normal read, write, and append requests, or in separate cache hints
   messages.

   Rationale: Purpose of cache hints is to convey information the appli-
              cation knows about future data use to the server. This
              should help the server to intelligently schedule its I/Os
              and let the application treat the server's memory as a
              second level cache. The assumption is that the server ages
              data in its cache. Hints are intended to aid the assign-
              ment of weights to aid server cache management. Hints can
              be provided with any read/write request or a separate
              request can be made to update a cache hint. Cache hint
              requests can also be part of a batch I/O request.

              The goal for the cache hints is to provide the server with
              all possible information so that it can maximize its cach-
              ing efficiency. However, hints can always be safely
              ignored (at a possible performance penalty), and hints are
              not persistent across server failures. The fact that these
              are only hints allows the server to benefit if it can,
              without additional resource commitment. The DAFS protocol
              does not dictate what actions a server SHOULD take upon
              reception of cache hints or even that the server needs to
              take any actions.


4.3.  Request Chaining

   Request chaining enables the server to process multiple dependent
   requests without incurring a round-trip delay between each request,


Wittle                                                         [Page 80]

INTERNET-DRAFT         Direct Access File System          September 2001


   the DAFS protocol implements request chaining. The motivation for
   chaining is similar to that for the COMPOUND feature of NFS Version
   4. Even with the relatively low communication latencies expected in
   the DAFS environment, there are considerable benefits from pipelining
   multiple requests so that multiple dependent requests do not incur a
   latency equal to the sum of the queuing, round-trip, and processing
   latencies for each of the requests.

   Chaining differs from NFS Version 4 COMPOUND in that each dependent
   request continues to retain its own separate identity (for flow-
   control and Response Cache purposes). This enables better utilization
   of memory by providing tighter bounds on request and response buffer
   sizes and limits the amount information being stored in the Response
   Cache.

   Chaining is defined for requests made by the client on the Operation
   Channel. It cannot be used for requests made by the server on the
   Back- control Channel.

   When multiple requests are chained, the server MUST execute them in
   the order they were sent by the client. A request cannot be started
   until the previous request has completed. If a previous dependent
   request encountered an error, subsequent requests are aborted with a
   DAFSERR_CHAIN_BROKEN error. The client can use chaining flags,
   described in 4.3.2., "Chaining Flags", to specify that certain infor-
   mation be passed between requests in a chain.

4.3.1.  Chaining Restrictions

   For the purposes of chaining (and for the Response Cache as discussed
   in 5.2.1., "Response Cache"), all requests are divided into five
   categories:

   o  Special requests

   o  Bulk fetch requests

   o  Simple requests

   o  FS-state-modifying requests

   o  Lock-state-modifying requests.

   Special requests are used in the setup and maintenance of DAFS Ses-
   sions. Such requests cannot be chained to any other request. The spe-
   cial operations are:

   o  DAFS_PROC_CLIENT_CONNECT


Wittle                                                         [Page 81]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  DAFS_PROC_CLIENT_AUTH

   o  DAFS_PROC_SERVER_AUTH

   o  DAFS_PROC_CLIENT_CONNECT_AUTH

   o  DAFS_PROC_CONNECT_BIND

   o  DAFS_PROC_DISCONNECT

   o  DAFS_PROC_SECINFO

   o  DAFS_PROC_REGISTER_CRED

   o  DAFS_PROC_RELEASE_CRED

   o  DAFS_PROC_GET_FENCING_LIST

   o  DAFS_PROC_SET_FENCING_LIST

   o  DAFS_PROC_DISCARD_RESPONSES.

   Bulk fetch requests retrieve a significant amount of file system
   data, but do not change the file system state (with the possible
   exception of file access times). Such requests can be chained, but
   they MUST not be followed by an fs-state-modifying request in the
   chain. This allows a request chain to be reissued after server
   failure without requiring the server to save very large amounts of
   response data in stable storage. The bulk fetch requests are:

   o  DAFS_PROC_GETATTR_DIRECT

   o  DAFS_PROC_GETATTR_INLINE

   o  DAFS_PROC_GET_FSATTR

   o  DAFS_PROC_READLINK_DIRECT

   o  DAFS_PROC_READLINK_INLINE

   o  DAFS_PROC_READ_DIRECT

   o  DAFS_PROC_READ_INLINE

   o  DAFS_PROC_READDIR_DIRECT

   o  DAFS_PROC_READDIR_INLINE.


Wittle                                                         [Page 82]

INTERNET-DRAFT         Direct Access File System          September 2001


   Simple requests do not modify file system data, but only return a
   relatively small quantity of data. These can be chained together with
   other requests, but when they precede an fs-state-modifying request,
   they SHOULD be marked with a special flag in the request header that
   indicates to the server that they are to be saved in the Response
   Cache. This allows a request chain that includes simple requests fol-
   lowed by state-modifying requests to be completed properly after
   server failure. The simple requests are:

   o  DAFS_PROC_ACCESS

   o  DAFS_PROC_CACHE_HINT

   o  DAFS_PROC_CHECK_RESPONSE

   o  DAFS_PROC_FETCH_RESPONSE

   o  DAFS_PROC_GET_ROOTHANDLE

   o  DAFS_PROC_LOOKUP

   o  DAFS_PROC_LOOKUPP

   o  DAFS_PROC_NVERIFY

   o  DAFS_PROC_NULL

   o  DAFS_PROC_OPENATTR

   o  DAFS_PROC_VERIFY.

   Fs-state-modifying requests modify file system data. These requests
   are always saved in the Response Cache (see 5.2.1., "Response
   Cache"). As mentioned previously, fs-state-modifying requests cannot
   appear in a chain following a bulk fetch request. If this occurs, the
   server SHOULD return DAFSERR_CHAIN_FORM when the fs-state-modifying
   request is encountered. Fs-state-modifying requests are:

   o  DAFS_PROC_APPEND_DIRECT

   o  DAFS_PROC_APPEND_INLINE

   o  DAFS_PROC_BATCH_SUBMIT

   o  DAFS_PROC_CLOSE

   o  DAFS_PROC_COMMIT


Wittle                                                         [Page 83]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  DAFS_PROC_CREATE

   o  DAFS_PROC_DELEGPURGE

   o  DAFS_PROC_DELEGRETURN

   o  DAFS_PROC_HURRY_UP

   o  DAFS_PROC_LINK

   o  DAFS_PROC_OPEN

   o  DAFS_PROC_REMOVE

   o  DAFS_PROC_RENAME

   o  DAFS_PROC_SETATTR_DIRECT

   o  DAFS_PROC_SETATTR_INLINE

   o  DAFS_PROC_WRITE_DIRECT

   o  DAFS_PROC_WRITE_INLINE.

   Lock-state-modifying requests modify volatile locking state on the
   server while making no change to stable file system state. These
   requests are also saved in the Response Cache. Unlike fs-state-
   modifying requests, lock-state-modifying requests can follow bulk
   fetch requests in the same chain. Note that DAFS_PROC_OPEN, even
   though it primarily affects locking state, is an fs-state-modifying
   request because, with the create option, it has an effect on stable
   file system storage. In addition, due to the Delete-on-Last-Close
   semantics of DAFS_PROC_CLOSE, it is also classified as fs-state-
   modifying. The lock-state-modifying requests are:

   o  DAFS_PROC_OPEN_DOWNGRADE

   o  DAFS_PROC_LOCK

   o  DAFS_PROC_LOCKT

   o  DAFS_PROC_LOCKU.

   When a chain contains bulk fetch requests that are followed by lock-
   state- modifying requests, special care is necessary to recover from
   a Session disconnect. Consider a chain consisting of a lock request
   followed by a number of read requests, and a final unlock request.
   Such a chain can be used to provide an atomic fetch of the contents


Wittle                                                         [Page 84]

INTERNET-DRAFT         Direct Access File System          September 2001


   of two noncontiguous regions of a file, without the possibility of an
   update occurring between the read requests. In the event of discon-
   nection, if the Response Cache shows that the unlock request has not
   executed, then a partial chain can be issued, starting at the first
   read request for which no response was received in the old Session.
   On the other hand, if the unlock request does appear as executed in
   the Response Cache, the chain would have to be reissued in its
   entirety, because there could be a read for which a response was not
   received in the old Session. Because the read requests are not avail-
   able in the Response Cache, if a response was not previously
   received, then reissuing those read requests is the only option. A
   consistent set of data could only be obtained by reissuing the chain
   as a whole.

4.3.2.  Chaining Flags

   Chaining is specified in the chaining flags field of the request
   header.  This field is present for all requests, even those for which
   chaining cannot be done.

   The chaining flags are defined as follows:

   DAFS_CHF_FORW   (0x01)

      Indicates that there is a subsequent dependent request chained to
      this one. If this flag is not set, this is the last or only
      request in the current chain. If this flag is set and the current
      request is a special request, a DAFSERR_CHAIN_FORM error results.

   DAFS_CHF_BACK   (0x02)

      Indicates that there is a previous dependent request in this chain
      that was sent immediately preceding this one. If this flag is not
      set, this is the first or only request in the current chain. If
      the previous request specified DAFS_CHF_FORW and the current one
      does not have DAFS_CHF_BACK, a DAFSERR_CHAIN_FORM error is
      returned. This is also true in the converse case: the current
      request specifies DAFS_CHF_BACK and the previous request sent does
      not have DAFS_CHF_FORW set.

   DAFS_CHF_SAVE (0x04)

      Indicates that the current request is a simple request that SHOULD
      be saved in the Response Cache because a state-modifying request
      follows it in the current chain. If the current request is not a
      simple request or if DAFS_CHF_FORW is not set, a
      DAFS_ERR_CHAIN_FORM error results.


Wittle                                                         [Page 85]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS_CHF_FH   (0x08)

      Indicates that the filehandle for the current operation is to be
      the filehandle used for or generated by the previous operation. In
      this case, the filehandle specified in the request header is
      ignored. If DAFS_CHF_BACK is not set on this request, a
      DAFSERR_CHAIN_FORM error results.

   DAFS_CHF_STATEID   (0x10)

      Indicates that the State-id for the current operation is to be
      taken from the one used for or generated by the previous operation
      within the chain that used or generated a State-id, in preference
      to the one specified in the operation itself, which is then
      ignored. If DAFS_CHF_BACK is not set on this request, a
      DAFSERR_CHAIN_FORM error results.

4.3.3.  Chaining and Flow Control

   For flow-control purposes, each request within a chain is considered
   separately. Thus, a client issuing a chain of requests might be
   unable to issue all the requests within a chain, without waiting for
   some to finish because of flow-control restrictions. In the worst
   case, when OPNreq equals one, the client would wait for each request
   to finish before issuing a subsequent one, vitiating the latency
   reduction benefits of chaining.

   If a client chooses to use chaining in a situation in which flow con-
   trol prevents all the requests from being issued immediately, the
   client MUST insure that requests that are not intended to be part of
   the current request chain are not issued concurrently. For example,
   if a Session is multiplexed among multiple threads, a chain of
   requests from one thread MUST NOT be interspersed with a request or a
   chain or requests from a second thread. If an unrelated request is
   issued while an uncompleted chain exists, this will generally result
   in a DAFSERR_CHAIN_FORM error.

   A server can perform internal batching of chained responses (for
   example, to optimize CPU resources) by waiting for the end of the
   chain to trigger action. When doing so, the server SHOULD insure that
   it does not cause a flow-control-constrained client using chaining to
   wait for an unduly long time, or forever. The server SHOULD never
   wait for an additional request when existing flow-control restric-
   tions would prevent a client from sending that request.

4.3.4.  Chaining and Recovery

   In the event of disconnection, server reboot, or server failover, the


Wittle                                                         [Page 86]

INTERNET-DRAFT         Direct Access File System          September 2001


   client SHOULD recover cleanly so that the results of all state-
   modifying operations are correctly retrieved, the operations that
   were not executed before the failure are properly retried, and no
   state-modifying operation is erroneously performed more than once. In
   the case of disconnection and reconnection without server failure,
   the server MAY provide Response Cache information that enables the
   client to do this successfully. In addition, the server can maintain
   sufficient state within stable storage to enable the client to do
   reconnect and recover when a server failure occurs.

   When disconnection occurs, chaining-related information is immedi-
   ately forgotten. Individual requests within the chain are individu-
   ally recorded. State-fetching operations are not recorded, because
   they can be safely reissued. State-modifying operations are recorded
   in a Response Cache (optionally on stable storage). Traversal opera-
   tions that preceded state- modifying operations within a chain are
   also saved in the response case because they have been specially
   marked to indicate this condition using the DAFS_CHF_SAVE flag.

   When request chaining is in effect, the client might have received
   responses for an initial subset of the requests in a chain. For those
   requests within the chain that have been issued, but for which a
   response has not been received, the client might need to determine
   which requests have in fact been executed. There are two cases to
   consider:

   o  If none of the unreplied-to requests are state-modifying opera-
      tions, then all these requests can be reissued. A new chain SHOULD
      be issued that includes only the unreplied-to requests. In some
      cases, information that in the original chain was passed from a
      previous operation will be explicitly entered into the parameters
      for the new chain. This information is available, because all such
      information is available either from the original requests or the
      earlier responses for requests that have completed.

   o  If some of the unreplied-to requests are state-modifying requests,
      then all the preceding requests are available in the Response
      Cache if they have been executed, because they are either state-
      modifying requests or simple requests marked specially to be
      recorded in the Response Cache. Thus the client can determine,
      with the server's help, exactly which requests need to the reis-
      sued. As in the previous case, information that was passed from a
      previous operation in the chain might need to be explicitly
      entered into parameters for the new chain. This information is
      available, because the response for all previous requests is
      available from the original response or from the Response Cache.


Wittle                                                         [Page 87]

INTERNET-DRAFT         Direct Access File System          September 2001


4.4.  Locking and Access Control

4.4.1.  Locking

   DAFS locking extends NFS locking with two new capabilities: PERSIST
   locks and AUTORECOVER locks. PERSIST locks survive client and server
   failures and become broken rather than revoked. AUTORECOVER locks
   have rollback semantics associated with them.

4.4.1.1.  DAFS/NFS locking differences

   DAFS locking is based on the NFS Version 4 locking model. This sec-
   tion presents the major differences between the DAFS and NFS Version
   4 file locking semantics.

   o  Client-id Management

      The NFS Version 4 protocol has two operations to set the client-
      id: SETCLIENTID and SETCLIENTID_CONFIRM. In DAFS, the client-ids
      are established when a connection between the client and the
      server is established by DAFS_PROC_CLIENT_CONNECT or
      DAFS_PROC_CLIENT_CONNECT_AUTH. Given the requirements for a reli-
      able transport layer, there is not need for a confirmation step
      when a client connects to the server and establishes a client-id.

      DAFS servers associate a DAFS session with the client-id generated
      when the session was connected. The DAFS locking requests do not
      explicitly include the client-id in the arguments as this informa-
      tion can be obtained from the session that received the request.

      In the event that a DAFS client receives a DAFSERR_STALE_CLIENTID
      or DAFSERR_STALE_STATEID, it obtains a new client-id by reconnect-
      ing to the DAFS server. These errors will most likely occur when
      the client has failed to renew its leases and the server has freed
      the client's locking state.

   o  State-id Management

      The NFS Version 4 protocol requires that the state-id be changed
      whenever the locking state it represents on the server changes.
      NFS operations such as LOCK and LOCK return state-ids since these
      operations modify the lockowner's locking state on the server.

      DAFS servers set up locking state when a lockowner opens a file
      with the DAFS_PROC_OPEN procedure. Once established, state-ids do
      not change when locking state changes. Therefore, lock state modi-
      fying operation like DAFS_PROC_LOCK and DAFS_PROC_LOCKU do not
      return state-ids.


Wittle                                                         [Page 88]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Leases and their renewal

      NFS Version 4 servers use leases to detect clients that have
      crashed. A server is allowed to free locking state from a client
      with expired leases. DAFS servers employ client leases in a simi-
      lar manner.

      Leases are renewed by an NFS client when it issues a procedure
      with a valid state id, such as LOCK, LOCKU, RENEW or WRITE, among
      others. DAFS clients, on the other hand, renew leases by issuing
      any DAFS procedure, including DAFS_PROC_NULL. The DAFS protocol
      does not require a special RENEW lease procedure like the NFS ver-
      sion 4 protocol does. The DAFS server is able to renew leases when
      a DAFS request is received because the session-based communication
      model allows the server to quickly identify the client that ori-
      ginated the request.

      Note that the DAFS' lease renewal mechanism has a low-overhead. An
      active client does not need to issue special requests to renew
      leases as the normal DAFS request traffic implicitly renews the
      client's leases. An inactive client need only send a
      DAFS_PROC_NULL procedure every lease expiration period to renew
      all its leases.

   o  Share Reservations

      DAFS locking supports share reservations as described in the NFS
      Version 4 specification. In addition, DAFS introduces the concept
      of Shared Key Reservations. See 4.4.2., "Shared Key Reservations"
      for a description of shared keys and how they relate to share
      reservations.

   o  Failure Recovery

      A DAFS client identifies itself when connecting to the server
      using a client id string and a verifier. When a client re-
      establishes a connection after a client failure, it uses a new
      verifier. The server releases any locking state it holds for a
      client whose verifier has changed. Unlike the NFS Version 4 proto-
      col, a change in verifier in a DAFS connect request always results
      in the release of all locking state associated with the client
      represented by the client-id-string in the connect arguments.

      See 4.4.1.5., "Client Failure and Recovery" and 4.4.1.6., "Server
      Failure and Recovery" for further discussion of DAFS locking state
      recovery after client and server crashes.

   o  Migration and Replication


Wittle                                                         [Page 89]

INTERNET-DRAFT         Direct Access File System          September 2001


      The DAFS protocol does not support migration or replication of
      file systems. The sections of the NFS V4 specification that
      describe locking functionality when files are replicated or
      migrated is omitted from the quoted text below.

4.4.1.2.  NFS Version 4 Locking

   Chapter 8 of RFC 3010 describes the NFS Version 4 locking and por-
   tions of this chapter are included below. The quoted text applies to
   DAFS locking unless as noted in 4.4.1.1., "DAFS/NFS locking differ-
   ences".

           "8.  File Locking and Share Reservations

           Integrating locking into the NFS [DAFS] protocol  neces-
           sarily  causes  it to be state-full.  With the inclusion
           of 'share' file locks the protocol becomes substantially
           more dependent on state than the traditional combination
           of NFS and NLM [XNFS]. There  are  three  components  to
           making this state manageable:

           o  Clear division between client and server

           o  Ability to reliably detect inconsistency in state
           between client and server

           o  Simple and robust recovery mechanisms

           In this model, the server owns  the  state  information.
           The  client  communicates  its view of this state to the
           server as needed.  The client is  also  able  to  detect
           inconsistent state before modifying a file.

           To support Win32 'share' locks it is necessary to atomi-
           cally   OPEN   or   CREATE  files.   Having  a  separate
           share/unshare operation would not allow  correct  imple-
           mentation  of  the  Win32  OpenFile  API.   In  order to
           correctly implement share semantics,  the  previous  NFS
           protocol  mechanisms  used  when  a  file  is  opened or
           created (LOOKUP, CREATE, ACCESS) need  to  be  replaced.
           The   NFS   version   4  [DAFS]  protocol  has  an  OPEN
           [DAFS_PROC_OPEN] operation that subsumes the functional-
           ity  of  LOOKUP,  CREATE,  and ACCESS.  However, because
           many operations require a  filehandle,  the  traditional
           LOOKUP  [DAFS_PROC_LOOKUP]  is  preserved  to map a file
           name to filehandle without  establishing  state  on  the
           server.   The  policy  of  granting  access or modifying
           files is managed by the server  based  on  the  client's


Wittle                                                         [Page 90]

INTERNET-DRAFT         Direct Access File System          September 2001


           state.   These  mechanisms  can implement policy ranging
           from advisory only locking to full mandatory locking.

           8.1.  Locking

           It is assumed that manipulating a lock is rare when com-
           pared  to READ and WRITE operations.  It is also assumed
           that crashes and network partitions are relatively rare.
           Therefore it is important that the READ and WRITE opera-
           tions have a lightweight mechanism to indicate  if  they
           possess a held lock.  A lock request contains the heavy-
           weight information required  to  establish  a  lock  and
           uniquely define the lock owner.

           The following sections describe the transition from  the
           heavy  weight  information  to the eventual stateid used
           for most client and server locking  and  lease  interac-
           tions.

           8.1.1.  Client ID

           For each LOCK request, the client must  identify  itself
           to the server.

           This is done in such a way as to allow for correct  lock
           identification  and  crash recovery.  Client identifica-
           tion is accomplished with two values.

           o  A verifier that is used to detect client reboots.

           o  A variable length opaque array to uniquely define a
           client.

              For an operating system this may be a fully qualified
              host name or IP address.  For a user level NFS [DAFS]
              client it may additionally contain a  process  id  or
              other unique sequence." (RFC 3010, pp. 51-52)

   The DAFS protocol defines a client_id structure. See 4.4.1.1.,
   "DAFS/NFS locking differences" for a description of DAFS client_id
   management.

           "It is  possible  through  the  mis-configuration  of  a
           client  or  the  existence  of  a  rogue client that two
           clients end up using the same nfs_client_id  [client-id-
           string]" (RFC 3010, pp. 52)

   The DAFS protocol client_id negotiation is similar to NFS, however


Wittle                                                         [Page 91]

INTERNET-DRAFT         Direct Access File System          September 2001


   there is no confirmation step. See 4.4.1.1., "DAFS/NFS locking
   differences" for a description of DAFS client_id management.

           "The following describes the two scenarios  of  negotia-
           tion.

           1  Client has never connected to the server

              In this case the client  generates  an  nfs_client_id
              [client-id- string] and unless another client has the
              same nfs_client_id.id [client-id-string]  field,  the
              server  accepts the request.  The server also records
              the principal (or principal to uid mapping) from  the
              credential  in  the  RPC  request  that  contains the
              nfs_client_id negotiation request (SETCLIENTID opera-
              tion) [DAFS Connect operation]." (RFC 3010, pp. 52)

           "2  Client is re-connecting to the server after a client
           reboot

              In  this  case,  the  client   still   generates   an
              nfs_client_id    [client-    id-string]    but    the
              nfs_client_id.id [client-id-string] field will be the
              same  as the nfs_client_id.id [client-id-string] gen-
              erated prior to reboot.  If the server finds that the
              principal/uid is equal to the previously 'registered'
              nfs_client_id.id [client-id-string], then locks asso-
              ciated   with   the   old  nfs_client_id  [client-id-
              string]are    immediately    released.     If     the
              principal/uid  is  not  equal,  then  this is a rogue
              client and the request is returned  in  error."  (RFC
              3010, pp. 52-53)

   Since DAFS has no message retransmissions, there is no need for a
   confirmation step during the re-connection.

           "In both cases, upon success,  NFS4_OK  [DAFS_STATUS_OK]
           is   returned.   To  help  reduce  the  amount  of  data
           transferred on OPEN  and  LOCK,  the  server  will  also
           return  a  unique 64-bit clientid value that is a short-
           hand reference to the  nfs_client_id  [client-id-string]
           values  presented  by  the client.  From this point for-
           ward, the client will  use  the  clientid  to  refer  to
           itself.

           The clientid assigned by the server should be chosen  so
           that   it  will  not conflict with a clientid previously
           assigned by the  server.   This  applies  across  server


Wittle                                                         [Page 92]

INTERNET-DRAFT         Direct Access File System          September 2001


           restarts  or reboots.  When a clientid is presented to a
           server and that clientid is  not  recognized,  as  would
           happen after a server reboot, the server will reject the
           request   with    the    error    NFS4ERR_STALE_CLIENTID
           [DAFSERR_STALE_CLIENTID].  When this happens, the client
           must obtain a new clientid by  use  of  the  SETCLIENTID
           [DAFS  Connect]  operation and then proceed to any other
           necessary recovery for the server reboot case  (See  the
           section 'Server Failure and Recovery').

           The client must also employ  the  SETCLIENTID  operation
           when     it     receives     a     NFS4ERR_STALE_STATEID
           [DAFSERR_STALE_STATEID] error using  a  stateid  derived
           from  its  current clientid, since this also indicates a
           server reboot which has invalidated the existing  clien-
           tid  (see  the  next  section 'nfs_lockowner and stateid
           Definition' for details).

           8.1.2.  Server Release of Clientid

           If the server determines that the client holds no  asso-
           ciated  state for its clientid, the server may choose to
           release the clientid. The server may  make  this  choice
           for  an  inactive  client so that resources are not con-
           sumed by those intermittently active  clients.   If  the
           client  contacts  the  server  after  this  release, the
           server must ensure the client receives  the  appropriate
           error      so      that      it     will     use     the
           SETCLIENTID/SETCLIENTID_CONFIRM sequence  [DAFS  Connect
           operation]  to  establish  a  new identity. It should be
           clear that the server must be very hesitant to release a
           clientid  since  the  resulting  work  on  the client to
           recover from such an event will be the same burden as if
           the server had failed and restarted.  Typically a server
           would not release a clientid unless there  had  been  no
           activity from that client for many minutes.

           8.1.3.  nfs_lockowner and stateid Definition

           When requesting a lock, the client must present  to  the
           server  the  clientid and an identifier for the owner of
           the requested lock. These two fields are  referred to as
           the  nfs_lockowner  [owner]  and the definition of those
           fields are:

           o  A clientid returned by the server as part of the
           client's use of the SETCLIENTID operation.


Wittle                                                         [Page 93]

INTERNET-DRAFT         Direct Access File System          September 2001


           o  A variable length opaque array used to uniquely
           define the owner of a lock managed by the client.

              This may be a thread id, process id, or other  unique
              value." (RFC 3010, p. 54)

   Because of the DAT transport requirement for in order delivery, DAFS
   maintain sequence information for lock state. Therefore, stateids are
   not returned by successful lock operations. However, DAFS does issue
   stateids when a file is opened to implement consistency between sub-
   sequent lock and I/O operations.

           "The stateid is used as a  shorthand  reference  to  the
           nfs_lockowner,  since the server will be maintaining the
           correspondence between them.

           The server is free to form the  stateid  in  any  manner
           that  it  chooses  as  long  as  it is able to recognize
           invalid  and  out-of-date  stateids.  This   requirement
           includes  those  stateids generated by earlier instances
           of the server.  From this, the client  can  be  properly
           notified  of  a  server restart.  This notification will
           occur when the client presents a stateid to  the  server
           from a previous instantiation.

           The server must be able  to  distinguish  the  following
           situations and return the error as specified:

           o  The stateid was generated by an earlier server
           instance (i.e. before a server reboot).  The error
           NFS4ERR_STALE_STATEID [DAFSERR_STALE_STATEID] should be
           returned.

           o  The stateid was generated by the current server
           instance but the stateid no longer designates the
           current locking state for the lockowner-file pair in
           question (i.e. one or more locking operations has
           occurred).  The error NFS4ERR_OLD_STATEID
           [DAFSERR_OLD_STATEID] should be returned.

              This error condition will only occur when the  client
              issues  a  locking  request  which  changes a stateid
              while an I/O request that uses that stateid  is  out-
              standing.

           o  The stateid was generated by the current server
           instance but the stateid does not designate a locking
           state for any active lockowner-file pair.  The error


Wittle                                                         [Page 94]

INTERNET-DRAFT         Direct Access File System          September 2001


           NFS4ERR_BAD_STATEID [DAFSERR_BAD_STATEID] should be
           returned.

              This error condition will occur when there has been a
              logic  error  on  the  part  of the client or server.
              This should not happen.

           One mechanism that may be used to satisfy these require-
           ments  is  for  the server to divide stateids into three
           fields:

           o  A server verifier which uniquely designates a partic-
           ular server instantiation.

           o  An index into a table of locking-state structures.

           o  A sequence value which is incremented for each sta-
           teid that is associated with the same index into the
           locking- state table.

           By matching the incoming stateid and  its  field  values
           with the state held at the server, the server is able to
           easily determine if a stateid is valid for  its  current
           instantiation  and  state.  If the stateid is not valid,
           the appropriate error can be supplied to the client.

           8.1.4.  Use of the stateid

           All READ and WRITE operations contain a stateid.  If the
           nfs_lockowner  performs  a  READ  or WRITE on a range of
           bytes within a locked  range,  the  stateid  (previously
           returned  by  the  server) must be used to indicate that
           the appropriate lock (record or share)  is  held."  (RFC
           3010, pp. 53-55)

   DAFS defines the special stateid value of zero for use when issuing
   DAFS_PROC_SETATTR_INLINE and DAFS_PROC_SETATTR_DIRECT, for all opera-
   tions that do not change the file size.

           "An explicit lock may not be granted  while  a  READ  or
           WRITE  operation  with  conflicting  implicit locking is
           being performed." (RFC 3010, pp. 55)

   DAFS does not define explicit lock sequencing because of the in-order
   and at-most-once requirements that it places on the DAT transport.

           "8.1.7.  Releasing nfs_lockowner State


Wittle                                                         [Page 95]

INTERNET-DRAFT         Direct Access File System          September 2001


           When a particular nfs_lockowner [owner] no longer  holds
           open or file locking state at the server, the server may
           choose to release the sequence number  state  associated
           with the nfs_lockowner.  The server may make this choice
           based on lease expiration, for the reclamation of server
           memory,  or  other  implementation specific details.  In
           any event, the server is able to  do  this  safely  only
           when  the nfs_lockowner [owner] no longer is being util-
           ized by the client.  The server may choose to  hold  the
           nfs_lockowner   [owner]   state   in   the   event  that
           retransmitted  requests  are  received.   However,   the
           period  to  hold this state is implementation specific."
           (RFC 3010, pp. 57)

   DAFS does not define special handling for message retransmissions.

           "8.2.  Lock Ranges

           The protocol allows a lock owner to request a lock  with
           one  byte range and then either upgrade or unlock a sub-
           range of the initial lock.  It  is  expected  that  this
           will  be  an  uncommon  type  of  request.  In any case,
           servers or server file systems may not be able  to  sup-
           port  sub-range  lock  semantics.   In  the event that a
           server receives a  locking  request  that  represents  a
           sub-range  of  current locking state for the lock owner,
           the   server   is   allowed   to   return   the    error
           NFS4ERR_LOCK_RANGE  [DAFSERR_LOCK_RANGE] to signify that
           it does not support sub-range lock  operations.   There-
           fore,  the  client  should  be  prepared to receive this
           error and, if  appropriate,  report  the  error  to  the
           requesting application.

           The  client  is  discouraged  from  combining   multiple
           independent  locking  ranges  that happen to be adjacent
           into a single request since the server may  not  support
           sub-range  requests  and  for  reasons  related  to  the
           recovery of file locking state in the  event  of  server
           failure. As discussed in the section "Server Failure and
           Recovery" below, the server may employ certain optimiza-
           tions  during  recovery  that work effectively only when
           the client's behavior during lock recovery is similar to
           the client's locking behavior prior to server failure.

           8.3.  Blocking Locks

           Some clients require the support of blocking locks.  The
           NFS  version  4  [DAFS]  protocol  must  not  rely  on a


Wittle                                                         [Page 96]

INTERNET-DRAFT         Direct Access File System          September 2001


           callback mechanism and therefore is unable to  notify  a
           client  when  a previously denied lock has been granted.
           Clients have no choice but to continually poll  for  the
           lock.   This  presents a fairness problem.  Two new lock
           types are added, READW and WRITEW, and are used to indi-
           cate  to  the  server  that  the  client is requesting a
           blocking lock.  The server should  maintain  an  ordered
           list  of  pending  blocking locks.  When the conflicting
           lock is released, the server may wait the  lease  period
           for  the  first  waiting  client to re-request the lock.
           After the lease period expires the next  waiting  client
           request  is  allowed  the  lock. Clients are required to
           poll at an interval sufficiently small that it is likely
           to  acquire  the lock in a timely manner.  The server is
           not required to maintain a list of pending blocked locks
           as  it  is  used  to  increase  fairness and not correct
           operation.  Because of the  unordered  nature  of  crash
           recovery,  storing of lock state to stable storage would
           be required to guarantee ordered  granting  of  blocking
           locks.

           Servers may also note the lock types and delay returning
           denial  of  the  request  to allow extra time for a con-
           flicting lock to  be  released,  allowing  a  successful
           return.   In  this  way, clients can avoid the burden of
           needlessly frequent  polling  for  blocking  locks.  The
           server  should  take  care in the length of delay in the
           event the client retransmits the request.

           8.4.  Lease Renewal

           The purpose of a lease is to allow a  server  to  remove
           stale  locks  that are held by a client that has crashed
           or is otherwise unreachable.  It is not a mechanism  for
           cache  consistency  and lease renewals may not be denied
           if the lease interval has not expired." (RFC  3010,  pp.
           57-58)

   Any DAFS message received by the server from a client acks to renew
   the client's current leases.

           "This approach allows for  low  overhead  lease  renewal
           which  scales  well.   In  the typical case no extra RPC
           calls are required for lease renewal and  in  the  worst
           case  one  RPC  is  required  every lease period (i.e. a
           RENEW [NULL] operation).  The number of  locks  held  by
           the  client  is  not  a  factor  since all state for the
           client is involved with the lease renewal action.


Wittle                                                         [Page 97]

INTERNET-DRAFT         Direct Access File System          September 2001


           Since all operations that create a new lease also  renew
           existing leases, the server must maintain a common lease
           expiration time for all valid leases for a given client.
           This lease time can then be easily updated upon implicit
           lease renewal actions.

           8.5.  Crash Recovery

           The important requirement in crash recovery is that both
           the  client  and  the  server  know  when  the other has
           failed. Additionally, it is required that a client  sees
           a  consistent  view  of  data  across server restarts or
           reboots.  All READ and WRITE operations  that  may  have
           been  queued  within  the client or network buffers must
           wait until the client  has  successfully  recovered  the
           locks protecting the READ and WRITE operations.

           8.5.1.  Client Failure and Recovery

           In the event that a client fails, the server may recover
           the  client's  locks  when  the  associated  leases have
           expired.  Conflicting locks from another client may only
           be  granted  after this lease expiration.  If the client
           is able to restart  or  reinitialize  within  the  lease
           period the client may be forced to wait the remainder of
           the lease period before obtaining new locks.

           To minimize client delay upon restart, lock requests are
           associated  with  an  instance of the client by a client
           supplied verifier.  This verifier is part of the initial
           SETCLIENTID [DAFS Connect] call made by the client.

           The server  returns  a  clientid  as  a  result  of  the
           SETCLIENTID  [DAFS  Connect]  operation." (RFC 3010, pp.
           59)

   DAFS does not require a confirmation step when the client receives a
   client_id as the result of a successful DAFS Connect request.

           "The clientid in combination with an opaque owner  field
           is  then  used  by the client to identify the lock owner
           for OPEN.  This chain of associations is  then  used  to
           identify all locks for a particular client.

           Since the verifier will be changed by  the  client  upon
           each initialization, the server can compare a new verif-
           ier to the verifier associated with currently held locks
           and  determine  that  they do not match.  This signifies


Wittle                                                         [Page 98]

INTERNET-DRAFT         Direct Access File System          September 2001


           the client's new instantiation and  subsequent  loss  of
           locking  state.   As  a  result,  the  server is free to
           release all locks held which are associated with the old
           clientid which was derived from the old verifier.

           For secure environments, a change in the  verifier  must
           only  cause  the  release  of  locks associated with the
           authenticated requester.  This is required to prevent  a
           rogue entity from freeing otherwise valid locks.

           Note that the verifier must  have  the  same  uniqueness
           properties of the verifier for the COMMIT operation.

           8.5.2.  Server Failure and Recovery

           If the server loses locking state (usually as  a  result
           of  a  restart or reboot), it must allow clients time to
           discover this fact and re- establish  the  lost  locking
           state.   The  client  must  be able to re- establish the
           locking state  without  having  the  server  deny  valid
           requests  because  the  server  has  granted conflicting
           access to another client.  Likewise,  if  there  is  the
           possibility  that  clients  have  not yet re-established
           their locking state for a file, the server must disallow
           READ and WRITE operations for that file. The duration of
           this recovery period is equal to  the  duration  of  the
           lease period.

           A client can determine that  server  failure  (and  thus
           loss  of  locking  state) has occurred, when it receives
           one   of   two   errors.    The    NFS4ERR_STALE_STATEID
           [DAFSERR_STALE_STATEID]   error   indicates   a  stateid
           invalidated   by   a    reboot    or    restart.     The
           NFS4ERR_STALE_CLIENTID   [DAFSERR_STALE_CLIENTID]  error
           indicates a clientid invalidated by reboot  or  restart.
           When  either  of  these  are  received,  the client must
           establish a new clientid (See the section  'Client  ID')
           and re-establish the locking state as discussed below.

           The period of special handling of locking and READs  and
           WRITEs,  equal  in  duration  to  the  lease  period, is
           referred to as the 'grace  period'.   During  the  grace
           period,  clients  recover locks and the associated state
           by reclaim-type locking  requests  (i.e.  LOCK  requests
           with  reclaim  set  to  true  and OPEN operations with a
           claim type of CLAIM_PREVIOUS).  During the grace period,
           the  server  must  reject  READ and WRITE operations and
           non-reclaim locking requests (i.e. other LOCK  and  OPEN


Wittle                                                         [Page 99]

INTERNET-DRAFT         Direct Access File System          September 2001


           operations) with an error of NFS4ERR_GRACE.

           If the server can reliably  determine  that  granting  a
           non-reclaim  request  will not conflict with reclamation
           of locks by other clients, the NFS4ERR_GRACE error  does
           not  have  to  be  returned  and  the non-reclaim client
           request can be serviced.  For the server to be  able  to
           service  READ  and  WRITE  operations  during  the grace
           period, it must again be able to guarantee that no  pos-
           sible  conflict could arise between an impending reclaim
           locking request and the READ or WRITE operation.  If the
           server   is   unable   to   offer  that  guarantee,  the
           NFS4ERR_GRACE error must be returned to the client.

           For a server to provide simple,  valid  handling  during
           the grace period, the easiest method is to simply reject
           all non-reclaim locking  requests  and  READ  and  WRITE
           operations     by     returning     the    NFS4ERR_GRACE
           [DAFSERR_GRACE]  error.   However,  a  server  may  keep
           information about granted locks in stable storage.  With
           this information, the server could determine if a  regu-
           lar  lock  or READ or WRITE operation can be safely pro-
           cessed.

           For example, if a count of locks  on  a  given  file  is
           available  in  stable  storage,  the  server  can  track
           reclaimed locks for the file and when all reclaims  have
           been processed, non-reclaim locking requests may be pro-
           cessed.  This way the server can ensure that non-reclaim
           locking   requests  will  not  conflict  with  potential
           reclaim requests. With respect to I/O requests,  if  the
           server  is able to determine that there are no outstand-
           ing reclaim requests for  a  file  by  information  from
           stable  storage  or  another similar mechanism, the pro-
           cessing of I/O requests could proceed normally  for  the
           file.

           To reiterate, for a server that allows non-reclaim  lock
           and  I/O  requests  to  be  processed  during  the grace
           period, it MUST  determine  that  no  lock  subsequently
           reclaimed will be rejected and that no lock subsequently
           reclaimed would have prevented any  I/O  operation  pro-
           cessed during the grace period.

           Clients  should  be   prepared   for   the   return   of
           NFS4ERR_GRACE  [DAFSERR_GRACE]  errors  for  non-reclaim
           lock and I/O requests.  In this case the  client  should
           employ  a  retry mechanism for the request.  A delay (on


Wittle                                                        [Page 100]

INTERNET-DRAFT         Direct Access File System          September 2001


           the order of several seconds) between retries should  be
           used  to avoid overwhelming the server.  Further discus-
           sion of the general is included in [Floyd].  The  client
           must  account for the server that is able to perform I/O
           and non-reclaim locking requests within the grace period
           as well as those that can not do so.

           A reclaim-type  locking  request  outside  the  server's
           grace  period can only succeed if the server can guaran-
           tee that no conflicting lock or  I/O  request  has  been
           granted since reboot or restart.

           8.5.3.  Network Partitions and Recovery

           If the duration of a network partition is  greater  than
           the lease period provided by the server, the server will
           have not received a lease renewal from the  client.   If
           this  occurs, the server may free all locks held for the
           client.  As a result, all stateids held  by  the  client
           will  become  invalid or stale.  Once the client is able
           to reach the server after such a network partition,  all
           I/O  submitted  by  the client with the now invalid sta-
           teids will fail with  the  server  returning  the  error
           NFS4ERR_EXPIRED  [DAFSERR_EXPIRED].   Once this error is
           received, the client will suitably notify  the  applica-
           tion that held the lock.

           As a courtesy to the client or as an  optimization,  the
           server  may continue to hold locks on behalf of a client
           for which recent communication has extended  beyond  the
           lease  period.   If  the  server  receives a lock or I/O
           request that conflicts with one of these courtesy locks,
           the server must free the courtesy lock and grant the new
           request.

           If the server continues to hold locks beyond the expira-
           tion  of  a  client's  lease,  the  server MUST employ a
           method of recording this fact  in  its  stable  storage.
           Conflicting  locks  requests  from another client may be
           serviced after the lease expiration.  There are  various
           scenarios  involving  server failure after such an event
           that require the storage of these lease  expirations  or
           network partitions.  One scenario is as follows:

              A client holds a lock at the server and encounters  a
              network  partition and is unable to renew the associ-
              ated lease.  A second client  obtains  a  conflicting
              lock  and  then  frees  the  lock.   After the unlock


Wittle                                                        [Page 101]

INTERNET-DRAFT         Direct Access File System          September 2001


              request by the second client, the server  reboots  or
              reinitializes.  Once the server recovers, the network
              partition heals and the original client  attempts  to
              reclaim the original lock.

           In this scenario and without any state information,  the
           server  will allow the reclaim and the client will be in
           an inconsistent state because the server or  the  client
           has no knowledge of the conflicting lock.

           The server may choose to store this lease expiration  or
           network partitioning state in a way that will only iden-
           tify the client as a whole.  Note that this  may  poten-
           tially  lead to lock reclaims being denied unnecessarily
           because of a  mix  of  conflicting  and  non-conflicting
           locks.   The server may also choose to store information
           about each lock that has an expired lease with an  asso-
           ciated  conflicting  lock.  The choice of the amount and
           type of state information that is stored is left to  the
           implementor.   In  any case, the server must have enough
           state information to enable correct recovery from multi-
           ple partitions and multiple server failures." (RFC 3010,
           pp. 59-63)

   DAFS does not require explicit handling of lock request timeouts.

           "8.7.  Server Revocation of Locks

           At any point, the server can  revoke  locks  held  by  a
           client  and  the client must be prepared for this event.
           When the client detects that its locks have been or  may
           have  been  revoked, the client is responsible for vali-
           dating the state  information  between  itself  and  the
           server.   Validating  locking state for the client means
           that it must verify  or  reclaim  state  for  each  lock
           currently held.

           The first instance of lock  revocation  is  upon  server
           reboot  or  re-  initialization.   In  this instance the
           client will receive an error  (NFS4ERR_STALE_STATEID  or
           NFS4ERR_STALE_CLIENTID)     [DAFSERR_STALE_STATEID    or
           DAFSERR_STATLE_CLIENTID] and  the  client  will  proceed
           with  normal crash recovery as described in the previous
           section.

           The second lock revocation event  is  the  inability  to
           renew the lease period.  While this is considered a rare
           or  unusual  event,  the  client  must  be  prepared  to


Wittle                                                        [Page 102]

INTERNET-DRAFT         Direct Access File System          September 2001


           recover.   Both  the  server  and client will be able to
           detect the failure to renew the lease and are capable of
           recovering  without data corruption.  For the server, it
           tracks the last renewal event serviced  for  the  client
           and  knows  when  the lease will expire.  Similarly, the
           client must track operations which will renew the  lease
           period.   Using the time that each such request was sent
           and the time that the corresponding reply was  received,
           the  client should bound the time that the corresponding
           renewal could have  occurred  on  the  server  and  thus
           determine  if it is possible that a lease period expira-
           tion could have occurred.

           The third lock revocation event can occur as a result of
           administrative  intervention  within  the  lease period.
           While this is considered a rare event,  it  is  possible
           that  the  server's administrator has decided to release
           or revoke a particular lock held by the  client.   As  a
           result  of  revocation, the client will receive an error
           of NFS4ERR_EXPIRED [DAFSERR_EXPIRED] and  the  error  is
           received  within the lease period for the lock.  In this
           instance  the  client   may   assume   that   only   the
           nfs_lockowner's  locks have been lost.  The client noti-
           fies the lock holder appropriately.  The client may  not
           assume  the lease period has been renewed as a result of
           failed operation.

           When the client determines the  lease  period  may  have
           expired,  the  client  must  mark all locks held for the
           associated  lease  as  'unvalidated'.   This  means  the
           client  has  been unable to re- establish or confirm the
           appropriate lock state with the server. As described  in
           the  previous  section  on  crash  recovery,  there  are
           scenarios in which  the  server  may  grant  conflicting
           locks  after  the lease period has expired for a client.
           When it is possible that the lease period  has  expired,
           the  client  must  validate  each lock currently held to
           ensure that a conflicting lock has not been granted. The
           client  may  accomplish  this  task  by  issuing  an I/O
           request, either a pending I/O  or  a  zero-length  read,
           specifying the stateid associated with the lock in ques-
           tion. If the response to the  request  is  success,  the
           client  has  validated all of the locks governed by that
           stateid and re-established the appropriate state between
           itself  and  the server.  If the I/O request is not suc-
           cessful, then one or more of the locks  associated  with
           the  stateid  was  revoked  by the server and the client
           must notify the owner.


Wittle                                                        [Page 103]

INTERNET-DRAFT         Direct Access File System          September 2001


           8.8.  Share Reservations

           A share reservation is a mechanism to control access  to
           a file.  It is a separate and independent mechanism from
           record locking. When a client opens a file, it issues an
           OPEN  operation  to  the  server  specifying the type of
           access required (READ, WRITE, or BOTH) and the  type  of
           access to deny others (deny NONE, READ, WRITE, or BOTH).
           If the OPEN fails the client will fail the application's
           open request.

           Pseudo-code definition of the semantics:


                   if ((request.access & file_state.deny)) ||
                       (request.deny & file_state.access))
                        return (NFS4ERR_DENIED)[DAFSERR_DENIED]


           The constants  used  for  the  OPEN  and  OPEN_DOWNGRADE
           operations  for  the  access and deny fields are as fol-
           lows:


                   const OPEN4_SHARE_ACCESS_READ   = 0x00000001;
                   const OPEN4_SHARE_ACCESS_WRITE  = 0x00000002;
                   const OPEN4_SHARE_ACCESS_BOTH   = 0x00000003;


                   const OPEN4_SHARE_DENY_NONE     = 0x00000000;
                   const OPEN4_SHARE_DENY_READ     = 0x00000001;
                   const OPEN4_SHARE_DENY_WRITE    = 0x00000002;
                   const OPEN4_SHARE_DENY_BOTH      = 0x00000003;


           8.9.  OPEN/CLOSE Operations

           To provide correct share semantics, a  client  MUST  use
           the  OPEN operation to obtain the initial filehandle and
           indicate the desired access and what if  any  access  to
           deny.   Even  if  the client intends to use a stateid of
           all 0's or all 1's, it must still obtain the  filehandle
           for  the  regular  file  with  the OPEN operation so the
           appropriate share semantics can be applied.  For clients
           that  do  not  have  a  deny  mode built into their open


Wittle                                                        [Page 104]

INTERNET-DRAFT         Direct Access File System          September 2001


           programming interfaces, deny equal  to  NONE  should  be
           used.

           The OPEN operation with the CREATE flag,  also  subsumes
           the CREATE operation for regular files as used in previ-
           ous versions of the NFS protocol.  This allows a  create
           with a share to be done atomically.

           The CLOSE operation removes all share locks held by  the
           nfs_lockowner  on  that file.  If record locks are held,
           the client SHOULD release all  locks  before  issuing  a
           CLOSE.   The  server  MAY  free all outstanding locks on
           CLOSE but some servers may not support the  CLOSE  of  a
           file  that still has record locks held.  The server MUST
           return failure if any locks would exist after the CLOSE.

           The LOOKUP operation will return  a  filehandle  without
           establishing  any  lock  state on the server.  Without a
           valid stateid, the server will assume the client has the
           least  access.  For  example,  a  file  opened with deny
           READ/WRITE  cannot  be  accessed  using   a   filehandle
           obtained  through  LOOKUP  because  it  would not have a
           valid stateid (i.e. using a stateid of all bits 0 or all
           bits 1).

           8.10.  Open Upgrade and Downgrade

           When an OPEN is done for a file and  the  lockowner  for
           which  the open is being done already has the file open,
           the result is to upgrade the open file status maintained
           on the server to include the access and deny bits speci-
           fied by the new OPEN as well as those for  the  existing
           OPEN.  The result is that there is one open file, as far
           as the protocol is concerned, and it includes the  union
           of the access and deny bits for all of the OPEN requests
           completed.  Only a single CLOSE will be  done  to  reset
           the  effects of both OPEN's.  Note that the client, when
           issuing the OPEN, may not know that the same file is  in
           fact being opened. The above only applies if both OPEN's
           result in the OPEN'ed object  being  designated  by  the
           same filehandle.

           When the server chooses to export  multiple  filehandles
           corresponding  to  the same file object and returns dif-
           ferent filehandles on two different OPEN's of  the  same
           file object, the server MUST NOT'OR' together the access
           and deny bits and coalesce the two open files.   Instead
           the  server  must maintain separate OPEN's with separate


Wittle                                                        [Page 105]

INTERNET-DRAFT         Direct Access File System          September 2001


           stateid's and will  require  separate  CLOSE's  to  free
           them.

           When multiple open files on the client are merged into a
           single  open file object on the server, the close of one
           of the open files (on the client) may necessitate change
           of  the  access  and deny status of the open file on the
           server.  This is because the union  of  the  access  and
           deny  bits for the remaining open's may be smaller (i.e.
           a proper subset) than  previously.   The  OPEN_DOWNGRADE
           operation  is  used to make the necessary change and the
           client should use it to update the server so that  share
           reservation  requests by other clients are handled prop-
           erly.

           8.11.  Short and Long Leases

           When determining the time period for the  server  lease,
           the  usual lease tradeoffs apply.  Short leases are good
           for fast server recovery at a cost  of  increased  RENEW
           [DAFS_PROC_NULL]  or  READ  (with zero length) requests.
           Longer leases are certainly kinder and gentler to  large
           internet  servers trying to handle very large numbers of
           clients.  The number of RENEW [DAFS_PROC_NULL]  requests
           drop in proportion to the lease time.  The disadvantages
           of long leases are slower recovery after server  failure
           (server  must wait for leases to expire and grace period
           before granting new lock requests)  and  increased  file
           contention  (if  client  fails  to  transmit  an  unlock
           request then  server  must  wait  for  lease  expiration
           before granting new locks).

           Long leases are usable if the server is  able  to  store
           lease  state in non-volatile memory.  Upon recovery, the
           server can reconstruct the lease  state  from  its  non-
           volatile  memory and continue operation with its clients
           and therefore long leases are not an issue.

           8.12.  Clocks and Calculating Lease Expiration

           To avoid the need for synchronized clocks,  lease  times
           are  granted  by  the  server as a time delta.  However,
           there is a requirement that the client and server clocks
           do  not drift excessively over the duration of the lock.
           There is also the issue of propagation delay across  the
           network  which  could  easily  be  several  hundred mil-
           liseconds as well as the possibility that requests  will
           be lost and need to be retransmitted.


Wittle                                                        [Page 106]

INTERNET-DRAFT         Direct Access File System          September 2001


           To take  propagation  delay  into  account,  the  client
           should  subtract it from lease times (e.g. if the client
           estimates the one- way propagation delay  as  200  msec,
           then  it  can  assume that the lease is already 200 msec
           old when it gets it).  In addition, it will take another
           200  msec  to get a response back to the server.  So the
           client must send a lock renewal or write  data  back  to
           the  server  400  msec  before  the lease would expire."
           (RFC 3010, pp. 63-67)

4.4.1.3.  PERSIST Locks

   PERSIST locks are provided so that if a lock-protected sequence of
   I/O operations is interrupted, the protecting lock is not made avail-
   able again until the lockholder (or a cooperating client) has an
   opportunity to repair any inconsistencies in the data that resulted
   from the interruption.   The events which might cause such an interr-
   uption are a client failure, a network partition which results in
   lock lease expiration and revocation, and server failure when the
   lock cannot be reclaimed successfully, or some combination of these
   events. For example, power loss to an entire cluster of DAFS clients
   and servers.

   The model of PERSIST locks is that rather than being released follow-
   ing certain failure events the locks, instead they become "broken".
   The state of a broken lock MUST survive client failures, server
   failures, and network partitions. Where a normal lock would have
   become subject to revocation, PERSIST locks enter a state of being
   breakable. The specific conditions for a lock to become breakable are
   either a client lease expiration or a server restart grace period
   expiration. When a lock is breakable, any conflicting lock request
   causes the lock to be broken. The lock also becomes broken when a
   client re-initialization (e.g., reboot) occurs, regardless of whether
   it was breakable at the time of the client re- initialization.

   When a PERSIST lock is broken, it behaves much like a normal lock
   relative to read and write operations. Read and write requests will
   receive one of the status values:

   o  DAFS_ERR_LOCKED: if a null State-id (all zeroes) was specified.

   o  DAFS_ERR_STALE_STATEID: if a non-null State-id was specified.

   It is only when clients either try to acquire a lock using the
   DAFS_PROC_LOCK operation or inquire about a lock using the
   DAFS_PROC_LOCKT operation that they see it is broken via the status
   DAFS_ERR_LOCK_BROKEN.


Wittle                                                        [Page 107]

INTERNET-DRAFT         Direct Access File System          September 2001


   To release a broken PERSIST lock, a client issues a DAFS_PROC_LOCK
   request with the REPAIR option. The client application would then
   presumably perform some recovery action to repair the data contained
   within the locked region and then could release the lock with
   DAFS_PROC_LOCKU.

   A lock is specified to be PERSIST by setting the PERSIST option on
   the DAFS_PROC_LOCK request.

4.4.1.4.  Auto-Recovery Locks

   AUTORECOVER locks provide a limited UNDO or rollback recovery ser-
   vice. In the absence of failures, an AUTORECOVER lock behaves exactly
   like a normal NFS Version 4 lock.

   In failure conditions, however, the server guarantees that any modif-
   ications made to a file that were done under the protection of an
   AUTORECOVER lock are undone before the lock is released and made
   available to other clients. Failure conditions are server restart
   grace period expirations, client lease expirations, and client re-
   initializations.

   This recovery is limited in the sense that there is no atomicity to
   actions performed on different files or even different byte regions
   of the same file if those regions were protected with different
   locks. If the failure occurs during the release of locks, those
   DAFS_PROC_LOCKU requests that completed will have no recovery associ-
   ated with them, whereas the locks that have not yet become released
   will have recovery actions performed.

   A lock is specified to be an AUTORECOVER lock by setting the AUTORE-
   COVER option on the DAFS_PROC_LOCK request.

   If a lock is requested with both AUTORECOVER and PERSIST options, the
   rollback associated with the lock is delayed until the lock becomes
   broken. Before the lock becomes broken the client can reclaim the
   lock and either continue on or forcibly roll back. A new lock type,
   ABORT_T, is defined to forcibly roll back an AUTORECOVER lock.

   AUTORECOVER locks are an OPTIONAL feature.

   An implementation can restrict AUTORECOVER locks to locks that cover
   the entire valid range for byte-range locks (i.e., from 0 to 264-1
   bytes), thus preventing multiple simultaneous AUTORECOER locks on a
   single file. If an implementation does not support AUTORECOVER locks
   or it only supports AUTORECOVERY locks that cover the entire valid
   byte range, and an AUTORECOVER locks is attempted specifying a
   smaller file range, then the error status DAFSERR_NOTSUPP is


Wittle                                                        [Page 108]

INTERNET-DRAFT         Direct Access File System          September 2001


   returned.

4.4.1.5.  Client Failure and Recovery

   Client failures are seen by the server in the following two ways:

   o  as lease expirations

   o  as Sessions with a new client verifier indicating a client re-
      initialization

   The effect of lease expirations on PERSIST locks is to put them in a
   breakable state. When the lock is in a breakable state, the client
   still has an opportunity to renew a lease if it does so before any
   conflicting DAFS_PROC_LOCK request is serviced. This allows the
   client to continue operation following a network partition and
   recovery. If a conflicting lock is requested before the lease is
   renewed, the lock becomes broken. The lock request causing the break
   as well as any subsequent conflicting lock requests will receive the
   status DAFS_ERR_LOCK_BROKEN. Repair of the lock requires a client to
   request the lock with the REPAIR option and to then release it.

   The effect of lease expiration can be summarized by lock type:

   o  Normal Lock: Revoke the lock. This action can take place immedi-
      ately or the server can defer it until there is a conflicting
      request by another client.

   o  PERSIST Lock: Make the lock breakable. After a lock is made break-
      able it is made broken when a conflicting lock request occurs.

   o  AUTORECOVER Lock: Rollback and revoke the lock. After a lock is
      made breakable it is made broken when a conflicting lock request
      occurs.

   o  PERSIST - AUTORECOVER lock: Make the lock breakable. After a lock
      is made breakable it is made broken when a conflicting lock
      request occurs.

   The effect of a client re-initialization on PERSIST locks is to put
   them into the broken state. Conflicting lock requests will receive
   the status DAFSERR_LOCK_BROKEN. Repair of the lock requires a client
   to obtain the lock with the REPAIR option, and then release it.

   The effect of client re-initialization can be summarized by lock
   type:

   o  Normal Lock: Release the lock.


Wittle                                                        [Page 109]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  PERSIST Lock: Make the lock broken.

   o  AUTORECOVER Lock: Rollback and revoke the lock.

   o  PERSIST - AUTORECOVER Lock. Make the lock broken.

4.4.1.6.  Server Failure and Recovery

   Clients can detect server failures when they establish a new Session
   after a previous Session with that server has been disconnected. The
   client presents the same client-id-string and client-verifier that it
   has used to establish the previous Session, but the server returns a
   different client-id. If the server had not re-initialized, it would
   return the same client-id as the client had used to establish the
   previous Session.

   Immediately following a server restart, the server enters a "grace
   period" equal in length to the lease period.  During this time read,
   write, and lock requests other than Reclaim lock requests return the
   error DAFSERR_GRACE, unless the server can determine that all valid
   locks for the file have already been reclaimed.  Locking behavior
   during the grace period is the same for all locks regardless of
   whether they are normal, PERSIST, or AUTORECOVER locks.

   During the grace period, clients can reclaim locks using the
   DAFS_PROC_LOCK operation with the reclaim option. If an AUTORECOVER
   lock is reclaimed during the grace period, any modifications made to
   the file while it was protected by the lock will be reflected in the
   file. Note that modifications made using asynchronous requests (i.e.,
   unstable DAFS_PROC_WRITE_INLINE and DAFS_PROC_WRITE_DIRECT operations
   that have not been committed yet, and modifications made with the
   DAFS_PROC_BATCH_SUBMIT operation that have not received completion
   notification yet) might not be reflected in the file). If a
   DAFS_PROC_COMMIT had been done, the file will reflect the write
   operation. The client has the option of explicitly rolling back any
   changes by issuing a DAFS_PROC_LOCK request with the ABORT_T lock
   type.

   After the grace period expires, non-PERSIST locks that were not
   reclaimed are made available to all clients. Any such locks that were
   AUTORECOVER locks will have their associated modifications undone
   before any conflicting lock is granted. The server MAY allow reclaim
   of locks to occur after the grade period has ended, but only if it
   can be sure that no conflicting locks have been granted and released
   since the grace period ended.

   PERSIST locks that are not reclaimed during the grace period enter
   the state of being breakable. The lock remains in a breakable state


Wittle                                                        [Page 110]

INTERNET-DRAFT         Direct Access File System          September 2001


   until the first conflicting lock request arrives at the server. At
   that time the lock becomes broken. If it is an AUTORECOVER lock, the
   rollback will be performed at this time. The lock remains broken
   until a client attempts to reaquire the lock with the REPAIR option
   and subsequently releases the lock.

   If a PERSIST lock becomes breakable at the end of the grace period,
   the client can still reclaim it, as long as the server is sure that
   no intervening conflicting lock has been granted (that is, the lock
   was not repaired and then made breakable because of a different PER-
   SIST lock).

4.4.2.  Shared Key Reservations

   DAFS extends NFS share reservations with one new capability: SHARE
   KEY reservations. SHARE KEY reservations enable a set of cooperating
   clients (identified by a single shared KEY) to simultaneously access
   a file while at the same time denying access to cooperating clients
   that are not members of the original set (identified by a different
   KEY than the original one).

   SHARE KEY reservations are provided to aid a clustered application to
   detect rogue instances of the application that are trying to perform
   conflicting access to a file. Such rogue access is now a common
   source of corruption in clustered applications. SHARE KEY reserva-
   tions allow a clustered application to have all components of a clus-
   ter instance share a SHARE KEY reservation. Thus multiple clients
   participating in the cluster instance can access the file, but when a
   client participating in a different cluster instance tries to access
   the file, then access is denied.

   SHARE KEY reservation checking is in addition to ordinary NFSv4-style
   share reservation checking.

   Pseudocode definition of the semantics:


Wittle                                                        [Page 111]

INTERNET-DRAFT         Direct Access File System          September 2001


          // Do the checking for NFSv4-style open semantics
          if ((request.access & file_state.deny) ||
              (request.deny & file_state.access)) {

              return (DAFSERR_DENIED
          }

          // Do special SHARE KEY handling, if appropriate
          if (request.share_key_type) {
              if (file_state.share_key_type) {
                  file_state.key = request.key
              } else if (request.key != file_state.key) {
                  return DAFSERR_KEY_MISMATCH;
              }
              file_state.share_key_count++;
          }

          // Request will succeed. Update the remaining state.
          file_state.access |= request.access;
          file_state.deny |= request.deny;


   The CLOSE operation decrements the file_state.share_key_count for any
   SHARE KEY locks held by the dafs_lockowner on that file. Similarly,
   file_state.access and file_state.deny are updated so that they
   reflect share reservations held by other dafs_lockowners on that
   file.

4.4.3.  Access Control Lists (ACLs)

   The access control lists in DAFS and NFS version 4 are the same. The
   NFS Version 4 description of ACLs is quoted below. For the most part,
   the quote applies to DAFS. The exceptions are the naming of data
   structures and constants that have changed to adhere to the DAFS pro-
   tocol naming conventions. See 6.1.4., "Basic Types" for the DAFS-
   equivalent names of data structures and constants.

           "The NFS [DAFS] ACL attribute is an array of access con-
           trol  entries  (ACE).   There are various access control
           entry types. The server is able to communicate which ACE
           types  are  supported by returning the appropriate value
           within the aclsupport attribute.  The types of ACEs  are
           defined as follows:


Wittle                                                        [Page 112]

INTERNET-DRAFT         Direct Access File System          September 2001


                   Type    Description
                   ALLOW   Explicitly grants the access defined
                           in acemask4 to the file or
                           directory
                   DENY    Explicitly denies the access defined
                           in acemask4 to the file or
                           directory.
                   AUDIT   LOG (system dependent) any access
                           attempt to a file or directory which
                           uses any of the access methods
                           specified in acemask4.
                   ALARM   Generate a system ALARM (system
                           dependent) when any access attempt is
                           made to a file or directory for the
                           access methods specified in acemask4.


           The NFS ACE attribute is defined as follows:


                   typedef uint32_t        acetype4;
                   typedef uint32_t        aceflag4;
                   typedef uint32_t        acemask4;

                   struct nfsace4 {
                       acetype4      type;
                       aceflag4      flag;
                       acemask4      access_mask;
                       utf8string    who;
                   };


           To determine if an ACCESS or OPEN request succeeds  each
           nfsace4 entry is processed in order by the server.  Only
           ACEs which have a 'who'  that matches the requester  are
           considered.  Each ACE is processed until all of the bits
           of the requester's access have been ALLOWED.  Once a bit
           (see  below)  has been ALLOWED by an ACCESS_ALLOWED_ACE,
           it is no longer considered in the  processing  of  later
           ACEs.  If  an ACCESS_DENIED_ACE is encountered where the
           requester's mode still has unALLOWED bits in common with
           the 'access_mask' of the ACE, the request is denied.

           The bitmask constants used to represent the above defin-
           itions within the aclsupport attribute are as follows:


Wittle                                                        [Page 113]

INTERNET-DRAFT         Direct Access File System          September 2001


                   const ACL4_SUPPORT_ALLOW_ACL    = 0x00000001;
                   const ACL4_SUPPORT_DENY_ACL     = 0x00000002;
                   const ACL4_SUPPORT_AUDIT_ACL    = 0x00000004;
                   const ACL4_SUPPORT_ALARM_ACL    = 0x00000008;


           5.9.1.  ACE type

           The semantics of the "type" field  follow  the  descrip-
           tions provided above.

           The bitmask constants used for the  type  field  are  as
           follows:


                   const ACE4_ACCESS_ALLOWED_ACE_TYPE= 0x00000000;
                   const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
                   const ACE4_SYSTEM_AUDIT_ACE_TYPE  = 0x00000002;
                   const ACE4_SYSTEM_ALARM_ACE_TYPE  = 0x00000003;


           5.9.2.  ACE flag

           The "flag" field contains values based on the  following
           descriptions.

           ACE4_FILE_INHERIT_ACE

           Can be placed on a directory and indicates that this ACE
           should be added to each new non-directory file created.

           ACE4_DIRECTORY_INHERIT_ACE

           Can be placed on a directory and indicates that this ACE
           should be added to each new directory created.

           ACE4_INHERIT_ONLY_ACE

           Can be placed on a directory but does not apply  to  the
           directory,  only  to  newly created files/directories as
           specified by the above two flags.

           ACE4_NO_PROPAGATE_INHERIT_ACE

           Can be placed  on  a  directory.  Normally  when  a  new


Wittle                                                        [Page 114]

INTERNET-DRAFT         Direct Access File System          September 2001


           directory  is  created  and  an ACE exists on the parent
           directory which  is  marked  ACL4_DIRECTORY_INHERIT_ACE,
           two  ACEs  are  placed on the new directory. One for the
           directory itself and one which is an inheritable ACE for
           newly  created  directories.  This flag tells the server
           to not place an ACE on the newly created directory which
           is  inheritable  by subdirectories of the created direc-
           tory.

           ACE4_SUCCESSFUL_ACCESS_ACE_FLAG

           ACL4_FAILED_ACCESS_ACE_FLAG

           Both indicate for AUDIT and ALARM which state to log the
           event.   On  every ACCESS or OPEN call which occurs on a
           file or directory which has  an  ACL  that  is  of  type
           ACE4_SYSTEM_AUDIT_ACE_TYPE                            or
           ACE4_SYSTEM_ALARM_ACE_TYPE, the attempted access is com-
           pared  to the ace4mask of these ACLs. If the access is a
           subset of ace4mask and the identifier  match,  an  AUDIT
           trail or an ALARM is generated.  By default this happens
           regardless of the success or failure of  the  ACCESS  or
           OPEN call.

           The flag ACE4_SUCCESSFUL_ACCESS_ACE_FLAG  only  produces
           the  AUDIT  or  ALARM if the ACCESS or OPEN call is suc-
           cessful.  The  ACE4_FAILED_ACCESS_ACE_FLAG  causes   the
           ALARM or AUDIT if the ACCESS or OPEN call fails.

           ACE4_IDENTIFIER_GROUP

           Indicates that the "who" refers to a  GROUP  as  defined
           under Unix.

           The bitmask constants used for the  flag  field  are  as
           follows:


Wittle                                                        [Page 115]

INTERNET-DRAFT         Direct Access File System          September 2001


                   const ACE4_FILE_INHERIT_ACE        = 0x00000001;
                   const ACE4_DIRECTORY_INHERIT_ACE   = 0x00000002;
                   const ACE4_NO_PROPAGATE_INHERIT_ACE
                                                      = 0x00000004;
                   const ACE4_INHERIT_ONLY_ACE        = 0x00000008;
                   const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG
                                                      = 0x00000010;
                   const ACE4_FAILED_ACCESS_ACE_FLAG
                                                      = 0x00000020;
                   const ACE4_IDENTIFIER_GROUP        = 0x00000040;


           5.9.3.  ACE Access Mask

           The access_mask field contains values based on the  fol-
           lowing:


Wittle                                                        [Page 116]

INTERNET-DRAFT         Direct Access File System          September 2001


                   Access             Description
                   READ_DATA          Permission to read the
                                      data of the file
                   LIST_DIRECTORY     Permission to list the
                                      contents of a directory
                   WRITE_DATA         Permission to modify the
                                      file's data
                   ADD_FILE           Permission to add a new
                                      file to a directory
                   APPEND_DATA        Permission to append data
                                      to a file
                   ADD_SUBDIRECTORY   Permission to create a
                                      subdirectory to a
                                      directory
                   READ_NAMED_ATTRS   Permission to read the
                                      named attributes of a file
                   WRITE_NAMED_ATTRS  Permission to write the
                                      named attributes of a file
                   EXECUTE            Permission to execute a
                                      file
                   DELETE_CHILD       Permission to delete a
                                      file or directory within
                                      a directory
                   READ_ATTRIBUTES    The ability to read basic
                                      attributes (non-acls) of a
                                      file
                   WRITE_ATTRIBUTES   Permission to change basic
                                      attributes (non-acls) of a
                                      file
                   DELETE             Permission to Delete the
                                      file
                   READ_ACL           Permission to Read the ACL
                   WRITE_ACL          Permission to Write the
                                      ACL
                   WRITE_OWNER        Permission to change the
                                      owner
                   SYNCHRONIZE        Permission to access file
                                      locally at the server with
                                      synchronous reads and
                                      writes


           The bitmask constants used for the access mask field are
           as follows:


Wittle                                                        [Page 117]

INTERNET-DRAFT         Direct Access File System          September 2001


                   const ACE4_READ_DATA            = 0x00000001;
                   const ACE4_LIST_DIRECTORY       = 0x00000001;
                   const ACE4_WRITE_DATA           = 0x00000002;
                   const ACE4_ADD_FILE             = 0x00000002;
                   const ACE4_APPEND_DATA          = 0x00000004;
                   const ACE4_ADD_SUBDIRECTORY     = 0x00000004;
                   const ACE4_READ_NAMED_ATTRS     = 0x00000008;
                   const ACE4_WRITE_NAMED_ATTRS    = 0x00000010;
                   const ACE4_EXECUTE              = 0x00000020;
                   const ACE4_DELETE_CHILD         = 0x00000040;
                   const ACE4_READ_ATTRIBUTES      = 0x00000080;
                   const ACE4_WRITE_ATTRIBUTES     = 0x00000100;
                   const ACE4_DELETE               = 0x00010000;
                   const ACE4_READ_ACL             = 0x00020000;
                   const ACE4_WRITE_ACL            = 0x00040000;
                   const ACE4_WRITE_OWNER          = 0x00080000;
                   const ACE4_SYNCHRONIZE          = 0x00100000;


           5.9.4.  ACE who

           There are several special identifiers ("who") which need
           to  be understood universally. Some of these identifiers
           cannot be understood when an  NFS  client  accesses  the
           server,  but  have meaning when a local process accesses
           the file. The ability to display and modify  these  per-
           missions is permitted over NFS.


Wittle                                                        [Page 118]

INTERNET-DRAFT         Direct Access File System          September 2001


                   Who             Description
                   "OWNER"         The owner of the file.
                   "GROUP"         The group associated with the
                                   file.
                   "EVERYONE"      The world.
                   "INTERACTIVE"   Accessed from an interactive
                                   terminal.
                   "NETWORK"       Accessed via the network.
                   "DIALUP"        Accessed as a dialup user to
                                   the server.
                   "BATCH"         Accessed from a batch job.
                   "ANONYMOUS"     Accessed without any
                                   authentication.
                   "AUTHENTICATED" Any authenticated user
                                   (opposite of ANONYMOUS)
                   "SERVICE"       Access from a system
                                   service.


           To avoid conflict, these special identifiers are distin-
           guish  by  an appended "@" and should appear in the form
           "xxxx@" (note: no domain name after the "@").  For exam-
           ple: ANONYMOUS@." (RFC 3010, pp. 40-44)

4.4.4.  Fencing

   Cluster systems with shared resources need to "fence off" access to
   shared resources by nodes when those nodes lose membership in the
   cluster quorum. This is done primarily to prevent a failing or mis-
   behaving node from improperly accessing the shared resource, and more
   specifically to "drain" all outstanding I/O requests to the resource
   from the evicted node. Draining is necessary to permit other nodes in
   the quorum to repair any damages made to the resource by the evicted
   node. To perform this repair (that is, recovery) the recovering node
   needs to know that no more I/Os will be executed by the evicted node.

   Fencing has also been described as "client access revocation." This
   is an accurate description in the file server environment.

   The problem can be broken down into the following subproblems:

   o  Access revocation - Preventing further access to the resource by
      the node.

   o  Draining - Providing indication of when all outstanding I/O
      requests of the node have completed or been cancelled.


Wittle                                                        [Page 119]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Authorization and access control - Assuring that agents issuing
      fence operations are authorized to do so.

   o  Concurrency control - Avoiding race conditions that could result
      in incorrect or hung system states.

   The DAFS Fencing mechanism is described below, in terms of

   o  subjects (i.e., dafs clients),

   o  objects (e.g., file systems),

   o  permissions (i.e., allow or deny), and

   o  operations (e.g., get and set permissions)

4.4.4.1.   Fencing Subjects

   A Fencing subject is the active entity whose access to an a file or
   set of files is being controlled. A dafs client can associate a
   "fence_id_string" with a Session to the DAFS server by specifying it
   in the fence_id_string field of the DAFS_PROC_CLIENT_CONNECT opera-
   tion (this is new argument field added to that operation). The
   fence_id_string is similar in concept to the existing DAFS Client-
   id-string argument, but does not overload Fencing semantics onto the
   Client-id.

   Rationale: DAFS Fencing is intended to address access control between
              a set of cooperating DAFS clients. The set of cooperating
              clients needs to

               o  each use a unique Fence_id_string, and

               o  needs to make the set of Fence_id_strings in use known
                  to some central authority (e.g., cluster manager) for
                  administering the Fencing mechanism.

              Since this level of cooperation is needed, Fencing is not
              meant to protect against malicious attacks. Being "spoof-
              proof" is NOT REQUIRED.


4.4.4.2.  Fencing Object

   A Fencing Object is defined by the DAFS filehandle and object_flag
   argument fields in the Fencing administrative operations. The
   object_flag specifies whether the Object being Fenced is the file
   associated with the filehandle, or the file system specified by the


Wittle                                                        [Page 120]

INTERNET-DRAFT         Direct Access File System          September 2001


   FShandle part of the filehandle.

   Note: For the case of Fencing a fs_handle, it is up to the underlying
         DAFS server side file system to export fs_handles to the DAFS
         server in a way that the implementation-specific unit of
         storage (e.g., file system) that is associated with the
         fs_handle can be described well enough so that users who want
         to use the Fencing feature can place the set of files that need
         to be fenced as a unit into the underlying file system
         appropriately.

4.4.4.3.  Fencing Permissions

   Fencing permissions are defined by a "Fencing_list" of
   Fence_id_stings. The list designates Fence_id_strings, and thus the
   DAFS clients, who are allowed access to (vis-a-vis Fencing) to the
   Object (defined by the filehandle and object_flag). The Fencing_list
   is stored persistently by the DAFS server. A null Fencing_list is a
   special case means that all dafs clients are allowed access to the
   Object. A non-null Fencing_list means that all DAFS clients with con-
   nections that specify a Fence_id_string in the Fencing_list can
   access the Object.

4.4.4.4.  Fencing Operations

   The Fencing operations used to manage the Fencing_list, and to cause
   Fencing access controls to be in effect, are
   DAFS_PROC_SET_FENCING_LIST and DAFS_PROC_GET_FENCING_LIST.

   The ability to set the Fencing_list for an filehandle object is
   reserved to the owner of the object, or a trusted Client.The ability
   to set the Fencing_list for a file system is reserved to trusted
   Clients.

   The set operation atomically updates the Fencing_list, adding or
   removing Fence_id_strings from the existing list, or overwriting the
   existing list, as specified by the argument flags.

   A side-effect of the set operation when invoked with the deny flag is
   to

   1) drain (i.e., abort or complete) any in-progress operations
      received on a Session with the just-denied Fence_id_string. All
      subsequent requests on a Session that has the associated just-
      denied Fence_id- string, MUST enforce the denial of access implied
      by the new Fencing_list. This requires determining that the
      request is associated with a denied Fence_id_string (e.g., deter-
      mining that the request's Session has a denied Fence_id_string),


Wittle                                                        [Page 121]

INTERNET-DRAFT         Direct Access File System          September 2001


      and matching the filehandle in the request to Objects that are
      Fenced.

   2) if the Object fenced includes all DAFS file objects (directories,
      files, symlinks, etc.) provided by the DAFS Server, then all
      existing Sessions associated with the just-denied
      Fencing_id_string can be closed in error. Subsequent attempts to
      create a Session that contains the just-denied Fence_id_string can
      be returned in error.

   Rationale: These two effects of Fencing provide a range of capabil-
              ity. First, if the Object to be Fenced includes all
              Objects provided by the DAFS server, the runtime checks
              needed to implement Fencing are reduced and performance is
              enhanced. Second, by defining Fencing to include only some
              of the Objects provided by the DAFS server, multiple sets
              of cooperating dafs clients (e.g., application clusters)
              can be supported on the same DAFS server with some cost in
              runtime performance.

   Note: Fencing an Object does not destroy other file state (e.g.,
         locks) associated with the Client. This state is controlled by
         lease expiration.

4.5.  NFS-Derived Operations

   The NFS Version 4 file system specification, RFC 3010, identifies a
   large set of file operations that are common to many file system
   environments. Most of these file operations are common to any file
   system and are not specific to either a wide-sharing or local file-
   sharing environment. A number of these common file operations do not
   involve bulk data movement between client and server, and the seman-
   tics are not significantly enhanced through the use of memory-to-
   memory architectures. For this reason, the DAFS file system currently
   incorporates these operation semantics as defined in NFS Version 4.

   Although the operation semantics are the same for these operations,
   the message packet format and other aspects of the communication
   between client and server are specific to DAFS and incompatible with
   NFS Version 4.

   Specifically, DAFS incorporates the following operational semantics
   from NFS Version 4 as specified in the corresponding NFS operation:

   o  DAFS_PROC_NULL

   o  DAFS_PROC_ACCESS


Wittle                                                        [Page 122]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  DAFS_PROC_CLOSE

   o  DAFS_PROC_COMMIT

   o  DAFS_PROC_CREATE

   o  DAFS_PROC_DELEGPURGE

   o  DAFS_PROC_DELEGRETURN

   o  DAFS_PROC_LINK

   o  DAFS_PROC_LOOKUP

   o  DAFS_PROC_LOOKUPP

   o  DAFS_PROC_NVERIFY

   o  DAFS_PROC_OPEN

   o  DAFS_PROC_OPENATTR

   o  DAFS_PROC_OPEN_DOWNGRADE

   o  DAFS_PROC_REMOVE

   o  DAFS_PROC_RENAME

   o  DAFS_PROC_RENEW

   o  DAFS_PROC_VERIFY

   o  DAFS_PROC_BC_NULL

   o  DAFS_PROC_BC_RECALL.


Wittle                                                        [Page 123]

INTERNET-DRAFT         Direct Access File System          September 2001


5.  Failure Recovery

   This chapter describes failure recovery in the follow topics:

   o  Exactly-Once semantics

   o  Server Response Cache

   o  Server Failover.

5.1.  Exactly Once Semantics

   DAFS supports "exactly once" semantics in the face of connection and
   server failures. Building upon the characteristics of DAT message
   delivery (see 2.3.3., "DAT Requirements"), DAFS makes an important
   assumption: DAFS requests are not repeatedly reissued until a
   response is received. During a DAFS Session, the server will not
   receive multiple copies of a request sent by the client. This means
   that the server does not need to check each request to see if it is a
   spurious repetition of a request performed earlier. Further, because
   there are no retransmissions, the server will not erroneously execute
   any request twice because of an overflow of a time-based reply cache.

   It is possible for a DAFS communication channel to fail without an
   indication to the client. It is expected that clients will implement
   timeouts. Typically the timeouts will specify long values, simply to
   detect failed DAFS communication channels. In these error cases, the
   client will destroy the existing channel and create a new one. In the
   event of abnormal disconnection, whether because of a timeout, or
   some other error, DAFS defines a Response Cache that enables the
   client to determine which requests, issued before the disconnection,
   were executed and which were not executed. The client is able to
   reissue only those requests not executed previously. This mechanism
   prevents a request from being executed more than once. Requests that
   do not modify file system state are not included in the Response
   Cache because these can be reissued harmlessly.

5.2.  Server Response Cache

5.2.1.  Response Cache

   As an option, negotiated when each Session is created, DAFS servers
   maintain a Response Cache that stores the results of requests that
   are not guaranteed to have reached the issuing client. If the use of
   the Response Cache is negotiated and agreed to during Session crea-
   tion, then the server MUST store these results for all state-
   modifying file system requests (see 4.3.1., "Chaining Restrictions"
   for a list) and for chained requests marked with the DAFS_CHF_SAVE


Wittle                                                        [Page 124]

INTERNET-DRAFT         Direct Access File System          September 2001


   flag (see 4.3.2., "Chaining Flags"). The server can store results for
   other requests but is NOT REQUIRED to because such requests can be
   re-executed harmlessly.

   Rationale: The Response Cache is an optional DAFS server behavior.
              It's use if negotiated when a Session is created. A client
              requests a response in order to improve it's service
              guarantees following a failure in the client, server, or
              network. Use of the Response Cache could introduce some
              loss of performance for the Session, particularly since
              the content of the Response Cache needs to survive server
              failure. The server is generally expected to follow the
              wishes of the client in respect to Response Cache use.
              However, as with other Session options, the server MAY
              decline a request for Response Cache use, or MAY always
              maintain a response regardless of client requests.

   For each request that a client issued but for which it did not
   receive a response, the Response Cache enables the client to deter-
   mine whether or not the request was executed by the server, and the
   results of the response. The Response Cache excludes requests that
   can be safely reissued, and is only maintained after a Session has
   been disconnection only when the Session is disconnected abnormally
   due to and error. Thus, "exactly once" semantics can be maintained
   across disconnection and server failure.

   When a new Session is established, information from the Response
   Cache of the old Session can be used to determine the set of requests
   that were in progress. This includes requests that were in transit to
   the server, requests being executed by the server, and requests whose
   response was in transit to the client. This enables the client and
   server agree on the state of the file system so that no request will
   be executed more than once (with the exception of requests that can
   be reissued harmlessly).

   Rationale: The session orientation of DAFS combined with the reliable
              delivery semantics of DAT enable DAFS to reissued requests
              and responses. By using the limits flow-control places on
              the size of the set of outstanding requests, the client
              and the server bound the set of requests whose state needs
              to be determined following a failure. The client can
              interrogate the server and resolve ambiguity. Then the
              client and server can proceed from a known file state.

   The number of responses that need to be stored in the Response Cache
   is at most OPNreq, because the flow control algorithm limits the
   number of client requests per channel that can be in progress a the
   same time to OPNreq. However, the server does not know which


Wittle                                                        [Page 125]

INTERNET-DRAFT         Direct Access File System          September 2001


   particular Response Cache entries can be reused when a new client
   request is received. Therefore, the DAFS protocol partitions the
   Response Cache into OPNreq "streams" that a client can use to submit
   requests to the server. A given stream can have only one request in
   progress at a time. Note that this is a way of restating the flow
   control algorithm explained earlier in 3.2.6., "Message Flow Con-
   trol".

   A Response Cache entry is associated with each stream and it contains
   the most recent state-modifying request (or saved chained request)
   issued for that stream.

   Each request is identified by a 32-bit transaction identifier that
   consists of a 16-bit stream ID and a 16-bit sequence number. The
   sequence number is incremented for each request sent on the given
   stream. Wrap around of the 16-bit sequence number does not pose any
   special difficulties because the transaction ID is used to resolve
   uncertainties about which requests have been processed only.

5.2.2.  Response Cache Handling of OPNreq Decrease

   When OPNreq is decreased, the highest numbered streams expire. For
   example, if the old OPNreq values was N, the highest numbered stream
   would be stream N-1. If OPNreq is decreased by 1, then the stream
   numbered N-1 expires. Requests already issued by the client that use
   the newly expired stream id will be completed normally, but the
   stream id can not be used for subsequent requests. If there is
   currently a request outstanding that specifies the newly expired
   stream id, then in order to keep the total number of outstanding
   requests below Nreq, the client is REQUIRED to refrain from issuing
   new requests on some other valid stream id, until the outstanding
   request has completed.

   Note that OPNreq can be decreased by at most "one" at time. For more
   information, see 3.2.6.4., "Flow Control Specifics".

   However, the number of Response Cache entries cannot be immediately
   decreased, because no entry can be deleted from the Response Cache
   until the server verifies that the client has received the associated
   response. If the old OPNreq value was N, then the Response Cache
   entry associated with stream N-1 might contain an entry that cannot
   be deleted immediately. The reason is that the client might have
   transmitted a new request for this stream before receiving notifica-
   tion of OPNreq being decreased. The new request that the client might
   have sent on the stream might need its response stored in the
   Response Cache, pending confirmation of it's receipt by the client.

   When the client receives the response that contains the new OPNreq


Wittle                                                        [Page 126]

INTERNET-DRAFT         Direct Access File System          September 2001


   value, it will stop using the expired stream. The next request that
   the client sends will contain the new OPnreq value, and this serves
   to acknowledge both the new value for OPNreq and the fact that one or
   more streams have expired.

5.2.2.1.  Freeing Entries in a Stream

   The server relies on DAT connection ordering rules to determine when
   the client has acknowledged a new OPNreq value. The response message
   that contained the new OPNreq also contained a particular stream id.
   When the server subsequently receives a request that also specifies
   that particular stream id, the server is assured that the client has
   received the response that contained the new OPNreq value. The
   Response Cache entry for the expired stream can now be deleted.

5.2.2.2.  Freeing Entries in the Highest Numbered Stream

   If the new value of OPNreq is sent to the client in a response mes-
   sage that contains a stream id that is itself about to expire (N-1 in
   the example from the previous section), then the server cannot use
   receipt of a new request with that stream id as an indication that
   the client received the new value. The reason is because the client
   will no longer use that stream id, and therefore the server will not
   receive subsequent messages that specify that stream id. In this
   case, acknowledgement of the new OPNreq value is based on the receipt
   of a request that the server can confirm was send by the client after
   the client received the response containing the new value of OPNreq.
   If the server makes no further changes to the value of OPNreq, the
   server can confirm receipt of the new value when it receives a
   request from the client that contains the new value. The Response
   Cache entry for the expired stream can now be deleted.

   When the server has received OPNreq requests from the client after
   having sent the response containing the new value of OPNreq, the
   server is assured that the client has received the request. The
   Response Cache entry for the expired stream can now be deleted.

   It could be that the next response sent by the server also decreases
   OPNreq. In the worst case, OPNreq-1 responses could be sent, each on
   the highest stream id valid at the time it is sent. In this case,
   when the server has received OPNreq requests from the client after
   having sent the first response containing the new value of OPNreq,
   the server is assured that the client has received all of the
   responses. Typically the delay in confirming receipt of the new
   OPNreq value will be shorter than the worst case, and the Response
   Cache entries for the expired streams can be released.


Wittle                                                        [Page 127]

INTERNET-DRAFT         Direct Access File System          September 2001


5.2.3.  Handling Batch I/O Requests

   The case of the DAFS_PROC_BATCH_SUBMIT operation special considered.
   If a disconnection occurs before the final
   DAFS_PROC_BC_BATCH_COMPLETE message is sent to indicate that all I/O
   operations are complete, the Response Cache will not contain an entry
   for the batch submission message even though some of the individual
   I/O requests might have completed. Clients will need to reissue the
   DAFS_PROC_BATCH_SUBMIT operation in such circumstances. If other
   clients were accessing the same areas as the batch I/O requests, the
   original sequence of operations will be altered. In such cases, the
   semantics of the repeated I/O operations MAY be different from a sin-
   gle occurrence. It is up to the client using batch I/O requests to
   use them in circumstances where such semantic differences are accept-
   able.

5.2.4.  Server Response Cache in Stable Storage

   If the use of the Response Cache is negotiated and agreed to during
   Session creation, then, in order to provide exactly once semantics
   across server failures, the DAFS server MUST keep its Response Cache
   in stable storage. The server MUST not place an entry in the Response
   Cache if the corresponding operation is not reflected in the file
   system. Conversely, if the operation is reflected in the file system
   state, the corresponding entry MUST appear in the Response Cache. A
   mismatch between the file system state and the Response Cache could
   result in an operation being performed more than once or not per-
   formed at all.

   Note: Ensuring agreement between the file system data and the
         Response Cache involves recording operation parameters for fs-
         state-modifying-requests in low-latency stable storage (for
         example, nonvolatile RAM) before performing the operation. Fol-
         lowing a failure, the server consults the saved information and
         uses it to formulate the Response Cache as it will appear to
         the client when new Sessions are established. In some cases,
         the server can determine whether the requested operation was
         completed by examining the file system. In other cases (for
         example, write operations), the operation can be repeated as
         part of server reboot but before allowing any other user access
         to the file system. In all of these cases, the server needs to
         deny access to the modified file system data by other requests,
         before marking the current request complete.

5.2.5.  Use of the Server Response Cache

   Assuming that use of the Response Cache was agree to during Session
   establishment, then following a disconnection, the server MUST save


Wittle                                                        [Page 128]

INTERNET-DRAFT         Direct Access File System          September 2001


   the Response Cache information so that it can be used by the client
   upon reconnection. As part of disconnection processing, the server
   MUST insure that no request issued before disconnection is still
   being executed and that the Response Cache entries associated with
   the disconnected Session can no longer be modified. The information
   in the Response Cache MUST be saved until the client reinitializes,
   reconnects to the server, and queries the Response Cache for the pre-
   vious Session, or for an implementation defined period. The reconnec-
   tion identification can use the same client verifier or, following a
   client reboot, can use a different client verifier.

   The client obtains information from the saved Response Cache by
   specifying the Session-ID of the disconnected Session together with
   the transaction ID. The DAFS_PROC_CHECK_RESPONSE request determines
   whether such an entry exists for the specified request. The
   DAFS_PROC_FETCH_RESPONSE request retrieves the response. The response
   returned to DAFS_PROC_FETCH_RESPONSE is the same that would have been
   returned for the original request.

   Because the server MUST execute operations in a chain in order, all
   Response Cache entries for chained requests will be ordered. (Note,
   however, that the actual responses can be delivered in a different
   order.) If the client queries the Response Cache during replay and
   finds that the last operation in a chain has been completed success-
   fully, then all other operations in that chain were also completed
   successfully. Nonchained operations can complete in any order.

   After response information is obtained for all requests that the
   client needs to verify results for, the client issues a
   DAFS_PROC_DISCARD_RESPONSES request and then proceeds with the rest
   of the necessary recovery. This will include reestablishing any
   necessary cached credentials for the new Session. Note that such
   credentials are NOT REQUIRED to access the Response Cache because no
   file system requests are executed at that time; only the results of
   previously executed requests are obtained.

   When reconnection occurs because of a server reboot or failover,
   locks SHOULD be reclaimed before issuing any new requests. In this
   context, new requests include any previously issued requests that
   were not found in the Response Cache, because they either were not
   executed or could be reexecuted safely.

   After any necessary recovery is done, the client can reissue requests
   that were not found in the Response Cache.

5.2.6.  Response Cache Operations

   The following DAFS operations are provided for DAFS Response Cache


Wittle                                                        [Page 129]

INTERNET-DRAFT         Direct Access File System          September 2001


   access and management.

   DAFS_PROC_CHECK_RESPONSE

      Check a disconnected Session's Response Cache for the results of a
      request.

   DAFS_PROC_FETCH_RESPONSE

      Fetch information from a disconnected Session's Response Cache.

   DAFS_PROC_DISCARD_RESPONSES

      Discard Response Cache information for a disconnected Session's
      Response Cache.

5.3.  Server Failover

   Optionally, a file system's failover_locations attribute can be used
   to specify an alternate location to be used to obtain access to the
   file system in the event of server failure. Clients can retrieve the
   failover_locations attribute when they cross into a new file system
   to determine if alternate locations exist. The file system handle
   returned by lookup, lookup parent, and open requests SHOULD be exam-
   ined to see if it contains a file system handle previously unknown to
   the client. At this point, the file system attribute
   failover_locations SHOULD be retrieved to determine the proper place
   to perform failover for that file system.

   If a disconnection occurs, clients will normally attempt to reconnect
   to the server. If this fails, the alternate locations can be used.
   These MAY be the same for all file systems or there MAY be different
   alternate servers for different locations.

   After a functioning alternate server is found for a given file sys-
   tem, recovery is similar to a server reboot. One difference is that
   the client might already have existing Sessions with some or all of
   the alternate servers specified by the failover_locations attribute.

   In any case, the client SHOULD obtain Response Cache information for
   each request that was in flight at the time of disconnection. Because
   the alternate server can be different for different file systems, the
   Response Cache information for each in-flight message MAY need to be
   obtained from different servers. Servers MUST ensure that the
   Response Cache information is propagated to the appropriate alternate
   server for the file system being accessed by the request.

   After the Response Cache information is obtained, recovery proceeds


Wittle                                                        [Page 130]

INTERNET-DRAFT         Direct Access File System          September 2001


   as it does in other server failure cases, including establishing
   cached credentials and reclamation of client locks. The client needs
   to be prepared to perform these activities on multiple servers if
   some of its locks are located on file systems that have failed over
   to different alternate servers.

5.3.1.  Changing failover_locations

   When the value of failover_locations changes, any responses to
   requests from clients who have not fetched the new value will set a
   special flag in the response header, in the special condition flag
   field, DAFS_SPCOND_FAILOVER. The client SHOULD notice the special
   condition flag and retrieve the failover_locations attribute for all
   file systems so that the client will fail over to the correct loca-
   tion in the event of server failure. After the client has interro-
   gated failover_locations for file systems where the value has
   changed, the DAFS_SPCOND_FAILOVER is reset until a subsequent value
   change causes it to be set again.

   Note: There might be significant advantage in introducing a way for
         the client to obtain failover_location information in a more
         efficient manner. For instance, the client might look up a
         dataset name in a name server or distributed directory and get
         a list of potential servers. See Appendix A. "DAFS Name Ser-
         vice" for more information.


Wittle                                                        [Page 131]

INTERNET-DRAFT         Direct Access File System          September 2001


6.  Message Formats

   This chapter describes the format of requests and responses in the
   DAFS protocol. Some highlights of the layout of DAFS messages
   include:

   o  DAFS requests and their responses are encoded as individual mes-
      sage packets. Limiting one DAFS operation per packet places a rea-
      sonable upper bound on the size of the buffer for the DAT receive
      data transfer operation (DTO) that is preallocated and submitted
      on the DAT connections underlying the DAFS communication channels.

   o  Message formats are organized to consolidate fixed-length fields
      at the beginning of messages. Rearranging fields in this manner
      fits with the fixed/variable-sized segregated encoding.

   o  Operations that make use of RDMA capabilities define an inline
      portion of the message along with the format of the data that is
      transferred in a "direct" buffer using remote DMA.

   o  All DAFS messages start with either a request or response header.
      Immediately following the request header is a procedure-specific
      component that contains the arguments of the request. Similarly,
      immediately following the response header in each response message
      is a procedure-specific component that contains the information
      that forms the result.

   In the remaining portions of this section, the discussion focuses on
   the format of DAFS message headers, the Session connection management
   procedures, the Response Cache management operations, the client-
   initiated file system requests, and the server-initiated DAFS back-
   control directives. For each procedure, the functionality of the pro-
   cedure is defined and the format of the argument and result portions
   of the request and response messages are given.

   Note that for each procedure, the identities of the requester and
   responder (for example, DAFS client or server) are implied. For most
   procedures, the DAFS client is the requester and the DAFS server is
   the responder. But for the back-control procedures (such as
   DAFS_PROC_BC_NULL and DAFS_PROC_BC_GETATTR), the DAFS server is the
   requester and the DAFS client is the responder.

6.1.  Message Headers and Common Structures

6.1.1.  Message Format


Wittle                                                        [Page 132]

INTERNET-DRAFT         Direct Access File System          September 2001


6.1.1.1.  Overall Packet Format

   All defined message formats have the following attributes:

   o  All multi-byte fields use the byte ordering negotiated for the
      DAFS Session.

   o  The offsets of all 2-byte fields MUST be 2-byte aligned; the
      offsets of all 4-byte fields MUST be 4-byte aligned; the offsets
      of all 8-byte fields MUST be 8-byte aligned; and variable-sized
      fields MUST be padded to ensure the proper alignment of the next
      field. All messages MUST be 8-byte aligned.

   o  Because UTF-8 encoding is used for all string fields, multiple
      octets could be needed to encode a single character. See the fol-
      lowing discussion on variable-sized fields.

   DAFS packets delivered inline using send and receive data transfer
   operation (DTO) buffers are laid out in two sections: the first con-
   tains all fields that are fixed in size; the second, herein referred
   to as the heap, contains fields that are variable in size. Each
   variable-sized field contains an entry in the fixed sized section
   consisting of a 4-byte offset into the heap where the actual variable
   sized data is encoded. Some operations, such as read and write
   inline, place the count in the fixed portion of the packet: this
   allows the variable-sized data itself to be aligned at natural buffer
   boundaries if the fixed-size portion is padded accordingly.

   DAFS defines data packets that contain variable-sized fields inside
   the format of a variable-sized field. This nesting is handled by
   recursively encoding each variable-sized field: its fixed contents
   are encoded and any nested variable-sized field is encoded using an
   offset into the next available location in the heap, which will con-
   tain the length and variable- sized contents as before.

   In addition to inline data, some operations define data transfers
   using RDMA functionality. Each such operation defines the format of
   the information transferred in via RDMA.

   Encoding of unions is done in a manner similar to what most C-
   compilers do: unions are encoded in a memory chunk large enough to
   hold the largest arm of the union. Union arms that are smaller than
   the allocated chunk are padded to fill in the unused portion of the
   union's memory. Encoding unions in this fashion makes them fixed in
   size regardless of which set of data is included. Fixed-sized unions
   ease computations of fixed offsets within packets as shown.

   There is also a need for union-like constructs (called joins) whose


Wittle                                                        [Page 133]

INTERNET-DRAFT         Direct Access File System          September 2001


   encoding attempts to minimize memory usage. This encoding does not
   use padding for arms that are shorter than the maximum size, thus
   using only as much memory as it is needed. The downside of this
   encoding is the inability to compute fixed offset after one of these
   constructs appears in a DAFS packet-that is, definitions containing
   joins appear in the variable sections of structure definitions.

6.1.1.2.  Data Definition Language

   DAFS packet formats are described in a C-like language. The main
   differences are as follows:

   o  Counted arrays with an upper bound are described using angle
      brackets. For example, an array of maximum size N 32-bit numbers
      would be described as int32 Sample_CountedArray <N>; This defini-
      tion would translate to a leading 32-bit unsigned number of
      entries field followed by the array itself.

      A counted array with zero entries is defined to be an array with
      zero in the number of entries field, followed by no array entries.

   o  A counted array with a no upper bound is also defined using angle
      brackets, but the maximum size is omitted, for example: int32
      Sample_VarCountedArray<>; The encoding is the same as that used
      for the counted arrays discussed previously.

   o  In contrast to counted arrays, the DAFS definition language also
      supports the more traditional arrays (as in C-traditional). These
      arrays are defined using the standard square brackets and their
      encoding omits the leading 32-bit number of entries. A sample
      definition: int32 Array[N]; An unbounded instance of a traditional
      array is specified by int32 Array[]; omitting the maximum size.

   o  Unions and joins explicitly define the discriminator by using
      switch- style syntax. Unlike switch statements, there is no need
      to add 'break' statements at the end of each case. An empty case
      statement can be used to share the format with the next nonempty
      case.

                   [union | join] switch (discriminator)
                      {
                      case A:
                         .
                      case B:
                         .
                         .
                      default;
                      }


Wittle                                                        [Page 134]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  Determining whether a field is encoded in the initial fixed-size
      fields section or the subsequent variable-sized fields section
      requires understanding whether the field is variable length itself
      or contains variable length fields. Inline comments are used to
      note which fields are variable length in each message packet.
      These fields will be encoded in the message packet with an offset
      value in the fixed-side section pointing to the start of the
      (variable-size) value of the field in the variable-size section.
      This variable-size section of the message is called the "heap."
      The order that variable-size fields are listed in a structure
      definition is not necessarily the order in which they will be
      encoded for transmission. This order is determined by the offsets
      into the "heap" found in the fixed sections of the packet. More-
      over, not all variable portions of a structure are necessarily
      included in all packets. Take the case of a union that contains
      variable fields in different arms: only those variable fields in
      the selected arm will appear in the transmitted packet.

   o  In addition, comments are used to note when an RDMA buffer is
      referenced through the use of a direct_op_buffer field in the mes-
      sage. These comments are labeled "DIRECT:". Although direct sec-
      tions are noted via comments sequentially following the structure
      definitions, the actual memory buffers involved in the transfer
      will rarely be laid out right after the inline data. Moreover the
      transfer of the contents of the buffer will occur as separate
      transport operations.

6.1.1.3.  Alignment of Variable Length Fields

   There are two items to be considered while dealing with a variable
   sized field,

   1) Encoding of a variable sized field.

   2) Calculation of offsets.

6.1.1.3.1.  Encoding of a variable sized field

   There are three types of variable sized fields:

   1) counted arrays

      Encoding of a counted array is as follows:

      o  Counted arrays always begin on an 8-byte boundary. Encoding of
         counted arrays remains the same irrespective of where it
         appears on the heap.


Wittle                                                        [Page 135]

INTERNET-DRAFT         Direct Access File System          September 2001


      o  The individual components of a counted array are naturally
         aligned.

      o  Counted arrays begin with a 4-byte count.

      o  Next is a 4-byte pad, if the elements of the counted array are
         8 byte aligned.

      o  Counted arrays are not padded to any natural boundary at the
         end. Any padding is dictated by the alignment requirements of
         the next item in the heap.

   2) joins

      Encoding of a join is straight forward.

   3) pathnames

      A pathname is a counted array of utf8strings, which themselves are
      variable-sized. They are not encoded recursively. A pathname
      always begins on a 8-byte boundary just like any other counted
      array and consists of a 4-byte count followed by a 4-byte pad and
      the utf8strings themselves. As before, a utf8string is not padded
      to any natural boundary at the end and any padding that is neces-
      sary is dictated by the alignment that is REQUIRED for the next
      item.

6.1.1.3.2.  Calculation of offsets

   Any procedure request or results will have:

   1) Header

   2) Procedure-specific fixed section

   3) Heap.

   Hereafter the procedure-specific fixed section is called the "fixed
   section."

   Offset fields specify the number of bytes from the beginning of the
   innermost scope that contains the offset field up to the beginning of
   the variable length field pointed to.

   Definition of a scope:

   1) Fixed section creates the outer scope.


Wittle                                                        [Page 136]

INTERNET-DRAFT         Direct Access File System          September 2001


   2) Any variable sized field (a counted array or a join) creates an
      inner scope.

   3) The above two constructs (fixed section and variable sized field)
      are the only ones that can define a scope.

   Some examples:

   1) Fixed section contains an offset to a counted array of RDMA
      buffers:

      o  The offset is the number of bytes from the beginning of the
         fixed section up to the beginning of the counted array of RDMA
         buffers that it points to.

   2) Fixed section contains an offset to a counted array of read/write
      requests and each read/write request contains an offset to a
      counted array of file chunks.

      o  The offset in the fixed section is the number of bytes from the
         beginning of the fixed section up to the beginning of the
         counted array of read/write requests that it points to.

      o  The offset in the read/write request is the number of bytes
         from the beginning of the counted array of read/write requests
         up to the beginning of the counted array of file chunks that it
         points to.

   3) Fixed section contains an offset to a counted array of directory
      entries; each directory entry contains an offset to a join of
      attributes; and each join of attributes contains an offset to an
      owner name.

      o  The offset in the fixed section is the number of bytes from the
         beginning of the fixed section up to the beginning of the
         counted array of directory entries that it points to.

      o  The offset in the directory entry is the number of bytes from
         the beginning of the counted array of directory entries up to
         the beginning of the attributes join that it points to.

      o  The offset in the attributes join is the number of bytes from
         the beginning of the attributes join up to the beginning of the
         owner name.

6.1.1.4.  Basic Data Types

   The basic building blocks of DAFS messages are:


Wittle                                                        [Page 137]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  dafs_int8, dafs_int16, dafs_int32, dafs_int64: 1-, 2-, 4-, and 8-
      byte signed quantities.

   o  dafs_uint8, dafs_uint16, dafs_uint32, dafs_uint64: 1-, 2-, 4-, and
      8- byte unsigned quantities.

   o  dafs_opaque8, dafs_opaque16, dafs_opaque32, dafs_opaque64: 1-, 2-,
      4-, and 8-byte opaque containers. These containers are truly
      opaque and do not require byte swapping by a host with an endian
      encoding different than the encoding agreed upon for use in the
      DAFS Session.

   o  typedef dafs_uint8 dafs_boolean; TRUE is defined as 1, FALSE as 0.

   o  Enumeration types are encoded as dafs_uint32.

   o  typedef dafs_uint8 dafs_utf8string<>; This defines a variable-
      sized UTF 8 string. The dafs_uint32 array count field precedes the
      array entries and is the number of octets in the array, not the
      number of characters.

   o  typedef dafs_uint32 dafs_var_offset_type: Heap offset to the
      beginning of a variable-sized field. The offset is in bytes.

6.1.1.5.  Endianness

   All DAFS messages exchanged between clients and servers MUST adhere
   to the endianness requirement agreed upon by both parties when a DAFS
   Session is established. This includes client-initiated requests as
   well as server-initiated back-control directives.

   There is a simple rule for negotiating the endianness of data encod-
   ing: the client chooses. After the encoding format is chosen, neither
   party can change it during the lifetime of the Session. Moreover,
   because it is possible for the server to cache replies to recent
   client's state-modifying requests for recovery purposes, it is
   strongly encouraged that clients request the same endianness when
   establishing Sessions to be used to reissue requests of a previous
   failed Session. In practice this is not be an issue, because it is
   expected that clients will always prefer one encoding over the other.

   Rationale: Unlike other network-based file access protocols, DAFS can
              enable extremely low overhead client access. The server
              also benefits from a reduction in protocol processing, but
              still has work to do to simply service the request.
              Because of this asymmetry in processing overhead, DAFS
              biases in favor of low-overhead clients by letting the
              client specify the endianness of the in-memory data


Wittle                                                        [Page 138]

INTERNET-DRAFT         Direct Access File System          September 2001


              structures.

6.1.1.6.  Internationalization Support

   The encoding/representation of strings brings up the issue of inter-
   nationalization support in the protocol. DAFS requires the use of
   UTF8 encoding for strings, including file names. The following para-
   graph is taken from the NFS Version 4 specification and applies to
   the DAFS protocol as well:

           "The primary issue in  which  NFS  needs  to  deal  with
           internationalization,  or  I18n, is with respect to file
           names and other strings as  used  within  the  protocol.
           The  choice  of string representation must allow reason-
           able name/string access to  clients  which  use  various
           languages.   The UTF-8 encoding of the UCS as defined by
           [ISO10646] allows for this type of  access  and  follows
           the  policy  described in 'IETF Policy on Character Sets
           and Languages', [RFC2277]."  (RFC 3010, p. 91)

6.1.2.  Request Header

   All DAFS requests begin with the following header:

        struct DAFS_Request_Header
           {
           dafs_uint32                header_magic;
           dafs_uint32                protocol_version;
           dafs_uint16                desired_nreq;
           dafs_uint16                chain_flags;
           dafs_uint16                stream_id;
           dafs_uint16                seq_number;
           dafs_opaque64              analyzer;
           dafs_checksum_type         message_checksum;
           dafs_cred_handle_type      cred_handle;
           dafs_uint32                procedure;
           dafs_uint32                request_len;
           };


   Fields:

   header_magic

      The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S'") is used
      to mark each message header.    It also determines the endianness
      of the message. The first transmitted message on a Session


Wittle                                                        [Page 139]

INTERNET-DRAFT         Direct Access File System          September 2001


      determines the endianness for a Session, and all subsequent mes-
      sages MUST use the same endianness. This can be used as a sanity
      check and to aid identification of DAFS-related packets on bus
      analyzers, etc.

   protocol_version

      The protocol version used by the client for the connection is
      specified in the message header. Once a Session is created, all
      messages exchanged on the Session MUST specify the same protocol
      version.

      If the server does not accept this version of the protocol, but
      does accept another version of the DAFS protocol, it will respond
      to the client with a message header containing the DAFS
      header_magic field, and a protocol version the server will accept.
      The client can then retry the DAFS connect operation with that
      protocol version, or any other that it wishes to try. The client
      can also determine the protocol version supported by the server
      through the use of the DAFS name service.

   desired_nreq

      Flow control field as described in 3.2.6., "Message Flow Control".

   chain_flags

      Chaining flags for this request as described in 4.3., "Request
      Chaining".

   stream_id

      Slot number portion of the transaction ID for this request. The
      responder grants to the requester some number of requests, OPNreq,
      which can be simultaneously outstanding at any given time. The
      client manages this pool of requests as a collection of mail slots
      or mailboxes, and guarantees not to submit a request to a specific
      slot before receiving the result from a request previously submit-
      ted to that slot. Stream_id MUST be between 0 and OPNreq - 1. See
      5.2.1., "Response Cache" for more information.

   seq_number

      The rest of the transaction ID for this request. This is a
      sequence number that is incremented for each request sent on a
      given slot. The combined value of stream_id and seq_number serves
      to identify requests in the Response Cache for the purposes of
      failure recovery. For more information, see 5.2.1., "Response


Wittle                                                        [Page 140]

INTERNET-DRAFT         Direct Access File System          September 2001


      Cache".

   analyzer

      The server can make no assumption about the value of this field,
      but MUST simply return it in the response message for this
      request.

   Note: The analyzer field is opaque to the DAFS server. It is simply
         returned in each response message associated with a given
         request. The client MAY store information in this field that it
         deems helpful to it when the response is received.

   message_checksum

      Room for an optional checksum, if the use of one has been nego-
      tiated, as described in 3.2.5., "Checksums".

   cred_handle

      Credentials to be used for the authorizing the request as
      described in 3.1.1., "Security Model".

   procedure

      Procedure number for this request, as defined below.

   request_len

      Length in bytes of the entire request, including the header.

6.1.3.  Response Header

   All DAFS responses begin with the following header:


Wittle                                                        [Page 141]

INTERNET-DRAFT         Direct Access File System          September 2001


        struct DAFS_Response_Header
           {
           dafs_uint32                header_magic;
           dafs_uint32                protocol_version;
           dafs_uint16                target_nreq;
           dafs_uint16                spec_cond;
           dafs_uint16                stream_id;
           dafs_uint16                seq_number;
           dafs_opaque64              analyzer;
           dafs_checksum_type         message_checksum;
           dafs_uint32                status;
           dafs_uint32                response_len;
           dafs_uint32                reserved;
           };


   Fields:

   header_magic

      The magic sequence 0x44 0x41 0x46 0x52 ('D' 'A' 'F' 'R'") is used
      to mark each reply message header. These can be used as a sanity
      check and aid in the identification of DAFS-related packets on bus
      analyzers, etc.

   protocol_version

      Protocol version of the packet. For messages on a successfully
      created Session, this is the same as the protocol_version that was
      received in the request. In a reply to a connection request that
      has failed because of a protocol mismatch, this field contains the
      next lowest numbered protocol version supported by the server, or
      zero if no such alternative is supported.

   target_nreq

      Flow control field as described in 3.2.6., "Message Flow Control".

   spec_cond

      Flags for special conditions that the client SHOULD note and act
      upon. The server SHOULD set all undefined flags to be zero. The
      flags currently defined follow:

   o  DAFS_SPCOND_LMOVED (0x0001): Indicates that a lease for this
      client has been migrated to a new server and that the client


Wittle                                                        [Page 142]

INTERNET-DRAFT         Direct Access File System          September 2001


      SHOULD renew his leases at that new location.

   o  DAFS_SPCOND_FAILOVER (0x0002): Indicates that a value for the
      failover_locations attribute previously fetched by the client has
      changed and that the client SHOULD fetch new values for all file
      systems.

   stream_id

      Field value copied from the request header to which this response
      pertains.

   seq_number

      Field value copied from the request header to which this response
      pertains.

   analyzer

      The server can make no assumption about the value of this field,
      but MUST simply copy it from the request message for this
      response.

   message_checksum

      Room for an optional checksum, if the use of one has been nego-
      tiated, as described in 3.2.5., "Checksums".

   status

      Result code for the operation.

   response_len

      Length in bytes of the entire response, including the header.

   reserved

      Reserved for future use. This reserved field forces the message
      header to a multiple of 8-bytes, insuring that operation headers
      will be aligned on 8-byte boundaries.

6.1.4.  Basic Types

   This section defines some basic DAFS types that will be used when
   describing packet formats for DAFS requests and responses.


Wittle                                                        [Page 143]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef dafs_uint64           dafs_attr_bitmap_type;
        typedef dafs_opaque64         dafs_session_id_type;
        typedef dafs_opaque64         dafs_client_id_type;
        typedef dafs_opaque64         dafs_state_id_type;
        typedef dafs_opaque32         dafs_cred_handle_type;
        typedef dafs_uint32           dafs_status_type;

        typedef struct dafs_checksum
           {
           dafs_uint16                S2;
           dafs_uint16                S1;
           } dafs_checksum_type;

        typedef dafs_uint32           dafs_memhandle_type;
        typedef dafs_memhandle_type   dafs_rmr_context_type;

        typedef dafs_uint64           dafs_rmr_target_address_type;


   A RMR Context identifies a virtually contiguous buffer that can be
   used by other systems to read from or write to using RDMA capabili-
   ties.


Wittle                                                        [Page 144]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef dafs_opaque64         dafs_FSHandle_type[2];
        typedef dafs_opaque64         dafs_verifier_type;
        typedef dafs_utf8string       dafs_component_type;  /*heap */
        typedef dafs_component_type<> dafs_pathname_type; /*heap*/
        typedef dafs_opaque8<>       dafs_lockowner_type;/*heap */
        typedef dafs_opaque8<>        dafs_client_string_type;
                                                            /*heap */

        typedef dafs_utf8string       dafs_fencing_id_type; /*heap */
        typedef dafs_fencing_id_type<>  dafs_fence_array_type;

        typedef struct dafs_filehandle
           {
           dafs_FSHandle_type         fshandle;
           dafs_opaque64              fileid[6];
           } dafs_filehandle_type;

        typedef dafs_fs_location
           {
           dafs_utf8string            server;              /* heap */
           dafs_pathname_type         root_path;           /* heap */
           } dafs_fs_location_type;

        typedef dafs_fs_locations
           {
           dafs_pathname_type         fs_root;             /* heap */
           dafs_fs_location_type      locations<>;         /* heap */
           };

        typedef dafs_ace_type
           {
           dafs_uint32                type;
           dafs_uint32                flag;
           dafs_uint32                access_mask;
           dafs_utf8string            who;                 /* heap */
           } dafs_ace_type;


   The bitmask values for the type field above:

        #define DAFS_ACE_ACCESS_ALLOWED_ACE_TYPE    0x00000000
        #define DAFS_ACE_ACCESS_DENIED_ACE_TYPE     0x00000001
        #define DAFS_ACE_SYSTEM_AUDIT_ACE_TYPE      0x00000002
        #define DAFS_ACE_SYSTEM_ALARM_ACE_TYPE      0x00000004


Wittle                                                        [Page 145]

INTERNET-DRAFT         Direct Access File System          September 2001


   The bitmask values for the flag field above:

        #define DAFS_ACE_FILE_INHERIT_ACE           0x00000001
        #define DAFS_ACE_DIRECTORY_INHERIT_ACE      0x00000002
        #define DAFS_ACE_NO_PROPAGATE_INHERIT_ACE   0x00000004
        #define DAFS_ACE_INHERIT_ONLY_ACE           0x00000008
        #define DAFS_ACE_SUCCESSFUL_ACCESS_ACE_FLAG 0x00000010
        #define DAFS_ACE_FAILED_ACCESS_ACE_FLAG     0x00000020
        #define DAFS_ACE_IDENTIFIER_GROUP           0x00000040


   The values for the access_mask field above:

        #define DAFS_ACE_READ_DATA                  0x00000001
        #define DAFS_ACE_LIST_DIRECTORY             0x00000001
        #define DAFS_ACE_WRITE_DATA                 0x00000002
        #define DAFS_ACE_ADD_FILE                   0x00000002
        #define DAFS_ACE_APPEND_DATA                0x00000004
        #define DAFS_ACE_ADD_SUBDIRECTORY           0x00000004
        #define DAFS_ACE_READ_NAMED_ATTRS           0x00000008
        #define DAFS_ACE_WRITE_NAMED_ATTRS          0x00000010
        #define DAFS_ACE_EXECUTE                    0x00000020
        #define DAFS_ACE_DELETE_CHILD               0x00000040
        #define DAFS_ACE_READ_ATTRIBUTES            0x00000080
        #define DAFS_ACE_WRITE_ATTRIBUTES           0x00000100
        #define DAFS_ACE_DELETE                     0x00010000
        #define DAFS_ACE_READ_ACL                   0x00020000
        #define DAFS_ACE_WRITE_ACL                  0x00040000
        #define DAFS_ACE_WRITE_OWNER                0x00080000
        #define DAFS_ACE_SYNCHRONIZE                0x00100000

        typedef struct dafs_specdata
           {
           dafs_uint64                specdata1;
           dafs_uint64                specdata2;
           } dafs_specdata_type;

        typedef struct dafs_change_info
           {
           dafs_uint64                before;
           dafs_uint64                after;
           dafs_uint32                atomic;
           dafs_uint32                pad;
           } dafs_change_info_type;

        Servers set the atomic field to TRUE if they can modify a
        file object and obtain the before and after times atomically.


Wittle                                                        [Page 146]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef struct dafs_time_type
           {
           dafs_int64                 seconds;
           dafs_uint32                nseconds;
           } dafs_time_type;


   A positive value in the seconds field refer to times after the 0-hour
   January 1, 1970 UTC (Universal Coordinated Time). Negative seconds
   refer to times before the 0-hour January 1, 1970 UTC.

        enum timeset_how
           {
           SET_TO_SERVER_TIME         = 1,
           SET_TO_CLIENT_TIME         = 2
           };

        typedef dafs_settime
           {
           dafs_time_type             client_time; /* CLIENT TIME */
           enum dafs_timeset_how      how;
           } dafs_settime_type;


6.1.5.  File Attributes

   The DAFS attributes format definition provides for selective support
   and retrieval of a subset of attributes and for extending the number
   of supported attributes in a straightforward manner. The DAFS defini-
   tion separates attributes into two subsets: file attributes and file
   system attributes.

   File attributes that are labeled mandatory MUST be supported by all
   DAFS server implementations. A DAFS server MAY support non-mandatory
   attributes and a DAFS client MUST not rely on a server implementing
   any of these attributes.

   The goal of the encoding of the attributes scheme proposed here is to
   consume bandwidth only for the attributes requested. When fetching
   attributes, a client can request a subset of all possible attributes.
   Note that by virtue of using joins, the size of the file attributes
   structure is variable and, therefore, attributes are placed in the
   variable section of packet definitions.

   The server is REQUIRED to return a packet formatted to contain all


Wittle                                                        [Page 147]

INTERNET-DRAFT         Direct Access File System          September 2001


   requested attributes. If the server does not support or is unable to
   supply a requested attribute, it MUST include the space for the
   unsupported attributes in the response and indicate its invalid value
   in the valid attributes field of the response. The "blank" fields for
   requested-but- unsupported attributes are included to enable the
   client to quickly locate its requested attributes using precomputed
   offsets based on the sizes of the attribute fields it requested.

   The DAFS file attributes definitions follow. Attribute 1 is
   represented by the least significant bit in the attributes bit map,
   and subsequent attributes are represented by the corresponding bit.
   The server MUST support those labeled "mandatory."

   DAFS_FATTR_NAMED_ATTR (1)

      Boolean that indicates server support of named attributes.

   DAFS_FATTR_ARCHIVE   (2)

      Boolean that indicates whether a file has been archived (backed-
      up) since the time of last modification. This attribute is writ-
      able by the client.

   DAFS_FATTR_HIDDEN   (3)

      Determines if file is hidden for Win32 purposes. This attribute is
      writable by the client.

   DAFS_FATTR_SYSTEM   (4)

      Whether file object is a system file object for Win32 purposes.
      This attribute is writable by the client.

   DAFS_FATTR_OBJECT_ TYPE   (5)

      File object type. Support for this file attribute is mandatory.
      See 6.5.9., "DAFS_PROC_CREATE" for supported types.

   DAFS_FATTR_MODE   (6)

      Access mode bits - Unix style. This attribute is writable by the
      client.

   DAFS_FATTR_NUM_LINKS   (7)

      Number of links pointing to file object.

   DAFS_FATTR_CHANGE   (8)


Wittle                                                        [Page 148]

INTERNET-DRAFT         Direct Access File System          September 2001


      Server-generated value that changes any time the file object
      changes. A server MAY use the modification time if the granularity
      is appropriate. Support for this file attribute is mandatory.

   DAFS_FATTR_OBJECT_SIZE   (9)

      Size, in bytes, of file object. Attribute is writable by the
      client. Support for this file attribute is mandatory.

   DAFS_FATTR_FILE_ID   (10)

      Unique id for this file object. Id is unique within the same
      FSHandle space. Support for this file attribute is mandatory.

   DAFS_FATTR_SPACE_USED   (11)

      File systems bytes allocated to this file object.

   DAFS_FATTR_TIME_ACCESS   (12)

      The last time this object was accessed.

   DAFS_FATTR_TIME_ACCESS_SET   (13)

      Set the time last accessed to this value. Client write-only attri-
      bute.

   DAFS_FATTR_TIME_BACKUP   (14)

      Last time this object was backed-up. This attribute is writable by
      the client.

   DAFS_FATTR_TIME_CREATE   (15)

      Creation time of this object. This is not the Unix-style c-time.
      This attribute is writable by the client.

   DAFS_FATTR_TIME_DELTA   (16)

      Time granularity supported by server.

   DAFS_FATTR_TIME_METADATA   (17)

      Last time this object's metadata changed. This attribute is writ-
      able by the client.

   DAFS_FATTR_TIME_MODIFY   (18)


Wittle                                                        [Page 149]

INTERNET-DRAFT         Direct Access File System          September 2001


      Last the this object's contents were modified.

   DAFS_FATTR_TIME_MODIFY_SET   (19)

      Set the time last modified to this value. Client write only attri-
      bute.

   DAFS_FATTR_RAW_DEV   (20)

      Identifier for raw devices.

   DAFS_FATTR_FILEHANDE   (21)

      Filehandle for this object. Support for this file attribute is
      mandatory.

   DAFS_FATTR_ACL   (22)

      Access control list for this object. This attribute is writable by
      the client.

   DAFS_FATTR_MIME_TYPE   (23)

      MIME body type/subtype. This attribute is writable by the client.

   DAFS_FATTR_OWNER   (24)

      The owner of this object. This attribute is writable by the
      client.

   DAFS_FATTR_OWNER_GROUP   (25)

      This object's owner's group. This attribute is writable by the
      client.

   The "bitset" pseudo-function is used to define the specification of
   the file attributes structure. Bitset(x, y) is TRUE if the attribute
   bit "y" is set in the attribute bitmap "x".


Wittle                                                        [Page 150]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef struct file_attr
           {
           attr_bitmap_type           included;
           attr_bitmap_type           valid;

           join switch (bitset(included, DAFS_FATTR_NAMED_ATTR))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } named_attributes;

           join switch (bitset(included, DAFS_FATTR_ARCHIVE))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } archive;

           join switch (bitset(included, DAFS_FATTR_HIDDEN))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
                 hidden;

           join switch (bitset(included, DAFS_FATTR_SYSTEM))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } system;

           join switch (bitset(included, DAFS_FATTR_OBJECT_TYPE))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } object_type;

           join switch (bitset(included, DAFS_FATTR_MODE)
              {


Wittle                                                        [Page 151]

INTERNET-DRAFT         Direct Access File System          September 2001


              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } mode;

           join switch (bitset(included, DAFS_FATTR_NUM_LINKS))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
               } num_links;

           join switch (bitset(included, DAFS_FATTR_CHANGE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } change;

           join switch (bitset(included, DAFS_FATTR_OBJECT_SIZE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } object_size;

           join switch (bitset(included, DAFS_FATTR_FILE_ID))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } file_id;

           join switch (bitset(included, DAFS_FATTR_SPACE_USED))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              space_used;

           join switch (bitset(included, DAFS_FATTR_TIME_ACCESS))
              {


Wittle                                                        [Page 152]

INTERNET-DRAFT         Direct Access File System          September 2001


              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_access;

           join switch (bitset(included,DAFS_FATTR_TIME_ACCESS_SET))
              {
              case TRUE:
                 dafs_settime_type    settime;
              case FALSE:
                 void;
              } time_access_set;

           join switch (bitset(included, DAFS_FATTR_TIME_BACKUP))
              {
              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_backup;

           join switch (bitset(included, DAFS_FATTR_TIME_CREATE))
              {
              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_create;

           join switch (bitset(included, DAFS_FATTR_TIME_DELTA))
              {
              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_delta;

           join switch (bitset(included, DAFS_FATTR_TIME_METADATA))
              {
              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_metadata;

           join switch (bitset(included, DAFS_FATTR_TIME_MODIFY))
              {


Wittle                                                        [Page 153]

INTERNET-DRAFT         Direct Access File System          September 2001


              case TRUE:
                 dafs_time_type       time;
              case FALSE:
                 void;
              } time_modify;

           join switch(bitset(included,DAFS_FATTR_TIME_MODIFY_SET))
              {
              case TRUE:
                 dafs_settime_type    settime;
              case FALSE:
                 void;
              } time_modify_set;

           join switch (bitset(included, DAFS_FATTR_RAW_DEV))
              {
              case TRUE:
                 dafs_specdata_type   contents;
              case FALSE:
                 void;
              } specdata;

           join switch (bitset(included, DAFS_FATTR_FILEHANDLE))
              {
              case TRUE:
                 dafs_filehandle_type contents;
              case FALSE:
                 void;
              } filehandle;

           join switch (bitset(included, DAFS_FATTR_ACL))
              {
              case TRUE:
                 dafs_ace_type        acl<>;
              case FALSE:
                 void;
              } acl;

           join switch (bitset(included, DAFS_FATTR_MIME_TYPE))
              {
              case TRUE:
                 dafs_utf8string      mimetype;             /* heap */
              case FALSE:
                 void;
              } mime_type;

           join switch (bitset(included, DAFS_FATTR_OWNER))
              {


Wittle                                                        [Page 154]

INTERNET-DRAFT         Direct Access File System          September 2001


              case TRUE:
                 dafs_utf8string      owner;                /* heap */
              case FALSE:
                 void;
              } owner;

           join switch (bitset(included, DAFS_FATTR_OWNER_GROUP))
              {
              case TRUE:
                 dafs_utf8string      owner_goup;           /* heap */
              case FALSE:
                 void;
              } owner_group;

        } dafs_file_attr_type;


6.1.6.  File System Attributes

   The DAFS file system attributes definitions follow. The same encoding
   described for DAFS file attributes applies to file system attributes.

   DAFS_FSATTR_LINK_ SUPPORT   (1)

      Denotes server support for hard links on this file system. Support
      for this file system attribute is mandatory.

   DAFS_FSATTR_SYMLINK_SUPPORT   (2)

      Denotes server support for symbolic links on this file system.
      Support for this file system attribute is mandatory.

   DAFS_FSATTR_CAN_SET_TIME   (3)

      Denotes server support for setting the access/modify times of a
      file object.

   DAFS_FSATTR_CASE_INSENSITIVE   (4)

      If TRUE, file names on a server are case insensitive. FALSE other-
      wise.

   DAFS_FSATTR_CASE_PRESERVING   (5)

      If TRUE, server preserves file name case. FALSE otherwise.

   DAFS_FSATTR_CHOWN_RESTRICTED   (6)


Wittle                                                        [Page 155]

INTERNET-DRAFT         Direct Access File System          September 2001


      Denotes server restrictions to setting the owner/owner groups file
      attributes by non-privilege users.

   DAFS_FSATTR_HOMOGENEOUS   (7)

      TRUE if all objects in a file system have the same file system
      attributes values. FALSE otherwise.

   DAFS_FSATTR_NO_TRUNC   (8)

      Whether a server truncates or rejects with an error file names
      that exceed the server's maximum supported length.

   DAFS_FSATTR_UNIQUE_HANDLE   (9)

      Whether server guarantees that a file object is always represented
      by the same unique handle.

   DAFS_FSATTR_LEASE_TIME   (10)

      The locking lease time. This value is in seconds. Since any lease
      renewal message renews all of the client's leases on the receiving
      DAFS server, this value SHOULD be the same for all file systems
      provided by a single DAFS server. Support for this file system
      attribute is mandatory.

   DAFS_FSATTR_RD_ATTR_ERROR   (11)

      Error a server returns when a failure to obtain attributes during
      a DAFS_PROC_READDIR is encountered. Support for this file system
      attribute is mandatory.

   DAFS_FSATTR_ACL_SUPPORT   (12)

      ACL types supported by the server.

   DAFS_FSATTR_MAX_LINK   (13)

      Maximum number of hard links to a file object allowed.

   DAFS_FSATTR_MAX_NAME   (14)

      Maximum number of characters allowed in a file object's name.

   DAFS_FSATTR_SUPPORTED_FATTR   (15)

      Bitmap representing the supported file attributes in this server.
      Support for this file system attribute is mandatory.


Wittle                                                        [Page 156]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS_FSATTR_SUPPORTED_FSATTR   (16)

      Bitmap representing the supported file system attributes in this
      server. Support for this file system attribute is mandatory.

   DAFS_FSATTR_FILES_AVAILABLE   (17)

      Number of files available on this file system for use by the user
      issuing this request.

   DAFS_FSATTR_FILES_FREE   (18)

      Number of free files in this file system.

   DAFS_FSATTR_FILES_TOTAL   (19)

      Total number of files in this file system.

   DAFS_FSATTR_MAX_FILE_SIZE   (20)

      File system's maximum file size, in bytes.

   DAFS_FSATTR_MAX_READ   (21)

      Maximum read size allowed for objects in this file system.

   DAFS_FSATTR_MAX_WRITE   (22)

      Maximum write size allowed for objects in this file system.

   DAFS_FSATTR_QUOTA_HARD   (23)

      Available space in bytes that MAY be allocated to this file object
      before allocations are refused. This space is not specifically
      reserved for this object and MAY be allocated to other objects in
      this file system that by some rule belong to a common set.

   DAFS_FSATTR_QUOTA_SOFT   (24)

      Available space in bytes that MAY be allocated to this file object
      before a warning is issued. This space is not specifically
      reserved for this object and MAY be allocated to other objects in
      this file system that by some rule belong to a common set.

   DAFS_FSATTR_QUOTA_USED   (25)

      Disk space, in bytes, used by this file object and possibly others
      in a set that share the space reported in DAFS_FATTR_QUOTA_HARD.


Wittle                                                        [Page 157]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS_FSATTR_SPACE_AVAILABLE   (26)

      Amount of space, in bytes, available in this file system.

   DAFS_FSATTR_SPACE_FREE   (27)

      Number of bytes in this file system that is free.

   DAFS_FSATTR_SPACE_TOTAL   (28)

      File system's total size, in bytes.

   DAFS_FSATTR_FSHANDLE   (29)

      The FSHandle associated with this file system. Support for this
      file system attribute is mandatory.

   DAFS_FSATTR_FAILOVER_LOCATIONS   (30)

      List of alternate server locations that might serve this file sys-
      tem in the event of a server failure.

   DAFS_FSATTR_MAX_APPEND   (32)

      The maximum number of bytes that can be atomically appended the
      end of a file. Any DAFS_PROC_APPEND_INLINE or
      DAFS_PROC_APPEND_DIRECT operation that specifies more bytes than
      DAFS_FSATTR_MAX_APPEND is returned in error without any data being
      added to the file. DAFS_FSATTR_MAX_APPEND MUST be set to 65536 or
      greater. Support for this file system attribute is mandatory.

   DAFS_FSATTR_PREF_IO_SIZE   (33)

      Server's preferred I/O size for this file system, in bytes.

   DAFS_FSATTR_FH_EXPIRE_TYPE   (34)

      Volatility of filehandles in this file system

   The bitmap values for the DAFS_FSATTR_ACL_SUPPORT attribute are:

        #define DAFS_ACL_SUPPORT_ALLOW        = 0x00000001
        #define DAFS_ACL_SUPPORT_DENY         = 0x00000002
        #define DAFS_ACL_SUPPORT_AUDIT        = 0x00000004
        #define DAFS_ACL_SUPPORT_ALARM        = 0x00000008


Wittle                                                        [Page 158]

INTERNET-DRAFT         Direct Access File System          September 2001


   The bitmap values for the DAFS_FSATTR_FH_EXPIRE_TYPE attribute are:

        #define DAFS_FH_PERSISTENT            = 0x00000000;
        #define DAFS_FH_NOEXPIRE_WITH_OPEN    = 0x00000001;
        #define DAFS_FH_VOLATILE_ANY          = 0x00000002;
        #define DAFS_FH_VOL_RENAME            = 0x00000008;


   The filesys_attr_type structure definition follows.


Wittle                                                        [Page 159]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef struct filesys_attr
           {
           attr_bitmap_type           included;
           attr_bitmap_type           valid;

           join switch (bitset(included, DAFS_FSATTR_LINK_SUPPORT))
              {
              case TRUE:
                 dafs_boolean         contents
              case FALSE:
                 void;
              } link_support;

          join switch(bitset(included,DAFS_FSATTR_SYMLINK_SUPPORT))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } symlink_support;

           join switch (bitset(included, DAFS_FSATTR_CAN_SET_TIME))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } can_set_time;

           join switch (bitset(included, FSATTR_CASE_INSENSITIVE))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } case_insensitive;

           join switch (bitset (included,
                                  DAFS_FSATTR_CASE_PRESERVING))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } case_preserving;

           join switch (bitset (included,


Wittle                                                        [Page 160]

INTERNET-DRAFT         Direct Access File System          September 2001


                                DAFS_FSATTR_CHOWN_RESTRICTED))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } chown_restricted;

           join switch (bitset(included, DAFS_FSATTR_HOMOGENEOUS))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } homogeneous;

           join switch (bitset(included, DAFS_FSATTR_NO_TRUNC))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } no_trunc;

           join switch (bitset(included,
                               DAFS_FSATTR_UNIQUE_HANDLE))
              {
              case TRUE:
                 dafs_boolean         contents;
              case FALSE:
                 void;
              } unique_handle;

           join switch (bitset(included, DAFS_FSATTR_LEASE_TIME))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } lease_time;

           join switch (bitset(included, DAFS_FSATTR_RD_ATTR_ERROR))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } rd_attr_error;


Wittle                                                        [Page 161]

INTERNET-DRAFT         Direct Access File System          September 2001


           join switch (bitset(included, DAFS_FSATTR_ACL_SUPPORT))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } acl_support;

           join switch (bitset(included, DAFS_FSATTR_MAX_LINK))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } max_link;

           join switch (bitset(included, DAFS_FSATTR_MAX_NAME))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } max_name;

           join switch (bitset(included,
                               DAFS_FSATTR_SUPPORTED_FATTR))
              {
              case TRUE:
                 dafs_attr_bitmap_type  contents;
              case FALSE:
                 void;
              } supported_fattr;

           join switch (bitset(included,
                               DAFS_FSATTR_SUPPORTED_FSATTR))
              {
              case TRUE:
                 dafs_attr_bitmap_type  contents;
              case FALSE:
                 void;
              } supported_fsattr;

           join switch (bitset(included,
                               DAFS_FSATTR_FILES_AVAILABLE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:


Wittle                                                        [Page 162]

INTERNET-DRAFT         Direct Access File System          September 2001


                 void;
              } files_available;

           join switch (bitset(included, DAFS_FSATTR_FILES_FREE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } files_free;

           join switch (bitset(included, DAFS_FSATTR_FILES_TOTAL))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } files_total;

           join switch (bitset(included, DAFS_FSATTR_MAX_FILE_SIZE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } max_file_size;

           join switch (bitset(included, DAFS_FSATTR_MAX_READ))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } max_read;

           join switch (bitset(included, DAFS_FSATTR_MAX_WRITE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } max_write;

           join switch (bitset(included, DAFS_FSATTR_QUOTA_HARD))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:


Wittle                                                        [Page 163]

INTERNET-DRAFT         Direct Access File System          September 2001


                 void;
              } quota_hard;

           join switch (bitset(included, DAFS_FSATTR_QUOTA_SOFT))
              {
              case TRUE:
                 uint64               contents;
              case FALSE:
                 void;
              } quota_soft;

           join switch (bitset(included, DAFS_FSATTR_QUOTA_USED))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } quota_used;

           join switch (bitset(included,
                               DAFS_FSATTR_SPACE_AVAILABLE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } space_available;

           join switch (bitset(included, DAFS_FSATTR_SPACE_FREE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } space_free;

           join switch (bitset(included, DAFS_FSATTR_SPACE_TOTAL))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } space_total;

           join switch (bitset(included, DAFS_FSATTR_FSHANDLE)
              {
              case TRUE:
                 dafs_FSHandle_type   contents;


Wittle                                                        [Page 164]

INTERNET-DRAFT         Direct Access File System          September 2001


              case FALSE:
                 void;
              } fshandle;

           join switch (bitset(included,
                               DAFS_FSATTR_FAILOVER_LOCATIONS))
              {
              case TRUE:
                 dafs_fs_locations_type  failoverlocations<>;
              case FALSE:
                 void;
              } failover_locations;

           join switch (bitset(included, DAFS_FSATTR_MAX_APPEND))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } max_append;

           join switch (bitset(included, DAFS_FSATTR_PREF_IO_SIZE))
              {
              case TRUE:
                 dafs_uint64          contents;
              case FALSE:
                 void;
              } pref_io_size;

          join switch (bitset(included, DAFS_FSATTR_FH_EXPIRE_TYPE))
              {
              case TRUE:
                 dafs_uint32          contents;
              case FALSE:
                 void;
              } fh_expire_type;

        } dafs_filesys_attr_type;


6.1.7.  Direct Operations

   The DAFS direct read and write operations use the following common
   structure.


Wittle                                                        [Page 165]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef struct dafs_direct_op_buffer
           {
           dafs_uint64                buffer_address;
           dafs_uint32                buffer_byte_count;
           dafs_memhandle_type        buffer_handle;
           } dafs_direct_op_buffer_type;

        typedef dafs_direct_op_buffer_type<>  dafs_dob_array_type;

        typedef struct dafs_file_chunk
           {
           dafs_uint64                offset;
           dafs_uint32                byte_count;
           dafs_cache_hint_type       cache_hint;
           } dafs_file_chunk_type;

        typedef dafs_file_chunk_type<>  dafs_chunk_array_type;

        typedef struct dafs_read_write_request
           {
           dafs_filehandle_type       filehandle;
           dafs_uint64                request_id; /* per session */
           dafs_state_id_type         state_id;
           dafs_rw_flag               rw_flag;
           dafs_file_chunk_arry_type  chunks;
           dafs_dob_array_type        request_buffers;
           dafs_checksum_type         direct_checksum;
           } dafs_rw_request_type;

        typedef dafs_rw_request_type<>   dafs_rw_request_array_type;

        typedef struct dafs_completion_notification
           {
           dafs_uint64                request_id;
           dafs_status_type           status;
           dafs_uint32                bytes_transferred;
           dafs_checksum_type         direct_checksum;
           dafs_uint32                pad;
           } dafs_completion_notification_type;

        typedef dafs_completion_notification_type<>
                                          dafs_completion_array_type;


Wittle                                                        [Page 166]

INTERNET-DRAFT         Direct Access File System          September 2001


6.1.8.  Cache Hints

   The DAFS i/o and cache hints operations define the following common
   structures.

        dafs_uint32                   dafs_access_pattern_type;

        #define DAFS_CACHE_HINT_NORMAL         0
        #define DAFS_CACHE_HINT_RANDOM         1
        #define DAFS_CACHE_HINT_SEQUENTIAL     2
        #define DAFS_CACHE_HINT_WILLNEED       3
        #define DAFS_CACHE_HINT_DONTNEED       4

        dafs_uint32                   dafs_cache_hint_type;

        #define dafs_prefetch                  0x01

        #define dafs_readhint_1                0x02
        #define dafs_readhint_2                0x04
        #define dafs_readhint_3                0x06
        #define dafs_readhint_4                0x08
        #define dafs_readhint_5                0x0A
        #define dafs_readhint_6                0x0C
        #define dafs_readhint_7                0x0E

        #define dafs_writehint_1               0x10
        #define dafs_writehint_2               0x20
        #define dafs_writehint_3               0x30
        #define dafs_writehint_4               0x40
        #define dafs_writehint_5               0x50
        #define dafs_writehint_6               0x60
        #define dafs_writehint_7               0x70


6.1.9.  Authentication

   The DAFS connection and authentication operations define the follow-
   ing common structures.


Wittle                                                        [Page 167]

INTERNET-DRAFT         Direct Access File System          September 2001


        enum dafs_auth_type
           {
           DAFS_AUTH_NONE             = 0,
           DAFS_AUTH_TEXT             = 1,
           DAFS_AUTH_GSS              = 2,
           DAFS_AUTH_DEFAULT          = 3
           };

        struct dafs_auth_text
           {
           dafs_utf8string            auth_id;
           dafs_utf8string            auth_password;
           };

        enum dafs_gss_procedure
           {
           DAFS_GSS_OP_INIT           = 1,
           DAFS_GSS_OP_CONTINUE_INIT  = 2
           };

        enum dafs_gss_service                   /* GSS service used */
           {
           DAFS_GSS_SVC_AUTH          = 1,
           DAFS_GSS_SVC_INTEGRITY     = 2,
           DAFS_GSS_SVC_PRIVACY       = 3
           };

        typedef struct
           {
           enum dafs_gss_procedure    procedure;
           enum dafs_gss_service      service;
           dafs_opaque8               token<>;
                                            /* heap, gss context */
           } dafs_auth_gss;

        typedef union switch (enum dafs_auth_type auth_type)
           {
           case DAFS_AUTH_NONE:
              void;
           case DAFS_AUTH_TEXT:
              dafs_auth_text          auth_text;
           case DAFS_AUTH_GSS:
              dafs_auth_gss           auth_gss;
           case DAFS_AUTH_DEFAULT:
              void;
           } dafs_auth_req;


Wittle                                                        [Page 168]

INTERNET-DRAFT         Direct Access File System          September 2001


        typedef struct
           {
           dafs_uint32                gss_major;
           dafs_uint32                gss_minor;
           dafs_opaque8               gss_token<>;
                                            /* continue context */
           } dafs_auth_gss_res;

        typedef union switch (enum dafs_auth_type auth_type)
           {
           case DAFS_AUTH_NONE:
           case DAFS_AUTH_TEXT:
           case DAFS_AUTH_DEFAULT:
              void;
           case DAFS_AUTH_GSS:
              dafs_auth_gss_res       auth_gss_res;
           } dafs_auth_res;


6.1.10.  Procedures

   DAFS defines the following operations with the associated procedure
   numbers.


Wittle                                                        [Page 169]

INTERNET-DRAFT         Direct Access File System          September 2001


        #define DAFS_PROC_CLIENT_AUTH             100
        #define DAFS_PROC_CLIENT_CONNECT          101
        #define DAFS_PROC_CLIENT_CONNECT_AUTH     102
        #define DAFS_PROC_CONNECT_BIND            103
        #define DAFS_PROC_DISCONNECT              104
        #define DAFS_PROC_REGISTER_CRED           105
        #define DAFS_PROC_RELEASE_CRED            106

        #define DAFS_PROC_SECINFO                 108
        #define DAFS_PROC_SERVER_AUTH             109
        #define DAFS_PROC_CHECK_RESPONSE          110
        #define DAFS_PROC_FETCH_RESPONSE          111
        #define DAFS_PROC_DISCARD_RESPONSES       112
        #define DAFS_PROC_ACCESS                  113
        #define DAFS_PROC_CACHE_HINT              114
        #define DAFS_PROC_CLOSE                   115
        #define DAFS_PROC_COMMIT                  116
        #define DAFS_PROC_CREATE                  117
        #define DAFS_PROC_DELEGPURGE              118
        #define DAFS_PROC_DELEGRETURN             119

        #define DAFS_PROC_GET_FSATTR              122
        #define DAFS_PROC_GET_ROOT_HANDLE         123
        #define DAFS_PROC_GETATTR_INLINE          124
        #define DAFS_PROC_GETATTR_DIRECT          125
        #define DAFS_PROC_LINK                    126
        #define DAFS_PROC_LOCK                    127
        #define DAFS_PROC_LOCKT                   128
        #define DAFS_PROC_LOCKU                   129
        #define DAFS_PROC_LOOKUP                  130
        #define DAFS_PROC_LOOKUPP                 131
        #define DAFS_PROC_NULL                    132
        #define DAFS_PROC_NVERIFY                 133
        #define DAFS_PROC_OPEN                    134
        #define DAFS_PROC_OPEN_DOWNGRADE          135
        #define DAFS_PROC_OPENATTR                136
        #define DAFS_PROC_READ_INLINE             137
        #define DAFS_PROC_READ_DIRECT             138
        #define DAFS_PROC_READDIR_INLINE          139
        #define DAFS_PROC_READDIR_DIRECT          140
        #define DAFS_PROC_READLINK_INLINE         141
        #define DAFS_PROC_READLINK_DIRECT         142
        #define DAFS_PROC_REMOVE                  143
        #define DAFS_PROC_RENAME                  144
        #define DAFS_PROC_SETATTR_INLINE          145
        #define DAFS_PROC_SETATTR_DIRECT          146
        #define DAFS_PROC_VERIFY                  147


Wittle                                                        [Page 170]

INTERNET-DRAFT         Direct Access File System          September 2001


        #define DAFS_PROC_BATCH_SUBMIT            148
        #define DAFS_PROC_WRITE_INLINE            149
        #define DAFS_PROC_WRITE_DIRECT            150

        #define DAFS_PROC_BC_GETATTR              151
        #define DAFS_PROC_BC_NULL                 152
        #define DAFS_PROC_BC_RECALL               153

        #define DAFS_PROC_BC_BATCH_COMPLETION     155

        #define DAFS_PROC_APPEND_INLINE           156
        #define DAFS_PROC_APPEND_DIRECT           157
        #define DAFS_PROC_GET_FENCING_LIST        158
        #define DAFS_PROC_SET_FENCING_LIST        159
        #define DAFS_PROC_HURRY_UP                160


Wittle                                                        [Page 171]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.  Connection and Security Management

   This section gives the message packet definitions for the connection
   and security operations. A description of the features and rationale
   associated with these operations can be found in 3.1.3., "Session
   Operations".

6.2.1.  DAFS_PROC_CLIENT_CONNECT

   SUMMARY

   Begins a DAFS Session from a client to a server, including negotia-
   tion of basic protocol information and parameters for use of the DAFS
   Operation Channel associated with the Session.

   ARGUMENTS

        struct DAFS_Client_Connect_Args
           {
           dafs_uint32                use_checksums;
           dafs_uint32                use_response_cache;
           dafs_uint32                max_credentials;
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_uint32                inline_write_header_size;
           dafs_uint32                use_back_control_channel;
           dafs_uint32                use_rdma_read_channel;
           dafs_utf8string            fence_id_string;
           dafs_var_offset_type       client_id_string;
           dafs_verifier_type         client_verifier;
           };


   RESULTS


Wittle                                                        [Page 172]

INTERNET-DRAFT         Direct Access File System          September 2001


        struct DAFS_Client_Connect_Res
           {
           dafs_session_id_type       session_id;
           dafs_client_id_type        client_id;
           dafs_uint32                use_checksums;
           dafs_uint32                use_response_cache;
           dafs_uint32                max_credentials;
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_uint32                inline_write_header_size;
           dafs_uint32                use_back_control_channel;
           dafs_uint32                use_rdma_read_channel;
           };


   DESCRIPTION

   Establish a DAFS Session from client to server, including establish-
   ment of initial protocol settings. The DAFS_PROC_CLIENT_CONNECT
   operation, the DAFS_PROC_CLIENT_CONNECT_AUTH operation or the
   DAFS_PROC_CLIENT_CONNECT_BIND operation MUST be the first operation
   sent on a newly created communication channel. After successfully
   completing this operation, the DAFS_PROC_SECINFO operation can be
   issued to discover what authentication mechanism the server supports.
   Then, the DAFS_PROC_CLIENT_AUTH operation can be used to authenticate
   the client to the server. (The DAFS_PROC_CLIENT_CONNECT_AUTH opera-
   tion is available to combine the connection and authentication opera-
   tions in a single step if it is not necessary to discover the
   server's authentication support mechanisms.)

   The connection request specifies a client-id-string and client-
   verifier so that the server can identity the Session as being associ-
   ated with a particular client instantiation.

   The returned session_id field can be used to identify the Session and
   the Response Cache associated with the Session in the event of con-
   nection failure.

   The returned Client-id is associated with the shared set of creden-
   tials registered on any Session created using this Client-id. Also,
   the Client-id can be used when interpreting lock information returned
   by DAFS_PROC_LOCKT operations.

   A client specifies desired options as part of the Session and Opera-
   tion Channel, and the server responds with the values to be used for


Wittle                                                        [Page 173]

INTERNET-DRAFT         Direct Access File System          September 2001


   the Session and the Operation Channel.

   o  Checksum data transferred in DAFS. If use_checksums is TRUE, then
      checksums will be generated and checked. If the client sets
      use_checksums to TRUE, then the server MUST return use_checksums
      set to TRUE.

   o  Whether the server is maintaining a Response Cache for the Ses-
      sion. If use_response_cache is TRUE, then the server maintains the
      Response Cache.

   o  Maximum number of credentials that can be associated with this
      client. This is set during the first Session that a client
      creates. The value is ignored on subsequent connection requests
      issued by the same client. Following the establishment of the
      first Session for a client, subsequent connection response mes-
      sages will specify the same maximum credential value as the ini-
      tial response. (specify 0 to use server default).

   o  Maximum operation request size on this channel, in bytes (specify
      0 to use server default).

   o  Maximum operation response size on this channel, in bytes (specify
      0 to use server default).

   o  Maximum operations requests outstanding on this channel. (specify
      0 to use server default).

   o  inline_write_header_size gives the padding amount between the end
      of a DAFS message header and the data being transferred by
      DAFS_PROC_WRITE_INLINE operations, in bytes. (specify 0 to use
      server default).

   o  Specify whether a separate channel can be bound to the current
      Session for use in processing back-control messages from server to
      client. If use_back_control_channel is TRUE, then the client can
      establish an additional transport connection for that purpose (see
      DAFS_PROC_CONNECT_BIND operation).

   o  Specify whether a separate channel can be bound to the current
      Session for use in processing RDMA read operations from server to
      client. If use_rdma_read_channel is TRUE, then the client can
      establish an additional transport connection for that purpose (see
      DAFS_PROC_CONNECT_BIND operation).

   The DAFS server MAY accept the options as requested, or MAY respond
   with different values. The server determined option values are
   returned as part of the response message. It is the client's


Wittle                                                        [Page 174]

INTERNET-DRAFT         Direct Access File System          September 2001


   responsibility to check and verify whether the server options are
   acceptable to it. As an example, the client can request a maximum
   request outstanding limit of 65536, but the server can respond with a
   limit of 512 due to resource restrictions.

   The server returns DAFSERR_ILLEGAL_PROT if the protocol version
   requested by the client is not supported. The client SHOULD retry
   with a lower protocol version number. Protocol_version contains a
   suggested protocol version to use that would be supported.
   DAFSERR_ILLEGAL_STATE indicates that the protocol has already been
   negotiated.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_ILLEGAL_PROT

   DAFSERR_ILLEGAL_STATE

   DAFSERR_INVAL


Wittle                                                        [Page 175]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.2.  DAFS_PROC_CLIENT_AUTH

   SUMMARY

   Authenticates the client to the server.

   ARGUMENTS

        struct DAFS Client_Auth_Args
           {
           dafs_auth_req              auth_req;
           };


   RESULTS

        struct DAFS_Client_Auth_Res
           {
           dafs_auth_res              auth_res;
           dafs_boolean               trusted;
           };


   DESCRIPTION

   Establishes the primary DAFS authentication from client to server for
   a DAFS Session.

   Authentication is performed after the DAFS Session is created, but
   prior to other DAFS operations, with the exception of the SECINFO
   operation, which can be used to determine which security authentica-
   tion mechanisms can be used with the server.

   A response message containing a DAFSERR_NOT_AUTHENTICATED error will
   be returned in response to any other request messages received when
   the DAFS Session has not yet been authenticated.

   DAFS servers MUST support at least one of the following authentica-
   tion methods:

   DAFS_AUTH_NONE

      Authentication is NOT REQUIRED. The client is trusted to provide
      credentials as needed.

   DAFS_AUTH_TEXT


Wittle                                                        [Page 176]

INTERNET-DRAFT         Direct Access File System          September 2001


      Session is authenticated using an auth_id and clear text password.

   DAFS_AUTH_GSS

      Session is authenticated using the GSS framework (see RFC 2743).

   DAFS_AUTH_DEFAULT

      MAY be used for untrusted clients and indicates that the default
      credentials are to be used for the DAFS Session.

   A DAFS server providing maximum security needs to support AUTH_GSS.
   If the DAFS server supports AUTH_GSS, it MUST identify itself in
   GSS-API via a GSS_C_NT_HOSTBASED_SERVICE name type.
   GSS_C_NT_HOSTBASED_SERVICE names are of the form service@hostname.

   For DAFS, the "service" element is "dafs".

   Implementations of security mechanisms will convert dafs@hostname to
   various different forms. For Kerberos V4, the following form is
   RECOMMENDED:

      dafs/hostname.


   This would be a server principal in the Kerberos Key Distribution
   Center database.

   If the client desires mutual authentication with the DAFS server, it
   SHOULD set the mutual_req_flag in its call to GSS_Init_sec_context
   (see RFC 2743). All GSS mechanisms used by DAFS MUST support mutual
   authentication between client and server.

   The response from the server includes the trusted field which speci-
   fies whether the client is trusted to issue subsequent
   DAFS_PROC_REGISTER_CREDS operations.

   If the server fails to authenticate the client due to incorrect
   authentication information, it returns DAFSERR_NOT_AUTHORIZED.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_INVAL


Wittle                                                        [Page 177]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_NOT_AUTHORIZED

   DAFSERR_NOTSUPP


Wittle                                                        [Page 178]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.3.  DAFS_PROC_SERVER_AUTH

   SUMMARY

   Authenticates a server to a client.

   ARGUMENTS

        struct DAFS_Server_Auth_Args
           {
           dafs_auth_type             auth_type;
           };


   RESULTS

        struct DAFS_Server_Auth_Res
           {
           union switch (auth_type)
              {
              case DAFS_AUTH_NONE:
                 void;
              case DAFS_AUTH_TEXT:
                 dafs_auth_text       auth_text;
              case DAFS_AUTH_GSS:
                 void;
              case DAFS_AUTH_DEFAULT:
                 void;
              } auth_data;
           };


   DESCRIPTION

   Authenticate the DAFS server to the DAFS client. This operation is
   OPTIONAL and can be used only after the DAFS client has been authen-
   ticated to the server.

   The same authentication methods are used for server authentication as
   for client authentication. For information on authentication methods,
   see 6.2.2., "DAFS_PROC_CLIENT_AUTH".

   In the case where the client has authenticated using DAFS_AUTH_GSS,
   further authentication of the server to the client is NOT REQUIRED,
   as the client and server MAY mutually authenticate during the initial
   client authentication to the server. This is accomplished by the


Wittle                                                        [Page 179]

INTERNET-DRAFT         Direct Access File System          September 2001


   client by setting the mutual_req_flag on its call to
   GSS_Init_sec_context, and by checking the result in the mutual_state
   returned. All GSS implementations used by DAFS MUST support mutual
   authentication.

   If the server fails to authenticate the client, it returns
   DAFSERR_NOT_AUTHORIZED.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_INVAL

   DAFSERR_NOT_AUTHORIZED

   DAFSERR_NOTSUPP


Wittle                                                        [Page 180]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.4.  DAFS_PROC_CLIENT_CONNECT_AUTH

   SUMMARY

   Creates a DAFS Session and authenticate client in a single step.

   ARGUMENTS

        struct DAFS_Client_Connect_Auth_Args
           {
           dafs_uint32                use_checksums;
           dafs_uint32                use_response_cache;
           dafs_uint32                max_credentials;
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_uint32                inline_write_header_size;
           dafs_uint32                use_back_control_channel
           dafs_uint32                use_rdma_read_channel;
           dafs_utf8string            fence_id_string;
           dafs_var_offset_type       client_id_string;
           dafs_verifier_type         client_verifier;
           dafs_auth_req              auth_req;
           };


   RESULTS

        struct DAFS_Client_Connect_Auth_Res
           {
           dafs_session_id_type       session_id;
           dafs_client_id_type        client_id;
           dafs_uint32                use_checksums;
           dafs_uint32                use_response_cache;
           dafs_uint32                max_credentials;
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_uint32                inline_write_header_size;
           dafs_uint32                use_back_control_channel;
           dafs_uint32                use_rdma_read_channel;
           dafs_auth_res              auth_res;
           dafs_boolean               trusted;
           };


Wittle                                                        [Page 181]

INTERNET-DRAFT         Direct Access File System          September 2001


   DESCRIPTION

   This operation combines DAFS_PROC_CLIENT_CONNECT and
   DAFS_PROC_CLIENT_AUTH into a single operation, creating a DAFS Ses-
   sion and authenticating it in a single step. It is appropriate to use
   when the authentication method is already known. See 6.2.1.,
   "DAFS_PROC_CLIENT_CONNECT" and 6.2.2., "DAFS_PROC_CLIENT_AUTH" for
   more information on individual fields.

   A return status of DAFSERR_ILLEGAL_PROT indicates that the protocol
   version requested by the client is not supported. The client SHOULD
   retry with a lower protocol version number. Alt_protocol_version con-
   tains a suggested protocol version to use that would be supported. If
   the protocol has already been negotiated, the server returns
   DAFSERR_ILLEGAL_STATE.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_ILLEGAL_PROT

   DAFSERR_ILLEGAL_STATE

   DAFSERR_INVAL

   DAFSERR_NOT_AUTHENTICATED

   DAFSERR_NOTSUPP


Wittle                                                        [Page 182]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.5.  DAFS_PROC_CONNECT_BIND

   SUMMARY

   Binds a new communication channel to an existing DAFS Session. Used
   to bind the Back-control Channel or RDMA-read channel to an existing
   Session and negotiate the parameters associated with that channel.

   ARGUMENTS

        const DAFS_BACK_CONTROL_CHANNEL  = 1;
        const DAFS_RDMA_READ_CHANNEL     = 2;


        struct DAFS_Connect_Bind_Args
           {
           dafs_session_id_type       session_id;
           dafs_uint16                channel_use;
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_auth_req              auth_req;
           };


   RESULTS

        struct DAFS_Connect_Bind_Res
           {
           dafs_uint32                max_request_size;
           dafs_uint32                max_response_size;
           dafs_uint32                max_requests;
           dafs_auth_res              auth_res;
           dafs_boolean               trusted;
           };


   DESCRIPTION

   Binds a new communication channel to an existing DAFS Session. A new
   binding can be created by the client if multiple channels are permit-
   ted for a single DAFS Session and an additional channel is being
   added to the DAFS Session.

   A client specifies desired options as part of the DAFS channel, and


Wittle                                                        [Page 183]

INTERNET-DRAFT         Direct Access File System          September 2001


   the server responds with the values to be used for the channel
   requested.

   o  Maximum operation request size on this channel, in bytes (specify
      0 to use server default)

   o  Maximum operation response size on this channel, in bytes (specify
      0 to use server default)

   o  Maximum operations requests outstanding on this channel (specify 0
      to use server default)

   The DAFS server MAY accept the options as requested, or MAY respond
   with different values. The server determined option values are
   returned as part of the response message. It is the client's respon-
   sibility to check and verify whether the server options are accept-
   able to it. As an example, the client (or server, depending on the
   message flow on the particular channel) can request a maximum request
   outstanding limit of 65536, but the response MAY be a limit of 512
   due to resource restrictions.

   Only DAFS Sessions initially created with one of the use_xxx_channel
   attributes are permitted to have subsequent DAFS communication chan-
   nels bound to them. When a client binds an additional DAFS communica-
   tion channel mapped onto a separate channel to a DAFS Session it
   specifies whether this DAFS communication channel is to be used for
   RDMA read or for back-control directives.

   A client binding to an existing DAFS Session MUST authenticate itself
   successfully using the same identification as the original DAFS com-
   munication channel. (For example, if an initial DAFS Session was
   created and authenticated as user O then a subsequent client cannot
   attempt to bind to the Session and authenticate itself as user S.)

   The response from the server includes the trusted field which speci-
   fies whether the client is trusted to issue subsequent
   DAFS_PROC_REGISTER_CREDS operations.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_ILLEGAL_STATE

   DAFSERR_INVAL


Wittle                                                        [Page 184]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_NOT_AUTHORIZED

   DAFSERR_NOTSUPP

   DAFSERR_UNKNOWN_SESSION


Wittle                                                        [Page 185]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.6.  DAFS_PROC_DISCONNECT

   SUMMARY

   Terminates a DAFS Session.

   ARGUMENTS

   None.

   RESULTS

   None.

   DESCRIPTION

   Disconnect a DAFS Session and all communication channels that have
   been that been associated with it using DAFS_PROC_CONNECT_BIND. This
   operation can be used anytime after a Session has been connected.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_INVAL

   DAFSERR_NOTSUPP


Wittle                                                        [Page 186]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.7.  DAFS_PROC_SECINFO

   SUMMARY

   Queries the server for available security information.

   ARGUMENTS

   None.

   RESULTS

        struct authtype
           {
           dafs_uint32                  auth_type;
           union switch (auth_type)
              {
              case DAFS_AUTH_NONE:
              case DAFS_AUTH_TEXT:
                 void;
              case DAFS_AUTH_GSS:
                 dafs_opaque8           oid<>;
                 dafs_uint32            qop;
                 enum dafs_gss_service  gss_service;
              case DAFS_AUTH_DEFAULT:
                 void;
              } authinfo;
           };


        struct DAFS_Secinfo_Res
           {
           struct authtype              secinfo<>;          /* heap */
           };


   DESCRIPTION

   Determine which authentication methods the server supports. This is
   typically used after DAFS_PROC_CLIENT_CONNECT and before
   DAFS_PROC_CLIENT_AUTH to determine which authentication mechanisms
   can be used.

   The result returns an array of auth_type (DAFS_AUTH_TEXT etc.) each
   with associated authentication-method specific information:


Wittle                                                        [Page 187]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  For auth_type DAFS_AUTH_TEXT or DAFS_AUTH_NONE, empty associated
      information is returned.

   o  For auth_type DAFS_AUTH_GSS, the gss Object Identifier, the sup-
      ported service, and supported qop values for that service are
      returned. The object identifier is encoded as a variable length
      array of opaque bytes. It is NOT REQUIRED to encode the length
      field of the oid as part of the opaque data of the oid field.
      Rather, the length field of the oid is encoded as the length field
      of the opaque array, and the data bytes of the oid are transferred
      as the entire contents of the opaque8 byte array.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_NOTSUPP


Wittle                                                        [Page 188]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.8.  DAFS_PROC_REGISTER_CRED

   SUMMARY

   Registers credentials.

   ARGUMENTS

           const DAFS_CRED_NAME     = 0;
           const DAFS_CRED_ID       = 1;
           const DAFS_CRED_GSS      = 2;
           const DAFS_CRED_DEFAULT  = 3;


        struct DAFS_Register_Cred_Args
           {
           dafs_uint32                cred_type;
           union switch (cred_type)
              {
              case DAFS_CRED_ID:                 dafs_int32           uid;
                 dafs_int32           gid;
                 dafs_int32           groups<>;            /* heap */
              case DAFS_CRED_NAME:
                 dafs_utf8string      name;                /* heap */
              case DAFS_CRED_GSS:
                 dafs_opaque8         name<>;              /* heap */
              case DAFS_CRED_DEFAULT:
                 void;
              } cred_data;
                };


   RESULTS

        struct DAFS_Register_Cred_Res
           {
           dafs_cred_handle_type      cred_handle;
           };


   DESCRIPTION

   This operation is used to advise a server that a set of credentials
   is active for the client associated with the Session. For each valid


Wittle                                                        [Page 189]

INTERNET-DRAFT         Direct Access File System          September 2001


   specified credentials, the server returns a client-unique credential
   handle that can be used for subsequent DAFS messages. The new set of
   credentials can be used during subsequent DAFS operations on any Ses-
   sion that is used for communicating between this client instance and
   the current server instance.

   Note that accessing credentials is distinct from authentication. The
   client has already performed appropriate authentication, and is
   trusted by the server to request proxy credentials as necessary to
   perform DAFS operations.

   The server can support enhanced credentials, such as multiple proto-
   col support or id mappings. Because of this the client only needs to
   provide the identification component of the credential set (for exam-
   ple, DAFS_CRED_NAME). The client MAY also want to provide more tradi-
   tional UNIX-style credentials to the server as a hint of the result-
   ing credential set (DAFS_CRED_ID). The client MAY also request that a
   server-defined default set of credentials be used
   (DAFS_CRED_DEFAULT).

   The server is REQUIRED to keep credentials for as long as it is
   REQUIRED to keep a Client-id. If there are not currently connected
   Sessions, and the lease time of any locks have expired, it is permis-
   sible for the server to release all client state, including any
   credentials associated with the client. A client that fails to recon-
   nect quickly enough to avoid the release of previous client state can
   detect this case by noticing that the returned Client-id has changed,
   prompting it to re-register it's credentials.

   The following types of credentials can be specified:

   o  DAFS_CRED_NAME: The username identifying the individual or entity
      to be associated with the credentials.

   o  DAFS_CRED_ID: The numeric uid/gid to be used in identifying the
      credentials.

   o  DAFS_CRED_GSS: A GSS-based set of credentials. This is specified
      as a mechanism name, i.e. a GSS mechanism-specific format name, as
      returned by GSS_Inquire_context, and GSS_Display_name, for exam-
      ple.

   o  DAFS_CRED_DEFAULT: Set of default credentials supplied by the
      server.

   The server MUST support at least one credential type.

   The number of credentials that a client can register is negotiated


Wittle                                                        [Page 190]

INTERNET-DRAFT         Direct Access File System          September 2001


   during the client's initial Session creation. See 6.2.1.,
   "DAFS_PROC_CLIENT_CONNECT" for more information. If the client
   attempts to register more credentials than was negotiated, the server
   will return a error. The client can use DAFS_PROC_RELEASE_CRED to
   release an existing credential.

   The server returns DAFSERR_INVAL if there are too many groups in the
   groups list.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_INVAL

   DAFSERR_NOT_AUTHORIZED

   DAFSERR_NOTSUPP


Wittle                                                        [Page 191]

INTERNET-DRAFT         Direct Access File System          September 2001


6.2.9.  DAFS_PROC_RELEASE_CRED

   SUMMARY

   Releases registered credentials.

   ARGUMENTS

        struct DAFS_Release_Cred_Args
           {
           dafs_cred_handle_type      cred_handle;
           };


   RESULTS

   None.

   DESCRIPTION

   The RELEASE_CRED message is used to advise a server that a set of
   credentials (as obtained from the REGISTER_CRED operation) is no
   longer REQUIRED for the client. The cred_handle specifies which
   credential handle is to be released.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_NOT_AUTHORIZED

   DAFSERR_NOTSUPP


Wittle                                                        [Page 192]

INTERNET-DRAFT         Direct Access File System          September 2001


6.3.  Response Cache

   This section gives the message packet definitions for the operations
   that provide Response Cache recovery following a failure. A descrip-
   tion of the features and rationale associated with these operations
   can be found in 5.2., "Server Response Cache".

6.3.1.  DAFS_PROC_CHECK_RESPONSE

   SUMMARY

   Determines the availability of cached results from a previously
   issued fs- state-modifying operation.

   ARGUMENTS

        struct DAFS_Check_Response_Args
           {
           dafs_session_id_type       session_id;
           dafs_uint16                xid_stream;
           dafs_uint16                xid_seq;
           dafs_uint32                procedure;
           };


   RESULTS

   None.

   DESCRIPTION

   CHECK_RESPONSE is used to determine the availability of a Response
   Cache entry when recovering from disconnection, server reboot, or
   server failover. The client uses it so that it can determine which of
   its in-flight requests have actually been executed.

   The server returns DAFS_STATUS_OK when a Response Cache is present
   indicating the specified request was performed and has response
   information that can be fetched using the DAFS_PROC_FETCH_REPONSE
   procedure. DAFSERR_NO_XID_MATCH indicates that the request identified
   by the xid_stream and xid_seq did not have an entry in the cache. In
   general, this is an indication that the request was not executed
   before the session represented by the session_id failed.

   The server might not have knowledge of the session represented by the
   session_id. In this case, DAFSERR_UNKNOWN_SESSION will be returned.
   Assuming the client submitted a valid session_id, this status is


Wittle                                                        [Page 193]

INTERNET-DRAFT         Direct Access File System          September 2001


   returned because the server lost state and it does not maintain its
   response caches in stable storage.

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_BROKEN

   DAFSERR_NO_XID_MATCH

   DAFSERR_UNKNOWN_SESSION


Wittle                                                        [Page 194]

INTERNET-DRAFT         Direct Access File System          September 2001


6.3.2.  DAFS_PROC_FETCH_RESPONSE

   SUMMARY

   Retrieves the cached results from fs-state-modifying operations and
   chained operations.

   ARGUMENTS

        struct DAFS_Fetch_Response_Args
           {
           dafs_session_id_type       session_id;
           dafs_uint16                xid_stream;
           dafs_uint16                xid_seq;
           dafs_uint32                procedure;
           };


   RESULTS

   See "Implementation" below.

   DESCRIPTION

   The result from DAFS_PROC_FETCH_RESPONSE is the result from the ori-
   ginal request. The header (xid, analyzer, and other fields) is taken
   from the DAFS_PROC_FETCH_RESPONSE request. The results proper are
   taken from the response in the server's Response Cache.

   If the response is not found, the result is zero-length. This SHOULD
   not happen, because the caller SHOULD use DAFS_PROC_CHECK_RESPONSE to
   find out if the results it is intending to fetch have been cached by
   the server. Otherwise, the status returned is the status from the
   fetched response, not the status of the fetch operation itself.

   IMPLEMENTATION

   A recommendation is to make these operations chainable. Making the
   operations chainable ensures that a CHECK_RESPONSE, FETCH_RESPONSE
   chain can be issued to check for the response and fetch it, if the
   response is in the server's Response Cache.

   ERRORS

   See "Description" above.


Wittle                                                        [Page 195]

INTERNET-DRAFT         Direct Access File System          September 2001


6.3.3.  DAFS_PROC_DISCARD_RESPONSES

   SUMMARY

   Tells server that client is finished with the Response Cache for a
   disconnected Session.

   ARGUMENTS

        struct DAFS_Discard_Responses_Args
           {
           dafs_session_id_type       session_id;
           };


   RESULTS

   None.

   DESCRIPTION

   This operation lets the server to remove the Response Cache for the
   specified disconnected Session. If the client does not do this, the
   Response Cache will be maintained until client reinitialization.

   IMPLEMENTATION

   ERRORS

   DAFSERR_STATUS_OK

   DAFSERR_CHAIN_FORM

   DAFSERR_NOT_AUTHORIZED

   DAFSERR_UNKNOWN_SESSION


Wittle                                                        [Page 196]

INTERNET-DRAFT         Direct Access File System          September 2001


6.4.  Fencing Procedures

   This section describes the fencing operations used to manage the
   Fencing_list and put special fencing access controls in to effect.

6.4.1.  DAFS_PROC_GET_FENCING_LIST

   SUMMARY

   Return the Fencing_Id_List for a file or file system.

   ARGUMENTS

        struct DAFS_Get_Fencing_List_Args
           {
           dafs_filehandle_type          filehandle;
                                               /* file or fshandle */
           };


   RESULTS

        struct DAFS_Get_Fencing_List_Res
           {
           dafs_fence_array_type      fence_list;          /* heap */
           };


   DESCRIPTION

   The get fencing list procedure returns the current fencing list for
   the file or file system specified by filehandle argument. The ability
   to get the Fencing_list for an filehandle object is reserved to the
   owner of the object, or a trusted Client.

   IMPLEMENTATION

   ERRORS

   DAFS_ERR_CHAIN_FORM

   DAFSERR_INVAL

   DAFSERR_NOT_AUTHORIZED


Wittle                                                        [Page 197]

INTERNET-DRAFT         Direct Access File System          September 2001


6.4.2.  DAFS_PROC_SET_FENCING_LIST

   SUMMARY

   Set the Fencing_Id_List for a file or file system.

   ARGUMENTS

        enum object_type
           {
           FILE                       = 1,
           FILESYSTEM                 = 2
           };


        enum access_type
           {
           ALLOW                      = 1,
           DENY                       = 2
           };


        enum update_type
           {
           OVERWRITE                  = 1,
           APPEND                     = 2
           REMOVE                     = 3
           };


        struct DAFS_Set_Fencing_List_Args
           {
           dafs_filehandle            filehandle; /* file/fshandle */
           object_type                object;
           access_type                access;     /* add or remove */
           update_type                update;     /* create or add */
           dafs_fence_array_type      fence_list;
           };


   RESULTS

   None.


Wittle                                                        [Page 198]

INTERNET-DRAFT         Direct Access File System          September 2001


   DESCRIPTION

   The set operation atomically updates the Fencing_list, adding or
   removing Fence_id_strings from the existing list, or overwriting the
   existing list, as specified by the argument flags.

   The filehandle specifies the file or file system that the fencing
   list will be associated with. The object argument indicates whether
   the file or file system is the object to be fenced.

   The fencing list field is the list of fence_id_strings that is to be
   added or removed from the fencing list.

   The access field specifies whether the fencing list specifies a list
   of fence_id_strings that are to be allowed access to the fenced
   object, or denied access to that object.

   The update field specifies whether the fencing list argument modifies
   an existing fencing list for the object, or, in the case of allowing
   access, that it overwrites the existing list.

   A side-effect of the set operation when invoked access = DENY is to

   1) drain (i.e., abort or complete) any in-progress operations
      received on a DAFS Session with the just-denied Fence_id_string.
      All subsequent requests on the Session that has the associated
      just- denied Fence_id_string, MUST enforce the denial of access
      implied by the new Fencing_list. This requires determining that
      the request is associated with a denied Fence_id_string (e.g.,
      determining that the request's Session has a denied
      Fence_id_string), and matching the filehandle in the request to
      Objects that are Fenced.

   2) if the Object fenced includes all DAFS file objects (directories,
      files, symlinks, etc.) provided by the DAFS Server, then all
      existing DAFS Sessions associated with the just-denied
      Fence_id_string can be closed in error. Subsequent attempts by to
      create a Session that contains the just-denied Fence_id_string can
      be returned in error.

   The ability to set the Fencing_list for an filehandle object is
   reserved to the owner of the object, or a trusted Client.The ability
   to set the Fencing_list for a file system is reserved to trusted
   Clients.

   IMPLEMENTATION

   It is server implementation specific how to handle a Set_Fencing_List


Wittle                                                        [Page 199]

INTERNET-DRAFT         Direct Access File System          September 2001


   operation on a filehandle that already has Fencing_list defined for
   associated dafs_FShandle, or a Set_Fencing_List operation on a
   dafs_FShandle that already has a Fencing_list defined for the
   filehandle.

   ERRORS

   DAFSERR_CHAIN_FORM

   DAFSERR_INVAL

   DAFSERR_NOT_AUTHORIZED


Wittle                                                        [Page 200]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.  File System Procedures

   This section describes the individual operations, along with the for-
   mats of the arguments portion of the request and the results portion
   of the response. All operations in this section are initiated by
   clients and are processed and responded to by servers.

6.5.1.  DAFS_PROC_NULL

   SUMMARY

   No operation.

   ARGUMENTS

   None.

   RESULTS

   None.

   DESCRIPTION

           "Standard NULL procedure. Void [no] arguments, void [no]
           results.  This procedure has no functionality associated
           with it. Because of this, it is sometimes used to  meas-
           ure the overhead of processing a service request. There-
           fore, the server should ensure that no unnecessary  work
           is done in servicing this procedure." (RFC 3010, p. 102)

   One other potential use of this procedure is as a means for a reques-
   ter to receive flow control information from the server in a timely
   fashion during an otherwise quiet period. See 3.2.6., "Message Flow
   Control" for a discussion.

   A side effect of a DAFS_PROC_NULL request is the renewal of a
   client's lock leases.


Wittle                                                        [Page 201]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.2.  DAFS_PROC_ACCESS

   SUMMARY

   Checks an object's access rights.

   ARGUMENTS

           const ACCESS_READ          = 0x00000001;
           const ACCESS_LOOKUP        = 0x00000002;
           const ACCESS_MODIFY        = 0x00000004;
           const ACCESS_EXTEND        = 0x00000008;
           const ACCESS_DELETE        = 0x00000010;
           const ACCESS_EXECUTE       = 0x00000020;


        struct DAFS_Access_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_uint32                access;
           };


   RESULTS

        struct DAFS_Access_Res
           {
           dafs_uint32                supported;
           dafs_uint32                access;
           };


   DESCRIPTION

           "ACCESS [DAFS_PROC_ACCESS] determines the access  rights
           that a user, as identified by the credentials in the RPC
           request, has with respect  to  the  file  system  object
           specified by the current filehandle. (RFC 3010, p. 105)

   DAFS does not use a separate RPC section of the message to transmit
   credentials information. See 4.1.3.2., "Credentials" for explanation
   on DAFS credential handles.   In addition, DAFS explicitly exchanges
   filehandle information in argument and result message fields.
   Further, DAFS provides server-based implicit transmission of a
   "current_filehandle" between chained operations. See 4.1.3.1.,


Wittle                                                        [Page 202]

INTERNET-DRAFT         Direct Access File System          September 2001


   "Filehandles in Compound vs. Chaining" for differences in filehandle
   management in DAFS procedures.

           "The client encodes the set of access rights that are to
           be  checked in the bit mask 'access.'  The server checks
           the permissions encoded in the bit mask.  If a status of
           NFS4_OK  [DAFS_STATUS_OK] is returned, two bit masks are
           included  in  the  response.   The  first,  'supported',
           represents  the  access  rights for which the server can
           verify reliably.  The second, 'access,'  represents  the
           access  rights  available to the user for the filehandle
           provided.  On success, the  current  filehandle  retains
           its value.

           Note that the supported field will contain only as  many
           values  as  was  originally  sent in the arguments.  For
           example,    if    the    client    sends    an    ACCESS
           [DAFS_PROC_ACCESS]     operation     with    only    the
           ACCESS4_READ[ACCESS_READ] value set and the server  sup-
           ports   this   value,   the   server  will  return  only
           ACCESS4_READ [ACCESS_READ] even if it could  have  reli-
           ably checked other values.

           The results of this operation are  necessarily  advisory
           in  nature. A return status of NFS4_OK  [DAFS_STATUS_OK]
           and the appropriate bit set in the  bit  mask  does  not
           imply  that such access will be allowed to the file sys-
           tem object in the future. This is because access  rights
           can be revoked by the server at any time.

           The following access permissions may be requested:

           ACCESS4_READ [ACCESS_READ]: Read data from file or  read
           a directory.

           ACCESS4_LOOKUP [ACCESS_LOOKUP]:  Look up  a  name  in  a
           directory (no meaning for non-directory objects).

           ACCESS4_MODIFY [ACCESS_MODIFY]:  Rewrite  existing  file
           data or modify existing directory entries.

           ACCESS4_EXTEND [ACCESS_EXTEND]:  Write new data  or  add
           directory entries.

           ACCESS4_DELETE  [ACCESS_DELETE]:    Delete  an  existing
           directory entry (no meaning for non-directory objects).

           ACCESS4_EXECUTE  [ACCESS_EXECUTE]:   Execute  file   (no


Wittle                                                        [Page 203]

INTERNET-DRAFT         Direct Access File System          September 2001


           meaning for a directory" (RFC 3010, pp. 105-106)

   IMPLEMENTATION

           "For the NFS version 4 [and DAFS] protocol, the  use  of
           the  ACCESS  [DAFS_PROC_ACCESS] procedure when opening a
           regular file  is  deprecated  in  favor  of  using  OPEN
           [DAFS_PROC_OPEN].

           In general, it is  not  sufficient  for  the  client  to
           attempt  to  deduce access permissions by inspecting the
           uid, gid, and mode fields in the file attributes  or  by
           attempting  to  interpret the contents of the ACL attri-
           bute.  This is because the server may perform uid or gid
           mapping  or  enforce  additional access control restric-
           tions.  It is also possible that the server may  not  be
           in the same ID space as the client.  In these cases (and
           perhaps others), the client can not reliably perform  an
           access check with only current file attributes.

           In the NFS version 2 protocol, the only reliable way  to
           determine whether an operation was allowed was to try it
           and see if it succeeded or  failed.   Using  the  ACCESS
           [DAFS_PROC_ACCESS]   procedure in the NFS version 4 [and
           DAFS] protocol, the client can ask the server  to  indi-
           cate  whether  or  not one or more classes of operations
           are permitted. The ACCESS  [DAFS_PROC_ACCESS]  operation
           is  provided  to  allow  clients to check before doing a
           series of operations which  will  result  in  an  access
           failure.  The OPEN [DAFS_PROC_OPEN] operation provides a
           point where the server can verify  access  to  the  file
           object  and  method  to  return  that information to the
           client.   The  ACCESS  [DAFS_PROC_ACCESS]  operation  is
           still  useful for directory operations or for use in the
           case the UNIX API 'access' is used on the client.

           The information returned by the server in response to an
           ACCESS  [DAFS_PROC_ACCESS]   call  is not permanent.  It
           was correct at the exact time that the server  performed
           the  checks, but not necessarily afterwards.  The server
           can revoke access permission at any time.

           The client should use the effective credentials  of  the
           user  to  build  the  authentication  information in the
           ACCESS  [DAFS_PROC_ACCESS]  request  used  to  determine
           access rights. (RFC 3010, pp. 106-107)

   In DAFS, a client needs to register the user's effective credentials


Wittle                                                        [Page 204]

INTERNET-DRAFT         Direct Access File System          September 2001


   and include the credentials handle thus obtained in the DAFS header
   of the ACCESS operation.

           "It is the effective user and group credentials that are
           used in subsequent read and write operations.

           "Many  implementations  do  not  directly  support   the
           ACCESS4_DELETE   [ACCESS_DELETE]  permission.  Operating
           systems  like  UNIX  will  ignore   the   ACCESS4_DELETE
           [ACCESS_DELETE]  bit  if  set  on an access request on a
           non-directory object.  In these systems, delete  permis-
           sion  on  a file is determined by the access permissions
           on the directory in which the file resides,  instead  of
           being  determined by the permissions of the file itself.
           Therefore, the mask returned  enumerating  which  access
           rights  can  be  determined will have the ACCESS4_DELETE
           [ACCESS_DELETE]  value set to 0.  This indicates to  the
           client that the server was unable to check that particu-
           lar access right. The ACCESS4_DELETE [ACCESS_DELETE] bit
           in  the access mask returned will then be ignored by the
           client." (RFC 3010, pp. 106-107)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 205]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.3.  DAFS_PROC_APPEND_INLINE

   SUMMARY

   Appends data to the end of a file. The data to be written is part of
   the request packet and is passed inline.

   ARGUMENTS

        enum dafs_append_stable_how
           {
           DATA_SYNC                  = 1,
           FILE_SYNC                  = 2
           };


        struct DAFS_Append_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_append_stable_how     stable_how;
           dafs_uint32                byte_count;
           dafs_uint32                write_padded;
           dafs_cache_hint_type       cache_hint;
           dafs_opaque8               data[byte_count];
           };


   RESULTS

        struct DAFS_Append_Inline_Res
           {
           dafs_uint64                offset;
           dafs_verifier_type         verifier;
           dafs_append_stable_how     committed;
           };


   DESCRIPTION

   Using the append inline procedure, a client requests that the server
   atomically append the data at the end of the file. The request is
   atomic with respect to:

   1) the end-of-file. The server insures that the determination of


Wittle                                                        [Page 206]

INTERNET-DRAFT         Direct Access File System          September 2001


      current

      end-of-file file offset and appending the data at that offset are
      atomic with respect to other write operations on the file.

   2) data to be written. The server insures that either all of the data
      is appended to the file, or none of the data is appended to the
      file.

   Append operations MUST specify either Data_Sync or File_Sync stabil-
   ity.

   If the append request specifies a byte count larger than the
   DAFS_FSATTR_MAX_APPEND file system attribute associated with the
   file, the server MAY return the error value DAFSERR_WRITE_TOOBIG and
   not append any of the data.

   IMPLEMENTATION

   The append operation is REQUIRED to support both atomicities
   described. The server MAY need to provide local buffer space for up
   to DAFS_FSATTR_MAX_APPEND bytes of data, and MAY need to adjust the
   resulting file size in order to eliminate any effects of an partially
   completed append operation.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED


Wittle                                                        [Page 207]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID

   DAFSERR_WRITE_TOOBIG


Wittle                                                        [Page 208]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.4.  DAFS_PROC_APPEND_DIRECT

   SUMMARY

   Initiates a write append to file using data retrieved via RDMA read
   directly from client memory buffers.

   ARGUMENTS

        enum dafs_append_stable_how
           {
           DATA_SYNC                  = 1,
           FILE_SYNC                  = 2
           };


        struct DAFS_Append_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_append_stable_how     stable_how;
           dafs_uint32                byte_count;
           dafs_cache_hint_type       cache_hint;
           dafs_checksum_type         direct_checksum;
           dafs_direct_op_buffer      write_data_buffer;
           };


        /* DIRECT: opaque write_data_buffer[byte_count]; */


   RESULTS

        struct DAFS_Append_Direct_Res
           {
           dafs_uint64                offset;
           dafs_verifier_type         verifier;
           dafs_append_stable_how     committed;
           };


   DESCRIPTION

   See DAFS_PROC_APPEND_INLINE for a description.


Wittle                                                        [Page 209]

INTERNET-DRAFT         Direct Access File System          September 2001


   ERRORS:

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_OLD_STATEID

   DAFSERR_RDMA-READ_CHANNEL_UNUSABLE

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT


Wittle                                                        [Page 210]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_STALE

   DAFSERR_STALE_STATEID

   DAFSERR_WRITE_TOOBIG


Wittle                                                        [Page 211]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.5.  DAFS_PROC_BATCH_SUBMIT

   SUMMARY

   Submits a batch of I/O requests to the server.

   ARGUMENTS

        typedef enum
           {
           DAFS_READ_OP               = 1,
           DAFS_WRITE_OP              = 2
           } dafs_rw_flag;


        struct DAFS_Batch_Submit_Args
           {
           dafs_rw_request_array_type requests;            /* heap */
           dafs_uint32                usec_window;
           dafs_uint32                num_completions;
           dafs_boolean               synchronous;
           };


   RESULTS

        struct DAFS_Batch_Submit_Res
           {
           dafs_completion_array_type completions;         /* heap */
           };


   DESCRIPTION

   The DAFS_PROC_BATCH_SUBMIT operation is used to initiate a set of I/O
   requests from and to regular files. This can be used as both a syn-
   chronous list-io mechanism as well as an asynchronous I/O request. In
   synchronous mode, the server completes all I/O requests, then reports
   each of their status results in a single reply message to the DAFS
   client. In asynchronous mode, the server reports results over the
   Back Channel.

   Each I/O request is described by the "dafs_read_write_request" struc-
   ture, which includes a request_id, unique to the DAFS Session. The
   request_id allows a client to match completion notifications to


Wittle                                                        [Page 212]

INTERNET-DRAFT         Direct Access File System          September 2001


   previously issued requests. It also includes a filehandle, a
   read/write flag, a state id, and a checksum field. Lastly, each
   dafs_read_write_request contains two lists, one of file chunks and
   another of memory buffers. The number of elements in those two lists
   does not need to match, nor does the size of the individual file
   regions need to match the size of the individual memory buffers, but
   the total number of bytes specified in the file region list MUST
   match the total number of bytes specified in the memory list. These
   two lists represent a combined scatter-gather list, either moving
   data from the file to the specified addresses in the case of a read,
   or from the addresses to a file in the case of a write. The server
   can decompose the two lists to a single list of transfers of memory
   to file regions, and is free to perform any optimizations. The decom-
   posed list can be thought of as the list of transfers defined by
   dividing all the data to be moved for a given filehandle at every
   point that separates one memory buffer from another or one file
   region from another. This list can easily be generated by running a
   cursor through each of the file list and memory buffer list provided.
   The condensed description using two lists, however, makes certain
   optimizations trivial. Two such examples are the case of I/O from a
   single file region into multiple memory buffers, or from a single
   memory buffer into discontiguous segments of a file.

   If the "synchronous" flag is set in the DAFS_Batch_Submit_Args struc-
   ture, then the server treats the batch operation as a list of syn-
   chronous I/O requests to be performed immediately. When all the
   requests complete (either successfully or in error), the server sends
   a single reply on the Operation Channel containing the status of each
   I/O request. In this case, the server ignores the values of
   "usec_window" and "num_completions".

   Otherwise, if "synchronous" is not set, then the message becomes an
   asynchronous batch submit. In the asynchronous case, the client MAY
   set the "usec_window" parameter as a hint to the server indicating
   how quickly the client would like the have the batch requests satis-
   fied. The client can also set "num_completions" as a hint to tell the
   server how many completions it would like reported at once. Note that
   the server SHOULD attempt to respect these hints, but is NOT REQUIRED
   to do so. That is, a client MUST be prepared to receive a different
   number of completions than it requested, or to wait longer than
   desired for a given completion. The server sends a reply message on
   the Operation Channel acknowledging the batch request. Then, as each
   request is completed, the server sends DAFS_PROC_BC_BATCH_COMPLETION
   messages to the client over the Back-control Channel.

   The server is not obligated to complete the requests in the batch in
   any specific order or with any atomicity obligations. Also, if a
   client submits multiple asynchronous batch operations, the server MAY


Wittle                                                        [Page 213]

INTERNET-DRAFT         Direct Access File System          September 2001


   coalesce them and report completions from multiple batches in the
   same back-channel completion notification or future request poll.

   The "num_completions" parameter is global to a DAFS Session. If more
   than one batch of asynchronous I/O requests is in progress, the
   server will respect the most recently received "num_completions"
   parameter, and MAY combine requests from different
   DAFS_PROC_BATCH_SUBMIT messages into the same
   DAFS_PROC_BC_BATCH_COMPLETION message. The client can use
   DAFS_PROC_BATCH_SUBMIT to change the current value of
   "num_completions" without submitting any additional requests. To do
   so, the client sets "num_completions" to the desired size, clears the
   synchronous flag, and issues the DAFS_PROC_BATCH_SUBMIT message with
   a zero-length I/O request list. Conversely, a client can submit more
   I/O requests in a batch without changing the Session's current
   "num_completions" parameter by setting "num_completions" to 0 in the
   new batch message.

   If the client and server have negotiated the use of data checksums,
   each batch I/O request is independently checksummed, using the DAFS
   checksum. For write requests, the client fills in the checksum field
   of the dafs_read_write_request structure, which the server verifies.
   For read requests, the server fills in the checksum field of the
   proper dafs_completion_notification structure, which the client veri-
   fies upon receipt of the data.

   IMPLEMENTATION

   In general, the DAFS_PROC_BATCH_SUBMIT response message SHOULD not be
   sent to the client until the server is able to allocate the resources
   necessary to store/queue the list of requests.

   This delays the operation channel flow control credits somewhat, but
   it automatically causes the client to slow down when the server
   experiences transient loads. It allows the server to respond rapidly
   on any Session as long as it has resources, but it need not dedicate
   resources for large BATCH_SUBMIT requests until they are requested.

   In some sense, this is no different than any other 'normal' message -
   the response isn't sent until the message is processed.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID


Wittle                                                        [Page 214]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_OLD_STATEID

   DAFSERR_RDMA-READ_CHANNEL_UNUSABLE

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID

   DAFSERR_WHOA_COWBOY

   DAFSERR_WRITE_TOOBIG


Wittle                                                        [Page 215]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.6.  DAFS_PROC_CACHE_HINT

   SUMMARY

   Provide the server with cache management hints.

   ARGUMENTS

        struct DAFS_Cache_Hint_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_uint64                offset;
           dafs_uint32                count;
           dafs_access_pattern_type   access_pattern;
           dafs_cache_hint_type       cache_hint;
           }


   RESULTS

   None.

   DESCRIPTION

   The access_pattern indicates the client's predicted general access
   pattern for the filehandle, if the pattern is know.

   The cache_hint defines a bit-field argument containing the cache hint
   for the specified byte range of the file. The first bit indicates
   whether a prefetch of this data is to be performed. The following 3
   bits indicate the likelihood of a read in the near future. The 3 bits
   following those indicate the likelihood that a write will be per-
   formed on this data in the near future. Detailed definitions for each
   of the cache_hint bits are given below.

   Cache_hint set to NULL is a special value that means that the hint is
   not being used or the client does not know currently what cache
   weighting to assign to the byte range. In the NULL case the server
   SHOULD assume default values (prefetch 0, dafs_readhint_4,
   dafs_writehint4).

   dafs_prefetch

      Dafs_prefetch indicates that the given data range is to be pre-
      fetched. If this is set and accompanies a read or write request
      then it is ignored. It is to intended be used when the
      DAFS_PROC_CACHE_HINT function is called. If the prefetch bit is


Wittle                                                        [Page 216]

INTERNET-DRAFT         Direct Access File System          September 2001


      set and both the radiant and write=>intv are negative (i.e.,
      readhint < 0x8 and writehint < 0x40) then the request SHOULD be
      ignored. If not negative and the server is not otherwise busy it
      is then recommended that the requested data be prefetched into the
      server's cache. If prefetching is not supported by the server then
      an error DAFS_PREFETCH_NOT_SUPPORTED is returned (provided that
      the hint is not accompanying a read or write request in which case
      read/write errors are to be reported).

   dafs_readhint_1

      The client is confident that it will not read the data again in
      the near future.

   dafs_readhint_2

      The client believes there is a good chance that it will not read
      the data again in the near future.

   dafs_readhint_3

      The client believes there is a better than even chance that it
      will not read the data again in the near future.

   dafs_readhint_4

      The client does not know whether it will read the data again or
      not (default value).

   dafs_readhint_5

      The client believes there is a better than even chance that it
      will read this data again in the near future.

   dafs_readhint_6

      The client believes there is a good chance that it will read this
      data again in the near future.

   dafs_readhint_7

      The client believes there is an excellent chance that it will read
      the data again in the near future.

   dafs_writehint_1

      The client is confident that it will not write the data again in
      the near future.


Wittle                                                        [Page 217]

INTERNET-DRAFT         Direct Access File System          September 2001


   dafs_writehint_2

      The client believes there is a good chance that it will not write
      the data again in the near future.

   dafs_writehint_3

      The client believes there is a better than even chance that it
      will not write the data again in the near future.

   dafs_writehint_4

      The client does not know whether it will write the data again or
      not (default value).

   dafs_writehint_5

      The client believes there is a better than even chance that it
      will write this data again in the near future.

   dafs_writehint_6

      The client believes there is a good chance that it will write this
      data again in the near future.

   dafs_writehint_7

      The client believes there is an excellent chance that it will
      write the data again in the near future.

   IMPLEMENTATION

   The DAFS protocol does not dictate what actions a server SHOULD take
   upon reception of cache hints or even that the server needs to take
   any actions at all. The protocol does not dictate how long a server
   SHOULDMUST make use of a hint issued by the client.

   ERRORS

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_PREFETCH_NOT_SUPPORTED


Wittle                                                        [Page 218]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.7.  DAFS_PROC_CLOSE

   SUMMARY

   Closes a file.

   ARGUMENTS

        struct DAFS_Close_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           };


   RESULTS

   None.

   DESCRIPTION

           "The CLOSE [DAFS_PROC_CLOSE]  operation  releases  share
           reservations  for  the  file as specified by the current
           filehandle.  The  share  reservations  and  other  state
           information  released  at the server as a result of this
           CLOSE [DAFS_PROC_CLOSE] is only associated with the sup-
           plied  stateid [state_id].  The sequence id provides for
           the correct ordering. (RFC 3010, p. 108)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures

           "State associated with other OPENs [DAFS_PROC_OPENs]  is
           not affected.

           If record locks are held, the client SHOULD release  all
           locks  before  issuing  a  CLOSE [DAFS_PROC_CLOSE].  The
           server   MAY   free    all    outstanding    locks    on
           CLOSE[DAFS_PROC_CLOSE]  but some servers may not support
           the CLOSE [DAFS_PROC_CLOSE] of a  file  that  still  has
           record  locks  held.   The server MUST return failure if
           any    locks    would    exist    after    the     CLOSE
           [DAFS_PROC_CLOSE]." (RFC 3010, p.108)

   The DAFS_PROC_CLOSE operation provides "Delete On Last Close" seman-
   tics. Once a file has been opened, the DAFS Server MUST continue to
   provide access to the file to the Clients that have the file open,


Wittle                                                        [Page 219]

INTERNET-DRAFT         Direct Access File System          September 2001


   even after the file has been removed, up until the number of Clients
   that have the file open has dropped to zero. However, once the file
   has been removed, subsequent lookup and open operations will fail.

   IMPLEMENTATION

   ERRORS

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_ISDIR

   DAFSERR_LEASE_MOVED

   DAFSERR_MOVED

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 220]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.8.  DAFS_PROC_COMMIT

   SUMMARY

   Commits data cached on the server as the result of previous asynchro-
   nous write requests.

   ARGUMENTS

        struct DAFS_Commit_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_uint64                offset;
           dafs_uint32                count;
           };


   RESULTS

        struct DAFS_Commit_Res
           {
           dafs_verifier_type         writeverf;
           };


   DESCRIPTION

           "The  COMMIT  [DAFS_PROC_COMMIT]  operation  forces   or
           flushes data to stable storage for the file specified by
           the current file handle." (RFC 3010, p. 109)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures

           "The flushed data is that which was  previously  written
           with      a     WRITE     [DAFS_PROC_WRITE_INLINE     or
           DAFS_PROC_WRITE_DIRECT] operation which had  the  stable
           field set to UNSTABLE4 [UNSTABLE].

           The offset specifies the position within the file  where
           the  flush  is  to  begin.   An offset value of 0 (zero)
           means to flush data starting at  the  beginning  of  the
           file.   The  count specifies the number of bytes of data
           to flush.  If count is 0 (zero), a flush from offset  to
           the end of the file is done.


Wittle                                                        [Page 221]

INTERNET-DRAFT         Direct Access File System          September 2001


           The server returns a write verifier upon successful com-
           pletion  of  the  COMMIT  [DAFS_PROC_COMMIT].  The write
           verifier is used by  the  client  to  determine  if  the
           server  has  restarted  or  rebooted between the initial
           WRITE(s)           [DAFS_PROC_WRITE_INLINEs           or
           DAFS_PROC_WRITE_DIRECTs]       and       the      COMMIT
           [DAFS_PROC_COMMIT].  The client does this  by  comparing
           the  write verifier returned from the initial writes and
           the verifier returned by the  COMMIT  [DAFS_PROC_COMMIT]
           procedure.   The server must vary the value of the write
           verifier at each server event or instantiation that  may
           lead  to a loss of uncommitted data.  Most commonly this
           occurs when  the  server  is  rebooted;  however,  other
           events at the server may result in uncommitted data loss
           as well." (RFC 3010, pp. 109-110)

   IMPLEMENTATION

           "The COMMIT [DAFS_PROC_COMMIT] procedure is  similar  in
           operation  and  semantics  to  the POSIX fsync(2) system
           call that synchronizes a  file's  state  with  the  disk
           (file  data  and  metadata  is flushed to disk or stable
           storage). COMMIT [DAFS_PROC_COMMIT]  performs  the  same
           operation  for  a  client,   flushing any unsynchronized
           data and metadata on the server to the server's disk  or
           stable storage for the specified file. Like fsync(2), it
           may be that there is some modified data or  no  modified
           data  to  synchronize.  The data may have been synchron-
           ized by the server's normal periodic buffer synchroniza-
           tion  activity.  COMMIT [DAFS_PROC_COMMIT] should return
           NFS4_OK [DAFS_STATUS_OK], unless there has been an unex-
           pected error.

           COMMIT [DAFS_PROC_COMMIT] differs from fsync(2) in  that
           it  is  possible  for the client to flush a range of the
           file (most likely triggered  by  a  buffer-  reclamation
           scheme  on  the  client  before [the] file has been com-
           pletely written).

           The server implementation of  COMMIT  [DAFS_PROC_COMMIT]
           is  reasonably  simple.   If  the server receives a full
           file COMMIT [DAFS_PROC_COMMIT] request, that is starting
           at  offset 0 and count 0, it should do the equivalent of
           fsync()'ing the file.  Otherwise, it should  arrange  to
           have  the  cached  data in the range specified by offset
           and count to be flushed  to  stable  storage.   In  both
           cases,  any  metadata  associated  with the file must be
           flushed to stable storage before returning.  It  is  not


Wittle                                                        [Page 222]

INTERNET-DRAFT         Direct Access File System          September 2001


           an error for there to be nothing to flush on the server.
           This means that the data and metadata that needed to  be
           flushed  have  already  been  flushed or lost during the
           last server failure.

           The client implementation of  COMMIT  [DAFS_PROC_COMMIT]
           is  a  little  more  complex.  There are two reasons for
           wanting to commit a client  buffer  to  stable  storage.
           The  first  is  that the client wants to reuse a buffer.
           In this case, the offset and count  of  the  buffer  are
           sent  to  the  server  in  the COMMIT [DAFS_PROC_COMMIT]
           request.  The server then flushes any cached data  based
           on  the offset and count, and flushes any metadata asso-
           ciated with the file.  It then returns the status of the
           flush  and the write verifier.  The other reason for the
           client to generate a COMMIT [DAFS_PROC_COMMIT] is for  a
           full  file flush, such as may be done at close.  In this
           case, the client would gather all  of  the  buffers  for
           this  file  that contain uncommitted data, do the COMMIT
           [DAFS_PROC_COMMIT] operation with an  offset  of  0  and
           count  of  0,  and  then free all of those buffers.  Any
           other dirty buffers would be sent to the server  in  the
           normal fashion.

           After a buffer is written by the client with the  stable
           parameter  set  to UNSTABLE4 [UNSTABLE], the buffer must
           be considered as modified by the client until the buffer
           has  either been flushed via a COMMIT [DAFS_PROC_COMMIT]
           operation or written via a WRITE operation [one of  DAFS
           WRITE   operations]   with   stable   parameter  set  to
           FILE_SYNC4 or DATA_SYNC4 [FILE_SYNC or DATA_SYNC].  This
           is  done  to  prevent  the  buffer  from being freed and
           reused before the data can be flushed to stable  storage
           on the server.

           When a response is returned from either a WRITE [one  of
           DAFS    WRITE    operation    flavors]   or   a   COMMIT
           [DAFS_PROC_COMMIT] operation and  it  contains  a  write
           verifier  that  is different than previously returned by
           the server, the client will need to  retransmit  all  of
           the  buffers  containing  uncommitted cached data to the
           server.  How this is to be done is up to  the  implemen-
           tor.   If  there is only one buffer of interest, then it
           should probably be sent back over  in  a  WRITE  request
           [one of DAFS WRITE operation flavors] with the appropri-
           ate stable parameter.  If there is more than one buffer,
           it might be worthwhile retransmitting all of the buffers
           in WRITE requests [one of DAFS  WRITE  request  flavors]


Wittle                                                        [Page 223]

INTERNET-DRAFT         Direct Access File System          September 2001


           with  the  stable  parameter set to UNSTABLE4 [UNSTABLE]
           and then retransmitting  the  COMMIT  [DAFS_PROC_COMMIT]
           operation  to  flush  all  of  the data on the server to
           stable storage.  The timing of these retransmissions  is
           left to the implementor.

           The above description applies to  page-cache-based  sys-
           tems  as  well  as buffer-cache-based systems.  In those
           systems, the virtual memory system will need to be modi-
           fied  instead  of the buffer cache." (RFC 3010, pp. 110-
           111)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_ISDIR

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 224]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.9.  DAFS_PROC_CREATE

   SUMMARY

   Creates a non-regular file object.

   ARGUMENTS

           const DAFS_TYPE_INVALID    = 0;
           const DAFS_TYPE_DIR        = 2;
           const DAFS_TYPE_BLK        = 3;
           const DAFS_TYPE_CHR        = 4;
           const DAFS_TYPE_LNK        = 5;
           const DAFS_TYPE_SOCK       = 6;
           const DAFS_TYPE_FIFO       = 7;
           const DAFS_TYPE_ATTRDIR    = 8;
           const DAFS_TYPE_NAMEDATTR  = 9;


        struct DAFS_Create_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_component_type        component;           /* heap */
           dafs_uint32                obj_type;
           union switch (obj_type)
              {
              case DAFS_TYPE_LNK:
                 dafs_utf8string_type  linkdata;           /* heap */
              case DAFS_TYPE_BLK:
              case DAFS_TYPE_CHR:
                 dafs_specdata_type    specdata;
              case DAFS_TYPE_SOCK:
              case DAFS_TYPE_FIFO:
              case DAFS_TYPE_DIR:
                 void;
              } create_type;
           dafs_file_attr              attr;               /* heap */
           };


   RESULTS


Wittle                                                        [Page 225]

INTERNET-DRAFT         Direct Access File System          September 2001


        struct DAFS_Create_Res
           {
           dafs_filehandle_type       filehandle;
           dafs_change_info_type      change_info;
           };


   DESCRIPTION

           "The CREATE [DAFS_PROC_CREATE] operation creates a  non-
           regular  file  object  in a directory with a given name.
           The OPEN [DAFS_PROC_OPEN]  procedure  MUST  be  used  to
           create a regular file.

           The objname [component] specifies the name for  the  new
           object.  If  the  objname  [component] has a length of 0
           (zero), the error NFS4ERR_INVAL [DAFSERR_INVAL] will  be
           returned.  The objtype [obj_type] determines the type of
           object to be created: directory, symlink, etc.

           If an object of the same  name  already  exists  in  the
           directory,    the   server   will   return   the   error
           NFS4ERR_EXIST [DAFSERR_EXIST].

           For the directory where the new file object was created,
           the   server   returns  change_info4  [dafs_change_info]
           information in cinfo  [change_info].   With  the  atomic
           field of the change_info4 [dafs_change_info] struct, the
           server will indicate if  the  before  and  after  change
           attributes  were obtained atomically with respect to the
           file object creation.

           If the objname has a length of  0 (zero), or if  objname
           does   not   obey   the   UTF-8  definition,  the  error
           NFS4ERR_INVAL [DAFSERR_INVAL] will be returned.

           The current filehandle is replaced by that  of  the  new
           object." (RFC 3010, p. 113)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   DAFS_PROC_CREATE allows a client to specify attributes to be set at
   the time an object of type DAFS_TYPE_DIR is created. Setting attri-
   butes for other object types is not permitted and will result in the
   server returning DAFSERR_INVAL. Setting the attributes when creating


Wittle                                                        [Page 226]

INTERNET-DRAFT         Direct Access File System          September 2001


   a directory saves the client the need to issue a SETATTR call after
   the CREATE.

   IMPLEMENTATION

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BADTYPE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DQUOT

   DAFSERR_EXIST

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NAMETOOLONG

   DAFSERR_NOSPC

   DAFSERR_NOTDIR

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 227]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.10.  DAFS_PROC_DELEGPURGE

   SUMMARY

   Purges delegations awaiting recovery.

   ARGUMENTS

   None.

   RESULTS

   None.

   DESCRIPTION

           "Purges all of the delegations awaiting recovery  for  a
           given  client.   This is useful for clients which do not
           commit delegation information to stable storage to indi-
           cate  that  conflicting  requests need not be delayed by
           the server awaiting recovery of delegation information.

           This operation should be used  by  clients  that  record
           delegation  information on stable storage on the client.
           In this case, DELEGPURGE  [DAFS_PROC_DELEGPURGE]  should
           be issued immediately after doing delegation recovery on
           all delegations know[n] to the client.   Doing  so  will
           notify the server that no additional delegations for the
           client will be recovered allowing it to free  resources,
           and  avoid delaying other clients who make requests that
           conflict with the unrecovered delegations.  The  set  of
           delegations  known  to  the server and the client may be
           different.  The reason for this is  that  a  client  may
           fail after making a request which resulted in delegation
           but before it received the results and committed them to
           the client's stable storage." (RFC 3010, p. 114)

   DAFS_PROC_DELEGPURGE takes no arguments. Delegations are purged for
   the client associated with the session on which the DELEGPURGE
   request arrives.

   ERRORS

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_RESOURCE


Wittle                                                        [Page 228]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_SERVERFAULT

   DAFSERR_STALE_CLIENTID


Wittle                                                        [Page 229]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.11.  DAFS_PROC_DELEGRETURN

   SUMMARY

   Returns delegation.

   ARGUMENTS

        struct DAFS_DelegReturn_Args
           {
           dafs_state_id_type         state_id;
           };


   RESULTS

   None.

   DESCRIPTION

           "Returns the delegation represented by the given stateid
           [state_id]." (RFC 3010, p. 115)

   ERRORS

   DAFSERR_BAD_STATEID

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 230]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.12.  DAFS_PROC_GET_ROOT_HANDLE

   SUMMARY

   Retrieves the root filehandle from a server.

   ARGUMENTS

   None.

   RESULTS

        struct DAFS_Get_Root_Handle_Args
           {
           dafs_filehandle_type       root_handle;
           };


   DESCRIPTION

   This procedure returns the root of the server file name space. See
   4.1.4., "Objects Naming And Filehandles" for a detailed description
   of server name space in the DAFS protocol.

   ERROR

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN


Wittle                                                        [Page 231]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.13.  DAFS_PROC_GETATTR_INLINE

   SUMMARY

   Gets attributes of an object.

   ARGUMENTS

        struct DAFS_GetAttr_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_attr_bitmap_type      attr_request_bitmap;
           };


   RESULTS

        struct DAFS_GetAttrInline_Res
           {
           dafs_file_attr_type        obj_attributes;
           };


   DESCRIPTION

           "The GETATTR [DAFS_PROC_GETATTR_INLINE]  operation  will
           obtain  attributes  for the file system object specified
           by the current filehandle." (RFC 3010, p. 116)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "The   client    sets    a    bit    in    the    bitmap
           [attr_request_bitmap]  argument for each attribute value
           that it would like the server  to  return.   The  server
           returns an attribute bitmap that indicates the attribute
           values for which it was able to return, followed by  the
           attribute values ordered lowest attribute number first."
           (RFC 3010, p. 116)

   See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding
   of attributes.

           "The server must return a value for each attribute  that
           the client requests if the attribute is supported by the
           server.  If the server does not support an attribute  or


Wittle                                                        [Page 232]

INTERNET-DRAFT         Direct Access File System          September 2001


           cannot  approximate  a  useful  value  then  it must not
           return the attribute value and must not set  the  attri-
           bute bit in the result bitmap.

           The server must return an error if it supports an attri-
           bute  but  cannot  obtain  its  value.   In that case no
           attribute values will be returned.

           All servers must support  the  mandatory  attributes  as
           specified  in the section 'File Attributes'." (RFC 3010,
           p. 116)

   See 6.1.5., "File Attributes" for a list of DAFS' mandatory file
   attributes.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 116)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

   The server MUST return an object attributes structure with all the
   attribute fields that the client requested in the attr_request_bitmap
   parameter. If a server does not support or cannot provide a requested
   attribute, it MUST still include space for the attribute but mark its
   contents as invalid by not setting the corresponding bit in the valid
   bitmap of the returned obj_attributes.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO


Wittle                                                        [Page 233]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 234]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.14.  DAFS_PROC_GETATTR_DIRECT

   SUMMARY

   Gets attributes of an object.

   ARGUMENTS

        struct DAFS_GetAttr_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_attr_bitmap_type      attr_request_bitmap;
           dafs_direct_op_buffer      data_buffer;
           };


   RESULTS

        struct DAFS_GetAttr_Direct_Res
           {
           dafs_checksum_type         direct_checksum;
           };


        /* Server copies the object attributes directly
           into the data_buffer passed by the client in
           the arguments structure */


        /* DIRECT: dafs_file_attr_type obj_attributes;  */


   DESCRIPTION

           "The GETATTR [DAFS_PROC_GETATTR_DIRECT]  operation  will
           obtain  attributes  for the file system object specified
           by the current filehandle." (RFC 3010, p. 116)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "The   client    sets    a    bit    in    the    bitmap
           [attr_request_bitmap]  argument for each attribute value
           that it would like the server  to  return.   The  server


Wittle                                                        [Page 235]

INTERNET-DRAFT         Direct Access File System          September 2001


           returns an attribute bitmap that indicates the attribute
           values for which it was able to return, followed by  the
           attribute values ordered lowest attribute number first."
           (RFC 3010, p. 116)

   See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding
   of attributes

           "The server must return a value for each attribute  that
           the client requests if the attribute is supported by the
           server.  If the server does not support an attribute  or
           cannot  approximate  a  useful  value  then  it must not
           return the attribute value and must not set  the  attri-
           bute bit in the result bitmap.

           The server must return an error if it supports an attri-
           bute  but  cannot  obtain  its  value.   In that case no
           attribute values will be returned.

           All servers must support  the  mandatory  attributes  as
           specified  in the section 'File Attributes'." (RFC 3010,
           p. 116)

   See 6.1.5., "File Attributes" for a list of DAFS' mandatory file
   attributes.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 116)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

   The server MUST return an object attributes structure with all the
   attribute fields that the client requested in the attr_request_bitmap
   parameter. If a server does not support or cannot provide a requested
   attribute, it MUST still include space for the attribute but mark its
   contents as invalid by not setting the corresponding bit in the valid
   bitmap of the returned obj_attributes.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM


Wittle                                                        [Page 236]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 237]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.15.  DAFS_PROC_GET_FSATTR

   SUMMARY

   Gets the attributes of the file system where a filehandle resides.

   ARGUMENTS

        struct DAFS_Get_FSAttr_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_attr_bitmap_type      attr_request_bitmap;
           };


   RESULTS

        struct DAFS_Get_FSAttr_Res
           {
           dafs_filesys_attr_type     obj_attributes;
           };


   DESCRIPTION

           "The  GETATTR  [DAFS_PROC_GET_FSATTR]   operation   will
           obtain  attributes  for the file system object specified
           by the current filehandle." (RFC 3010, p. 116)

   This DAFS operation returns attributes for the file system to which
   this file object belongs.See 4.1.3.1., "Filehandles in Compound vs.
   Chaining" for differences in filehandle management in DAFS procedures

           "The   client    sets    a    bit    in    the    bitmap
           [attr_request_bitmap]  argument for each attribute value
           that it would like the server  to  return.   The  server
           returns an attribute bitmap that indicates the attribute
           values for which it was able to return, followed by  the
           attribute values ordered lowest attribute number first."
           (RFC 3010, p. 116)

   See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding
   of attributes.

           "The server must return a value for each attribute  that
           the client requests if the attribute is supported by the


Wittle                                                        [Page 238]

INTERNET-DRAFT         Direct Access File System          September 2001


           server.  If the server does not support an attribute  or
           cannot  approximate  a  useful  value  then  it must not
           return the attribute value and must not set  the  attri-
           bute bit in the result bitmap. The server must return an
           error if it supports an attribute but cannot obtain  its
           value.   In  that  case  no  attribute  values  will  be
           returned.

           All servers must support  the  mandatory  attributes  as
           specified  in the section 'File Attributes'." (RFC 3010,
           p. 116)

   See 6.1.6., "File System Attributes" for a list of DAFS' mandatory
   file system attributes.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p. 116)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

   The server MUST return an object attributes structure with all the
   attribute fields that the client requested in the attr_request_bitmap
   parameter. If a server does not support or cannot provide a requested
   attribute, it MUST still include space for the attribute but mark its
   contents as invalid by not setting the corresponding bit in the valid
   bitmap of the returned obj_attributes.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO


Wittle                                                        [Page 239]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 240]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.16.  DAFS_PROC_HURRY_UP

   SUMMARY

   Set to zero the usec_window of an I/O requests that was previously
   submitted to the server by DAFS_PROC_BATCH_SUBMIT.

   ARGUMENTS

        struct DAFS_Hurry_Up_Args
           {
           dafs_uint64                request_id;   /* per session */
           }


   RESULTS

   None.

   DESCRIPTION

   The DAFS_PROC_HURRY_UP operation is used to inform the server that
   the client would like the server to "hurry up" processing of a single
   a single outstanding dafs_read_write_request from an asynchronous
   DAFS_PROC_BATCH_SUBMIT request (that is, the server has acknowledged
   receipt of the request by responding to the DAFS_PROC_BATCH_SUBMIT,
   but not yet completed it by identifying it in
   DAFS_PROC_BC_BATCH_COMPLETION).  Specifically this is intended for
   the case when the original request was submitted with a non- zero
   usec_window (indicating that the client would tolerate an unusually
   long latency), but circumstances have changed on the client such that
   the client would like the operation to be treated in an "ordinary"
   fashion rather than a "latency-insensitive" fashion.

   The request_id, unique to the DAFS Session, allows the server to
   identify the correct outstanding dafs_read_write_request. If the
   server does not have an outstanding dafs_read_write_request on this
   Session with matching request id (either because it the client never
   submitted such a request, the client has submitted the request, but
   the server hasn't acknowledged receipt of the request, or because the
   server has completed it and reported it in a
   DAFS_PROC_BC_BATCH_COMPLETION) the server SHALL return the error
   DAFSERR_BATCH_REQUEST_NOT_FOUND.

   Note: this operation is intended only to instruct the server to alter
         the processing priority associated with the indicated
         dafs_read_write_request. Generally the server implementation


Wittle                                                        [Page 241]

INTERNET-DRAFT         Direct Access File System          September 2001


         would be to move the request from a "high latency operation
         queue" to an ordinary operation queue. The server is recom-
         mended to complete the DAFS_PROC_HURRY_UP operation as soon as
         it has altered the processing priority of the indicated
         dafs_read_write_request.

         There is no expectation that the server complete the indicated
         dafs_read_write_request before completing the
         DAFS_PROC_BATCH_SUBMIT operation.

   The usec_window parameter to DAFS_PROC_BATCH_SUBMIT is a hint that
   the server is free to ignore. If the server ignores the usec_window
   parameter, then the server MAY return DAFSERR_NOTSUPP to
   DAFS_PROC_HURRY_UP. However, if the server does interpret the
   usec_window parameter to DAFS_PROC_BATCH_SUBMIT, the server SHALL NOT
   return DAFSERR_NOTSUPP.

   ERRORS

   DAFSERR_BATCH_REQUEST_NOT_FOUND

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_NOTSUPP


Wittle                                                        [Page 242]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.17.  DAFS_PROC_LINK

   SUMMARY

   Creates a hard link to a file.

   ARGUMENTS

        struct DAFS_Link_Args
           {
           dafs_filehandle_type       source_object;
           dafs_filehandle_type       destination_object;
           dafs_utf8string            newname;             /* heap */
           };


   RESULTS

        struct DAFS_Link_Res
           {
           dafs_change_info_type      change_info;
           };


   DESCRIPTION

           "The LINK [DAFS_PROC_LINK] operation  creates  an  addi-
           tional  newname  for  the  file represented by the saved
           filehandle, as set  by  the  SAVEFH  operation,  in  the
           directory  represented  by the current filehandle." (RFC
           3010, p. 118)

   When the DAFS_PROC_LINK operation occurs within a DAFS operation
   chain (see 4.3.2., "Chaining Flags", for a description of chaining),
   the DAFS chain current_filehandle specifies the target directory, and
   the source object and newname are taken from the message arguments.

           "The existing file and the target directory must  reside
           within  the same file system on the server.  On success,
           the current filehandle will continue to  be  the  target
           directory." (RFC 3010, p. 118)

   The target directory is also passed along in a DAFS chain. See
   4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in
   filehandle management in DAFS procedures.


Wittle                                                        [Page 243]

INTERNET-DRAFT         Direct Access File System          September 2001


           "For  the   target   directory,   the   server   returns
           change_info4  [dafs_change_info]  information  in  cinfo
           [change_info].    With   the   atomic   field   of   the
           change_info4  [dafs_change_info] struct, the server will
           indicate if the before and after change attributes  were
           obtained atomically with respect to the link creation.

           If the newname has a length of 0 (zero), or  if  newname
           does   not   obey   the   UTF-8  definition,  the  error
           NFS4ERR_INVAL [DAFSERR_INVAL]  will be  returned."  (RFC
           3010, pp. 118- 119)

   IMPLEMENTATION

           "Changes to any property of the 'hard' linked files  are
           reflected  in  all  of the linked files.  When a link is
           made to a file, the attributes for the file should  have
           a  value  for  numlinks  [num_links] that is one greater
           than the value before the LINK operation.

           The comments under RENAME  [DAFS_PROC_RENAME]  regarding
           object and target residing on the same file system apply
           here as well. The comments  regarding  the  target  name
           applies as well.

           Note that symbolic links are  created  with  the  CREATE
           [DAFS_PROC_CREATE] operation." (RFC 3010, p. 119)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DQUOT

   DAFSERR_EXIST

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO


Wittle                                                        [Page 244]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ISDIR

   DAFSERR_MLINK

   DAFSERR_MOVED

   DAFSERR_NAMETOOLONG

   DAFSERR_NOSPC

   DAFSERR_NOTDIR

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_XDEV


Wittle                                                        [Page 245]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.18.  DAFS_PROC_LOCK

   SUMMARY

   Creates a lock.

   ARGUMENTS

           #define RECLAIM            1
           #define PERSIST            2
           #define AUTORECOVERY       4


        enum dafs_lock_type
           {
           READ_LT                    = 1,
           WRITE_LT                   = 2,
           READW_LT                   = 3,     /* blocking read */
           WRITEW_LT                  = 4,     /* blocking write */
           ABORT_LT                   = 5      /* rollback the lock */
           };


        struct DAFS_Lock_Args
           {
           dafs_filehandle_type       filehandle;
           enum dafs_lock_type        lock_type;
           dafs_uint32                options;           /* Bitmap */
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint64                length;
           };


   RESULTS


Wittle                                                        [Page 246]

INTERNET-DRAFT         Direct Access File System          September 2001


        struct DAFS_Lock_Res
           {
           union switch (status)
              {
              case DAFS_STATUS_OK:
                 void;
              case DAFSERR_DENIED:
              case DAFSERR_LOCK_BROKEN:
                 dafs_client_id       owner_clientid;
                 dafs_lockowner_type  owner;               /* heap */
                 dafs_uint64          offset;
                 dafs_uint64          length;
                 enum dafs_lock_type  lock_type;
              default:
                 void;
              } lock_res;
           };


   DESCRIPTION

           "The LOCK [DAFS_PROC_LOCK] operation requests  a  record
           lock  for  the  byte  range  specified by the offset and
           length parameters.  The lock type is also  specified  to
           be one of the nfs4_lock [dafs_lock_type] types.  If this
           is a reclaim request,  the  reclaim  parameter  will  be
           TRUE." (RFC 3010, p. 120)

   In DAFS, reclaim operations are specified by setting the RECLAIM bit
   in the options arguments field.

   If this is a persistent lock, the options parameter will include PER-
   SIST. If this is an auto recovery lock, the options parameter will
   include AUTORECOVERY. The options parameter value is formed by OR'ing
   together desired options. If a server does not support a locking
   option, it returns DAFSERR_NOTSUPP.

           "Bytes in a file may be locked even if those  bytes  are
           not  currently  allocated to the file.  To lock the file
           from a  specific  offset  through  the  end-of-file  (no
           matter how long the file actually is) use a length field
           with all bits set to 1 (one).  To lock the entire  file,
           use  an  offset  of  0 (zero) and a length with all bits
           set to 1.  A length of 0 is reserved and should  not  be
           used.


Wittle                                                        [Page 247]

INTERNET-DRAFT         Direct Access File System          September 2001


           In the case that the lock is denied, the owner,  offset,
           and  length  of  a  conflicting lock are returned." (RFC
           3010, p. 120)

   DAFS also returns the client-id for the client that owns the con-
   flicting lock.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 120)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "If the server is unable to determine the  exact  offset
           and  length of the conflicting lock, the same offset and
           length that were provided in  the  arguments  should  be
           returned  in  the denied results.  The File Locking sec-
           tion contains a full description of this and  the  other
           file locking operations." (RFC 3010, p. 120)

   See 4.4., "Locking and Access Control", for a full description of
   this and the other file locking operations

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL


Wittle                                                        [Page 248]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ISDIR

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCK_BROKEN

   DAFSERR_LOCK_RANGE

   DAFSERR_MOVED

   DAFSERR_NOTSUPP

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_CLIENTID

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 249]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.19.  DAFS_PROC_LOCKT

   SUMMARY

   Tests for a file lock.

   ARGUMENTS

        struct DAFS_LockT_Args
           {
           dafs_filehandle_type       filehandle;
           enum dafs_lock_type        lock_type;
           dafs_lockowner_type        owner;               /* heap */
           dafs_uint64                offset;
           dafs_uint64                length;
           };


   RESULTS

        struct DAFS_LockT_Res
           {
           union switch (status)
              {
              case DAFSERR_DENIED:
              case DAFSERR_LOCK_BROKEN:
                 dafs_client_id       owner_clientid;
                 dafs_lockowner_type  owner;             /* heap */
                 dafs_uint64          offset;
                 dafs_uint64          length;
                 enum dafs_lock_type  lock_type;
              case DAFS_STATUS_OK:
                 default:
              void;
              } results;
           };


   DESCRIPTION

           "The LOCKT [DAFS_PROC_LOCKT] operation tests the lock as
           specified  in  the  arguments.  If  a  conflicting  lock
           exists, the owner, offset, and length of the conflicting
           lock  are  returned;  if  no lock is held, nothing other
           than NFS4_OK [DAFS_STATUS_OK] is returned."  (RFC  3010,
           p. 121)


Wittle                                                        [Page 250]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS also returns the client-id for the client that owns the con-
   flicting lock.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 121)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "If the server is unable to determine the  exact  offset
           and  length of the conflicting lock, the same offset and
           length that were provided in  the  arguments  should  be
           returned  in  the denied results.  The File Locking sec-
           tion contains further discussion  of  the  file  locking
           mechanisms." (RFC 3010, pp. 121-122)

   See 4.4., "Locking and Access Control", for further discussion of the
   file locking mechanisms.

           "LOCKT     [DAFS_PROC_LOCKT]     uses     nfs_lockowner4
           [lockowner_type]  instead of a stateid4 [state_id_type],
           as LOCK [DAFS_PROC_LOCK]  does, to identify the owner so
           that  the  client does not have to open the file to test
           for the existence of a lock." (RFC 3010, p. 122)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_ISDIR


Wittle                                                        [Page 251]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_LEASE_MOVED

   DAFSERR_LOCK_BROKEN

   DAFSERR_LOCK_RANGE

   DAFSERR_MOVED

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_CLIENTID


Wittle                                                        [Page 252]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.20.  DAFS_PROC_LOCKU

   SUMMARY

   Unlocks a file.

   ARGUMENTS

        struct DAFS_LOCKU_Arg
           {
           dafs_filehandle_type       filehandle;
           enum dafs_lock_type        lock_type;
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint64                length;
           };


   RESULTS

   None.

   DESCRIPTION

           "The  LOCKU  [DAFS_PROC_LOCKU]  operation  unlocks   the
           record lock specified by the parameters.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p. 123)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

   See 4.4., "Locking and Access Control", for a full description of
   this and the other file locking procedures.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN


Wittle                                                        [Page 253]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_LOCK_RANGE

   DAFSERR_LEASE_MOVED

   DAFSERR_MOVED

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_CLIENTID

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 254]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.21.  DAFS_PROC_LOOKUP

   SUMMARY

   Looks up a file object given its name.

   ARGUMENTS

        struct DAFS_Lookup_Args
           {
           dafs_filehandle_type       directory;
           dafs_pathname_type         path;                /* heap */
           };


   RESULTS

        struct DAFS_Lookup_Res
           {
           dafs_filehandle_type       filehandle;
           dafs_uint32                component_count;
           };


   DESCRIPTION

           "This operation LOOKUPs or finds a  file  system  object
           starting  from  the  directory  specified by the current
           filehandle." (RFC 3010, p. 124)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "LOOKUP [DAFS_PROC_LOOKUP] evaluates the  pathname  con-
           tained  in  the array of names and obtains a new current
           filehandle from the final name.  All but the final  name
           in the list must be the names of directories.

           If the pathname cannot be  evaluated  either  because  a
           component  does not exist or because the client does not
           have permission to evaluate a  component  of  the  path,
           then  an error will be returned and the current filehan-
           dle will be unchanged.

           If the path is a zero length  array,  if  any  component
           does  not obey the UTF-8 definition, or if any component


Wittle                                                        [Page 255]

INTERNET-DRAFT         Direct Access File System          September 2001


           in the path is of zero length, the  error  NFS4ERR_INVAL
           [DAFSERR_INVAL] will be returned." (RFC 3010, p. 124)

   If a DAFS_PROC_LOOKUP request contains multiple pathname segments,
   the response packet specifies the number of pathname components suc-
   cessfully looked-up in the component_count result field, and the
   filehandle of the last component successfully looked-up. A
   component_count result value == 0 indicates that the lookup of the
   first component failed and that the content of the returned filehan-
   dle is invalid. A return value > 0 indicates that one or more com-
   ponent lookups succeeded.

   If the server encounters an error before all pathname components are
   looked-up, the error status returned is either
   DAFSERR_NO_PARTIAL_INFO, meaning that the component_count field and
   filehandle are both invalid, or the error status that applies to the
   component the operation failed on.

   IMPLEMENTATION

           "NFS version 4 [and DAFS] servers depart from the seman-
           tics   of  previous  NFS  versions  in  allowing  LOOKUP
           [DAFS_PROC_LOOKUP] requests to cross mountpoints on  the
           server.   The client can detect a mountpoint crossing by
           comparing the fsid attribute of the directory  with  the
           fsid  attribute  of  the  directory  looked  up.  If the
           fsids[dafs_FS_Handles] are different then the new direc-
           tory is a server mountpoint.  Unix clients that detect a
           mountpoint crossing will  need  to  mount  the  server's
           filesystem.  This  needs to be done to maintain the file
           object  identity  checking  mechanisms  common  to  Unix
           clients.

           Servers that limit NFS  [DAFS]  access  to  'shares'  or
           'exported'   filesystems   should   provide   a  pseudo-
           filesystem into which the exported  filesystems  can  be
           integrated, so that clients can browse the server's name
           space.  The clients view of a pseudo filesystem will  be
           limited to paths that lead to exported filesystems.

           Note: previous versions of the protocol assigned special
           semantics  to  the  names  '.'  and '..'.  NFS version 4
           [DAFS] assigns no special semantics to these names.  The
           LOOKUPP  [DAFS_PROC_LOOKUPP]   operator  must be used to
           lookup a parent directory.

           Note that this procedure does not follow symbolic links.
           The  client  is responsible for all parsing of filenames


Wittle                                                        [Page 256]

INTERNET-DRAFT         Direct Access File System          September 2001


           including filenames that are modified by symbolic  links
           encountered during the lookup process.

           If the current file handle supplied is not  a  directory
           but   a   symbolic   link,   the  error  NFS4ERR_SYMLINK
           [DAFSERR_SYMLINK] is returned  as  the  error.  For  all
           other non-directory file types, the error NFS4ERR_NOTDIR
           [DAFSERR_NOTDIR] is returned." (RFC 3010, p. 125)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NAMETOOLONG

   DAFSERR_NOENT

   DAFSERR_NOTDIR

   DAFSERR_NO_PARTIAL_INFO

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_SYMLINK


Wittle                                                        [Page 257]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.22.  DAFS_PROC_LOOKUPP

   SUMMARY

   Looks up parent directory.

   ARGUMENTS

        struct DAFS_Lookupp_Args
           {
           dafs_filehandle_type       filehandle;
           };


   RESULTS

        struct DAFS_Lookupp_Res
           {
           dafs_filehandle_type       filehandle;
           };


   DESCRIPTION

           "The current filehandle is assumed to refer to a regular
           directory or a named attribute directory." (RFC 3010, p.
           126)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "LOOKUPP [DAFS_PROC_LOOKUPP]  assigns the filehandle for
           its  parent  directory to be the current filehandle.  If
           there  is  no   parent   directory   an   NFS4ERR_ENOENT
           [DAFSERR_NOENT] error must be returned.

           Therefore,  NFS4ERR_ENOENT   [DAFSERR_NOENT]   will   be
           returned  by  the server when the current filehandle  is
           at the root or top of  the  server's  file  tree."  (RFC
           3010, p. 126)

   IMPLEMENTATION

           "As    for    LOOKUP     [DAFS_PROC_LOOKUP],     LOOKUPP
           [DAFS_PROC_LOOKUPP] will also cross mountpoints.


Wittle                                                        [Page 258]

INTERNET-DRAFT         Direct Access File System          September 2001


           If the current filehandle is not a  directory  or  named
           attribute    directory,    the    error   NFS4ERR_NOTDIR
           [DAFSERR_NOTDIR] is returned." (RFC 3010, p. 126)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOENT

   DAFSERR_NOTDIR

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 259]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.23.  DAFS_PROC_NVERIFY

   SUMMARY

   Verifies difference in attributes.

   ARGUMENTS

        struct DAFS_Nverify_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_file_attr_type        obj_attributes;
           };


   RESULTS

   None.

   DESCRIPTION

           "This operation is used to prefix a sequence  of  opera-
           tions  to  be  performed  if one or more attributes have
           changed on some filesystem object.  If  all  the  attri-
           butes  match  then the error NFS4ERR_SAME [DAFSERR_SAME]
           must be returned.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p. 127)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "This operation is useful as a cache  validation  opera-
           tor.   If  the object to which the attributes belong has
           changed then the following  operations  may  obtain  new
           data associated with that object." (RFC 3010, p. 127)

           "In the case that a recommended attribute  is  specified
           in  the  NVERIFY  [DAFS_PROC_NVERIFY]  operation and the
           server does not support that attribute for the file sys-
           tem  object, the error NFS4ERR_NOTSUPP [DAFSERR_NOTSUPP]
           is returned to the client." (RFC 3010, p. 127)

   ERRORS


Wittle                                                        [Page 260]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_SAME

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 261]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.24.  DAFS_PROC_OPEN

   SUMMARY

   Opens a regular file.

   ARGUMENTS

        enum createmode
           {
           UNCHECKED                  = 0,
           GUARDED                    = 1,
           EXCLUSIVE                  = 2
           };


        enum opentype
           {
           OPEN_NOCREATE              = 0,
           OPEN_CREATE                = 1
           };


        enum open_claim_type
           {
           CLAIM_NULL                 = 0,
           CLAIM_PREVIOUS             = 1,
           CLAIM_DELEGATE_CUR         = 2,
           CLAIM_DELEGATE_PREV        = 3,
           CLAIM_CREATE_UNLINKED      = 4
           };


        enum open_delete_disp
           {
           DELETE_DONT_CARE           = 0,
           DELETE_DENY                = 1
           };


Wittle                                                        [Page 262]

INTERNET-DRAFT         Direct Access File System          September 2001


        enum limit_by
           {
           DAFS_LIMIT_SIZE            = 1,
           DAFS_LIMIT_BLOCKS          = 2
           };


        enum open_delegation_type
           {
           OPEN_DELEGATE_NONE         = 0,
           OPEN_DELEGATE_READ         = 1,
           OPEN_DELEGATE_WRITE        = 2
           };


           const DAFS_OPEN_SHARE_ACCESS_READ      = 0x00000001;
           const DAFS_OPEN_SHARE_ACCESS_WRITE     = 0x00000002;
           const DAFS_OPEN_SHARE_ACCESS_BOTH      = 0x00000003;


           const DAFS_OPEN_SHARE_DENY_NONE        = 0x00000000;
           const DAFS_OPEN_SHARE_DENY_READ        = 0x00000001;
           const DAFS_OPEN_SHARE_DENY_WRITE       = 0x00000002;
           const DAFS_OPEN_SHARE_DENY_BOTH        = 0x00000003;


           const DAFS_OPEN_SHARE_KEY_NONE         = 0x00000000;
           const DAFS_OPEN_SHARE_KEY_BOTH         = 0x00000003;


        struct DAFS_Open_Args
           {
           enum_open_claim_type              claim_type;
           union switch (claim_type)
              {
              case CLAIM_NULL:
                 dafs_filehandle_type        dir_handle;
                 dafs_pathname_type          claimnull_pathname;
                                                            /*heap */


Wittle                                                        [Page 263]

INTERNET-DRAFT         Direct Access File System          September 2001


              case CLAIM_PREVIOUS:
                 dafs_filehandle_type        filehandle;
                 dafs_uint32                 delegate_type;


              case CLAIM_DELEGATE_CUR:
                 dafs_filehandle_type        dir_handle;
                 dafs_pathname_type          claimdelcur_pathname;
                                                            /*heap */
                 dafs_state_id_type          claimdelcur_stateid;


              case CLAIM_DELEGATE_PREV:
                 dafs_filehandle_type        dir_handle;
                 dafs_pathname_type          claimdelprev_pathname;
                                                            /*heap */


              case CLAIM_PREVIOUS:
              case CLAIM_CREATE_UNLINKED:
                 dafs_filehandle_type        dir_handle;
              } open_claim;


           enum opentype                     open_type;
           union switch (open_type)
              {
              case OPEN_CREATE:
                 enum opentype               createmode;
                 union createhow switch (createmode)
                    {
                    case UNCHECKED:
                    case GUARDED:
                       dafs_file_attr_type   createattrs;
                    case EXCLUSIVE:
                       dafs_verifier_type    create_verifier;
                    };
              default:
                 void;
              } openflag;


Wittle                                                        [Page 264]

INTERNET-DRAFT         Direct Access File System          September 2001


           enum open_delete_disp             delete_disp;
           dafs_lockowner_type               owner;        /* heap */
           dafs_uint32                       share_access;
           dafs_uint32                       share_deny;
           dafs_uint32                       share_key_type;
           dafs_uint32                       pad;
           dafs_uint64                       share_key;
           };


    RESULTS

           const OPEN_RESULT_MLOCK                = 0x00000001;


        struct DAFS_Open_Res
           {
           dafs_filehandle_type              filehandle;
           dafs_state_id_type                state_id;
           dafs_change_info_type             change_info;
           dafs_uint32                       component_count;
           dafs_uint32                       result_flags;
           enum open_delegation_type         delegation_type;


           union switch (open_delegation_type)
              {
              case OPEN_DELEGATE_NONE:
                 void;


              case OPEN_DELEGATE_READ:
                 state_id_type               state_id;
                 dafs_uint32                 recall;
                 dafs_uint32                 permissions_acetype;
                 dafs_uint32                 permissions_aceflag;
                 dafs_uint32                 permissions_acemask;
                 dafs_utf8string             readdel_who;
                                                        /* heap */


Wittle                                                        [Page 265]

INTERNET-DRAFT         Direct Access File System          September 2001


              case OPEN_DELEGATE_WRITE:
                 dafs_state_id_type          state_id;
                 dafs_uint32                 recall;
                 enum limit_by               limitby;
                 union space_limit switch (limitby)
                    {
                    case DAFS_LIMIT_SIZE:
                       dafs_uint64           filesize;
                    case DAFS_LIMIT_BLOCKS:
                       dafs_uint32           num_blocks;
                       dafs_uint32           bytes_per_block
                    };
                    dafs_uint32              permissions_acetype;
                    dafs_uint32              permissions_aceflag;
                    dafs_uint32              permissions_acemask;
                    dafs_utf8string          writedel_who; /* heap */
                 } opendelegation;
           };


           "WARNING TO CLIENT IMPLEMENTORS

           OPEN       [DAFS_PROC_OPEN]       resembles       LOOKUP
           [DAFS_PROC_LOOKUP] in that it generates a filehandle for
           the client to  use.   Unlike  LOOKUP  [DAFS_PROC_LOOKUP]
           though,  OPEN  [DAFS_PROC_OPEN]  creates server state on
           the filehandle.  In normal circumstances, the client can
           only  release  this state with a CLOSE [DAFS_PROC_CLOSE]
           operation.  CLOSE  [DAFS_PROC_CLOSE]  uses  the  current
           filehandle to determine which file to close." (RFC 3010,
           p. 132)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "Simply waiting for the lease on the file to  expire  is
           insufficient  because  the server may maintain the state
           indefinitely as long as another client does not  attempt
           to  make  a  conflicting  access to the same file." (RFC
           3010, p. 132)

   DESCRIPTION

           "The  OPEN  [DAFS_PROC_OPEN]  operation  creates  and/or
           opens  a  regular  file in a directory with the provided
           name.  If the file does not  exist  at  the  server  and


Wittle                                                        [Page 266]

INTERNET-DRAFT         Direct Access File System          September 2001


           creation  is  desired,  specification  of  the method of
           creation is provided by the openhow [open_type]  parame-
           ter.   The  client  has  the  choice  of  three creation
           methods: UNCHECKED, GUARDED, or EXCLUSIVE.

           UNCHECKED means that the file should  be  created  if  a
           file  of  that  name  does not exist and encountering an
           existing regular file of that name is not an error.  For
           this  type  of create, createattrs specifies the initial
           set of attributes for the file.  The set  of  attributes
           may  includes  any  writable attribute valid for regular
           files.  When an UNCHECKED create encounters an  existing
           file,  the  attributes  specified  by createattrs is not
           used, except that when an object_size of zero is  speci-
           fied,  the  existing  file  is truncated.  If GUARDED is
           specified, the server  checks  for  the  presence  of  a
           duplicate  object  by name before performing the create.
           If  a  duplicate  exists,  an  error  of   NFS4ERR_EXIST
           [DAFSERR_EXIST]  is  returned  as  the  status.   If the
           object does not  exist,  the  request  is  performed  as
           described for UNCHECKED.

           EXCLUSIVE  specifies  that  the  server  is  to   follow
           exclusive   creation   semantics,   using  the  verifier
           [create_verifier] to ensure exclusive  creation  of  the
           target.   The  server should check for the presence of a
           duplicate object by name.  If the object does not exist,
           the  server  creates  the object and stores the verifier
           with the object.  If  the  object  does  exist  and  the
           stored  verifier  matches  the client provided verifier,
           the server uses the existing object as the newly created
           object.  If the stored verifier does not match,  then an
           error of NFS4ERR_EXIST [DAFSERR_EXIST] is  returned.  No
           attributes  may  be  provided  in  this  case, since the
           server may use an attribute  of  the  target  object  to
           store the verifier.

           For   the   target   directory,   the   server   returns
           change_info4   [dafs_change_info_type]   information  in
           cinfo  [change_info].  With  the  atomic  field  of  the
           change_info4  [dafs_change_info_type] struct, the server
           will indicate if the before and after change  attributes
           were  obtained atomically with respect to the link crea-
           tion.

           Upon successful  creation,  the  current  filehandle  is
           replaced  by  that  of  the  new object." (RFC 3010, pp.
           132-133)


Wittle                                                        [Page 267]

INTERNET-DRAFT         Direct Access File System          September 2001


   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "The OPEN [DAFS_PROC_OPEN]  procedure provides  for  DOS
           SHARE  capability  with  the  use of the access and deny
           fields of  the  OPEN  [DAFS_PROC_OPEN]  arguments.   The
           client  specifies  at OPEN [DAFS_PROC_OPEN] the required
           access and deny modes.  For clients that do not directly
           support  SHAREs  (i.e. Unix), the expected deny value is
           DENY_NONE [DAFS_OPEN_SHARE_DENY_NONE].  In the case that
           there  is  a  existing  SHARE reservation that conflicts
           with  the  OPEN  [DAFS_PROC_OPEN]  request,  the  server
           returns  the error NFS4ERR_DENIED [DAFSERR_DENIED].  For
           a complete SHARE request, the client must provide values
           for   the   owner   and   seqid   fields  for  the  OPEN
           [DAFS_PROC_OPEN] argument.  For additional discussion of
           SHARE  semantics  see  the  section  on  "Share Reserva-
           tions"." (RFC 3010, p. 133)

   DAFS locking model does not require use of sequence ids. Therefore,
   the DAFS_PROC_OPEN arguments structure does not contain one.

           "In the case that the client is recovering state from  a
           server  failure,  the  reclaim [claim_type] field of the
           OPEN [DAFS_PROC_OPEN] argument is used to  signify  that
           the request is meant to reclaim state previously held.

           The   'claim   [claim_type]'   field   of    the    OPEN
           [DAFS_PROC_OPEN] argument is used to specify the file to
           be opened and the state  information  which  the  client
           claims  to  possess.   There  are four basic claim types
           which  cover  the  various  situations   for   an   OPEN
           [DAFS_PROC_OPEN].  They are as follows:

           CLAIM_NULL

           For the client, this  is  a  new  OPEN  [DAFS_PROC_OPEN]
           request and there is no previous state associate[d] with
           the file for the client.

           CLAIM_PREVIOUS

              The client is claiming  basic  OPEN  [DAFS_PROC_OPEN]
              state  for  a file that was held previous to a server
              reboot. Generally used when  a  server  is  returning
              persistent  file handles; the client may not have the
              file name to reclaim the OPEN [DAFS_PROC_OPEN].


Wittle                                                        [Page 268]

INTERNET-DRAFT         Direct Access File System          September 2001


           CLAIM_DELEGATE_CUR

              The  client  is  claiming  a  delegation   for   OPEN
              [DAFS_PROC_OPEN]   as  granted  by  the server.  Gen-
              erally this is done as part of  recalling  a  delega-
              tion.

           CLAIM_DELEGATE_PREV

              The client is claiming a delegation granted to a pre-
              vious   client   instance;   used  after  the  client
              reboots.

           For OPEN [DAFS_PROC_OPEN] requests whose claim  type  is
           other  than  CLAIM_PREVIOUS  (i.e.  requests  other than
           those devoted to reclaiming opens after a server reboot)
           that  reach the server during its grace or lease expira-
           tion  period,   the   server   returns   an   error   of
           NFS4ERR_GRACE [DAFSERR_GRACE].

           For any OPEN [DAFS_PROC_OPEN] request,  the  server  may
           return  an  open delegation,  which allows further opens
           and closes to  be  handled  locally  on  the  client  as
           described  in  the  section  Open Delegation.  Note that
           delegation [delegation_type]  is up  to  the  server  to
           decide.   The client should never assume that delegation
           [delegation_type]  will or will not be granted in a par-
           ticular  instance.   It  should  always  be prepared for
           either  case.  A  partial  exception  is   the   reclaim
           (CLAIM_PREVIOUS)  case,  in  which  a delegation type is
           claimed.   In  this  case,  delegation  will  always  be
           granted,  although  the  server may specify an immediate
           recall in the delegation structure.

           The rflags [result_flags] returned by a successful  OPEN
           [DAFS_PROC_OPEN]  allow the server to return information
           governing  how  the  open  file  is   to   be   handled.
           OPEN4_RESULT_MLOCK  [OPEN_RESULT_MLOCK] indicates to the
           caller that mandatory locking is in effect for this file
           and  the  client should act appropriately with regard to
           data cached on the client.   OPEN4_RESULT_CONFIRM  indi-
           cates  that  the  client  MUST  execute  an OPEN_CONFIRM
           operation before using the open file."  (RFC  3010,  pp.
           133-134)

   There is no need for confirming a DAFS_PROC_OPEN. Consequently, this
   flag does not exist in the DAFS protocol and there is no correspond-
   ing open confirm operation.


Wittle                                                        [Page 269]

INTERNET-DRAFT         Direct Access File System          September 2001


           "If the file [claimnull_pathname,  claimdelcur_pathname,
           or claimdelprev_pathname] is a zero length array, if any
           component does not obey the UTF-8 definition, or if  any
           component  in  the  path  is  of  zero length, the error
           NFS4ERR_INVAL [DAFSERR_INVAL] will be returned.

           When an OPEN [DAFS_PROC_OPEN] is done and the  specified
           lockowner  [owner]  already has the resulting filehandle
           open, the result is to 'OR' together the new  share  and
           deny  status together with the existing status.  In this
           case, only a  single  CLOSE  [DAFS_PROC_CLOSE]  need  be
           done,  even though multiple OPEN's [DAFS_PROC_OPEN] were
           completed." (RFC 3010, pp. 132-134)

   If the OPEN claim_type is CLAIM_CREATE_UNLINKED, then DAFS_PROC_OPEN
   creates an unlinked regular file in the file system in which the
   specified directory is located. A subsequent link request can be used
   to link the file in that directory, or in any other directory which
   could link to the file if it already had a link in the specified
   directory.

   The OPEN procedure provides for Share Key Reservations with the use
   of the share_key_type and share_key fields of the OPEN arguments. The
   client specifies at OPEN the share_key_type, and if the
   share_key_type is not SHARE_KEY_NONE, the client also specifies the
   target SHARE KEY. For clients that wish to bypass SHARE KEY verifica-
   tion (i.e. all legacy clients), the expected share_key_type value is
   SHARE_KEY_NONE. If there is an existing SHARE KEY reservation that
   conflicts with the OPEN request, the server returns the error
   DAFSERR_KEY_MISMATCH.

   See DAFS_PROC_LOOKUP for a description of the component_count result
   field.

   IMPLEMENTATION

           "The OPEN [DAFS_PROC_OPEN]  procedure  contains  support
           for  EXCLUSIVE  create.  The mechanism is similar to the
           support in NFS version 3 [RFC1813].  As in  NFS  version
           3,  this mechanism provides reliable exclusive creation.
           Exclusive create is invoked when  the  how  [createmode]
           parameter  is  EXCLUSIVE.  In this case, the client pro-
           vides a verifier [create_verifier] that  can  reasonably
           be  expected  to  be  unique.  A combination of a client
           identifier, perhaps the client network  address,  and  a
           unique  number  generated by the client, perhaps the RPC
           transaction identifier, may be appropriate." (RFC  3010,
           p. 135)


Wittle                                                        [Page 270]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS does not use RPC. Clients could use the equivalent
   stream_id/seq_number information that is already generated for the
   DAFS header.

           "If the object does not exist, the  server  creates  the
           object  and  stores  the  verifier [create_verifier]  in
           stable storage. For file systems that do not  provide  a
           mechanism  for the storage of arbitrary file attributes,
           the server may use one or more elements  of  the  object
           meta-data  to  store the verifier [create_verifier]. The
           verifier [create_verifier]  must  be  stored  in  stable
           storage  to  prevent erroneous failure on retransmission
           of the request. It is assumed that an  exclusive  create
           is being performed because exclusive semantics are crit-
           ical to the application. Because of the expected  usage,
           exclusive  CREATE  does  not rely solely on the normally
           volatile duplicate request  cache  for  storage  of  the
           verifier. (RFC 3010, p. 135)

   DAFS clients MAY rely on the persistent response cache for exclusive-
   create semantics, if use of the Response Cache has been agreed to for
   the Session. DAFS servers, however, MUST handle create verifiers as
   described here, regardless of whether the server implements per-
   sistent response caches.

           "The duplicate request cache in  volatile  storage  does
           not  survive  a  crash  and may actually flush on a long
           network partition, opening failure windows.  In the UNIX
           local  file  system  environment,  the  expected storage
           location for the verifier on creation is  the  meta-data
           (time  stamps)  of  the  object.  For  this  reason,  an
           exclusive object create may not include  initial  attri-
           butes because the server would have nowhere to store the
           verifier.

           If the server can not  support  these  exclusive  create
           semantics, possibly because of the requirement to commit
           the verifier to stable storage, it should fail the  OPEN
           [DAFS_PROC_OPEN] request with the error, NFS4ERR_NOTSUPP
           [DAFSERR_NOTSUPP].

           During  an  exclusive  CREATE  request,  if  the  object
           already  exists,   the  server reconstructs the object's
           verifier   and   compares   it   with    the    verifier
           [create_verifier]  in  the  request.  If they match, the
           server treats the request as a success. The  request  is
           presumed  to  be  a  duplicate of an earlier, successful
           request for which the reply was lost and that the server


Wittle                                                        [Page 271]

INTERNET-DRAFT         Direct Access File System          September 2001


           duplicate  request  cache  mechanism did not detect.  If
           the verifiers do not match, the request is rejected with
           the status, NFS4ERR_EXIST [DAFSERR_EXIST].

           Once the client has  performed  a  successful  exclusive
           create,      it      must      issue      a      SETATTR
           [DAFS_PROC_SETATTR_INLINE  or  DAFS_PROC_SETATTR_DIRECT]
           to set the correct object attributes.  Until it does so,
           it should not rely upon any of  the  object  attributes,
           since  the  server  implementation  may need to overload
           object meta-data to store the verifier.  The  subsequent
           SETATTR  must  not occur in the same COMPOUND request as
           the OPEN.   This  separation  will  guarantee  that  the
           exclusive  create  mechanism  will  continue to function
           properly in the face of retransmission of  the  request.
           (RFC 3010, pp. 135-136)

   The setattr and open-exclusive MAY be part of the same DAFS chain.
   However, a DAFS client SHOULD be aware that this could cause problems
   in the event a server crashes and it doesn't keep persistent response
   cache information. In this event, the create exclusive and the
   setattr completion status might be unknown to the client and the
   client might not be able to determine accurately if the file was
   created exclusively or not.

           "Use of the GUARDED attribute does not provide  exactly-
           once  semantics.   In particular, if a reply is lost and
           the server does not detect  the  retransmission  of  the
           request,  the  procedure  can  fail  with  NFS4ERR_EXIST
           [DAFSERR_EXIST], even though the  create  was  performed
           successfully." (RFC 3010, p. 136)

   DAFS clients do not retransmit a request on an active session. This
   type of error would occur if a client issues the request on a new
   session and either the response cache (volatile) has been lost, or
   the client does not properly check the cache.

           "For SHARE reservations, the client must specify a value
           for   access  that  is  one  of  READ,  WRITE,  or  BOTH
           [DAFS_OPEN_SHARE_ACCESS_READ,
           DAFS_OPEN_SHARE_ACCESS_WRITE                          or
           DAFS_OPEN_SHARE_ACCESS_BOTH].  For deny, the client must
           specify    one   of   NONE,   READ,   WRITE,   or   BOTH
           [DAFS_OPEN_SHARE_DENY_NONE,   DAFS_OPEN_SHARE_DENY_READ,
           DAFS_OPEN_SHARE_DENY_WRITE,                           or
           DAFS_OPEN_SHARE_DENY_BOTH].  If the client fails  to  do
           this,    the    server    must    return   NFS4ERR_INVAL
           [DAFSERR_INVAL].


Wittle                                                        [Page 272]

INTERNET-DRAFT         Direct Access File System          September 2001


           If the final component provided to OPEN [DAFS_PROC_OPEN]
           is   a   symbolic   link,   the   error  NFS4ERR_SYMLINK
           [DAFSERR_SYMLINK] will be returned to the client.  If an
           intermediate  component of the pathname provided to OPEN
           is   a   symbolic   link,   the   error   NFS4ERR_NOTDIR
           [DAFSERR_NOTDIR]  will  be returned to the client." (RFC
           3010, pp. 135-136)

   For SHARE KEY reservations the client specifies a value for
   share_key_type that is one of SHARE_KEY_NONE or SHARE_KEY_BOTH. If
   the client fails to do this, the server returns DAFSERR_INVAL.

   If the server cannot support SHARE KEY semantics and the
   share_key_type is not SHARE_KEY_NONE, the server fails the OPEN
   request ant returns the error DAFSERR_NOTSUPP.

   The open_delete_disp flags specify a disposition by which subsequent
   remove requests are handled with respect to the open file. See
   6.5.33., "DAFS_PROC_REMOVE" for more on file removal semantics.

   An open operation that specifies a delete disposition that is not
   fully supported by the server results in DAFSERR_DENYDISP_NOTSUPP
   status. Note: If the server supports multiple protocols, then
   requesting a disposition of DELETE_DENY MAY result in the server
   returning either this error or DAFSERR_STATUS_OK.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_DELAY

   DAFSERR_DENYDISP_NOTSUPP

   DAFSERR_EXIST

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO


Wittle                                                        [Page 273]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ISDIR

   DAFSERR_MOVED

   DAFSERR_NOENT

   DAFSERR_NOTDIR

   DAFSERR_NOTSUPP

   DAFSERR_NO_PARTIAL_INFO

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_SYMLINK


Wittle                                                        [Page 274]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.25.  DAFS_PROC_OPENATTR

   SUMMARY

   Opens the named attribute directory.

   ARGUMENTS

        struct DAFS_OpenAttr_Args
           {
           dafs_filehandle_type       filehandle;
           };


   RESULTS

        struct DAFS_OpenAttr_Res
           {
           dafs_filehandle_type       filehandle;
           };


   DESCRIPTION

           "The OPENATTR [DAFS_PROC_OPENATTR] operation is used  to
           obtain  the  filehandle of the named attribute directory
           associated with the current filehandle." (RFC  3010,  p.
           137)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

           The result of the OPENATTR [DAFS_PROC_OPENATTR] will  be
           a   filehandle   to   an   object   of  type  NF4ATTRDIR
           [DAFS_TYPE_ATTRDIR].   From  this  filehandle,   READDIR
           [DAFS_PROC_READDIR]  and LOOKUP [DAFS_PROC_LOOKUP]  pro-
           cedures can be used to obtain filehandles for the  vari-
           ous  named  attributes associated with the original file
           system object. Filehandles  returned  within  the  named
           attribute  directory  will  have  a type of NF4NAMEDATTR
           [DAFS_TYPE_NAMEDATTR]." (RFC 3010, p. 137)

   IMPLEMENTATION

           "If the server does not support named attributes for the
           current   filehandle,   an   error   of  NFS4ERR_NOTSUPP


Wittle                                                        [Page 275]

INTERNET-DRAFT         Direct Access File System          September 2001


           [DAFSERR_NOTSUPP]  will be returned to the client." (RFC
           3010, p. 137)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOENT

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 276]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.26.  DAFS_PROC_OPEN_DOWNGRADE

   SUMMARY

   Reduces open file access rights.

   ARGUMENTS

        struct DAFS_Open_Downgrade_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint32                share_access;
           dafs_uint32                share_deny;
           dafs_uint32                share_key_type;
           dafs_uint32                pad;
           dafs_uint64                share_key;
           };


   RESULTS

   None.

   DESCRIPTION

           "This operation is used to adjust the  access  and  deny
           bits  for  a given open.  This is necessary when a given
           lockowner opens the same file multiple times  with  dif-
           ferent  access  and  deny  flags.   In this situation, a
           close of one of the open's may  change  the  appropriate
           access  and  deny  flags  to remove bits associated with
           open's no longer in effect.

           The access and deny bits  specified  in  this  operation
           replace  the  current  ones for the specified open file.
           If either the access or the deny mode specified includes
           bits not in effect for the open, the error NFS4ERR_INVAL
           [DAFSERR_INVAL] should be returned.   Since  access  and
           deny  bits  are  subsets of those already granted, it is
           not possible for this request to be  denied  because  of
           conflicting share reservations.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p. 141)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences


Wittle                                                        [Page 277]

INTERNET-DRAFT         Direct Access File System          September 2001


   in filehandle management in DAFS procedures.

   This operation is also used to release the SHARE KEY reservation held
   by a given open. This is necessary when a given lockowner wishes to
   exit the group of lockowners with SHARE KEY access to the file.

   If share_key_type is SHARE_KEY_NONE and the lockowner previously held
   a SHARE KEY reservation on the file then that lockowner's share key
   reservation is released (and hence file_state.share_key_count is
   decremented). If share_key_type is not SHARE_KEY_NONE, and both
   share_key_type and share_key do not match the current open, then the
   error DAFSERR_INVAL is returned. Since this definition only permits
   SHARE KEY reservations to be released, and not acquired, it is not
   possible for this request to be denied because of conflicting
   share_key reservations.

   ERRORS

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_MOVED

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 278]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.27.  DAFS_PROC_READ_INLINE

   SUMMARY

   Reads data from a file. The data transfer is done inline using memory
   pointed to by the read descriptor buffers.

   ARGUMENTS

        struct DAFS_Read_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint32                byte_count;
           dafs_cache_hint_type       cache_hint;
           };


   RESULTS

        struct DAFS_Read_Inline_Res
           {
           dafs_uint32                eof;
           dafs_uint32                bytes_read;
           dafs_opaque8               read_data[byte_count];
                                /* Split count & data for alignment */
           };


   DESCRIPTION

           "The READ [DAFS_PROC_READ_INLINE] operation  reads  data
           from the regular file identified by the current filehan-
           dle." (RFC 3010, p. 144)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

           "The  client  provides  an  offset  of  where  the  READ
           [DAFS_PROC_READ_INLINE]   is   to   start  and  a  count
           [byte_count] of how many  bytes  are  to  be  read.   An
           offset  of  0  (zero) means to read data starting at the
           beginning of the file.  If offset  is  greater  than  or
           equal  to  the  size  of  the  file, the status, NFS4_OK
           [DAFS_STATUS_OK],  is  returned  with  a   data   length


Wittle                                                        [Page 279]

INTERNET-DRAFT         Direct Access File System          September 2001


           [bytes_read]   set  to  0 (zero) and eof is set to TRUE.
           The READ [DAFS_PROC_READ_INLINE] is  subject  to  access
           permissions checking.

           If the client specifies a count [byte_count] value of  0
           (zero),  the  READ  [DAFS_PROC_READ_INLINE] succeeds and
           returns 0 (zero) bytes of data again subject  to  access
           permissions  checking.  The  server may choose to return
           fewer bytes than specified by the  client.   The  client
           needs  to check for this condition and handle the condi-
           tion appropriately.

           The   stateid    [state_id]    value    for    a    READ
           [DAFS_PROC_READ_INLINE]    request  represents  a  value
           returned from a previous record lock or  share  reserva-
           tion  request.   Used  by  the server to verify that the
           associated lock is  still  valid  and  to  update  lease
           timeouts for the client." (RFC 3010, pp. 144-145)

   In DAFS, leases are updated by any DAFS procedure, including
   DAFS_PROC_NULL. DAFS servers use information associated with the ses-
   sion of the incoming request to determine which client's leases to
   renew.

           "If the read ended at the end-of-file  (formally,  in  a
           correctly  formed  READ [DAFS_PROC_READ_INLINE] request,
           if offset + count [byte_count] is equal to the  size  of
           the  file),  or the read request extends beyond the size
           of the file (if offset + count [byte_count]  is  greater
           than  the  size  of  the file), eof is returned as TRUE;
           otherwise   it   is   FALSE.     A    successful    READ
           [DAFS_PROC_READ_INLINE]  of  an  empty  file will always
           return eof as TRUE.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 144-145)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS requests.

   IMPLEMENTATION

           "It is possible for the  server  to  return  fewer  than
           count [byte_count] bytes of data.  If the server returns
           less than the count requested and eof set to FALSE,  the
           client should issue another READ [DAFS_PROC_READ_INLINE]
           to get the remaining data.  A  server  may  return  less
           data  than  requested  under several circumstances.  The


Wittle                                                        [Page 280]

INTERNET-DRAFT         Direct Access File System          September 2001


           file may  have  been  truncated  by  another  client  or
           perhaps  on  the  server  itself, changing the file size
           from what the requesting client believes to be the case.
           This would reduce the actual amount of data available to
           the client.  It is possible that the server may back off
           the  transfer  size  and reduce the read request return.
           Server resource exhaustion may also occur  necessitating
           a smaller read return.

           If  the  file  is  locked  the  server  will  return  an
           NFS4ERR_LOCKED  [DAFSERR_LOCKED]  error.  Since the lock
           may be of short  duration,  the  client  may  choose  to
           retransmit  the  READ  [DAFS_PROC_READ_INLINE]   request
           (with   exponential   backoff)   until   the   operation
           succeeds." (RFC 3010, p. 145)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LOCKED

   DAFSERR_LEASE_MOVED

   DAFSERR_MOVED


Wittle                                                        [Page 281]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_NXIO

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 282]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.28.  DAFS_PROC_READ_DIRECT

   SUMMARY

   Reads from file and returns data using RDMA write directly into
   client memory buffers.

   ARGUMENTS

        struct DAFS_Read_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint32                byte_count;
           dafs_cache_hint_type       cache_hint;
           dafs_dob_array_type        read_data_buffers;
           };


   RESULTS

        struct DAFS_Read_Direct_Res
           {
           dafs_uint32                eof;
           dafs_uint32                bytes_read;
           dafs_checksum_type         direct_checksum;
           }


        /* Data placed in read_data_buffers advertised by client */

           /* DIRECT: dafs_opaque8     readdata[bytes_read]; */


   DESCRIPTION

           "The READ [DAFS_PROC_READ_DIRECT] operation  reads  data
           from the regular file identified by the current filehan-
           dle." (RFC 3010, p. 144)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures

           "The  client  provides  an  offset  of  where  the  READ


Wittle                                                        [Page 283]

INTERNET-DRAFT         Direct Access File System          September 2001


           [DAFS_PROC_READ_DIRECT]   is   to   start  and  a  count
           [byte_count]  of how many bytes  are  to  be  read.   An
           offset  of  0  (zero) means to read data starting at the
           beginning of the file.  If offset  is  greater  than  or
           equal  to  the  size  of  the  file, the status, NFS4_OK
           [DAFS_STATUS_OK],  is  returned  with  a   data   length
           [bytes_read]  set  to  0  (zero) and eof is set to TRUE.
           The READ [DAFS_PROC_READ_DIRECT] is  subject  to  access
           permissions checking.

           If the client specifies a count [byte_count] value of  0
           (zero),  the  READ  [DAFS_PROC_READ_DIRECT] succeeds and
           returns 0 (zero) bytes of data again subject  to  access
           permissions  checking.  The  server may choose to return
           fewer bytes than specified by the  client.   The  client
           needs  to check for this condition and handle the condi-
           tion appropriately.

           The   stateid   [state_id]     value    for    a    READ
           [DAFS_PROC_READ_DIRECT]    request  represents  a  value
           returned from a previous record lock or  share  reserva-
           tion  request.  Used  by  the  server to verify that the
           associated lock is  still  valid  and  to  update  lease
           timeouts for the client." (RFC 3010, pp. 144-145)

   In DAFS, leases are updated by any DAFS procedure, including
   DAFS_PROC_NULL. DAFS servers use information associated with the ses-
   sion of the incoming request to determine which client's leases to
   renew.

           "If the read ended at the end-of-file  (formally,  in  a
           correctly  formed  READ [DAFS_PROC_READ_DIRECT] request,
           if offset + count [byte_count] is equal to the  size  of
           the  file),  or the read request extends beyond the size
           of the file (if offset + count [byte_count]  is  greater
           than  the  size  of  the file), eof is returned as TRUE;
           otherwise   it   is   FALSE.     A    successful    READ
           [DAFS_PROC_READ_DIRECT]  of  an  empty  file will always
           return eof as TRUE.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 144-145)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS requests.

   The file data is written directly into the specified client memory
   buffers using RDMA write.


Wittle                                                        [Page 284]

INTERNET-DRAFT         Direct Access File System          September 2001


   IMPLEMENTATION

           "It is possible for the  server  to  return  fewer  than
           count [byte_count] bytes of data.  If the server returns
           less than the count requested and eof set to FALSE,  the
           client should issue another READ [DAFS_PROC_READ_DIRECT]
           to get the remaining data.  A  server  may  return  less
           data  than  requested  under several circumstances.  The
           file may  have  been  truncated  by  another  client  or
           perhaps  on  the  server  itself, changing the file size
           from what the requesting client believes to be the case.
           This would reduce the actual amount of data available to
           the client.  It is possible that the server may back off
           the  transfer  size  and reduce the read request return.
           Server resource exhaustion may also occur  necessitating
           a smaller read return.

           If  the  file  is  locked  the  server  will  return  an
           NFS4ERR_LOCKED  [DAFSERR_LOCKED]  error.  Since the lock
           may be of short  duration,  the  client  may  choose  to
           retransmit  the  READ  [DAFS_PROC_READ_DIRECT]   request
           (with   exponential   backoff)   until   the   operation
           succeeds." (RFC 3010, p. 145)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_EXPIRED

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL


Wittle                                                        [Page 285]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_IO

   DAFSERR_LOCKED

   DAFSERR_LEASE_MOVED

   DAFSERR_MOVED

   DAFSERR_NXIO

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 286]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.29.  DAFS_PROC_READDIR_INLINE

   SUMMARY

   Reads the contents of a directory. The data transfer is done inline,
   that is, using memory pointed to by the read descriptor buffers.

   ARGUMENTS

        struct DAFS_Readdir_Inline_args
           {
           dafs_filehandle_type       filehandle;
           dafs_uint64                cookie;
           dafs_verifier_type         cookieverf;
           dafs_uint32                dircount;
           dafs_uint32                maxcount;
           dafs_attr_bitmap_type      attr_request_bitmap;
        };


   RESULTS

        struct direntry
           {
           dafs_uint64                cookie;
           dafs_file_attr_type        attrs;               /* heap */
           dafs_var_offset_type       name_offset;         /* heap */
           };


        struct DAFS_Readdir_Inline_Res
           {
           dafs_verifier_type         cookieverf;
           dafs_uint32                eof;
           dafs_struct_direntry       entries<>;           /* heap */
           };


   DESCRIPTION

           "The   READDIR   [DAFS_PROC_READDIR_INLINE]    operation
           retrieves  a variable number of entries from a file sys-
           tem directory and returns  client  requested  attributes
           for  each  entry  along  with  information  to allow the
           client to request  additional  directory  entries  in  a


Wittle                                                        [Page 287]

INTERNET-DRAFT         Direct Access File System          September 2001


           subsequent   READDIR  [DAFS_PROC_READDIR_INLINE]."  (RFC
           3010, p. 147)

   A client is free to use the cookies and cookie verifiers obtained by
   previous DAFS readdir operations, regardless of whether the opera-
   tions were done INLINE or DIRECT. Keep this in mind when reading the
   remainder of the description of this DAFS procedure.

           "The arguments contain a cookie  value  that  represents
           where   the  READDIR  [DAFS_PROC_READDIR_INLINE]  should
           start within the directory.  A value of 0 (zero) for the
           cookie  is used to start reading at the beginning of the
           directory.         For        subsequent         READDIR
           [DAFS_PROC_READDIR_INLINE]  requests,  the client speci-
           fies a cookie value that is provided by the server on  a
           previous READDIR [DAFS_PROC_READDIR_INLINE] request.

           The cookieverf value should be set to 0 (zero) when  the
           cookie  value  is  0  (zero) (first directory read).  On
           subsequent  requests,  it  should  be  a  cookieverf  as
           returned  by the server.  The cookieverf must match that
           returned by the READDIR  [DAFS_PROC_READDIR_INLINE]   in
           which the cookie was acquired.

           The dircount portion of the argument is a  hint  of  the
           maximum  number  of  bytes of directory information that
           should be returned. This value represents the length  of
           the  names of the directory entries and the cookie value
           for these  entries.   This  length  represents  the  XDR
           [DAFS]  encoding of the data (names and cookies) and not
           the length in the  native  format  of  the  server.  The
           server may return less data.

           The maxcount value of the argument is the maximum number
           of  bytes  for the result.  This maximum size represents
           all of the data being  returned  and  includes  the  XDR
           [DAFS  encoding]  overhead.   The server may return less
           data.  If the server is unable to return a single direc-
           tory   entry   within  the  maxcount  limit,  the  error
           NFS4ERR_READDIR_NOSPC  [DAFSERR_READDIR_NOSPC]  will  be
           returned to the client.

           Finally, attrbits [attr_request_bitmap]  represents  the
           list  of  attributes  to  be returned for each directory
           entry supplied by the server.

           On successful return, the server's response will provide
           a  list  of  directory  entries.   Each of these entries


Wittle                                                        [Page 288]

INTERNET-DRAFT         Direct Access File System          September 2001


           contains the name of the directory entry, a cookie value
           for   that  entry,  and  the  associated  attributes  as
           requested." (RFC 3010, p. 147)

   See 4.1.3.3., "Attribute Bitmaps" for a discussion on attribute
   encoding in DAFS.

           "The cookie value is only meaningful to the  server  and
           is  used  as  a  'bookmark' for the directory entry.  As
           mentioned, this cookie is used by the client for  subse-
           quent  READDIR  [DAFS_PROC_READDIR_INLINE] operations so
           that it may continue reading a directory.  The cookie is
           similar  in  concept  to a READ offset but should not be
           interpreted as such by the client. Ideally,  the  cookie
           value  should  not  change  if the directory is modified
           since the client may be caching these values.

           In some cases, the server may encounter an  error  while
           obtaining the attributes for a directory entry.  Instead
           of  returning  an   error   for   the   entire   READDIR
           [DAFS_PROC_READDIR_INLINE]  operation,  the  server  can
           instead  return   the   attribute   'fattr4_rdattr_error
           [DAFS_FATTR_RDATTR_ERROR]'.   With  this,  the server is
           able to communicate the failure to the  client  and  not
           fail  the entire operation in the instance of what might
           be a transient  failure.   Obviously,  the  client  must
           request              the             fattr4_rdattr_error
           [DAFS_FATTR_RDATTR_ERROR] attribute for this  method  to
           work  properly.   If  the  client  does  not request the
           attribute, the  server  has  no  choice  but  to  return
           failure        for        the       entire       READDIR
           [DAFS_PROC_READDIR_INLINE] operation.

           For some file system environments, the directory entries
           '.' and '..'  have special meaning and in other environ-
           ments, they may not.  If the server supports these  spe-
           cial  entries  within  a  directory,  they should not be
           returned  to  the  client  as  part   of   the   READDIR
           [DAFS_PROC_READDIR_INLINE]  response.   To  enable  some
           client environments, the cookie values of 0,  1,  and  2
           are  to  be  considered  reserved.   Note  that the Unix
           client will use these values when combining the server's
           response  and  local  representations  to enable a fully
           formed Unix directory presentation to the application.

           For READDIR [DAFS_PROC_READDIR_INLINE] arguments, cookie
           values  of  1  and  2 should not be used and for READDIR
           [DAFS_PROC_READDIR_INLINE] results cookie values  of  0,


Wittle                                                        [Page 289]

INTERNET-DRAFT         Direct Access File System          September 2001


           1, and 2 should not returned.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 147-148)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

   IMPLEMENTATION

           "The server's file system directory representations  can
           differ  greatly.   A client's programming interfaces may
           also be bound to the local operating  environment  in  a
           way  that  does  not  translate well into the NFS [DAFS]
           protocol.  Therefore the use of the  dircount  and  max-
           count  fields are provided to allow the client the abil-
           ity to provide guidelines to the server.  If the  client
           is  aggressive about attribute collection during a READ-
           DIR [DAFS_PROC_READDIR_INLINE], the server has  an  idea
           of  how  to  limit  the  encoded response.  The dircount
           field provides a hint on the  number  of  entries  based
           solely  on the names of the directory entries.  Since it
           is a hint, it may be possible that a dircount  value  is
           zero.   In  this  case, the server is free to ignore the
           dircount value and return directory information based on
           the specified maxcount value.

           The cookieverf may be used by the server to help  manage
           cookie  values  that  may  become stale.  It should be a
           rare occurrence that a  server  is  unable  to  continue
           properly   reading   a   directory   with  the  provided
           cookie/cookieverf pair.  The server  should  make  every
           effort  to avoid this condition since the application at
           the client may not be able to properly handle this  type
           of failure.

           The use of the cookieverf will also protect  the  client
           from  using  READDIR  [DAFS_PROC_READDIR_INLINE]  cookie
           values that may be stale.  For example, if the file sys-
           tem has been migrated, the server may or may not be able
           to  use  the  same  cookie  values  to  service  READDIR
           [DAFS_PROC_READDIR_INLINE]  as the previous server used.
           With the client providing the cookieverf, the server  is
           able  to provide the appropriate response to the client.
           This prevents the case where the  server  may  accept  a
           cookie  value  but  the underlying directory has changed
           and the response is invalid from the client's context of
           its previous READDIR [DAFS_PROC_READDIR_INLINE].


Wittle                                                        [Page 290]

INTERNET-DRAFT         Direct Access File System          September 2001


           Since some servers will not be returning  '.'  and  '..'
           entries  as  has been done with previous versions of the
           NFS protocol, the client that requires these entries  be
           present  in READDIR [DAFS_PROC_READDIR_INLINE] responses
           must fabricate them." (RFC 3010, pp. 148-149)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_COOKIE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOTDIR

   DAFSERR_NOTSUPP

   DAFSERR_READDIR_NOSPC

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_TOOSMALL


Wittle                                                        [Page 291]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.30.  DAFS_PROC_READDIR_DIRECT

   SUMMARY

   Reads directory and returns data using RDMA write directly into
   client memory buffers.

   ARGUMENTS

        struct DAFS_Readdir_Direct_args
           {
           dafs_filehandle_type       filehandle;
           dafs_uint64                cookie;
           dafs_verifier_type         cookieverf;
           dafs_uint32                dircount;
           dafs_uint32                maxcount;
           dafs_attr_bitmap_type      attr_request_bitmap;
           dafs_direct_op_buffer      readdir_data_buffers<>;
           };


   RESULTS

        struct DAFS_Readdir_Direct_Res
           {
           dafs_verifier_type         cookieverf;
           dafs_uint32                eof;
           dafs_checksum_type         direct_checksum;
           };


           /* Readdir entries are returned in
              readdir_data_buffers
              specified in the arguments */
           /* DIRECT: struct direntries entries<>; */


   DESCRIPTION

           "The   READDIR   [DAFS_PROC_READDIR_DIRECT]    operation
           retrieves  a variable number of entries from a file sys-
           tem directory and returns  client  requested  attributes
           for  each  entry  along  with  information  to allow the
           client to request additional directory entries in a sub-
           sequent  READDIR [DAFS_PROC_READDIR_DIRECT]." (RFC 3010,


Wittle                                                        [Page 292]

INTERNET-DRAFT         Direct Access File System          September 2001


           p.147)

   A client is free to use the cookies and cookie verifiers obtained by
   previous DAFS readdir operations, regardless of whether the opera-
   tions were done INLINE or DIRECT. Keep this in mind when reading the
   remainder of the description of this DAFS procedure.

           "The arguments contain a cookie  value  that  represents
           where   the  READDIR  [DAFS_PROC_READDIR_DIRECT]  should
           start within the directory.  A value of 0 (zero) for the
           cookie  is used to start reading at the beginning of the
           directory.         For        subsequent         READDIR
           [DAFS_PROC_READDIR_DIRECT]  requests,  the client speci-
           fies a cookie value that is provided by the server on  a
           previous READDIR [DAFS_PROC_READDIR_DIRECT] request.

           The cookieverf value should be set to 0 (zero) when  the
           cookie  value  is  0  (zero) (first directory read).  On
           subsequent  requests,  it  should  be  a  cookieverf  as
           returned  by the server.  The cookieverf must match that
           returned by the READDIR  [DAFS_PROC_READDIR_DIRECT]   in
           which the cookie was acquired.

           The dircount portion of the argument is a  hint  of  the
           maximum  number  of  bytes of directory information that
           should be returned. This value represents the length  of
           the  names of the directory entries and the cookie value
           for these  entries.   This  length  represents  the  XDR
           [DAFS]  encoding of the data (names and cookies) and not
           the length in the native  format  of  the  server.   The
           server may return less data.

           The maxcount value of the argument is the maximum number
           of  bytes  for the result.  This maximum size represents
           all of the data being  returned  and  includes  the  XDR
           [DAFS  encoding]  overhead.   The server may return less
           data.  If the server is unable to return a single direc-
           tory   entry   within  the  maxcount  limit,  the  error
           NFS4ERR_READDIR_NOSPC  [DAFSERR_READDIR_NOSPC]  will  be
           returned to the client.

           Finally, attrbits [attr_request_bitmap]  represents  the
           list  of  attributes  to  be returned for each directory
           entry supplied by the server.

           On successful return, the server's response will provide
           a list of directory entries.  Each of these entries con-
           tains the name of the directory entry,  a  cookie  value


Wittle                                                        [Page 293]

INTERNET-DRAFT         Direct Access File System          September 2001


           for   that  entry,  and  the  associated  attributes  as
           requested." (RFC 3010, p.147)

   See 4.1.3.3., "Attribute Bitmaps" for a discussion on attribute
   encoding in DAFS.

           "The cookie value is only meaningful to the  server  and
           is  used  as  a  'bookmark' for the directory entry.  As
           mentioned, this cookie is used by the client for  subse-
           quent  READDIR  [DAFS_PROC_READDIR_DIRECT] operations so
           that it may continue reading a directory.  The cookie is
           similar  in  concept  to a READ offset but should not be
           interpreted as such by the client. Ideally,  the  cookie
           value  should  not  change  if the directory is modified
           since the client may be caching these values.

           In some cases, the server may encounter an  error  while
           obtaining the attributes for a directory entry.  Instead
           of  returning  an   error   for   the   entire   READDIR
           [DAFS_PROC_READDIR_DIRECT]  operation,  the  server  can
           instead  return   the   attribute   'fattr4_rdattr_error
           [DAFS_FATTR_RDATTR_ERROR]'.  With  this,  the  server is
           able to communicate the failure to the  client  and  not
           fail  the entire operation in the instance of what might
           be a transient  failure.   Obviously,  the  client  must
           request              the             fattr4_rdattr_error
           [DAFS_FATTR_RDATTR_ERROR] attribute for this  method  to
           work  properly.   If  the  client  does  not request the
           attribute, the  server  has  no  choice  but  to  return
           failure        for        the       entire       READDIR
           [DAFS_PROC_READDIR_DIRECT] operation.

           For some file system environments, the directory entries
           '.' and '..'  have special meaning and in other environ-
           ments, they may not.  If the server supports these  spe-
           cial  entries  within  a  directory,  they should not be
           returned  to  the  client  as  part   of   the   READDIR
           [DAFS_PROC_READDIR_DIRECT]  response.   To  enable  some
           client environments, the cookie values of 0,  1,  and  2
           are  to  be  considered  reserved.   Note  that the Unix
           client will use these values when combining the server's
           response  and  local  representations  to enable a fully
           formed Unix directory presentation to the application.

           For READDIR [DAFS_PROC_READDIR_DIRECT] arguments, cookie
           values  of  1  and  2 should not be used and for READDIR
           [DAFS_PROC_READDIR_DIRECT] results cookie values  of  0,
           1, and 2 should not returned.


Wittle                                                        [Page 294]

INTERNET-DRAFT         Direct Access File System          September 2001


           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 147-148)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

   IMPLEMENTATION

           "The server's file system directory representations  can
           differ  greatly.   A client's programming interfaces may
           also be bound to the local operating  environment  in  a
           way  that  does  not  translate well into the NFS [DAFS]
           protocol.  Therefore the use of the  dircount  and  max-
           count  fields are provided to allow the client the abil-
           ity to provide guidelines to the server.  If the  client
           is  aggressive about attribute collection during a READ-
           DIR [DAFS_PROC_READDIR_DIRECT], the server has  an  idea
           of  how  to  limit  the  encoded response.  The dircount
           field provides a hint on the  number  of  entries  based
           solely  on the names of the directory entries.  Since it
           is a hint, it may be possible that a dircount  value  is
           zero.   In  this  case, the server is free to ignore the
           dircount value and return directory information based on
           the specified maxcount value.

           The cookieverf may be used by the server to help  manage
           cookie  values  that  may  become stale.  It should be a
           rare occurrence that a  server  is  unable  to  continue
           properly   reading   a   directory   with  the  provided
           cookie/cookieverf pair.  The server  should  make  every
           effort  to avoid this condition since the application at
           the client may not be able to properly handle this  type
           of failure.

           The use of the cookieverf will also protect  the  client
           from  using  READDIR  [DAFS_PROC_READDIR_DIRECT]  cookie
           values that may be stale.  For example, if the file sys-
           tem has been migrated, the server may or may not be able
           to  use  the  same  cookie  values  to  service  READDIR
           [DAFS_PROC_READDIR_DIRECT]  as the previous server used.
           With the client providing the cookieverf, the server  is
           able  to provide the appropriate response to the client.
           This prevents the case where the  server  may  accept  a
           cookie  value  but  the underlying directory has changed
           and the response is invalid from the client's context of
           its previous READDIR [DAFS_PROC_READDIR_DIRECT].

           Since some servers will not be returning  '.'  and  '..'


Wittle                                                        [Page 295]

INTERNET-DRAFT         Direct Access File System          September 2001


           entries  as  has been done with previous versions of the
           NFS protocol, the client that requires these entries  be
           present  in READDIR [DAFS_PROC_READDIR_DIRECT] responses
           must fabricate them." (RFC 3010, pp. 148-149)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_COOKIE

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOTDIR

   DAFSERR_NOTSUPP

   DAFSERR_READDIR_NOSPC

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_TOOSMALL


Wittle                                                        [Page 296]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.31.  DAFS_PROC_READLINK_INLINE

   SUMMARY

   Reads the contents of a symbolic link. Contents of the link are
   returned inline.

   ARGUMENTS

        struct DAFS_Readlink_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           };


   RESULTS

        struct DAFS_Readlink_Inline_Res
           {
           utf8string_type            link;                /* heap */
           };


   DESCRIPTION

           "READLINK  [DAFS_PROC_READLINK_INLINE]  reads  the  data
           associated  with  a  symbolic link.  The data is a UTF-8
           string that is opaque to the server.  That  is,  whether
           created  by  an  NFS [DAFS] client or created locally on
           the server, the data in a symbolic link  is  not  inter-
           preted when created, but is simply stored.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p.150)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

   IMPLEMENTATION

           "A symbolic link is nominally a pointer to another file.
           The  data  is not necessarily interpreted by the server,
           just stored in the file. It is  possible  for  a  client
           implementation to store a path name that is not meaning-
           ful to the server operating system in a  symbolic  link.
           A READLINK [DAFS_PROC_READLINK_INLINE] operation returns


Wittle                                                        [Page 297]

INTERNET-DRAFT         Direct Access File System          September 2001


           the data to the client for interpretation. If  different
           implementations  want to share access to symbolic links,
           then they must agree on the interpretation of  the  data
           in the symbolic link.

           The READLINK  [DAFS_PROC_READLINK_INLINE]  operation  is
           only  allowed on objects of type NF4LNK [DAFS_TYPE_LNK].
           The  server  should  return  the  error,   NFS4ERR_INVAL
           [DAFSERR_INVAL],  if  the  object is not of type, NF4LNK
           [DAFS_TYPE_LNK]." (RFC 3010, p. 150)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 298]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.32.  DAFS_PROC_READLINK_DIRECT

   SUMMARY

   Reads the contents of a symbolic link. Contents of the link are
   returned via RDMA operations to the buffer specified by the client.

   ARGUMENTS

        struct DAFS_Readlink_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_direct_op_buffer      buffer;
           };


   RESULTS

        struct DAFS_Readlink_Direct_Res
           {
           dafs_checksum_type         direct_checksum;
           };


           /* Contents of link are returned in buffer
              described in the arguments packet */
           /* DIRECT: dafs_utf8string_type linkcontents; */


   DESCRIPTION

           "READLINK  [DAFS_PROC_READLINK_DIRECT]  reads  the  data
           associated  with  a  symbolic link.  The data is a UTF-8
           string that is opaque to the server.  That  is,  whether
           created  by  an  NFS [DAFS] client or created locally on
           the server, the data in a symbolic link  is  not  inter-
           preted when created, but is simply stored.

           On success, the current filehandle retains  its  value."
           (RFC 3010, p. 150)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

   IMPLEMENTATION


Wittle                                                        [Page 299]

INTERNET-DRAFT         Direct Access File System          September 2001


           "A symbolic link is nominally a pointer to another file.
           The  data  is not necessarily interpreted by the server,
           just stored in the file. It is  possible  for  a  client
           implementation to store a path name that is not meaning-
           ful to the server operating system in a  symbolic  link.
           A READLINK [DAFS_PROC_READLINK_DIRECT] operation returns
           the data to the client for interpretation. If  different
           implementations  want to share access to symbolic links,
           then they must agree on the interpretation of  the  data
           in the symbolic link.

           The READLINK  [DAFS_PROC_READLINK_DIRECT]  operation  is
           only  allowed on objects of type NF4LNK [DAFS_TYPE_LNK].
           The  server  should  return  the  error,   NFS4ERR_INVAL
           [DAFSERR_INVAL],  if  the  object is not of type, NF4LNK
           [DAFS_TYPE_LNK]." (RFC 3010, p. 150)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 300]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.33.  DAFS_PROC_REMOVE

   SUMMARY

   Removes a file object.

   ARGUMENTS

        enum removemode
           {
           UNCHECKED_REMOVE           = 0,
           CHECK_OPEN                 = 1
           };


        struct DAFS_Remove_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_component_type        target;              /* heap */
           enum removemode            remove_mode;
           };


   RESULTS

        struct DAFS_Remove_Res
           {
           dafs_change_info_type      change_info;
           };


   DESCRIPTION

           "The   REMOVE   [DAFS_PROC_REMOVE]   operation   removes
           (deletes)  a  directory entry named by filename from the
           directory corresponding  to  the  current   filehandle."
           (RFC 3010, p. 151)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandle in DAFS procedures.

           "If the entry in the directory was the last reference to
           the  corresponding file system object, the object may be
           destroyed." (RFC 3010, p. 152)


Wittle                                                        [Page 301]

INTERNET-DRAFT         Direct Access File System          September 2001


   Notice DAFS exceptions for open files in this DESCRIPTION section.

           "For the directory where the filename was  removed,  the
           server   returns   change_info4  [dafs_change_info_type]
           information in cinfo  [change_info].   With  the  atomic
           field   of   the   change_info4  [dafs_change_info_type]
           struct, the server will indicate if the before and after
           change  attributes were obtained atomically with respect
           to the removal.

           If the target has a length of 0  (zero),  or  if  target
           does   not   obey   the   UTF-8  definition,  the  error
           NFS4ERR_INVAL [DAFSERR_INVAL] will be returned.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 151-152)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

   DAFS_PROC_REMOVE has the ability to guard against the removal of
   files that are currently open. If remove_mode is set to UNCHECKED,
   then the server will attempt to delete the file, regardless of any
   outstanding open references. The request will fail with
   DAFSERR_DENYDISP_CONFLICT if the file is currently open with a
   DELETE_DENY disposition. If remove_mode is set to CHECK_OPEN, then
   the server will attempt to delete the file only if the file is not
   currently open. If it is, the request fails with
   DAFSERR_DENYDISP_CONFLICT. The request will fail with
   DAFSERR_DENYDISP_NOTSUPP if he server is unable to guarantee this
   behavior for DAFS clients.

   [Note: the remove request with CHECK_OPEN MAY or MAY NOT succeed if
   the server is unable to detect open references from other protocols.]

   The DAFS_PROC_REMOVE operation provides "Delete On Last Close" seman-
   tics. Once a file has been opened, the DAFS Server MUST continue to
   provide access to the file to the Clients that have the file open,
   even after the file has been removed, up until the number of Clients
   that have the file open has dropped to zero. However, once the file
   has been removed, subsequent lookup and open operations will fail.

   IMPLEMENTATION

           "NFS versions 2 and  3  required  a  different  operator
           RMDIR  for  directory  removal.   NFS  version  4 [DAFS]
           REMOVE [DAFS_PROC_REMOVE] can  be  used  to  delete  any
           directory entry independent of its file type.


Wittle                                                        [Page 302]

INTERNET-DRAFT         Direct Access File System          September 2001


           The concept of last reference is server  specific.  How-
           ever,  if  the numlinks field in the previous attributes
           of the object had the value 1,  the  client  should  not
           rely on referring to the object via a file handle. Like-
           wise, the client should not rely on the resources  (disk
           space,  directory  entry, and so on) formerly associated
           with the object becoming  immediately  available.  Thus,
           if  a  client  needs  to be able to continue to access a
           file after using REMOVE to remove it, the client  should
           take  steps  to  make  sure  that the file will still be
           accessible. The usual mechanism used is  to  RENAME  the
           file from its old name to a new hidden name." (RFC 3010,
           p. 152)

   DAFS supports delete-on-last-close. Clients do not have to rename
   files if it needs to protect access to an open file that is being
   removed. The rename MAY be necessary if the client wants to prevent
   deletion of a file that is NOT open but for which the client holds a
   filehandle obtained via a lookup operation.

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENYDISP_CONFLICT

   DAFSERR_DENYDISP_NOTSUPP

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NAMETOOLONG

   DAFSERR_NOENT


Wittle                                                        [Page 303]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_NOTDIR

   DAFSERR_NOTEMPTY

   DAFSERR_NOTSUPP

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 304]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.34.  DAFS_PROC_RENAME

   SUMMARY

   Renames a directory entry.

   ARGUMENTS

        struct DAFS_Rename_Args
           {
           dafs_filehandle_type       sourcedir;
           dafs_filehandle_type       targetdir;
           dafs_component_type        oldname;             /* heap */
           dafs_component_type        newname;             /* heap */
           };


   RESULTS

        struct DAFS_Rename_Res
           {
           dafs_change_info_type      source;
           dafs_change_info_type      target;
           };


   DESCRIPTION

           "The RENAME  [DAFS_PROC_RENAME]  operation  renames  the
           object  identified  by  oldname  in the source directory
           corresponding to the saved filehandle,  as  set  by  the
           SAVEFH  operation,  to  newname  in the target directory
           corresponding to the current filehandle." (RFC 3010,  p.
           153)

   When the DAFS_PROC_RENAME operation occurs within a DAFS operation
   chain (see 4.3.2., "Chaining Flags", for a description of chaining),
   the DAFS chain current_filehandle specifies the target directory, and
   the source directory, oldname, and newname are taken from the message
   arguments.

           "The operation is required to be atomic to  the  client.
           Source  and  target  directories must reside on the same
           file system on the  server.   On  success,  the  current
           filehandle  will  continue  to be the target directory."
           (RFC 3010, p. 153)


Wittle                                                        [Page 305]

INTERNET-DRAFT         Direct Access File System          September 2001


   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in management of filehandles in DAFS procedures.

           "If the target directory already contains an entry  with
           the  name, newname, the source object must be compatible
           with the target: either both are non-directories or both
           are directories and the target must be empty.  If compa-
           tible, the existing target is removed before the  rename
           occurs.   If they are not compatible or if the target is
           a directory but not empty, the server  will  return  the
           error, NFS4ERR_EXIST [DAFSERR_EXIST].

           If oldname and newname both refer to the same file (they
           might   be  hard  links  of  each  other),  then  RENAME
           [DAFS_PROC_RENAME] should perform no action  and  return
           success.

           For   both   directories   involved   in   the    RENAME
           [DAFS_PROC_RENAME]  ,  the  server  returns change_info4
           [dafs_change_info_type] information.   With  the  atomic
           field   of   the   change_info4  [dafs_change_info_type]
           struct, the server will indicate if the before and after
           change  attributes were obtained atomically with respect
           to the rename.

           If the oldname or newname has a length of 0  (zero),  or
           if  oldname  or  newname does not obey the UTF-8 defini-
           tion, the error NFS4ERR_INVAL  [DAFSERR_INVAL]  will  be
           returned." (RFC 3010, pp. 153-154)

   The DAFS_PROC_RENAME operation provides "Delete On Last Close" seman-
   tics. Once a file has been opened, the DAFS Server MUST continue to
   provide access to the file to the Clients that have the file open,
   even after the file has been renamed, up until the number of Clients
   that have the file open has dropped to zero. However, once the file
   has been renamed, subsequent lookup and open operations will fail.

   IMPLEMENTATION

           "The RENAME [DAFS_PROC_RENAME] operation must be  atomic
           to  the client.  The statement 'source and target direc-
           tories must reside  on  the  same  file  system  on  the
           server' means that the fsid fields in the attributes for
           the directories are the same. If  they  reside  on  dif-
           ferent    file    systems,   the   error,   NFS4ERR_XDEV
           [DAFSERR_XDEV], is returned.

           A filehandle may or may not become stale or expire on  a


Wittle                                                        [Page 306]

INTERNET-DRAFT         Direct Access File System          September 2001


           rename.   However,   server  implementors  are  strongly
           encouraged to attempt to keep file handles from becoming
           stale or expiring in this fashion.

           On some servers, the filenames, '.' and '..', are  ille-
           gal  as either oldname or newname.  In addition, neither
           oldname nor newname can  be  an  alias  for  the  source
           directory.    These   servers  will  return  the  error,
           NFS4ERR_INVAL [DAFSERR_INVAL],  in  these  cases."  (RFC
           3010, p. 154)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DQUOT

   DAFSERR_EXIST

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_ISDIR

   DAFSERR_MOVED

   DAFSERR_NAMETOOLONG

   DAFSERR_NOENT

   DAFSERR_NOSPC

   DAFSERR_NOTDIR

   DAFSERR_NOTEMPTY

   DAFSERR_NOTSUPP


Wittle                                                        [Page 307]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_XDEV


Wittle                                                        [Page 308]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.35.  DAFS_PROC_SETATTR_INLINE

   SUMMARY

   Sets the attributes of a file object.

   ARGUMENTS

        struct DAFS_Setattr_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_file_attr_type        obj_attributes;
           };


   RESULTS

        struct DAFS_Setattr_Inline_Res
           {
           dafs_attr_bitmap_type      attr_request_bitmap;
           };


   DESCRIPTION

           "The   SETATTR   [DAFS_PROC_SETATTR_INLINE]    operation
           changes  one  or more of the attributes of a file system
           object.  The new attributes are specified with a  bitmap
           and the attributes that follow the bitmap in bit order.

           The  stateid  [state_id]  is  necessary   for   SETATTRs
           [DAFS_PROC_SETATTR_INLINEs]  that  change  the size of a
           file (modify the attribute object_size).   This  stateid
           represents  a record lock, share reservation, or delega-
           tion   which   must   be   valid   for    the    SETATTR
           [DAFS_PROC_SETATTR_INLINE]  to  modify the file data.  A
           valid stateid would always be specified. When  the  file
           size  is  not changed, the special stateid consisting of
           all bits 0 (zero) should be used.

           On either success  or  failure  of  the  operation,  the
           server  will  return  the attrsset [attr_request_bitmap]
           bitmask to represent what (if any) attributes were  suc-
           cessfully set." (RFC 3010, p. 160)


Wittle                                                        [Page 309]

INTERNET-DRAFT         Direct Access File System          September 2001


   See 4.1.3.3., "Attribute Bitmaps" for a description of file attribute
   encoding in DAFS.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 160)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "The file size attribute is used to request  changes  to
           the  size of a file. A value of 0 (zero) causes the file
           to be truncated, a value less than the current  size  of
           the  file  causes  data  from new size to the end of the
           file to be  discarded,  and  a  size  greater  than  the
           current  size  of  the file causes logically zeroed data
           bytes to be added to the end of the file.   Servers  are
           free  to  implement this using holes or actual zero data
           bytes. Clients should not make any assumptions regarding
           a  server's  implementation of this feature, beyond that
           the bytes returned will be zeroed.  Servers must support
           extending      the     file     size     via     SETATTR
           [DAFS_PROC_SETATTR_INLINE].

           SETATTR [DAFS_PROC_SETATTR_INLINE]   is  not  guaranteed
           atomic.  A failed SETATTR [DAFS_PROC_SETATTR_INLINE] may
           partially change a file's attributes.

           Changing   the   size   of   a   file    with    SETATTR
           [DAFS_PROC_SETATTR_INLINE]    indirectly   changes   the
           time_modify.  A client must account  for  this  as  size
           changes can result in data deletion.

           If server and client times differ, programs that compare
           client  time to file times can break. A time maintenance
           protocol should be  used  to  limit  client/server  time
           skew.

           If the server cannot successfully set all the attributes
           it  must  return an NFS4ERR_INVAL [DAFSERR_INVAL] error.
           If the server can only support 32 bit offsets and sizes,
           a SETATTR [DAFS_PROC_SETATTR_INLINE]  request to set the
           size of a file to larger than can be represented  in  32
           bits  will be rejected with this same error." (RFC 3010,
           pp. 160-161)

   ERRORS


Wittle                                                        [Page 310]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_BROKEN

   DAFSERR_CHAIN_FORM

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVA

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_NOTSUPP

   DAFSERR_OLD_STATEID

   DAFSERR_PERM

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 311]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.36.  DAFS_PROC_SETATTR_DIRECT

   SUMMARY

   Sets the attributes of a file object.

   ARGUMENTS

        struct DAFS_Setattr_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_checksum_type         direct_checksum;
           dafs_direct_op_buffer      obj_attributes;
           };

            /* DIRECT: file_attr_type   obj_attributes; */


   RESULTS

        struct DAFS_Setattr_Direct_Res
           {
           dafs_attr_bitmap_type     attr_request_bitmap;
           };


   DESCRIPTION

           "The   SETATTR   [DAFS_PROC_SETATTR_DIRECT]    operation
           changes  one  or more of the attributes of a file system
           object.  The new attributes are specified with a  bitmap
           and the attributes that follow the bitmap in bit order.

           The  stateid  [state_id]  is  necessary   for   SETATTRs
           [DAFS_PROC_SETATTR_DIRECTs]  that  change  the size of a
           file (modify the attribute object_size).   This  stateid
           represents  a record lock, share reservation, or delega-
           tion   which   must   be   valid   for    the    SETATTR
           [DAFS_PROC_SETATTR_DIRECT]  to  modify the file data.  A
           valid stateid would always be specified. When  the  file
           size  is  not changed, the special stateid consisting of
           all bits 0 (zero) should be used.

           On either success  or  failure  of  the  operation,  the
           server  will  return  the attrsset [attr_request_bitmap]


Wittle                                                        [Page 312]

INTERNET-DRAFT         Direct Access File System          September 2001


           bitmask to represent what (if any) attributes were  suc-
           cessfully set." (RFC 3010, p. 160)

   See 4.1.3.3., "Attribute Bitmaps" for a description of file attribute
   encoding in DAFS.

           "On success, the current filehandle retains its  value."
           (RFC 3010, p. 160)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "The file size attribute is used to request  changes  to
           the  size of a file. A value of 0 (zero) causes the file
           to be truncated, a value less than the current  size  of
           the  file  causes  data  from new size to the end of the
           file to be  discarded,  and  a  size  greater  than  the
           current  size  of  the file causes logically zeroed data
           bytes to be added to the end of the file.   Servers  are
           free  to  implement this using holes or actual zero data
           bytes. Clients should not make any assumptions regarding
           a  server's  implementation of this feature, beyond that
           the bytes returned will be zeroed.  Servers must support
           extending      the     file     size     via     SETATTR
           [DAFS_PROC_SETATTR_DIRECT].

           SETATTR [DAFS_PROC_SETATTR_DIRECT]   is  not  guaranteed
           atomic.  A failed SETATTR [DAFS_PROC_SETATTR_DIRECT] may
           partially change a file's attributes.

           Changing   the   size   of   a   file    with    SETATTR
           [DAFS_PROC_SETATTR_DIRECT]    indirectly   changes   the
           time_modify.  A client must account  for  this  as  size
           changes can result in data deletion.

           If server and client times differ, programs that compare
           client  time to file times can break. A time maintenance
           protocol should be  used  to  limit  client/server  time
           skew.

           If the server cannot successfully set all the attributes
           it  must  return an NFS4ERR_INVAL [DAFSERR_INVAL] error.
           If the server can only support 32 bit offsets and sizes,
           a SETATTR [DAFS_PROC_SETATTR_DIRECT]  request to set the
           size of a file to larger than can be represented  in  32
           bits  will be rejected with this same error." (RFC 3010,


Wittle                                                        [Page 313]

INTERNET-DRAFT         Direct Access File System          September 2001


           pp. 160-161)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_NOTSUPP

   DAFSERR_OLD_STATEID

   DAFSERR_PERM

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT


Wittle                                                        [Page 314]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_STALE

   DAFSERR_STALE_STATEID


Wittle                                                        [Page 315]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.37.  DAFS_PROC_VERIFY

   SUMMARY

   Verifies equality of attributes.

   ARGUMENTS

        struct DAFS_Verify_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_file_attr_type        obj_attributes;
           };


   RESULTS

       None.

   DESCRIPTION

           "The VERIFY [DAFS_PROC_VERIFY] operation is used to ver-
           ify  that  attributes have a value assumed by the client
           before proceeding with following operations in the  com-
           pound request." (RFC 3010, p. 165)

   DAFS_PROC_VERIFY can be used in a similar fashion inside a DAFS
   chain.

           "If any of the attributes do not match  then  the  error
           NFS4ERR_NOT_SAME  [DAFSERR_NOT_SAME]  must  be returned.
           The current filehandle retains its value after  success-
           ful completion of the operation." (RFC 3010, p. 165)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "In the case that a recommended attribute  is  specified
           in  the  VERIFY   [DAFS_PROC_VERIFY]  operation  and the
           server does not support that attribute for the file sys-
           tem  object, the error NFS4ERR_NOTSUPP [DAFSERR_NOTSUPP]
           is returned to the client." (RFC 3010, p. 165)

   ERRORS


Wittle                                                        [Page 316]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY

   DAFSERR_FHEXPIRED

   DAFSERR_INVAL

   DAFSERR_MOVED

   DAFSERR_NOTSUPP

   DAFSERR_NOT_SAME

   DAFSERR_RESOURCE

   DAFSERR_SERVERFAULT

   DAFSERR_STALE


Wittle                                                        [Page 317]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.38.  DAFS_PROC_WRITE_INLINE

   SUMMARY

   Writes data to a file. The data to be written is part of the request
   packet and is passed inline.

   ARGUMENTS

        enum stable_how
           {
           UNSTABLE                   = 0,
           DATA_SYNC                  = 1,
           FILE_SYNC                  = 2
           };


        struct DAFS_Write_Inline_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint32                byte_count;
           stable_how                 stable_how;
           dafs_uint32                write_padded;
           dafs_cache_hint_type       cache_hint;
           dafs_opaque8               data[byte_count];
                                              /* heap or padded */
           };


   RESULTS

        struct DAFS_Write_Inline_Res
           {
           dafs_uint32                count;
           stable_how                 committed;
           dafs_verifier_type         verifier;
           };


   DESCRIPTION

           "The WRITE [DAFS_PROC_WRITE_INLINE] operation is used to
           write  data  to  a  regular  file.   The  target file is


Wittle                                                        [Page 318]

INTERNET-DRAFT         Direct Access File System          September 2001


           specified by the current filehandle." (RFC 3010, p. 167)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "The offset specifies the offset where the  data  should
           be  written.  An  offset  of 0 (zero) specifies that the
           write should start at the beginning of  the  file.   The
           count  [byte_count]  represents  the  number of bytes of
           data that are to be written.  If the count  [byte_count]
           is  0  (zero),  the WRITE [DAFS_PROC_WRITE_INLINE]  will
           succeed and return a count of 0 (zero) subject  to  per-
           missions checking.  The server may choose to write fewer
           bytes than requested by the client.

           Part of the write request is a specification of how  the
           write is to be performed.  The client specifies with the
           stable parameter the method of how the  data  is  to  be
           processed  by  the  server.   If  stable  is  FILE_SYNC4
           [FILE_SYNC], the server must  commit  the  data  written
           plus  all  file system metadata to stable storage before
           returning results.  This corresponds to the NFS  version
           2  protocol semantics.  Any other behavior constitutes a
           protocol   violation.    If   stable    is    DATA_SYNC4
           [DATA_SYNC], then the server must commit all of the data
           to stable storage and enough of the metadata to retrieve
           the  data  before  returning.  The server implementor is
           free to implement DATA_SYNC4  [DATA_SYNC]  in  the  same
           fashion  as  FILE_SYNC4 [FILE_SYNC], but with a possible
           performance drop.  If stable  is  UNSTABLE4  [UNSTABLE],
           the  server  is  free to commit any part of the data and
           the metadata to stable storage, including all  or  none,
           before  returning  a  reply  to  the client. There is no
           guarantee whether or when any uncommitted data will sub-
           sequently  be  committed  to  stable  storage.  The only
           guarantees made by the server are that it will not  des-
           troy  any  data  without  changing the value of verf and
           that it will not commit the data and metadata at a level
           less than that requested by the client.

           The stateid returned from  a  previous  record  lock  or
           share  reservation  request  is  provided as part of the
           argument.  The stateid is used by the server  to  verify
           that  the  associated  lock is still valid and to update
           lease timeouts for the client." (RFC 3010, p. 167)

   DAFS servers renew leases whenever any DAFS request (including NULL)
   is received from a client. Leases are renewed based on the client


Wittle                                                        [Page 319]

INTERNET-DRAFT         Direct Access File System          September 2001


   associated with the session on which the request is received.

           "Upon successful completion, the following  results  are
           returned.  The  count  result  is the number of bytes of
           data written to the file. The  server  may  write  fewer
           bytes  than requested. If so, the actual number of bytes
           written starting at location, offset, is returned.

           The server also returns an indication of  the  level  of
           commitment  of  the  data and metadata via committed. If
           the server committed all data  and  metadata  to  stable
           storage,   committed   should   be   set  to  FILE_SYNC4
           [FILE_SYNC]. If the level of commitment was at least  as
           strong  as DATA_SYNC4 [DATA_SYNC], then committed should
           be set to DATA_SYNC4 [DATA_SYNC].  Otherwise,  committed
           must  be returned as UNSTABLE4 [UNSTABLE]. If stable was
           FILE4_SYNC [FILE_SYNC],  then  committed  must  also  be
           FILE_SYNC4 [FILE_SYNC]: anything else constitutes a pro-
           tocol violation. If stable was  DATA_SYNC4  [DATA_SYNC],
           then   committed   may   be  FILE_SYNC4   or  DATA_SYNC4
           [FILE_SYNC or DATA_SYNC]: anything  else  constitutes  a
           protocol  violation. If stable was UNSTABLE4 [UNSTABLE],
           then committed may be either FILE_SYNC4, DATA_SYNC4,  or
           UNSTABLE4 [FILE_SYNC, DATA_SYNC, or UNSTABLE].

           The final portion of the result is the  write  verifier,
           verf [verifier]. The write verifier is a cookie that the
           client can use  to  determine  whether  the  server  has
           changed     state    between    a    call    to    WRITE
           [DAFS_PROC_WRITE_INLINE] and a subsequent call to either
           WRITE       [DAFS_PROC_WRITE_INLINE]      or      COMMIT
           [DAFS_PROC_COMMIT]." (RFC 3010, pp. 167-168)

   DAFS servers use the same write verifiers during a single DAFS server
   instance, whether the write operation is done INLINE or DIRECT. The
   client can then apply the same verifier tests regardless of the data
   transfer method chosen (inline or direct).

           "This cookie must be consistent during a single instance
           of the NFS version 4 [DAFS] protocol service and must be
           unique between instances of the  NFS  version  4  [DAFS]
           protocol server, where uncommitted data may be lost.

           If a client writes data to the server  with  the  stable
           argument  set  to  UNSTABLE4  [UNSTABLE]  and  the reply
           yields a committed response of DATA_SYNC4  or  UNSTABLE4
           [DATA_SYNC  or UNSTABLE], the client will follow up some
           time in the  future  with  a  COMMIT  [DAFS_PROC_COMMIT]


Wittle                                                        [Page 320]

INTERNET-DRAFT         Direct Access File System          September 2001


           operation  to  synchronize outstanding asynchronous data
           and metadata with the server's stable  storage,  barring
           client error. It is possible that due to client crash or
           other error that a subsequent COMMIT  [DAFS_PROC_COMMIT]
           will not be received by the server.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 167-168)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "It is possible for the server to write fewer than count
           [byte_count]  bytes  of  data.  In this case, the server
           should not return an error unless no data was written at
           all.   If the server writes less than count [byte_count]
           bytes,   the   client   should   issue   another   WRITE
           [DAFS_PROC_WRITE_INLINE] to write the remaining data.

           It is assumed that the act of writing  data  to  a  file
           will  cause the time_modified of the file to be updated.
           However, the time_modified of the  file  should  not  be
           changed  unless  the  contents  of the file are changed.
           Thus,  a  WRITE  [DAFS_PROC_WRITE_INLINE]  request  with
           count  [byte_count]   set  to  0  should  not  cause the
           time_modified of the file to be updated.

           The definition of stable storage has been historically a
           point  of contention.  The following expected properties
           of stable storage may help in resolving design issues in
           the implementation. Stable storage is persistent storage
           that survives:

              1. Repeated power failures.

              2. Hardware failures (of  any  board,  power  supply,
              etc.).

              3. Repeated software crashes, including reboot cycle.

           This definition does not address failure of  the  stable
           storage module itself.

           The verifier is defined to allow a client to detect dif-
           ferent  instances  of  an  NFS version 4 [DAFS] protocol
           server over which cached, uncommitted data may be  lost.


Wittle                                                        [Page 321]

INTERNET-DRAFT         Direct Access File System          September 2001


           In  the most likely case, the verifier allows the client
           to detect server reboots.  This information is  required
           so  that  the  client  can  safely determine whether the
           server could have lost cached data.  If the server fails
           unexpectedly  and  the  client has uncommitted data from
           previous WRITE [DAFS_PROC_WRITE_INLINE]  requests  (done
           with the stable argument set to UNSTABLE4 [UNSTABLE] and
           in which the result committed was returned as  UNSTABLE4
           [UNSTABLE]  as well) it may not have flushed cached data
           to stable storage. The burden  of  recovery  is  on  the
           client  and  the client will need to retransmit the data
           to the server.

           A suggested verifier would be to use the time  that  the
           server  was  booted  or  the  time  the  server was last
           started (if  restarting  the  server  without  a  reboot
           results in lost buffers).

           The committed field in the results allows the client  to
           do  more effective caching.  If the server is committing
           all WRITE requests to stable  storage,  then  it  should
           return  with  committed  set  to FILE_SYNC4 [FILE_SYNC],
           regardless of the value of the stable field in the argu-
           ments.  A  server  that  uses  an  NVRAM accelerator may
           choose to implement this policy.   The  client  can  use
           this  to increase the effectiveness of the cache by dis-
           carding cached data that has already been  committed  on
           the server.

           Some   implementations    may    return    NFS4ERR_NOSPC
           [DAFSERR_NOSPC] instead of NFS4ERR_DQUOT [DAFSERR_DQUOT]
           when a user's quota is exceeded." (RFC  3010,  pp.  168-
           169)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY


Wittle                                                        [Page 322]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID

   DAFSERR_WRITE_TOOBIG


Wittle                                                        [Page 323]

INTERNET-DRAFT         Direct Access File System          September 2001


6.5.39.  DAFS_PROC_WRITE_DIRECT

   SUMMARY

   Initiates a write to file using data retrieved via RDMA read directly
   from client memory buffers.

        enum stable_how
           {
           UNSTABLE                   = 0,
           DATA_SYNC                  = 1,
           FILE_SYNC                  = 2
           };


   ARGUMENTS

        struct DAFS_Write_Direct_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint64                offset;
           dafs_uint32                byte_count;
           stable_how                 stable_how;
           dafs_cache_hint_type       cache_hint;
           dafs_checksum_type         direct_checksum;
           dafs_direct_op_buffer      write_data_buffers<>;
           };

        /* DIRECT: opaque writedata[buffer_byte_count]; */


   RESULTS

        struct DAFS_Write_Direct_Res
           {
           dafs_uint32                count;
           stable_how                 committed;
           dafs_verifier_type         verifier;
           };


   DESCRIPTION

           "The WRITE [DAFS_PROC_WRITE_DIRECT] operation is used to


Wittle                                                        [Page 324]

INTERNET-DRAFT         Direct Access File System          September 2001


           write data to a regular file.  The target file is speci-
           fied by the current filehandle." (RFC 3010, p. 167)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

           "The offset specifies the offset where the  data  should
           be  written.  An  offset  of 0 (zero) specifies that the
           write should start at the beginning of  the  file.   The
           count  [byte_count]  represents  the  number of bytes of
           data that are to be written.  If the count  [byte_count]
           is  0  (zero),  the WRITE [DAFS_PROC_WRITE_DIRECT]  will
           succeed and return a count of 0 (zero) subject  to  per-
           missions checking.  The server may choose to write fewer
           bytes than requested by the client.

           Part of the write request is a specification of how  the
           write is to be performed.  The client specifies with the
           stable parameter the method of how the  data  is  to  be
           processed  by  the  server.   If  stable  is  FILE_SYNC4
           [FILE_SYNC], the server must  commit  the  data  written
           plus  all  file system metadata to stable storage before
           returning results.  This corresponds to the NFS  version
           2  protocol semantics.  Any other behavior constitutes a
           protocol violation. If stable is DATA_SYNC4 [DATA_SYNC],
           then  the  server  must commit all of the data to stable
           storage and enough of the metadata to retrieve the  data
           before  returning.   The  server  implementor is free to
           implement DATA_SYNC4 [DATA_SYNC] in the same fashion  as
           FILE_SYNC4  [FILE_SYNC], but with a possible performance
           drop.  If stable is UNSTABLE4 [UNSTABLE], the server  is
           free  to commit any part of the data and the metadata to
           stable storage, including all or none, before  returning
           a  reply to the client. There is no guarantee whether or
           when any uncommitted data will subsequently be committed
           to  stable  storage.  The  only  guarantees  made by the
           server are that it will not  destroy  any  data  without
           changing  the  value of verf and that it will not commit
           the  data  and  metadata  at  a  level  less  than  that
           requested by the client.

           The stateid returned from  a  previous  record  lock  or
           share  reservation  request  is  provided as part of the
           argument.  The stateid is used by the server  to  verify
           that  the  associated  lock is still valid and to update
           lease timeouts for the client." (RFC 3010, p. 167)

   DAFS servers renew leases whenever any DAFS request (including NULL)


Wittle                                                        [Page 325]

INTERNET-DRAFT         Direct Access File System          September 2001


   is received from a client. Leases are renewed based on the client
   associated with the session on which the request is received.

           "Upon successful completion, the following  results  are
           returned.  The  count  result  is the number of bytes of
           data written to the file. The  server  may  write  fewer
           bytes  than requested. If so, the actual number of bytes
           written starting at location, offset, is returned.

           The server also returns an indication of  the  level  of
           commitment  of  the  data and metadata via committed. If
           the server committed all data  and  metadata  to  stable
           storage,   committed   should   be   set  to  FILE_SYNC4
           [FILE_SYNC]. If the level of commitment was at least  as
           strong  as DATA_SYNC4 [DATA_SYNC], then committed should
           be set to DATA_SYNC4 [DATA_SYNC].  Otherwise,  committed
           must  be returned as UNSTABLE4 [UNSTABLE]. If stable was
           FILE4_SYNC [FILE_SYNC],  then  committed  must  also  be
           FILE_SYNC4 [FILE_SYNC]: anything else constitutes a pro-
           tocol violation. If stable was  DATA_SYNC4  [DATA_SYNC],
           then   committed   may   be  FILE_SYNC4   or  DATA_SYNC4
           [FILE_SYNC or DATA_SYNC]: anything  else  constitutes  a
           protocol  violation. If stable was UNSTABLE4 [UNSTABLE],
           then committed may be either FILE_SYNC4, DATA_SYNC4,  or
           UNSTABLE4 [FILE_SYNC, DATA_SYNC, or UNSTABLE].

           The final portion of the result is the  write  verifier,
           verf [verifier]. The write verifier is a cookie that the
           client can use  to  determine  whether  the  server  has
           changed     state    between    a    call    to    WRITE
           [DAFS_PROC_WRITE_DIRECT] and a subsequent call to either
           WRITE       [DAFS_PROC_WRITE_DIRECT]      or      COMMIT
           [DAFS_PROC_COMMIT]." (RFC 3010, pp. 167-168)

   DAFS servers use the same write verifiers during a single DAFS server
   instance, whether the write operation is done INLINE or DIRECT. The
   client can then apply the same verifier tests regardless of the data
   transfer method chosen (inline or direct).

           "This cookie must be consistent during a single instance
           of the NFS version 4 [DAFS] protocol service and must be
           unique between instances of the  NFS  version  4  [DAFS]
           protocol server, where uncommitted data may be lost.

           If a client writes data to the server  with  the  stable
           argument  set  to  UNSTABLE4  [UNSTABLE]  and  the reply
           yields a committed response of DATA_SYNC4  or  UNSTABLE4
           [DATA_SYNC  or UNSTABLE], the client will follow up some


Wittle                                                        [Page 326]

INTERNET-DRAFT         Direct Access File System          September 2001


           time in the  future  with  a  COMMIT  [DAFS_PROC_COMMIT]
           operation  to  synchronize outstanding asynchronous data
           and metadata with the server's stable  storage,  barring
           client error. It is possible that due to client crash or
           other error that a subsequent COMMIT  [DAFS_PROC_COMMIT]
           will not be received by the server.

           On success, the current filehandle retains  its  value."
           (RFC 3010, pp. 167-168)

   See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences
   in filehandle management in DAFS procedures.

   IMPLEMENTATION

           "It is possible for the server to write fewer than count
           [byte_count]  bytes  of  data.  In this case, the server
           should not return an error unless no data was written at
           all.   If the server writes less than count [byte_count]
           bytes,   the   client   should   issue   another   WRITE
           [DAFS_PROC_WRITE_DIRECT] to write the remaining data.

           It is assumed that the act of writing  data  to  a  file
           will  cause the time_modified of the file to be updated.
           However, the time_modified of the  file  should  not  be
           changed  unless  the  contents  of the file are changed.
           Thus,  a  WRITE  [DAFS_PROC_WRITE_DIRECT]  request  with
           count  [byte_count]  set  to  0  should  not  cause  the
           time_modified of the file to be updated.

           The definition of stable storage has been historically a
           point  of contention.  The following expected properties
           of stable storage may help in resolving design issues in
           the implementation. Stable storage is persistent storage
           that survives:

              1. Repeated power failures.

              2. Hardware failures (of  any  board,  power  supply,
              etc.).

              3. Repeated software crashes, including reboot cycle.

           This definition does not address failure of  the  stable
           storage module itself.

           The verifier is defined to allow a client to detect dif-
           ferent  instances  of  an  NFS version 4 [DAFS] protocol


Wittle                                                        [Page 327]

INTERNET-DRAFT         Direct Access File System          September 2001


           server over which cached, uncommitted data may be  lost.
           In  the most likely case, the verifier allows the client
           to detect server reboots.  This information is  required
           so  that  the  client  can  safely determine whether the
           server could have lost cached data.  If the server fails
           unexpectedly  and  the  client has uncommitted data from
           previous WRITE [DAFS_PROC_WRITE_DIRECT]  requests  (done
           with the stable argument set to UNSTABLE4 [UNSTABLE] and
           in which the result committed was returned as  UNSTABLE4
           [UNSTABLE]  as well) it may not have flushed cached data
           to stable storage. The burden  of  recovery  is  on  the
           client  and  the client will need to retransmit the data
           to the server.

           A suggested verifier would be to use the time  that  the
           server  was  booted  or  the  time  the  server was last
           started (if  restarting  the  server  without  a  reboot
           results in lost buffers).

           The committed field in the results allows the client  to
           do  more effective caching.  If the server is committing
           all WRITE requests to stable  storage,  then  it  should
           return  with  committed  set  to FILE_SYNC4 [FILE_SYNC],
           regardless of the value of the stable field in the argu-
           ments.  A  server  that  uses  an  NVRAM accelerator may
           choose to implement this policy.   The  client  can  use
           this  to increase the effectiveness of the cache by dis-
           carding cached data that has already been  committed  on
           the server.

           Some   implementations    may    return    NFS4ERR_NOSPC
           [DAFSERR_NOSPC] instead of NFS4ERR_DQUOT [DAFSERR_DQUOT]
           when a user's quota is exceeded." (RFC  3010,  pp.  168-
           169)

   ERRORS

   DAFSERR_ACCES

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_CHAIN_FORM

   DAFSERR_CHAIN_BROKEN

   DAFSERR_DELAY


Wittle                                                        [Page 328]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_DENIED

   DAFSERR_DQUOT

   DAFSERR_EXPIRED

   DAFSERR_FBIG

   DAFSERR_FHEXPIRED

   DAFSERR_GRACE

   DAFSERR_INVAL

   DAFSERR_IO

   DAFSERR_LEASE_MOVED

   DAFSERR_LOCKED

   DAFSERR_MOVED

   DAFSERR_NOSPC

   DAFSERR_OLD_STATEID

   DAFSERR_RESOURCE

   DAFSERR_ROFS

   DAFSERR_SERVERFAULT

   DAFSERR_STALE

   DAFSERR_STALE_STATEID

   DAFSERR_WRITE_TOOBIG


Wittle                                                        [Page 329]

INTERNET-DRAFT         Direct Access File System          September 2001


6.6.  Back-Control Directives

   This section describes the individual operations that a server can
   submit to the client, along with the formats of the arguments portion
   of the request and the results portion of the response. In this sec-
   tion, the requests go from server to client and the replies from
   client to server.

6.6.1.  DAFS_PROC_BC_NULL

   SUMMARY

   No operation.

   ARGUMENTS

   None.

   RESULTS

   None.

   DESCRIPTION

           "Standard NULL procedure.  Void argument, void response.
           Even  though there is no direct functionality associated
           with  this  procedure,  the  server  will  use   CB_NULL
           [DAFS_PROC_BC_NULL]  to  confirm the existence of a path
           for RPCs from server to client." (RFC 3010, p. 102)

   DAFS does not use RPC. A server determines whether a path to the
   client exists when a session contains a bound back channel.

   ERRORS

   None.


Wittle                                                        [Page 330]

INTERNET-DRAFT         Direct Access File System          September 2001


6.6.2.  DAFS_PROC_BC_BATCH_COMPLETION

   SUMMARY

   Notifies the client of completed batch I/O operations.

   ARGUMENTS

        struct DAFS_Batch_Submit_Res
           {
           dafs_completion_notification_type completions<>;/* heap*/
           };


   RESULTS

   None.

   DESCRIPTION

   The DAFS_PROC_BC_BATCH_COMPLETION back-control directive is used by
   the server to notify the client that one or more outstanding batch
   I/O requests have completed.

   For each completed I/O request, the server returns the batch request
   ID from the original client I/O request, its completion status, the
   number of bytes transferred, and in the case of reads, an optional
   checksum. For write requests, or all requests if checksumming is not
   enabled, the server sets the checksum field to 0. Note that the
   server MAY return short reads or writes, in which case the status is
   successful but the byte count varies from the original request.

   The returned completions MAY have been initiated by any number of
   prior batch operations; they need not come from a single batch sub-
   mission. If the client has set "num_completions" in a
   DAFS_PROC_BATCH_SUBMIT operation, the server SHOULD make return com-
   pletions in batches of the specified size, but it is NOT REQURIED to
   return that number and MAY return any number of completions at once
   (subject to the negotiated maximum message size). In particular, if a
   server has completed all outstanding batch IO requests, it SHOULD NOT
   delay reporting those completions over the back channel, regardless
   of the client's desired batch completion size.

   ERRORS


Wittle                                                        [Page 331]

INTERNET-DRAFT         Direct Access File System          September 2001


6.6.3.  DAFS_PROC_BC_GETATTR

   SUMMARY

   Requests the attributes of a file that has been delegated to a
   client.

   ARGUMENTS

        struct DAFS_BC_Getattr_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_attr_bitmap_type      attr_request_bitmap;
           };


   RESULTS

        struct DAFS_BC_Getattr_Res
           {
           dafs_file_attr_type        obj_attributes;      /* heap */
           };


   DESCRIPTION

           "The CB_GETATTR [DAFS_PROC_BC_GETATTR] operation is used
           to obtain the attributes modified by an open delegate to
           allow    the    server    to    respond    to    GETATTR
           [DAFS_PROC_GETATTR_INLINE]                           and
           DAFS_PROC_GETATTR_DIRECT] requests for a file  which  is
           the subject of an open delegation.

           If the handle specified is not one for which the  client
           holds  a  write  open  delegation,  an NFS4ERR_BADHANDLE
           [DAFSERR_BADHANDLE] error is returned."  (RFC  3010,  p.
           173)

   IMPLEMENTATION

           "The client returns attrbits and the  associated  attri-
           bute  values  only  for  attributes  that  it may change
           (change, time_modify, object_size)." (RFC 3010, p. 173)

   See 4.1.3.3., "Attribute Bitmaps" for a description of attribute
   encoding in DAFS.


Wittle                                                        [Page 332]

INTERNET-DRAFT         Direct Access File System          September 2001


   ERRORS

   DAFSERR_BADHANDLE

   DAFSERR_RESOURCE


Wittle                                                        [Page 333]

INTERNET-DRAFT         Direct Access File System          September 2001


6.6.4.  DAFS_PROC_BC_RECALL

   SUMMARY

   Recalls an open delegation.

   ARGUMENTS

        struct DAFS_BC_Recall_Args
           {
           dafs_filehandle_type       filehandle;
           dafs_state_id_type         state_id;
           dafs_uint32                truncate;
           };


   RESULTS

   None.

   DESCRIPTION

           "The CB_RECALL [DAFS_PROC_BC_RECALL] operation  is  used
           to begin the process of recalling an open delegation and
           returning it to the server.

           The truncate flag is used to optimize recall for a  file
           which is about to be truncated to zero.  When it is set,
           the client is freed of obligation to propagate  modified
           data  for  the  file  to  the server, since this data is
           irrelevant.

           If the handle specified is not one for which the  client
           holds   an   open   delegation,   an   NFS4ERR_BADHANDLE
           [DAFSERR_BADHANDLE] error is returned.

           If  the  stateid  [state_id]  specified   is   not   one
           corresponding  to an open delegation for the file speci-
           fied   by   the   filehandle,   an   NFS4ERR_BAD_STATEID
           [DAFSERR_BAD_STATEID]  is returned." (RFC 3010, pp. 173-
           174)

   IMPLEMENTATION

           "The client should reply to  the  callback  immediately.
           Replying  does  not  complete the recall.  The recall is
           not complete until the delegation is  returned  using  a


Wittle                                                        [Page 334]

INTERNET-DRAFT         Direct Access File System          September 2001


           DELEGRETURN [DAFS_PROC_DELEGRETURN]." (RFC 3010, p. 174)

   ERRORS

   DAFSERR_BADHANDLE

   DAFSERR_BAD_STATEID

   DAFSERR_RESOURCE


Wittle                                                        [Page 335]

INTERNET-DRAFT         Direct Access File System          September 2001


7.  Error Status Result Codes

   If a DAFS operation request fails, an error status will be entered
   into the reply message header status field.

   The following is a list of error codes and their numeric value:


Wittle                                                        [Page 336]

INTERNET-DRAFT         Direct Access File System          September 2001


        const DAFS_STATUS_OK                      =     0;
        const DAFSERR_PERM                        =     1;
        const DAFSERR_NOENT                       =     2;
        const DAFSERR_IO                          =     5;
        const DAFSERR_NXIO                        =     6;
        const DAFSERR_ACCES                       =    13;
        const DAFSERR_EXIST                       =    17;
        const DAFSERR_XDEV                        =    18;
        const DAFSERR_NODEV                       =    19;
        const DAFSERR_NOTDIR                      =    20;
        const DAFSERR_ISDIR                       =    21;
        const DAFSERR_INVAL                       =    22;
        const DAFSERR_FBIG                        =    27;
        const DAFSERR_NOSPC                       =    28;
        const DAFSERR_ROFS                        =    30;
        const DAFSERR_MLINK                       =    31;
        const DAFSERR_NAMETOOLONG                 =    63;
        const DAFSERR_NOTEMPTY                    =    66;
        const DAFSERR_DQUOT                       =    69;
        const DAFSERR_STALE                       =    70;
        const DAFSERR_BADHANDLE                   = 10001;
        const DAFSERR_BAD_COOKIE                  = 10003;
        const DAFSERR_NOTSUPP                     = 10004;
        const DAFSERR_TOOSMALL                    = 10005;
        const DAFSERR_SERVERFAULT                 = 10006;
        const DAFSERR_BADTYPE                     = 10007;
        const DAFSERR_DELAY                       = 10008;
        const DAFSERR_SAME                        = 10009;
        const DAFSERR_DENIED                      = 10010;
        const DAFSERR_EXPIRED                     = 10011;
        const DAFSERR_LOCKED                      = 10012;
        const DAFSERR_GRACE                       = 10013;
        const DAFSERR_FHEXPIRED                   = 10014;
        const DAFSERR_SHARE_DENIED                = 10015;
        const DAFSERR_WRONGSEC                    = 10016;
        const DAFSERR_CLID_INUSE                  = 10017;
        const DAFSERR_RESOURCE                    = 10018;
        const DAFSERR_MOVED                       = 10019;
        const DAFSERR_NOFILEHANDLE                = 10020;
        const DAFSERR_MINOR_VERS_MISMATCH         = 10021;
        const DAFSERR_STALE_CLIENTID              = 10022;
        const DAFSERR_STALE_STATEID               = 10023;
        const DAFSERR_OLD_STATEID                 = 10024;
        const DAFSERR_BAD_STATEID                 = 10025;
        const DAFSERR_BAD_SEQID                   = 10026;
        const DAFSERR_NOT_SAME                    = 10027;
        const DAFSERR_LOCK_RANGE                  = 10028;


Wittle                                                        [Page 337]

INTERNET-DRAFT         Direct Access File System          September 2001


        const DAFSERR_SYMLINK                     = 10029;
        const DAFSERR_READDIR_NOSPC               = 10030;
        const DAFSERR_LEASE_MOVED                 = 10031;
        const DAFSERR_ILLEGAL_PROT                = 15002;
        const DAFSERR_ILLEGAL_STATE               = 15003;
        const DAFSERR_UNKNOWN_SESSION             = 15004;
        const DAFSERR_NOXID_MATCH                 = 15005;
        const DAFSERR_NOT_AUTHORIZED              = 15006;
        const DAFSERR_NOT_FOUND                   = 15007;
        const DAFSERR_RDMA_READ_CHANNEL_UNUSABLE  = 15008;
        const DAFSERR_CHAIN_FORM                  = 15009;
        const DAFSERR_CHAIN_BROKEN                = 15010;
        const DAFSERR_GSS_CONTINUE_INIT           = 15011;
        const DAFSERR_BAD_SESSION                 = 15012;
        const DAFSERR_NO_CREDS                    = 15013;
        const DAFSERR_CRHAND_CONFLICT             = 15014;
        const DAFSERR_DENYDISP_CONFLICT           = 15015;
        const DAFSERR_DENYDISP_NOTSUPP            = 15016;
        const DAFSERR_KEY_MISMATCH                = 15017;
        const DAFSERR_WRITE_TOOBIG                = 15018;
        const DAFSERR_BACK_CHANNEL_UNUSABLE       = 15019;
        const DAFSERR_CHKSUM                      = 15020;


   The following list is the name and description for each DAFS error.

   DAFS_STATUS_OK

      Indicates the operation completed successfully.

   DAFSERR_ACCESS

      Permission denied. The caller does not have the correct permission
      to perform the requested operation. Contrast this with
      DAFSERR_PERM, which restricts itself to owner or privileged user
      permission failures.

   DAFSERR_BADHANDLE

      Illegal DAFS file handle. The file handle failed internal con-
      sistency checks.

   DAFSERR_BADTYPE

      An attempt was made to create an object of a type not supported by
      the server.


Wittle                                                        [Page 338]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_BAD_COOKIE

      READDIR cookie is stale.

   DAFSERR_BAD_STATEID

      A State-id generated by the current server instance, but which
      does not designate any locking state (either current or super-
      seded) for a current lockowner-file pair, was used.

   DAFSERR_BATCH_REQUEST_NOT_FOUND

      The server does not have an outstanding asynchronous
      dafs_read_write_request on a session when it receives a hurry up
      request.

   DAFSERR_CHAIN_BROKEN

      A chained operation was received after a previous chain operation
      failed, breaking the current DAFS chain.

   DAFSERR_CHAIN_FORM

      Error returned when a chained operation does not adhere to the
      chaining rules stated in the chaining section of this spec.

   DAFSERR_CHECKSUM

      A checksum mismatch error occurred.

   DAFSERR_CLID_INUSE

      The client id is already in use by another client.

   DAFSERR_DELAY

      The server initiated the request, but was not able to complete it
      in a timely fashion. The client SHOULD wait and then retry the
      request. For example, this error is returned from a server that
      supports hierarchical storage and receives a request to process a
      file that has been migrated. In this case, the server SHOULD start
      the immigration process and respond to client with this error.
      This error MAY also occur when a necessary delegation recall makes
      processing a request in a timely fashion impossible.

   DAFSERR_DENIED

      An attempt to lock a file is denied. Since this MAY be a temporary


Wittle                                                        [Page 339]

INTERNET-DRAFT         Direct Access File System          September 2001


      condition, the client is encouraged to retry the lock request
      until the lock is accepted.

   DAFSERR_DENYDISP_CONFLICT

      This error is returned from a REMOVB request because of a conflict
      in dispositions regarding the removal open files.

   DAFSERR_DENYDISP_NOTSUPP

      This error is returned from an OPEN or REMOVE request when the
      server is unable to support the requested remove dispositions.

   DAFSERR_DQUOT

      Resource (quota) hard limit exceeded. The user's resource limit on
      the server has been exceeded.

   DAFSERR_EXIST

      File exists. The file specified already exists.

   DAFSERR_EXPIRED

      A lease that is being used in the current procedure has expired.

   DAFSERR_FBIG

      File is too large. The operation would have caused a file to grow
      beyond the server's limit.

   DAFSERR_FHEXPIRED

      The file handle provided is volatile and has expired at the
      server.

   DAFSERR_GRACE

      The server is in its recovery or grace period.

   DAFSERR_GSS_CONTINUE_INIT

      The reply message is an intermediate result of a multi-step
      sequence of GSS authentication message exchanges.

   DAFSERR_ILLEGAL_PROT

      Protocol version is invalid for the client. The client SHOULD


Wittle                                                        [Page 340]

INTERNET-DRAFT         Direct Access File System          September 2001


      retry with a lower protocol version number. Protocol_version con-
      tains a suggested protocol version that is supported.

   DAFSERR_ILLEGAL_STATE

      The protocol has already been negotiated.

   DAFSERR_INVAL

      Invalid argument or unsupported argument for an operation. Two
      examples are attempting a READLINK on an object other than a sym-
      bolic link or attempting to SETATTR a time field on a server that
      does not support this operation.

   DAFSERR_IO

       I/O error. A hard error (for example, a disk error) occurred
      while processing the requested operation.

   DAFSERR_ISDIR

      Is a directory. The caller specified a directory in a non-
      directory operation.

   DAFSERR_KEY_MISMATCH

      Attempt to obtain a KEY SHARE reservation is denied because a KEY
      SHARE reservation already exists with a different key.

   DAFSERR_LEASE_MOVED

      A lease being renewed is associated with a file system that has
      been migrated to a new server.

   DAFSERR_LOCKED

      A read or write operation was attempted on a locked file.

   DAFSERR_LOCK_BROKEN

      An attempt to lock or a lock test was done on a persistent lock
      that has been marked as broken.

   DAFSERR_LOCK_RANGE

      A lock request is operating on a sub-range of a current lock for
      the lock owner, and the server does not support this type of
      request.


Wittle                                                        [Page 341]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_MLINK

      Too many hard links.

   DAFSERR_MOVED

      The file system that contains the current filehandle object has
      been relocated or migrated to another server. The client MAY
      obtain the new file system location by obtaining the
      "fs_locations" attribute for the current filehandle.

   DAFSERR_NAMETOOLONG

      The filename in an operation was too long.

   DAFSERR_NODEV

      No such device.

   DAFSERR_NOENT

      No such file or directory. The file or directory name specified
      does not exist.

   DAFSERR_NOFILEHANDLE

      The specified file handle value is invalid.

   DAFSERR_NOSPC

      No space left on device. The operation would have caused the
      server's file system to exceed its limit.

   DAFSERR_NOTDIR

      Not a directory. The caller specified a non-directory in a direc-
      tory operation.

   DAFSERR_NOTEMPTY

      An attempt was made to remove a directory that was not empty.

   DAFSERR_NOTSUPP

      Operation is not supported.

   DAFSERR_NOT_ AUTHORIZED


Wittle                                                        [Page 342]

INTERNET-DRAFT         Direct Access File System          September 2001


      Authorization failed.

   DAFSERR_NOT_FOUND

      The specific file does not exist.

   DAFSERR_NOT_SAME

      This error is returned by the VERIFY operation to signify that the
      attributes compared were not the same as the attributes provided
      in the client's request.

   DAFSERR_NOXID_MATCH

      The specified request has no Response Cache entry. Generally this
      indicates that the request was not executed before the Session was
      disconnected.

   DAFSERR_NXIO

      I/O error. No such device or address.

   DAFSERR_OLD_STATEID

      A State-id that designates the locking state for a lockowner-file
      at an earlier time was used.

   DAFSERR_PERM

      Not owner. The operation was not allowed because the caller is
      either not a privileged user (root) or not the owner of the target
      of the operation.

   DAFSERR_PREFETCH_NOT_SUPPORTED

      The server does not support the prefetch cache hint.

   DAFSERR_RDMA_READ_CHANNEL_UNUSABLE

      The request requires the use of the RDMA-read Channel, but that
      channel has not been established.

   DAFSERR_READDIR_NOSPC

      The encoded response to a READDIR request exceeds the size limit
      set by the initial request.

   DAFSERR_RESOURCE


Wittle                                                        [Page 343]

INTERNET-DRAFT         Direct Access File System          September 2001


      For the processing of a member of a chained set of operations, the
      server MAY exhaust available resources and can not continue pro-
      cessing procedures within the chain.

   DAFSERR_ROFS

      Read-only file system. A modifying operation was attempted on a
      read- only file system.

   DAFSERR_SAME

      This error is returned by the NVERIFY operation to signify that
      the attributes compared were the same as the attributes provided
      in the client's request.

   DAFSERR_SERVERFAULT

      An error occurred on the server that does not map to any of the
      legal DAFS protocol error values. The client SHOULD translate this
      into an appropriate error. UNIX clients MAY choose to translate
      this to EIO.

   DAFSERR_SHARE_DENIED

      An attempt to OPEN a file with a share reservation has failed
      because of a share conflict.

   DAFSERR_STALE

      Invalid file handle. The file handle given as an argument was
      invalid. The file referred to by that file handle no longer exists
      or access to it has been revoked.

   DAFSERR_STALE_CLIENTID

      The Client-id specified as an argument is no recognized by the
      server.

   DAFSERR_STALE_STATEID

      A State-id generated by an earlier server instance was used.

   DAFSERR_SYMLINK

      The current file handle provided for a LOOKUP is not a directory
      but a symbolic link. The final component of the OPEN path is a
      symbolic link.


Wittle                                                        [Page 344]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFSERR_TOOSMAL

       Buffer size is too small.

   DAFSERR_UNKNOWN_SESSION

      The Session-id specified in the request is not known to the
      server.

   DAFSERR_VERS_MISMATCH

      The DAFS server does not support the specified version.

   DAFSERR_WRITE_TOOBIG

      A request to write data to a file exceeds the maximum allowed I/O
      size for the target server.

   DAFSERR_WRONGSEC

      The security mechanism being used by the client for the procedure
      does not match the server's security policy. The client SHOULD
      change the security mechanism being used and retry the operation.

    DAFSERR_XDEV

      Attempted to perform a cross-device hard link.


Wittle                                                        [Page 345]

INTERNET-DRAFT         Direct Access File System          September 2001


8.  Security and IANA Considerations

8.1.  Security Considerations

   The key security concern for DAFS is authenticating clients.  This
   issue is discussed in 3.1.1., "Security Model".

8.2.  IANA Considerations

   Like NFS version 4 (as specified in RFC 3010), DAFS includes the use
   of named attributes.


Wittle                                                        [Page 346]

INTERNET-DRAFT         Direct Access File System          September 2001


9.  Bibliography

   [Christianson]

      N. Christenson, T. Bosserman, D. Beckemeyer, "A Highly Scalable
      Electronic Mail Service Using Open Systems", First Usenix Sympo-
      sium on Internet Technologies and Systems, December 1997.

   [Dicecco]

      S. DiCecco, J. Williams, B. Terrell, J. Scott, C. Sapuntzakis, "VI
      / TCP (Internet VI)", http://www.ietf.org/internet-drafts/draft-
      dicecco-vitcp-01.txt

   [Fletcher]

      Fletcher, An Arithmetic Checksum for Serial Transmission, IEEE
      Transactions on Communications, Volume COM-30, No. 1, January
      1982, pp.247-252.

   [IB]

      "InfiniBandTM Architecture Specification Release 1.0", InfiniBand
      Trade Association.SM

   [Linn]

      J. Linn, "Generic Security Service Application Program Interface,
      Version 2, Update 1", IETF RFC 2743,
      http://www.ietf.org/rfc/rfc2743.txt

   [POSIX]

      IEEE Standard 1003.1 (POSIX.1)

   [RFC1813]

      Callaghan, B., Pawlowski, B. and P. Staubach, "NFS Version 3 Pro-
      tocol Specification", RFC 1813, June 1995.

   [RFC2277]

      Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP
      18, RFC 2277, January 1998.

   [Sklower]

      Sklower, Improving the Efficiency of the OSI Checksum Calculation,


Wittle                                                        [Page 347]

INTERNET-DRAFT         Direct Access File System          September 2001


      http://www.cs.berkeley.edu/~sklower/cksmosi.ps

   [Shepler]

      S. Shepler, C. Beame, B. Callaghan, M. Eisler, D. Noveck, D.
      Robinson, R. Thurlow, "NFS Version 4 protocol", IETF RFC 3010,
      http://www.ietf.org/rfc/rfc3010.txt

   [T11-FCVI]

      T11, NCITS working group, "dpANS - Fibre Channel - Virtual Inter-
      face Architecture Mapping", http://www.t11.org.index.html

   [VIA]

      "Virtual Interface Architecture Specification Version 1.0",
      December 16,1997, Compaq/Intel/Microsoft.

   [VIDG]

      "Intel Virtual Interface (VI) Architecture Developer's Guide Revi-
      sion 1.0", Sept. 9, 1998, Intel Corporation.

   [WARP]

      "WARP Architectural Requirements Summary", J. Pink,
      http://www.ietf.org/internet-drafts/draft-jpink-warp-summary-
      00.txt


Wittle                                                        [Page 348]

INTERNET-DRAFT         Direct Access File System          September 2001


10.  Author Information and Acknowledgements

10.1.   Editor

        Mark Wittle
        Network Appliance
        627 Davis Drive, Suite 200
        Morrisville, NC  27560

        Phone: 919-993-5627
        Email: mwittle@netapp.com


10.2.  Authors

   This document is the result of a truly collaborative effort by many
   people from many companies within the DAFS Collaborative. Since July
   2000 when the DAFS Specification 0.5 was released, Mark Wittle has
   acted as the primary author and editor of the specification.

10.3.  Comments

   Comments on this document should be sent to

      dafs-discussions@groups.yahoo.com.


10.4.  Acknowledgements

   I'd like to thank all of the member companies of the DAFS Collabora-
   tive for supporting the effort of creating this specification.

   If I tried to list all of the people who contributed to the creation
   of this DAFS Specification, my list would surely be incomplete. So,
   simply, I'd like to thank the many individuals from Network Appliance
   and the DAFS Collaborative who helped create DAFS.


Wittle                                                        [Page 349]

INTERNET-DRAFT         Direct Access File System          September 2001


Appendix A.  DAFS Name Service

A.1.  Introduction

   DAFS provides a simple and flexible name space for the file and file-
   related objects (for example, directories, symlinks) that exist
   within a DAFS administrative domain. In addition, DAFS defines a name
   service structure for mapping the name space into specific locations
   in a distributed environment. This document describes the DAFS name
   space.

   DAFS provides a two-stage discovery process for connection establish-
   ment between DAFS clients and servers. First, the DAFS client queries
   DAFS Name Service with DAFS Name as its input and gets back a set of
   DAFS locations. A DAFS location consists of a DAT location and a
   directory path. Second, the DAFS client queries the DAT Name Service
   (see Appendix C. "DAT Name Service") with the DAT hostname from the
   DAT Location and a client's local channel adapter. The query returns
   the server's host channel adapter address(es). The server host chan-
   nel adapter address and DAT connection qualifier are used by the
   client to request a DAT connection to the DAFS server for a DAFS Ses-
   sion.

   A DAFS server advertises its services by filling the DAFS Name Ser-
   vice. There MAY be more than one DAFS server for the same DAFS file
   or file- related object.

A.2.  DAFS Name Space

   The DAFS name space allows DAFS file objects to be distributed within
   a collection of DAFS server systems. Each of the DAFS servers pro-
   vides access to a subset of the file objects within the DAFS name
   space. The collection of DAFS servers that participate within a par-
   ticular DAFS name space make up the DAFS name space domain.

A.3.  DAFS Name

   A "DAFS Name" is a simple string. Each DAFS Name is associated with a
   set of "DAFS Locations." Each DAFS Name is unique within the DAFS
   name space; that is, a given DAFS Name refers to a single set of DAFS
   Locations within the domain. For instance, if a name service is used
   to store and lookup DAFS Names and their Locations, then each DAFS
   Name can have at most one entry in the name service.

   Note: The runtime components that implement a DAFS name space are NOT
         REQUIRED to enforce the assertion of DAFS Name uniqueness
         (although they MAY). However, in order to provide reasonably-
         expected results when file objects are accessed, DAFS servers


Wittle                                                        [Page 350]

INTERNET-DRAFT         Direct Access File System          September 2001


         MAY assume that the name space domain is administered under
         this policy. DAFS clients MAY assume that any failover, migra-
         tion, or replication feature provided by the DAFS server is
         also governed by the same policy.

         Depending on how the implementations handle DAFS Names, res-
         triction on the use of common separator characters like "/" MAY
         be needed.

   DAFS Names are associated with DAFS Locations. In this sense, the
   DAFS Name acts as the "key" to a database lookup, and the DAFS Loca-
   tion is the returned "value."

A.4.  DAFS Location

   The DAFS location specifies a DAFS server and is made up of three
   components:

   1) DAT Location

   2) DAFS Directory Path

   3) DAFS Version.

A.4.1.  DAT Location

   The DAT Location provides two pieces of information. One is the
   "server address" information needed by a DAFS client to establish a
   DAFS communication channel between client and server. The other is
   the transport semantics supported by the DAFS server host including
   transport attributes and optional semantics. The DAFS Location is
   filled in by the DAFS server.

   The DAT Location consists of the following four parts:

   1) DAT Transport Type

   2) DAT Hostname

   3) DAT Connection Qualifier

   4) Transport Specific Server Attributes.

   The Transport Type is one of a well-defined set of simple strings
   that specifies a particular DAT Transport type. The Transport type is
   expected to be used to identify the appropriate DAT context in which
   the DAT Hostname and DAT Connection Qualifier are to be interpreted.


Wittle                                                        [Page 351]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS protocol version-1.0 defines two DAT transport types:

   1) VI-for Virtual Interface for FC/VI and VI-TCP

   2) IB-RC-for Infiniband Reliable Connection based implementations.

           const char* DAFS_DAT_VI = "VI";
           const char* DAFS_DAT_IBRC = "IBRC";


   Note: the enumerated transport mappings listed in this section are
         not meant to be exhaustive or exclusive in any way. As addi-
         tional DAT transports are identified and become available, it
         is expected that additional types will be added to this set.

   For each DAT transport type defined by the DAFS protocol there is an
   appendix that describes how the transport type supports DAT seman-
   tics. For VI, see Appendix D. "DAFS Mapping to VI Architecture" and
   for IB- RC, see Appendix E. "DAFS Mapping to InfiniBand Reliable Con-
   nection".

   The DAT Hostname is also a simple string that the client can pass to
   the appropriate transport-specific DAT Name Service Provider. The DAT
   Name Service Provider maps the DAT Hostname to a Channel Adapter
   Address or set of channel adapter Address(es) that identifies the
   DAFS server's channel adapter(s) for a set of remote endpoints for
   connection establishment. For more information, see Appendix C. "DAT
   Name Service" for a description of DAT Name Service characteristics
   and requirements.

   Note: Different hosts SHOULD have different DAT hostnames. However, a
         single host can have multiple hostnames.

   The DAT Connection Qualifier is also a simple string. It is the value
   that the DAFS client uses to specify a remote Endpoint on the target
   Channel Adapter associated with the DAFS server. This Endpoint will
   be used to create a DAT connection upon which a DAFS Session can be
   established. For a description of the DAT Connection Qualifier and
   its use in the DAT connection establishment process, see B.4., "Tran-
   sport Endpoints and Connections".

   Note: While the DAT Connection Qualifier is viewed by the DAFS name
         space as a simple string, internally it MAY have more intricate
         structure. The internal structure is determined by the DAT
         transport type. The definition of this internal structure is
         provided in the DAFS mapping for each DAT Transport type. For
         more information, see the appropriate appendices for the DAFS


Wittle                                                        [Page 352]

INTERNET-DRAFT         Direct Access File System          September 2001


         on VI and DAFS on IB mappings.

   The content of this Transport Specific Server Attributes field is
   defined for each DAT Transport Type in the appropriate appendix for
   the mapping of DAT to that Transport Type. These attributes provide
   clients information about transport specific attributes that are set
   up or supported by the server. Example are the maximum transfer size,
   reliability levels supported by the server's transport provider, and
   support of optional transport functionality.

A.4.2.  DAFS Directory Path

   The Directory Path is a list (array) of directory name components
   that specifies a hierarchical directory path provided by the DAFS
   server at the specified DAFS Location. The directory path can be
   accessed through the use of the DAFS_PROC_LOOKUP operation executed
   using the return value from DAFS_PROC_GETROOTHANDLE (for example, the
   rootfilehandle of the DAFS server) as the directory for the lookup
   operation.

   The Directory Path MAY be NULL, meaning that the directory path asso-
   ciated with the DAFS Name is the rootfilehandle returned by the
   DAFS_PROC_GETROOTHANDLE operation.

   Note: Although NOT REQUIRED, the Directory Path is expected to be the
         same for the same host regardless of which Transport Type chan-
         nel adapter is used to reach the host.

A.4.3.  DAFS Version

   The DAFS Version is a simple string that specifies the DAFS version
   number that is supported by the DAFS server. If DAFS server support
   multiple DAFS versions, then it needs to create a separate DAFS
   Location(s) for each of them.

A.5.  DAFS Names and Locations

   A DAFS Name MAY be associated with one or more DAFS Locations. Allow-
   ing multiple Locations provides the capability to associate different
   semantics with additional Locations. For instance, a DAFS Name with
   two Locations MAY indicate multiple transport paths to the Name that
   are capable of concurrent access, and this might be useful as a per-
   formance and/or availability enhancement. It might indicate some form
   of replication of resources. Multiple paths might indicate the
   existence of an active primary location and an inactive secondary (or
   backup) location. The structure defined by the on- to-many relation-
   ship of DAFS Name to Location:


Wittle                                                        [Page 353]

INTERNET-DRAFT         Direct Access File System          September 2001


           DAFS Name: {DAFS Location, DAFS Location, ... }


   The structure does not specify the relationship between the multiple
   locations. Currently, the DAFS client needs to determine the rela-
   tionship dynamically. Accordingly, the DAFS server MUST be prepared
   for clients that probe a set of a locations attempting to determine
   their status (for example, "active" or "inactive").

   Rationale: The Name to Locations mappings provided in DAFS name space
              are potentially more static than the relationship between
              two locations. One goal of the DAFS name space design is
              to provide the client with quick failover capability, and
              this prohibits reliance on propagation of new name space
              information at failover time. Thus, the approach here is
              to enable the client to capture and maintain the fairly
              static picture of the possible Locations for each Name in
              the domain. And then in realtime, allow the client to
              probe those Locations as it deems appropriate in response
              to various failure scenarios.

              However, since the mapping of Names to Locations MAY
              change and grow to include new entries over time, the DAFS
              client SHOULD have a mechanism to update those mappings.


   A DAFS server can listen on multiple Connection Qualifiers on the
   same host. Nevertheless, different Connection Qualifiers MUST be
   advertised as separate DAFS Locations.

   If a DAFS server host has multiple channel adapters of the same tran-
   sport type then the Connection Qualifier that is valid on one channel
   adapter of a transport type MUST also be valid on all channel
   adapters of that transport type. If a host has multiple channel
   adapters for the same DAT Transport Type, there will be a single DAFS
   Location Entry for all of them. By calling the transport-specific DAT
   Name Service and specifying the client's local channel adapter for
   that transport type for the DAT Hostname, the DAFS client will get
   all channel adapter addresses for that host.

A.6.  Name Space Repository

   For simple configurations a DAFS client MAY wish to directly connect
   to a DAFS server that provides the name space subset of interest. For
   more complex environments, where there are multiple DAFS clients,
   DAFS servers, multiple connection paths, and support for failover and


Wittle                                                        [Page 354]

INTERNET-DRAFT         Direct Access File System          September 2001


   trunking, the "DAFS Name Service" component is used to discover DAFS
   servers that provide the name space subset. The DAFS Name Service
   maps a list of DAFS Names and Locations.

A.7.  LDAP Schema

   The DAFS Name Space provides the ability to define the relationship
   between DAFS named objects and their locations. This section defines
   the basic elements of a mapping of the DAFS Name Space into the
   Lightweight Directory Access Protocol (LDAP) [Wahl].

   The LDAP schema is extended to add the dafsSchema, that supports the
   following object class and attribute definitions:

   Object Classes

   o  dafsNameSpace: the DAFS name space

   o  dafsNameSpaceEntry: an entry mapping a DAFS name to one or more
      locations

   o  dafsLocationList: a list of one or more locations for a DAFS name

   o  dafsLocation: a DAFS location

   o  datLocation: the DAT transport-specific information for a loca-
      tion.

   Attributes

   o  dafsName: a DAFS name string

   o  datTransportType: a transport type, currently either "VI" or
      "IBRC"

   o  datTransportHostname: a transport-specific name for the host-
      channel location

   o  datTransportConnectionQualifer: more specific information for the
      transport address

   o  datTransportAttributes: transport specific server attributes

   o  dafsDirectoryPath: a list of one or more directory pathname com-
      ponents

   o  dafsProtocolVersion: a list of one or more supported protocol ver-
      sion numbers.


Wittle                                                        [Page 355]

INTERNET-DRAFT         Direct Access File System          September 2001


           ########################################################
           #
           # DAFS name service
           #
           # DAFS_NAMESERVICE
           #

           attribute ( DAFS_NAMESERVICE.1.0
              NAME 'dafsName'
              DESC 'DAFS Name'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15
              SINGLE-VALUE )

           attribute ( DAFS_NAMESERVICE.1.1
              NAME 'datTransportType'
              DESC 'Type of dafs transport address (VI, IBRC)'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15
              SINGLE-VALUE )

           attribute ( DAFS_NAMESERVICE.1.2
              NAME 'datTransportHostname'
              DESC 'TransportHostname for DAFS'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15
              SINGLE-VALUE )

           attribute ( DAFS_NAMESERVICE.1.3
              NAME 'datTransportConnectionQualifier'
              DESC 'TransportConnectionQualifier for DAFS'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15
              SINGLE-VALUE )

           attribute ( DAFS_NAMESERVICE.1.4
              NAME 'datTransportAttributes'
              DESC 'TransportAttributes for DAFS'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15
              SINGLE-VALUE )

           attribute ( DAFS_NAMESERVICE.1.5
              NAME 'dafsDirectoryPath'
              DESC 'Dafs Directory Path Name component'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15 )


Wittle                                                        [Page 356]

INTERNET-DRAFT         Direct Access File System          September 2001


           attribute ( DAFS_NAMESERVICE.1.6
              NAME 'dafsProtocolVersion'
              DESC 'Dafs Protocol Version number'
              EQUALITY caseExactIA5Match
              SYNTAX  1.3.6.1.4.1.1466.115.121.1.15 )

           objectClass ( DAFS_NAMESERVICE.2.1
              NAME 'dafsNameSpace'
              DESC 'DAFS Name Space object'
              SUP top
              STRUCTURAL
              MAY cn )

           objectClass ( DAFS_NAMESERVICE.2.2
              NAME 'dafsNameSpaceEntry'
              DESC 'DAFS Name Space entry'
              SUP dafsNameSpace
              AUXILIARY
              MUST ( dafsName $
                     dafsLocationList )
              MAY ( description ) )

           objectClass ( DAFS_NAMESERVICE.2.3
              NAME 'dafsLocationList'
              SUP dafsNameSpaceEntry
              AUXILIARY
              DESC 'List of  locations for accessing a DAFS Name'
              MUST ( dafsLocation ) )

           objectClass ( DAFS_NAMESERVICE.2.4
              NAME 'dafsLocation'
              SUP dafs_LocationList
              AUXILIARY
              DESC 'A location for accessing a DAFS Name'
              MUST ( datLocation $
                     dafsDirectoryPath $
                      dafsProtocolVersion ) )

           objectClass ( DAFS_NAMESERVICE.2.5
              NAME 'datLocation'
              SUP dafsLocation
              AUXILIARY
              DESC 'Location for access to DAT interface'
              MUST ( datTransportType $
                     datTransportHostname $
                     datTransportConnectionQualifier $
                     datTransportAttributes ) )


Wittle                                                        [Page 357]

INTERNET-DRAFT         Direct Access File System          September 2001


   The following illustrates how an LDAP search for the DAFS name "data-
   base17" might return information, in this case 2 transport addresses
   associated with the same DAFS server:

        Query:
           (&  (objectClass=dafsNameSpaceEntry)
               (dafsNameSpaceEntry=database17))

        Result:
           objectClass:                     top
           objectClass:                     dafsNameSpaceEntry
           dafsName:                        database17
           datTransportType:                VI
           datTransportHostname:            server28
           datTransportConnectionQualifier: dafs
           datTransportAttributes:
           dafsDirectoryPath:               dbms_dir
           dafsProtocolVersion:             1
           datTransportType:                IBRC
           datTransportHostname:            server28
           datTransportConnectionQualifier: dafs
           datTransportAttributes:
           dafsDirectoryPath:               dbms_dir
           dafsProtocolVersion:             1


A.8.  References

   [Wahl]

      M. Wahl, T. Howes, S. Kille, "Lightweight Directory Access Proto-
      col (v3)", IETF RFC 2251,  http://www.ietf.org/rfc/rfc2251.txt


Wittle                                                        [Page 358]

INTERNET-DRAFT         Direct Access File System          September 2001


Appendix B.  DAT Semantics

   New interconnect networks that provide direct access to remote memory
   are emerging. In addition to offering direct access to remote memory,
   these new interconnect networks also provide low latency and high
   throughput. Their transport protocols support remote memory read and
   write in addition to more traditional message transfer operations.
   Examples of these transports include Virtual Interface Architecture,
   InfiniBand Architecture, and the WARP protocol for the Internet.
   Traditional message transfer operations allow only the receiver of a
   message to specify the particular location on the destination node
   were the message payload will be deposited. Remote memory writes
   allow the operation initiator to specify the target memory location
   on the destination node. Remote memory reads allow the operation ini-
   tiator to specify both the remote memory location that is to be the
   source of a data "fetch" operation as well as the local destination
   for the fetched remote memory contents. The addition of remote memory
   semantics to the transport layer supports a new class of networked
   applications.

   This appendix defines the semantics of a particular set of abstract
   transport capabilities. A transport whose semantics support these
   capabilities is called a Direct Access Transport (DAT). These seman-
   tics is intended to be mapped easily onto networks that support
   memory-to- memory operations, such as Virtual Interface Architecture,
   and InfiniBand Architecture. This Appendix does not define a specific
   transport layer interface, but does describe some functionality and
   concepts necessary to support the DAFS protocol.

B.1.  DAT Glossary

   Channel Adapter

      Channel Adapter is a host resident device that transfers messages
      to/from host memory associated with a specific Endpoint and a
      Fabric.

   Channel Adapter Address

      Channel Adapter Address on the fabric.

   Connection

      An association between a pair of Endpoints such that data of
      posted data-transfer operations requests of either Endpoint arrive
      at the other Endpoint of the Connection.

   Connection Qualifier


Wittle                                                        [Page 359]

INTERNET-DRAFT         Direct Access File System          September 2001


      A value that enables a new connection request to be associated
      with the upper-level-protocol entity providing the service.

   DAT Consumer

      An application that requires Direct Access Transport services.

   DAT Provider

      Provider of the Transport services for a Direct Access applica-
      tion.

   Data Transfer Completion (DTC)

      Status of a completed data transfer operation.

   Data Transfer Operation (DTO)

      Requested data movement transfer submitted to a DAT Provider.

   Endpoint (EP)

      The local part of a Connection that supports posting data-transfer
      operation requests.

   Fabric

      A network with RDMA capabilities.

   Operation Type

      Send, Receive, RDMA Read or RDMA Write data transfer operations
      (DTO).

   Remote Direct Memory Access (RDMA)

      Access of local memory by the remote Endpoint. There are two RDMA
      operations: RDMA Read and RDMA Write.


   RDMA Memory Region Context (RMR Context)

      A representation for an arbitrary-sized, registered, contiguous
      virtual space that belongs to a Channel Adapter so that it can
      support Remote DMA operations on the Connection whose local End-
      point belongs to the Channel Adapter.


Wittle                                                        [Page 360]

INTERNET-DRAFT         Direct Access File System          September 2001


   RMR Target Address

      Specifies the memory address within a region of memory represented
      by an RDMA Memory Region Context. (The specification can be either
      by virtual address or offset from the start of the memory
      represented by the RMR Context.)

B.2.  DAT Model

   There are two significant interfaces to a Direct Access Transport
   service provider. One interface defines the boundary between the con-
   sumer of a set of transport services and the local transport provider
   of these services. In the DAT model, this would be the interface
   between the DAT Consumer and the DAT Provider. The other interface
   defines the set of interactions between a local and remote transport
   provider that enables the local and remote providers to offer a set
   of transport services between the local and remote transport consu-
   mers. In the DAT model, this would be the set of interactions between
   a local DAT Provider and a remote DAT Provider that are visible to
   the local DAT Consumer and/or remote DAT Consumer.

   This document defines the minimal set of necessary semantics for the
   interaction between DAT Providers that are visible to the local DAT
   Consumer. The transport protocol-specific details of the DAT Provider
   to DAT Provider interactions for a specific transport is outside of
   the scope of this document.These lower level, transport-specific
   details are not defined here; it is expected that they are provided
   as part of the specification of a particular transport protocol (e.g.
   VI/TCP, FC-VI, IB, and WARP)


   Furthermore, except as needed to characterize the semantics of the
   target set of abstract transport services between a local and remote
   DAT Consumer, the local interactions between the DAT Provider and the
   DAT Consumer are not defined in this document.

B.3.  DAT Provider

   There can be multiple DAT Providers on the same node. Each DAT Pro-
   vider controls resources and provides RDMA and message transfer ser-
   vices for one or more DAT Consumer processes. A DAT Provider controls
   Channel Adapters. A Channel Adapter is controlled by at most one DAT
   Provider. A DAT Provider can have multiple Channel Adapters. Each
   Channel Adapter can have multiple Endpoints. An Endpoint belongs to
   exactly one Channel Adapter. An Endpoint is the local part of a con-
   nection that supports the posting of a data transfer operation (DTO),
   including RDMA operations. A Connection is an association between a


Wittle                                                        [Page 361]

INTERNET-DRAFT         Direct Access File System          September 2001


   pair of Endpoints such that the data payload described by data
   transfer operations posted on either Endpoint arrives at the other
   Endpoint of the Connection. In order for an endpoint of a connection
   to support RDMA operation, the remote endpoint MUST have access
   rights to the local memory accessed by RDMA.

   The DAT Provider controls one or more Channel Adapters that provide
   access to the network Fabric. Each Channel Adapter is identified on
   the Fabric by a unique address. The assignment and maintenance of
   Channel Adapter addresses on the Fabric as well as name service for
   these addresses are outside the scope of DAT. For connection estab-
   lishment purposes, a DAT Consumer such as a DAFS client needs to have
   a mechanism to specify remote Channel Adapters (see Appendix C. "DAT
   Name Service"). The mechanisms by which a DAT Consumer discovers
   remote Channel Adapters and identifies them to the DAT Provider is
   outside the scope of the DAT.

B.4.  Transport Endpoints and Connections

   DAT semantics require reliable data exchange on point-to-point con-
   nections. A point-to-point connection is an association, formed by a
   connection establishment protocol, between two transport endpoints.
   The transport MUST be capable of supporting more than one connection
   between a pair of DAT Providers.

   A DAT Endpoint is an object in the transport layer that upper layers
   use to create connections and exchange data. When it is part of a
   connection, an endpoint supports four data transfer operations: send,
   receive, RDMA write, and RDMA read. Data of a data transfer operation
   posted to an endpoint of a connection arrives to the other endpoint
   of the connection.

   The DAT Provider manages endpoint creation and destruction. A DAT
   Provider supports explicit connection endpoint creation by the DAT
   Consumer. The mechanism by which the DAT Consumer and the DAT Pro-
   vider interact to create and destroy endpoints is outside the scope
   of DAT. The properties of an unconnected endpoint are outside the
   scope of DAT. Each endpoint is associated with a single Channel
   Adapter for the lifetime of the connection, while a local endpoint is
   connected to a remote endpoint.

   An Endpoint is the interaction point for data transfer operation
   requests between a DAT Consumer and a DAT Provider. A DAT consumer
   submits data transfer operation requests to a local endpoint. A local
   endpoint executes data transfer requests only if it is a part of an
   established connection.

   DAT semantics require DAT Provider's support for active-passive


Wittle                                                        [Page 362]

INTERNET-DRAFT         Direct Access File System          September 2001


   (client- server) connection establishment. Some DAT Providers MAY
   offer automatic connection endpoint creation for the client or for
   the server. Many DAT Providers support specification of the desig-
   nated server connection endpoint in a generic fashion only. The
   actual server connection endpoint MAY be dynamically created and/or
   allocated upon connection request. An endpoint can be a part of at
   most one connection at a time.

   A DAT Provider supports a request for connection establishment of an
   unconnected local DAT endpoint. The DAT Consumer specifies a local
   endpoint that it wants to connect, the Connection Qualifier, and the
   Channel Adapter address for a remote endpoint. See Appendix C. "DAT
   Name Service" for a discussion of the Name Service for Channel
   Adapter Addresses. The DAT Provider notifies the DAT Consumer of suc-
   cessful or unsuccessful connection establishment.

   During the connection establishment process, the active side speci-
   fies a Connection Qualifier that is used by the passive side DAT pro-
   vider to associate the incoming connection request with the appropri-
   ate listening process. The Connection Qualifier does not uniquely
   specify the endpoint of the passive side. The same Connection Qualif-
   ier can be re-used by a an active DAT Consumer for establishing mul-
   tiple connections between the same pair of hosts. In many cases, a
   single DAFS server will listen only on a single Connection Qualifier
   for all incoming connection requests for all Sessions for all DAFS
   clients.

   The connection parameters negotiation, connection establishment, and
   other interactions between the DAT Providers of two sides of the con-
   nection are internal to the Transport and are outside the scope of
   the DAT protocol. Other interactions between the DAT Provider and
   Consumer, like attributes of the connection and timeout for connec-
   tion establishment, are local interactions and are also outside the
   scope of DAT. The passive DAT Provider supports notification of the
   DAT Consumer connection request based on the Connection Qualifier.
   The details of the interactions between a DAT Consumer and DAT Pro-
   vider, for example, providing an existing unconnected local endpoint
   or asking a Provider to create one on the fly, are outside the scope
   of DAT.

   A DAT Provider MUST offer a mechanism to enable the DAT Consumer on
   either Endpoint to break a connection. Upon receiving a request for a
   connection termination, the DAT Provider SHALL break the connection
   and SHALL complete all outstanding and in-progress data-transfer
   operations, with an error indication if they have not yet completed.
   The DAT Provider SHALL not process outstanding data transfer opera-
   tions subsequent to receiving a request for connection termination.
   DAT does not define how the remote DAT Provider discovers that the


Wittle                                                        [Page 363]

INTERNET-DRAFT         Direct Access File System          September 2001


   connection has been broken. However, the remote DAT Provider SHALL
   report to its local DAT Consumer that the connection has been broken.
   For example the remote DAT Provider can detect connection termination
   by the inability to deliver data subsequent to the connection termi-
   nation. For more specific details, see B.6., "DAT Data Transfer
   Operations and Connection Properties".

   If the DAT Provider exposes an error to the DAT Consumer (either due
   to a transport error for a transport that does not attempt
   transport-level error recovery, or due to an unrecoverable error),
   the DAT Provider MUST break the connection upon reporting the error
   and notify the DAT Consumer that the connection has been broken. When
   and how this notification takes place is outside the scope of DAT.
   After a single data transfer operation (DTO) has completed with an
   error status, all subsequently posted DTOs SHALL also be completed
   with an error status.

B.5.  DAT Memory Semantics

   A DAT Provider MUST have the right to read memory that contains the
   source data for a data transfer operation (DTO) and to write to
   memory that is the destination location for the payload carried by a
   DTO. The registration of local memory with the DAT Provider in order
   to establish local access rights is a local interaction between the
   DAT Provider and the DAT Consumer and is outside the scope of DAT.
   Furthermore, the registration of local memory needed for remote
   memory accesses for RDMA operations is also outside the scope of DAT.

   The DAT Consumer requires of the DAT Provider the following property:
   an RDMA Memory Region Context (RMR Context) which is the outcome of
   the memory registration MUST be able to be passed to the remote side
   of the connection for it to initiate an RDMA operation on the memory
   specified by the RMR Context.

   An RMR Context is a representation of an arbitrary-sized, registered,
   contiguous virtual space that can be directly accessed by a Channel
   Adapter to support:

   o  only Remote DMA Read operations,

   o  only Remote DMA Write operations, or

   o  both Remote DMA Read and Remote DMA Write operations.

   The mechanism by which an RMR Context is created is outside the scope
   of DAT.

   An RMR Context can be advertised to a remote DAT Consumer to allow


Wittle                                                        [Page 364]

INTERNET-DRAFT         Direct Access File System          September 2001


   the remote DAT Consumer to initiate an RDMA operation that targets
   the RMR Context. The mechanism by which the local DAT consumer adver-
   tises the RMR Context to the remote DAT Consumer is outside the scope
   of DAT. Nevertheless, DAT does assume the following properties of an
   RMR Context:

   o  An RMR Context has an association with a set of connections that
      support RDMA operation to that RMR Context via its local Endpoint.
      Note that multiple Endpoints on same Channel Adapter MAY be able
      to use the same RMR Context and multiple RMR Contexts can be asso-
      ciated with the same Endpoint. Defining the mechanism by which the
      DAT Consumer and the DAT Provider interact to set up an associa-
      tion between RMR Context and a set of Endpoints is outside the
      scope of DAT.

   o  It is not expected that all RMR Contexts valid on a given Endpoint
      on a Channel Adapter will be valid across all Endpoints on the
      Channel Adapter. Defining the mechanism by which the DAT Consumer
      and DAT Provider interact to limit the scope of an RMR Context to
      a given set of Endpoints is outside the scope of DAT.

   o  DAT does NOT REQUIRE that the same RMR Context can be used by mul-
      tiple Channel Adapters. However, DAT requires that an RMR Context
      be valid within the context of a set of Endpoints within a single
      Channel Adapter.

   o  The same memory can belong to multiple RMR Contexts within the
      same or different DAT Providers. The DAT Provider MUST be able to
      allow the Consumer to create new RMR Context mapping to the same
      physical memory.

   An RMR Context is specified as a 32-bit unsigned integer. An RMR Tar-
   get Address specifies the memory address to be used for an RDMA
   operation. The RMR Target Address must be within a region of memory
   represented by an RMR Context. An RMR Target Address is specified as
   64-bit unsigned integer.

B.6.  DAT Data Transfer Operations and Connection Properties

   There are four types of data transfer operations: send, recv, RDMA
   Write and RDMA Read. The wire protocol formats of the messages that
   underlies these operations are defined by specific transport proto-
   cols and are outside the scope of DAT. The initiator refers to the
   transport Endpoint of the connection whose consumer posted a given
   data transfer operation. The target refers to the transport Endpoint
   on the other end of the connection from the initiator. Each transport
   Endpoint can initiate data transfer operations and be the target of
   transport layer messages. Messages supported by the transport layer


Wittle                                                        [Page 365]

INTERNET-DRAFT         Direct Access File System          September 2001


   might be extremely large. The transport protocol is expected to seg-
   ment messages into transport layer packets and reassemble these pack-
   ets into messages at the target.

   An RDMA Write operation MUST contain an RMR Context and an RMR Target
   Address specifying the remote memory where the data is to be depo-
   sited. An RDMA Read operation MUST contain an RMR Context and RMR
   Target Address specifying the remote memory where the data is to be
   extracted.

   Delivery of data payloads for DAT operations MUST obey the following
   rules:

   o  All data transfer operations submitted to the DAT Provider will
      complete successfully in the absence of errors, with data
      delivered uncorrupted, in the order specified by the DAT delivery
      ordering rules.

   o  Corruption of the data delivered to the Consumer the (local Consu-
      mer for RDMA Read) is detected as an error and reported to the
      Consumer.

   o  Data loss (inability to deliver data to or from the remote End-
      point of the connection) SHALL be detected as an error and
      reported to the Consumer.

   o  Upon detection of an error, the connection SHALL be broken and all
      outstanding and in-progress data-transfer operations SHALL com-
      plete with an error.

   o  There is a one-to-one correspondence between send operations on
      one Endpoint of the connection and receive operations on the other
      Endpoint of the connection.

   o  There is no correspondence between RDMA operations on one endpoint
      of the connection and recv or send data transfer operation on the
      other endpoint of the connection.

   o  Data Transfer Operation completion means that the Consumer can
      reclaim resources associated with the operation including the
      memory that contains the data.

   o  Ordering rules:

      o  The data payload of a send operation and associated receive
         operation MUST be delivered without error into the receiver-
         specified memory buffer prior to the receive completion.


Wittle                                                        [Page 366]

INTERNET-DRAFT         Direct Access File System          September 2001


      o  Receive operations on a connection MUST be completed in the
         order of the posting of their corresponding sends.

      o  Each RDMA write operation posted on a connection prior to a
         send operation MUST have its data payload delivered to the tar-
         get memory region prior to the completion of the receive opera-
         tion matching the send.

   DAT can have several send, recv, RDMA write, and RDMA read operations
   active simultaneously. Out of order packet delivery can lead to com-
   plex target implementations. However, if the transport layer does not
   restrict accesses to the memory for data transfer operations (DTO),
   and the transport protocol packets contain enough information, the
   transport can write messages to the target memory as soon as these
   messages arrive. Specific implementations of the transport layer are
   allowed to implement more stringent memory ordering restrictions.

   The DAT memory access ordering by DTOs does not define the result of
   an RDMA write to a memory location followed by an RDMA read from the
   same memory location. It is up to the DAT Consumer to enforce
   specific ordering by DTOs for accesses to local or remote memory.


Wittle                                                        [Page 367]

INTERNET-DRAFT         Direct Access File System          September 2001


Appendix C.  DAT Name Service

   The main purpose for the Name Service is to provide a Channel Adapter
   address that can be used to identify a Channel Adapter for a remote
   Endpoint for connection establishment. For connection establishment,
   the Channel Adapter address and the Connection Qualifier specify the
   remote side of a requested connection. A Channel Adapter Address is a
   unique identifier of a Channel Adapter on the (network) Fabric.

   The same DAT Provider can support multiple Channel Adapters and mul-
   tiple fabrics. A remote host might not be reachable through all local
   Channel Adapters. A Fabric might not fully connect all the hosts.
   Only some remote hosts and only some of the Channel Adapters on the
   reachable remote hosts can be accessed from any given local Channel
   Adapter. The DAT Provider MAY require the use of a specific Fabric
   and a specific local Channel Adapter to access a remote host.

   The Name Service MAY or MAY NOT generate traffic between DAT Provid-
   ers on any of the Fabrics that connect hosts. The DAT Name Service
   SHALL be able to support multiple Fabrics connecting hosts.

   Note: Some transports, like InfiniBand support path specification and
         manipulation for a route between local and remote Channel
         Adapter. Others, like VI Architecture, do not. DAT chooses the
         least common denominator and does not describe interactions
         with paths. It describes Channel Adapter addresses and leaves
         paths and routing to the underlying transport that supports DAT
         semantics.

   DAFS has the following requirements for the DAT name service:

   1) The DAT Provider SHALL provide a way to enumerate all local Chan-
      nel Adapters and determine their names. This is needed so that the
      DAFS protocol (DAT Consumer) can open Channel Adapters to be used
      for communication.

   2) The DAT Provider SHALL provide a way to find addresses of all
      Channel Adapters on a remote host identified by a host name acces-
      sible from a specified local Channel Adapter.

   3) The name of a remote host SHALL be unique and the same across all
      Fabrics.

   Each host has a name. The DAT Name Service does not address the issue
   of assignment of a name to a host. How a DAT Consumer discovers the
   names of remote hosts is also outside the scope of DAT and DAFS (see
   Appendix A. "DAFS Name Service"). The only requirement on a host name
   is that the name SHALL be the same on all Fabrics and uniquely


Wittle                                                        [Page 368]

INTERNET-DRAFT         Direct Access File System          September 2001


   identify the host on all Fabrics. This ensures that the same name can
   be used on any of the local Channel Adapter to identify the same
   remote host independent of the network Fabric(s) that connect Channel
   Adapters on local and remote hosts. A host MAY have multiple names.
   It is up to the DAT Consumer to ensure that each name used by the DAT
   Consumer in the DAT Name Service adheres to the uniqueness rule on
   all fabrics.

   For a given local Channel Adapter and a given host identified by a
   hostname the DAT Provider SHALL return all Channel Adapter
   address(es) of the host reachable from the local Channel Adapter. An
   interface that provides that functionality is outside the scope of
   DAT. The DAT Provider might support a single operation which returns
   all Channel Adapter addresses on a remote node. Or, a DAT Consumer
   might need to call multiple times with each call returning a single
   Channel Adapter Address. A DAT Consumer SHOULD be capable of generat-
   ing a connectivity matrix between the local and remote hosts using
   the DAT Name Service semantic.

   The Channel Adapter address is not guaranteed to be globally unique.
   A remote Channel Adapter address is valid only for the local Channel
   Adapter through which it was found.

   Note: There is no requirement that the DAT Name service provide a
         host name based on a Channel Adapter Address. A DAT Provider is
         free to provide this optional functionality, but a DAT Consumer
         (for example, the DAFS protocol) SHALL NOT rely on this func-
         tionality.


Wittle                                                        [Page 369]

INTERNET-DRAFT         Direct Access File System          September 2001


Appendix D.   DAFS Mapping to VI Architecture

   This appendix provides a mapping of Direct Access Transport (DAT)
   semantics onto the Virtual Interface Architecture (VIA), and how the
   Direct Access File System (DAFS) makes use of this mapping.

D.1.  Terminology Mapping from DAT to VI

   DAT Channel Adapter (CA)

      VI Network Interface Controller (NIC)

      A NIC provides an electro-mechanical attachment of a computer to a
      network. Under program control, a NIC copies data from memory to a
      network medium, transmission, and from the medium to memory,
      reception, and implements a unique destination for messaged
      traversing the network.

   DAT CA Address

      VI Host Address

      The logical network address of the VI NIC.

   DAT Connection

      VI Connection

      An association between a pair of VI endpoints such that data of
      posted data transfer operations requests of either VI endpoint
      arrive at the other VI endpoint of the Connection.

   DAT Connection Qualifier

      VI Discriminator

      A value that allows a Connection Manager to associate an incoming
      Connection request with the entity providing the service.

   DAT Consumer

      VI Consumer

      An application that requires VI services.

   DAT Provider

      VI Provider


Wittle                                                        [Page 370]

INTERNET-DRAFT         Direct Access File System          September 2001


      Provider of the DAT services for a VI application.

   DAT DTC - Data Transfer Completion

      VI Descriptor

      Status of the completed data transfer operation.

   DAT DTO - Data Transfer Operation

      VI Descriptor

      Requested data movement transfer submitted to a VI Provider.

   DAT Endpoint (EP)

      Virtual Interface Endpoint (VI Endpoint)

      The local part of a Connection that supports posting data transfer
      operation requests.

   DAT Fabric

      VI Network Fabric

      A network with RDMA capabilities.

   DAT Operation Type

      VI Operation Type

      Send, Receive, RDMA Read or RDMA Write DTOs.

   DAT RDMA

      VI RDMA

      Remote direct memory access - access of local memory by the remote
      VI. There are two RDMA operations - RDMA Read and RDMA Write.

   DAT RDMA Memory Region Context (RMR Context)

      VI VIP_MEM_HANDLE (Memory Handle)

      A programmatic construct that represents a process's authorization
      to specify a memory region to the VI NIC. Memory Handle is a
      representation for an arbitrary sized, registered contiguous vir-
      tual space that is registered with a NIC so it can support Remote


Wittle                                                        [Page 371]

INTERNET-DRAFT         Direct Access File System          September 2001


      DMA operations on the Connection whose local VI belongs to the
      NIC.

   DAT RMR Target Address

      VI Virtual Address

      RMR Target Address specifies the memory address within a region of
      memory represented by RDMA Memory Region Context.

D.2.  Additional VI Terminology

   There are several more VI terms that are used in this appendix that
   need to define. The definitions are quoted from the Virtual Interface
   Architecture [VIArch] Chapter 1:

   Completion Queue (CQ)

      A queue containing information about completed Descriptors. Used
      to create a single point of completion notification for multiple
      queues.

   Immediate Data

      Data contained in a Descriptor that is sent along with the data to
      the remote node and placed in the remote node's per-posted Receive
      Queue Descriptor.

   Memory Protection Tag

      A unique identifier generated by the VI Provider for the use by
      the VI Consumer. Memory Protection Tags are associated with VIs
      and Memory Regions to define the access permission the VI has to a
      memory region.

   Memory Region

      An arbitrary sized region of a process's virtual address space
      registered as communication memory such that it can be directly
      accessed by the VI NIC.

   Work Queue (WQ)

      A posted list of Descriptors being processed by a VI NIC. Every VI
      has two Work Queues: a send queue and a receive queue. The combi-
      nation of the Work Queue selected by the post operation and the
      operation type indicated by the Descriptor determine the exact
      type of data movement that the VI NIC will perform.


Wittle                                                        [Page 372]

INTERNET-DRAFT         Direct Access File System          September 2001


D.3.  DAT Requirements Mapping

   VI Architecture supports DAT semantics. The Mapping of DAT Require-
   ments onto VI Architecture follows.

   1. DAT SHALL support a connection that provide send-recv message
   transfers and RDMA Read and Write operations.

      Complies

   2. DAT SHALL support reliable connection which provides the following
   features:

      "The VI Architecture supports three levels of communication relia-
      bility at the NIC level: Unreliable Delivery, Reliable Delivery
      and Reliable Connection... Support for Reliable Delivery and Reli-
      able Reception is OPTIONAL." (Virtual Interface Architecture
      Specification, Chapter 2.5, Page 16) Both Reliable Delivery and
      Reliable Reception satisfy all of the DAT requirements. All DAFS
      servers and clients REQUIRED to support Reliable Delivery and MAY
      optionally support Reliable Reception.

   3. All data transfer operations submitted to the DAT Provider will
   complete successfully in the absence of errors, with data delivered
   uncorrupted, in the order defined by DAT ordering rules (see below).

      "A reliable Delivery VI guarantees that all data submitted for
      transfer will arrive at its destination exactly once, intact, and
      in the order submitted, in the absence of errors." (Virtual Inter-
      face Architecture Specification, Chapter 2.5.2, Page 18)

   4. Corruption of the data delivered to the Consumer (local one for
   RDMA Read) is detected as an error and reported to the Consumer.

      Complies

   5. Data loss (inability to deliver a data to the remote endpoint of
   the connection (from remote to local one for RDMA Read)) SHALL be
   detected as an error and reported to the Consumer.

      Complies

   6. Upon detection of an error, the connection SHALL be broken and all
   outstanding and in progress data transfer operations SHALL complete
   with an error.

      Complies


Wittle                                                        [Page 373]

INTERNET-DRAFT         Direct Access File System          September 2001


   7. There is a one-to-one correspondence between send operations on
   one endpoint of the connection and recv operations on the other end-
   point of the connection.

      Complies

   8. There is no correspondence between RDMA operations on one endpoint
   of the connection and recv or send data transfer operation on the
   other endpoint of the connection.

      "No Descriptors on the remote node's receive queue are consumed by
      RDMA operations... The exception to this rule is that if Immediate
      Data is specified by the initiator of an RDMA Write request it
      will consume a Descriptor on the remote end when the data transfer
      is complete, thus allowing for synchronization." (Virtual Inter-
      face Architecture Specification, Chapter 2.3.1, Pages 14-15). DAFS
      does not use Immediate Data.

   9. Data Transfer Operation Completion means that the Consumer can
   reclaim resources associated with the operation including the memory
   that contains the data.

      Complies

   10. The data payload for the send operation matching a receive opera-
   tion MUST be delivered into the receiver indicated memory buffer
   without errors prior to the receive completion.

      Complies

   11. Receive operations on a connection MUST be completed in the order
   of posting of their corresponding sends.

      Complies

   12. Each RDMA write operation posted on a connection prior to a send
   operation MUST have its data payload delivered to the target memory
   region prior to the completion of the receive operation matching that
   send.

      Complies

   13. DAT SHALL support multiple connections between the same or dif-
   ferent pairs of nodes (client server pairs).

      Complies

   14. An RDMA Memory Region Context (RMR Context) SHALL support RDMA


Wittle                                                        [Page 374]

INTERNET-DRAFT         Direct Access File System          September 2001


   operations for the set of DAT connections that are associated with
   it. The association between a connection and an RMR Context is esta-
   blished by the local endpoint of the connection where the Memory
   Region resides.

      Complies

   15. The same RMR Context can be associated with multiple connections.

      Complies

   16. A connection can have multiple RMR Contexts associated with it.

      Complies

   17. The DAT Provider SHALL allow the DAT Consumer to create multiple
   RDMA Memory Region Contexts the same memory.

      Complies

   18. DAT SHALL support connection management including the client-
   server connection establishment and the connection termination by
   either side of the connection.

      Complies

D.4.  VI & Connections

   There is a one-to-one correspondence between a DAT connection and a
   VI connection.

D.4.1.  VI Discriminators

   A VI Provider can use the same Discriminator to establish multiple
   connection between the same pair of hosts. Moreover, the DAFS Client
   should use the same remote Discriminator for all the DAFS communica-
   tion channels for all DAFS Sessions to the same DAFS Server. The
   Discriminator and the Host Address can be obtained from the DAFS and
   DAT Name Service as described in section D.6. "VI Data Transfer
   Operations".

D.4.2.  VI Connection Attributes

   In order for the VI connection to be established the following three
   attributes of its VI connection endpoints of DAFS client and server
   need to match:

   o  ReliabilityLevel


Wittle                                                        [Page 375]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  MaxTransferSize

   o  QoS - should be set to 0

   VI Architecture optionally supports reliable delivery or reliable
   reception through the VI ReliabilityLevel attribute. Both reliable
   delivery and reliable reception support all of the reliable data
   exchange requirements of DAT. DAFS servers are REQUIRED to support
   Reliable Delivery and MAY optionally support Reliable Reception. A
   DAFS server advertises its support for Reliable Reception through the
   Transport Specific Server Attributes field of its DAT Location (see
   D.7. "Name Service Mapping for VI Architecture").

   The DAFS server SHALL support a VI MaxTransferSize up to the value it
   advertises in its Transport Specific Server Attributes field of its
   DAT Location (see D.7. "Name Service Mapping for VI Architecture"). A
   DAFS client is recommended to choose the largest value its own NIC
   can support up to the DAFS server's advertised MaxTransferSize.

D.4.3.  VI Endpoint Attributes

   VI endpoints on the DAFS server are NOT REQUIRED to have EnableRD-
   MARead and EnableRDMAWrite attributes to be set for any of the DAFS
   channels. Moreover, it is recommended that they should not be set
   since DAFS clients do not post RDMA operations and DAFS servers do
   not advertise any Memory Handles.

   The DAFS server is REQUIRED to support RDMA Write and MAY support
   RDMA Read. A DAFS server advertises its support for RDMA Read through
   the Transport Specific Server Attributes field of its DAT Location
   (see D.7. "Name Service Mapping for VI Architecture").

   If the DAFS server indicates support for RDMA Read and the DAFS
   client would like to use operations that depends on server RDMA
   Reads, then the client MUST set EnableRDMARead attribute to TRUE for
   its VI endpoints of operation and RDMA Read (if requested by the DAFS
   server) channels. If the DAFS client would like to use operations
   that depend on server RDMA Writes, then the client MUST set EnableRD-
   MAWrite attribute to TRUE for its VI endpoints of Operation Channels.
   The DAFS client is NOT REQUIRED to set EnableRDMAWrite and EnableRD-
   MARead attributes to TRUE for VI endpoints for Back- control Chan-
   nels.

D.4.4.  DAFS Flow Control Initialization

   When a VI connection is established for the DAFS Session the DAFS
   Server MUST have one receive Descriptor posted already to receive the
   initial DAFS connection request of the DAFS communication channel for


Wittle                                                        [Page 376]

INTERNET-DRAFT         Direct Access File System          September 2001


   the DAFS Session. For each VI connection for a DAFS Session (Opera-
   tion Channel, Back-control Channel and RDMA-read Channel) the DAFS
   Server Provider SHALL post the receive Descriptor. The size of that
   pre-posted Descriptor buffer SHALL be 4-KB. The DAFS Client's connec-
   tion request MUST fit into the 4-KB buffer.

D.4.5.  VI Disconnect

   A VI Consumer issues a Disconnect request to its VI Provider in order
   to disconnect a connected VI. The Disconnect request unilaterally
   aborts the connection. A Disconnect request results in the completion
   of all outstanding Descriptors on that VI endpoint with an error. A
   VI Provider detects that a VI is no longer connected and notifies the
   VI Consumer that it is no longer a part of a connection. Minimally,
   the VI Consumer will be notified upon the first data transfer opera-
   tion that follows the disconnect.

D.5.  VI Architecture Memory Semantics

   The VI Architecture provides memory registration functionality that
   allows DAFS clients to register memory for RDMA operations.

   DAFS requires that all Memory Regions whose Memory Handles will be
   used as RMR Contexts SHALL have their attributes for RDMA Read
   Enable, RDMA Write Enable, or both to be set according to the
   client's use of this memory.

   The client's Memory Handle that corresponds to the RMR Context can be
   passed over the Operation Channel to a remote server site. The Memory
   Protection Tag of the memory region of that Memory Handle and of the
   VI endpoint that is the endpoint of the DAFS channel that supports
   RDMA operations MUST be the same. The memory region registered SHALL
   be a contiguous virtual space as specified by DAT.

   The DAFS Client VIs that are the endpoints of DAFS communication
   channels for the operation and RDMARead channels of the same DAFS
   Session MUST share the same PTAG. This ensures that registered memory
   regions accessibly by the Operation Channel of the DAFS Session are
   also accessible by the RDMARead channel of that DAFS Session.

D.6.  VI Data Transfer Operations

   The VI Architecture provides data transfer operations over a connec-
   tion that support DAT reliability and ordered delivery requirements.

   VI Architecture data transfer operations are represented by Descrip-
   tors. Descriptors are posted to the VI endpoints. Descriptors are
   composed of segments. There are three types of segments: control,


Wittle                                                        [Page 377]

INTERNET-DRAFT         Direct Access File System          September 2001


   address and data. The Control Segments contains control and status
   information as well as reserved fields that are used for queuing. An
   Address Segment follows the Control Segment, but only for RDMA opera-
   tions. This segment contains remote buffer information for RDMA Read
   and RDMA Write operations. Remote buffer information consists of
   remote Memory Handle (RMR Context for DAT) and remote buffer virtual
   address (RMR Target Address for DAT). The Data Segment contains
   information about the local buffers of a send, receive or RDMA Read
   or RDMA Write operation. A Descriptor MAY contain multiple Data Seg-
   ments.

   VI Architecture supports multiple outstanding Descriptors posted to
   the Work Queues of a VI endpoint. This provides support for multiple
   send, receive, RDMA Read, RMDA Write to be active simultaneously. The
   DAT delivery rules requires support for that and DAFS clients can
   take advantage of this capability by negotiating OPNreq value with
   the server for operation and Back-control Channels.

   DAFS does not use Immediate Data.

D.7.  Name Service Mapping for VI Architecture

   This section describes how the DAT and DAFS Name Service features are
   supported in VI Architecture.

   The DAT Channel Adapter Address is represented by VI Host Address.
   The DAT Name Service query is mapped into VipNSGetHostbyName.

   The DAT Connection Qualifier is mapped onto a VI Discriminator. The
   DAT Connection Qualifier is a string while the VI Discriminator is
   not. VI Network Address contains the VI Host Address Discriminator
   Length and the Discriminator value. The maximum VI Discriminators
   length REQUIRED to be supported by all compliant implementations is
   16 bytes. Hence, DAFS uses discriminator less than or equal to 16
   bytes.

   The mapping scheme between a DAT Connection Qualifier and a VI
   Discriminator is defined as follows. The DAT Connection Qualifier
   used for DAT Transport Type of DAFS_DAT_VI is a string of length up
   to 35 characters. The first two characters of the Connection Qualif-
   ier encoded the actual length of the Connection Qualifier in the
   human readable decimal way. This means that value 16 is represented
   by the string "16", and value 1 is represented by the string "01".
   The next character is the delimiter space, provided to improve human
   readability of the DAT Connection Qualifier. The VI Discriminator is
   an array of octets. Each octet is mapped into 2 consecutive charac-
   ters. The characters representing the Discriminator value start from
   the delimiter.


Wittle                                                        [Page 378]

INTERNET-DRAFT         Direct Access File System          September 2001


   A DAFS server can listen on multiple Discriminators and hence dif-
   ferent DAT Connection Qualifier. Each Connection Qualifier MUST be
   advertised as different DAFS Locations. All the DAT Connection Qual-
   ifiers of the same server host NIC share the DAT Hostname and VI Host
   Address. If there are multiple VI NICs on the host then the DAFS
   server listens on the same set of Discriminators on all these NICs.
   The DAT Hostname is the same for all of them. The query to the DAT
   Name Service provides the list of all VI NIC Addresses.

   The DAFS Server should specify in the Transport Specific Server
   Attributes parameter of its DAT Location what values of the parame-
   ters it supports. These parameters include:

   ReliableReceptionSupported

      Indicates that DAFS server VIs can support Reliable Reception.
      This parameter can be set up if all DAFS server NICs can support
      Reliable Reception. The first 25 characters of the Transport
      Specific Server Attributes field specify ReliableReceptionSup-
      ported. "Reliable Reception TRUE " means that server's VIs can
      support Reliable Reception, while "Reliable Reception FALSE "
      means that not all server's VIs can support Reliable Reception. If
      DAFS client requests a connection on its local VI whose Reliabili-
      tyLevel attribute set to Reliable Reception (in this case client's
      NIC to which VI belongs to also supports Reliable Reception) and
      DAFS server advertised that capability then DAFS server MUST have
      the ReliabilityLevel of the responding VI to be set to Reliable
      Reception.

   MaxTransferSize

      Specifies the maximum transfer size the server VIs can accept. It
      is the minimum among MaxTransferSize attributes of the VI NICs
      advertised by the DAFS server. The 16 characters starting from
      26th character of the Transport Specific Server Attributes field
      specify the MaxTransferSize. The representation is human readable
      decimal number analogous to the Discriminator length encoding. If
      the MaxTransferSize of the requesting client's VI is up to the
      DAFS server advertised MaxTransferSize then the DAFS server MUST
      have the MaxTransferSize of the accepting VI to be equal to the
      client's one.

   RDMAReadSupported

      Specifies RDMA Read support of the DAFS server. If the Transport
      Specific Server Attributes field characters starting from the 42nd
      are "RDMA Read TRUE " then the DAFS server supports RDMA Read and
      if "RDMA Read FALSE" then it does not.


Wittle                                                        [Page 379]

INTERNET-DRAFT         Direct Access File System          September 2001


D.8.  DAFS Client Discriminators

   This section discusses support for establishing the Operation, Back-
   Control and RDMA Read communication channels within a DAFS Session.
   Recall that there is a one-to-one correspondence between a DAFS com-
   munication channel and a DAT connection, and there is a one- to-one
   correspondence between a DAT connection and a VI connection. Hence,
   there is a one-to-one correspondence between a DAFS communication
   channel and a VI connection.

   At the time that the DAFS communication channel is created, the DAFS
   server will need to be able to differentiate a client's connection
   request for various types of DAFS Session channels. For instance,
   when the server accepts connection for an RDMA-read Channel, it will
   need to use a VI whose Send Work Queue is tied to the CQ for RDMA
   Read completions, rather than the CQ for send and RDMA Write comple-
   tions.

   This can be achieved by encoding the channel type in the client's
   Discriminator field.  Then when a VI Provider for the DAFS server
   provides Client's connection request, the server can use the address
   attribute of the remote Client's requesting VI. The client's VI
   address encodes in the address' discrimination section the type of
   DAFS communication channel this VI is being used for.  This provides
   the server with enough information to accept the connection using a
   VI endpoint with the appropriate properties.

        #define Dafs_Channel_Dafs_Operation    0x01
        #define Dafs_Channel_Back_Control      0x02
        #define Dafs_Channel_Rdma_Read         0x04

        typedef uint32            Dafs_Channel_Enum_Type;


        struct DAFS_client_address_discriminator
           {
           opaque32               discriminator_magic;
           opaque32               client_instance_differentiator;
           opaque32               client_session_differentiator;
           Dafs_Channel_Enum_Type channel_type;
           };


   Fields:

   discriminator magic


Wittle                                                        [Page 380]

INTERNET-DRAFT         Direct Access File System          September 2001


      The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S') is used
      to identify a client discriminator as belonging to DAFS. This can
      be used as a sanity check and aid to identification of DAFS-
      related connection packets on bus analyzers, etc.

   client_instance_differentiator

      An opaque value chosen by the client to allow the server to dif-
      ferentiate different clients using the same remote VI NIC. Dif-
      ferent clients using the same VI NIC MUST choose different
      client_instance_differentiators.

   client_session_differentiator

      An opaque value chosen by the client to insure that different
      discriminators can be used for each connection to the server.
      [Analogous to port number]

   channel_type

      Identifies the type of Operation Channel that the VI will be used
      for after it is successfully connected. This allows the server to
      accept the connection request using a local VI endpoint with
      attributes appropriate for the given channel type (for example,
      server VI endpoints for DAFS RDMA-read Channels might use dif-
      ferent CQs than server VI endpoints for DAFS Operation Channels).

   Note: VIPL-1.0 compliant VI providers are NOT REQUIRED to support a
         discriminator larger than 16 bytes (See MaxDiscriminatorLen
         above). Therefore DAFS_client_address_discriminator has been
         defined so that it fits within this 16 byte limit.

   The client_session_differentiator has scope within a specific
   client_instance_differentiator. It MUST be the same for all VI con-
   nections used for DAFS communication channels within a specific DAFS
   Session, and different for different DAFS Sessions. The triple
   {client_instance_differentiator, client_session_differentiator,
   channel_type} form a unique VI discriminator.

   Rationale: This allows the server to associate related VI connections
              as soon as the VI connection is requested (i.e. before the
              binding to server VI or CQ is made), which might be desir-
              able for efficiency in implementation.

D.9.  Design Notes


Wittle                                                        [Page 381]

INTERNET-DRAFT         Direct Access File System          September 2001


D.9.1.  Connection Establishment

   The DAFS client is expected to issue a VipConnectRequest where remote
   address info is filled from the Name Service (see Section D.7. "Name
   Service Mapping for VI Architecture") and the local address which
   consists of the local VI NIC and VI Discriminator as defined in Sec-
   tion D.8. "DAFS Client Discriminators". The DAFS server is expected
   to perform VipConnectWait where the local address is defined by the
   DAFS Name Service and advertised by the server.

   If the server intends to accept the connection, it MUST prepare a VI
   for the connection. The server VI Consumer MAY either choose an
   existing unconnected VI or it MAY create a new VI with attributes it
   considers appropriate for this connection request. To accept the con-
   nection, the server VI Consumer issues ConnectAccept request to its
   VI Provider, specifying the incoming connection ID as well as the
   local VI to be used. If the local VI's MaxTransferSize and QoS relia-
   bility attributes match those needed by the remote VI, the connection
   is established, otherwise it completes with error. The server VI Con-
   sumer can also issue ConnectReject specifying the incoming connection
   ID.

D.9.2.  Memory Registration

   The same Memory Handle can be advertised over multiple connections as
   long the local endpoints of these connections match the Memory Pro-
   tection Tag of the registered memory specified by the Memory Handle
   in order to support RDMA operations over these connections. By shar-
   ing the same Memory Protection Tag among Memory Regions and a VI,
   multiple Memory regions can be associated with that VI. Hence, the
   remote endpoint of a connection of that VI can perform RDMA operation
   on the memory of these Memory Regions. By manipulating Memory Protec-
   tion Tags VI Consumers can control which VI of a VI NIC is associated
   with which Memory Regions. Thus, a Memory Region does not have to be
   able to be accessed by all VIs on the same VI NIC.

D.9.3.  NIC Attributes

   Several of the VI NIC attributes effect the scalability of the DAFS
   client.

   MaxRegisterRegions

      The maximum number of memory regions that can be registered. Since
      all memory used by the DAFS client for both send/receive opera-
      tions and RDMA operations have to be registered the number of
      memory regions that can be registered effect the granularity of
      the DAFS client memory registration.


Wittle                                                        [Page 382]

INTERNET-DRAFT         Direct Access File System          September 2001


   MaxVI

      The maximum number of VI instances supported by this VI NIC.

   MaxDescriptorsPerQueue

      The maximum Descriptors per VI Work Queue supported by this VI
      Provider. OPNreq can not exceed the MaxDescriptorPerQueue.

   MaxTransferSize

      The maximum transfer size supported by this VI NIC. The inline
      data transfer size can not exceed this value. DAFS server adver-
      tises the its NIC MaxTransferSize value that all its VI endpoint
      will support in DAT Location of the DAFS Name Service.

D.10.  References

   [VIArch]

      Virtual Interface Architecture Specification: Version 1.0,
      December 16, 1997, published by Compaq, Intel & Microsoft
      (http://www.viarch.org/html/Spec/vi_specification_version_10.htm).


Wittle                                                        [Page 383]

INTERNET-DRAFT         Direct Access File System          September 2001


Appendix E.  DAFS Mapping to InfiniBand Reliable Connection

   This appendix provides a mapping of Direct Access Transport (DAT)
   semantics onto the InfiniBand Architecture using Reliable Connection
   (RC) service, and how the Direct Access File System (DAFS) makes use
   of this mapping.

   All InfiniBand references are to Infiniband Architecture Release 1.0a
   Volume 1 - General Specifications, released June 19, 2001.

E.1.  Terminology Mapping from DAT to InfiniBand

   The following table summarizes the terminology used to DAT terms and
   the corresponding InfiniBand concepts.

   DAT Channel Adapter (CA)

      IB Host Channel Adapter (HCA)

      A Channel Adapter that supports abstract functionality described
      by a "verbs" interface. IB also supports TCAs (Target Channel
      Adapters) for devices with simpler interfacing. A DAFS client MUST
      be an HCA. Technically, the DAFS server could be a TCA, but it
      also requires the ability to initiate RDMA reads and writes, Reli-
      able Connection support, and a Communications Manager (CM). These
      features are generally only found on HCAs.

   DAT CA Address

      IB GID (Global Identifier)

      Address of a Channel Adapter on a specific port. IB also uses LIDs
      (Local Identifiers) that identify a specific path through the
      current subnet to a port. However, this mapping does not expose
      LIDs to the DAFS Provider.

   DAT Connection

      IB Connection

      An association between a Queue Pair (QP) with only one other QP,
      such that messages transmitted by the send work queue of one QP
      are reliably delivered to the receive work queue of the other QP.
      As such each QP is said to be connected to the opposite QP. In
      this mapping, the IB Reliable Connection (RC) Transport Service
      Type is used.

   DAT Connection Qualifier


Wittle                                                        [Page 384]

INTERNET-DRAFT         Direct Access File System          September 2001


      IB Service ID

      A value that enables a Connection Manager to associate an incoming
      Connection Request with the entity providing the service. IB Ser-
      vice ID's are analogous to TCP/UDP port numbers, but are 64-bit
      integers.

   DAT Provider

      IB Channel Interface

      The presentation of the channel to the Verbs Consumer as imple-
      mented through the combination of Host Channel Adapter, associated
      firmware, and device driver software.

   DAFS Provider

      IB Verbs Consumer

      A direct user of the functionality of a Host Channel Adapter.

   DAT DTO - Data Transfer Operation

      IB Work Queue Entry (WQE)

      WQEs are placed on the Work Queues by the implementation-specific
      APIs that implement the IB verbs. Semantics are standardized; APIs
      and formats are not.

   DAT DTC - Data Transfer Completion

      IB Queue Pair (QP)

      QP consists of a Send and Receive Work Queues. Note that in the
      alternate RD mapping, the endpoint would be the end-to-end context
      instead.

   DAT EndPoint (EP)

      IB Queue Pair (QP)

      QP consists of a Send and Receive Work Queues. Note: in the alter-
      nate RD mapping, the endpoint would be the end-to-end context
      instead.

   DAT Fabric

      IB Fabric


Wittle                                                        [Page 385]

INTERNET-DRAFT         Direct Access File System          September 2001


      A collection of IB subnets connected by routers.

   DAT RDMA

      IB RDMA

      Remote Direct Memory Access.

   RDMA Memory Region Context (RMR Context)

      IB R-Key (Remote Key)

      An R-Key is a reference to a Memory Region or to a window pointing
      within a Memory Region. For external interfacing the two types of
      reference handles are represented as a differentiated union. As
      with all IB operations, registering memory MUST be done separately
      for each HCA.

   DAT RMR Target Address

      IB Virtual Address

      Memory Registration provides mechanisms that enable Consumers to
      describe a set of virtually contiguous memory locations or a set
      of physically contiguous memory locations to the Channel Inter-
      face. This enables the HCA to access them in a virtually contigu-
      ous buffer using Virtual Addresses represented by a 64 bit
      integer.

E.2.  Additional InfiniBand Terminology

   There are several more IB terms that are used in this appendix that
   need to be defined. All InfiniBand references are to InfiniBand
   Architecture Release 1.0a Volume 1 - General Specifications, released
   June 19, 2001:

   Communications Manager (CM)

      The software, hardware, or combination of the two that supports
      communication management mechanisms and protocols.

   Completion Queue (CQ)

      A queue containing one or more Completion Queue Entries, that are
      Channel Interface internal representations of Work Completions. A
      CQ creates a single point of completion notification for multiple
      queues.


Wittle                                                        [Page 386]

INTERNET-DRAFT         Direct Access File System          September 2001


   Immediate Data

      Data contained in a Work Queue Element that is sent along with the
      payload to the remote Channel Adapter and placed in a Receive Work
      Completion.

   Memory Region

      A virtual contiguous area of arbitrary size within a Consumer's
      address space that has been registered enabling HCA local access
      and optional remote access.

   Partition

      A collection of Channel Adapter ports that are allowed to communi-
      cate with one another. Ports MAY be members of multiple partitions
      simultaneously. Ports in different partitions are unaware of each
      other's presence (insofar as possible).

   Protection Domain

      A mechanism for associating QPs, Memory Windows, and Memory
      Regions.

   Work Queue (WQ)

      A send or receive queue. A send queue contains WQEs that describe
      data to be transmitted. A receive queue contains WQEs that
      describe where to place incoming data.

E.3.  DAT Requirements Mapping

   The following table describes the mapping of DAT requirements onto
   InfiniBand architecture that uses Reliable Connection (RC) services.

   1. DAT SHALL support a connection that provides send-recv message
   transfers and RDMA Read and Write operations.

      Complies.

   2. DAT SHALL support reliable connection which provides the following
   features:

      From 9.1 (Transport Layer Overview): "When a QP is created it is
      associated with one of the five transport service types. The tran-
      sport service describes the degree of reliability and to what and
      how the QP transfers data. The five transport service types are:
      1) Reliable Connection 2) Reliable Datagram 3) Unreliable Datagram


Wittle                                                        [Page 387]

INTERNET-DRAFT         Direct Access File System          September 2001


      4) Unreliable Connection 5) Raw IPv6 Datagram & Raw Ethertype
      Datagram."

      The Reliable Connection service type trivially satisfies the DAT
      requirements for reliable connection oriented operation. All DAFS
      servers and clients are REQUIRED to support Reliable Connection.

      Note that it is possible to define an alternate mapping for DAT's
      reliable connection oriented service onto Reliable Datagrams using
      End-to-end Connections to provide the necessary connection
      oriented aspects. Defining such a mapping has been deferred to a
      later time.

      This Appendix describes mapping DAT connections onto Infiniband
      Reliable Connected QPs only.

   3. All data transfer operations submitted to the DAT Provider will
   complete successfully in the absence of errors, with data delivered
   uncorrupted, in the order defined by ordering rules.

      Complies.

   4. Corruption of the data delivered to the Consumer (local one for
   RDMA Read) is detected as an error and reported to the Consumer.

      Complies.

   5. Data loss (inability to deliver a data to the remote endpoint of
   the connection (from remote to local one for RDMA Read)) SHALL be
   detected as an error and reported to the consumer.

      Complies.

   6. Upon detection of an error, the connection SHALL be broken and all
   outstanding and in progress data transfer operations SHALL complete
   with an error.

      Locally detected errors MAY be corrected through local interac-
      tion.

      A regulated number of retries MAY be executed before an error is
      declared to the consumer.

      Exception: If an endpoint (Queue Pair) is reset then outstanding
      data transfer operations (WQEs) are removed from the queues
      without notifying the consumer. The DAT Provider MUST refrain from
      resetting Queue Pairs itself, but cannot prevent other management
      software from doing so. (See Table 78, Section 11.2.3.2 Modify


Wittle                                                        [Page 388]

INTERNET-DRAFT         Direct Access File System          September 2001


      Queue Pair).

      Exception: If a queue pair is destroyed, outstanding work requests
      are "out of scope" for the channel interface. The IBA Consumer is
      responsible for clean up of the resources associated with work
      requests on destroyed work queues. See section 11.2.3.4 (Destroy
      Queue Pair) on page 495.

   7. There is a one-to-one correspondence between send operations on
   one endpoint of the connection and recv operations on the other end-
   point of the connection.

      Complies.

   8. There is no correspondence between RMDA operations on one endpoint
   of the connection and recv or send data transfer operations on the
   other endpoint of the connection.

      "Normally an RDMA operation does not consume a receive WQE at the
      destination, but there is one exception. That is for an RDMA Write
      operation which specifies immediate data. Immediate data is 32
      bits of information that is optionally provided in a SEND or RDMA
      WRITE instruction, transferred as part of the operation, but
      instead of writing the immediate data to memory, the data is
      treated as another piece of status information and returned as a
      special field of the RECEIVE CQE status. This means that an RDMA
      WRITE with immediate data will consume a RECEIVE WQE at the desti-
      nation." (IB Spec, Chapter 3.2.1, Page 68- 69). DAFS does not use
      Immediate Data.

   9. Data Transfer Operation Completion means that the Consumer can
   reclaim resources associated with the operation including the memory
   that contains the data.

      Complies.

   10. Ordering Rule: The data payload for the send operation matching a
   receive operation MUST be delivered into the receiver indicated
   memory buffer without errors prior to the receive completion.

      Complies.

   11. Ordering Rule: Receive operations on a connection MUST be com-
   pleted in the order of posting of their corresponding sends.

      Complies.

   12. Ordering Rule: Each RDMA write operation posted on a connection


Wittle                                                        [Page 389]

INTERNET-DRAFT         Direct Access File System          September 2001


   prior to a send operation MUST have its data payload delivered to the
   target memory region prior to the completion of the receive operation
   matching that send.

      Complies.

   13. DAT SHALL support multiple connections between the same or dif-
   ferent pair of nodes.

      Complies.

   14. An RDMA Memory Region Context (RMR Context) SHALL support RDMA
   operations for the set of DAT connections that are associated with
   it. The association between a connection and an RMR Context is esta-
   blished by the local endpoint of the connection where the Memory
   Region resides.

      Complies.

   15. The same RMR Context can be associated with multiple connections.

      Complies.

   16. A connection can have multiple RMR Contexts associated with it.

      Complies.

   17. The DAT Provider SHALL allow the DAFS Provider to create multiple
   RDMA Memory Region Contexts referencing the same memory.

      Complies.

   18. DAT SHALL support connection management including the client-
   server connection establishment and the connection termination by
   either side of the connection.

      Termination of connections using the Communications Manager SHOULD
      only be done after terminating the associated DAFS Sessions.
      Because Infiniband Communications Management is conducted via
      out-of-band Management Datagrams (MADs) it is impossible to
      guarantee a predictable orderly shutdown of an active connection.

E.4.  IBA Model

   InfiniBand offers a wide range of capabilities. Many of them are sim-
   ply not needed to meet the DAT requirements. These include:

   o  IBA offers both active/passive (client/server) and active/active


Wittle                                                        [Page 390]

INTERNET-DRAFT         Direct Access File System          September 2001


      (peer- to-peer) connection models. The DAT to IB-RC mapping uses
      only the active/passive (client/server) model.

   o  IBA allows end-to-end flow control as an option on each half of a
      connection. Since DAFS already provides its own flow control there
      is no need to exercise this option.

   o  IBA specifies two types of channel adapters: host channel adapters
      (HCAs) and target channel adapters (TCAs). The DAT to IB-RC map-
      ping assumes that all DAFS participants have the necessary HCA
      capabilities.

   IBA supports many other connection types in addition to Reliable Con-
   nection. This appendix defines DAFS mapping only to Reliable Connec-
   tion Transport Type. An alternate mapping using Reliable Datagrams,
   End-to-end Connections and Reliable Datagram Domains is feasible.
   While such a mapping would have desirable scalability features for
   servers, it would be more complex to specify and would impose
   unneeded burdens on clients that had no need to connect to either
   multiple servers or servers providing only RD connections. Given the
   general DAFS objective for low-overhead clients, this would not be a
   suitable default mapping. It MAY be added at a later date as an
   alternate mapping.

   The DAFS mapping to InfiniBand Reliable Connections requires use of
   normal RC verbs once the connection is established through the Com-
   munication Manager (CM). The CM is described in Volume 1, Chapter 12
   of the InfiniBand Architecture Specification.

E.5.  InfiniBand Architecture Transport Endpoints and Connections

   There is a one to one correspondence between a DAT Connection and a
   connected pair of QPs using the Reliable Connection Transport Service
   Type.

   Most of the behavior necessary for DAFS connection management is
   implemented by the IBA Communications Manager (CM). As per the IBA
   Spec (Vol 1, Chapter 12.1)

           Connections are managed  over  Queue  Pairs  other  than
           those  used  for  the  connection,  through the protocol
           described herein,  between  the  Communication  Managers
           (CMs)  on each system. (See Figure 126) The CMs communi-
           cate using Management Datagrams (MADs),  typically  over
           the  General  Services  Interface  (GSI) on each system.
           This document  [InfiniBand  Architecture  specification]
           defines  CM  external behaviors, but internal interfaces
           and  implementations  are  outside  the  scope  of   the


Wittle                                                        [Page 391]

INTERNET-DRAFT         Direct Access File System          September 2001


           InfiniBand Architecture specification.

   In general, InfiniBand places the bulk of the requirements on connec-
   tion establishment upon the client side. Per the Architecture specif-
   ication Chapter 12.1:

           The requirements on participating CMs are not equal. The
           initiating CM is responsible for collecting or calculat-
           ing most of the information necessary to  establish  the
           connection.  Much  of  the  raw information is available
           from Subnet Administration, but some adjustments MAY  be
           desirable, depending on the application of the channel.

   The DAFS client is the initiator for establishment of all connec-
   tions. This includes the back-control channel, even though it will be
   acting on it as the "responder." Under the DAFS Session establishment
   procedures, the server requests additional connections for back-
   control and RDMA Read channels, but the client is responsible for
   establishing all connections.

   The Communications Manager MAY be shared with other services on the
   same host. It is not the intent of this mapping to create special
   requirements for the Communications Manager. A Communications Manager
   implemented to flexibly meet the capabilities described by the
   InfiniBand specification SHOULD be deployable without modification.

E.5.1.  Proxy Communications Managers

   IBA allows Proxy Communications Managers (for more information see
   12.10.7, Active Client to Passive Server with Redirector, page 590).
   This means that the Channel Adapter Address for the CM MAY not be the
   address of the requested server. Instead servers with multiple Chan-
   nel Adapters MAY elect to have all of their connection establishment
   done via a single CM. This central CM MAY elect to complete the con-
   nection request on any of its channel adapters. This feature MAY be
   used by a multiple-adapter DAFS server to load balance new connec-
   tions.

E.5.2.  Partitions

   Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec-
   tion 3.5.6:

           Partitioning enforces isolation among systems sharing an
           InfiniBand  fabric. Partitioning is not related to boun-
           daries established by  subnets,  switches,  or  routers.
           Rather  a  partition  describes a set of endnodes within
           the fabric that can communicate.


Wittle                                                        [Page 392]

INTERNET-DRAFT         Direct Access File System          September 2001


           Each port of an endnode is a member of at least one par-
           tition  and  MAY  be  a member of multiple partitions. A
           partition manager assigns  partition  keys  (P_Keys)  to
           each  channel adapter port. Each P_Key represents a par-
           tition. Each QP 1 and EE context is assigned to a parti-
           tion  and  uses  that  P_Key in all packets it sends and
           inspects the P_Key in all packets it receives. Reception
           of an Invalid P_Key causes the packet to be discarded.

           Switches and routers MAY optionally be used  to  enforce
           partitioning.   In  this case the partition manager pro-
           grams the switch or router with  P_Key  information  and
           when  the  switch  or  router  detects  a packet with an
           invalid P_Key, it discards the packet.

   DAFS does not require the use of Partitions, but can work with them
   when they are present. The only requirement is that Partitioning not
   interfere with client/server communications.

E.5.3.  DAFS Connection Establishment Requirements

E.5.3.1.  DAFS Client

E.5.3.1.1.  Connection Request

   The DAFS client is responsible for initiating establishment of all IB
   Reliable Connections by issuing REQ (request for connection) message.
   The DAFS REQUIRED content of the REQ message is defined as follows:

   Service ID

      A 64 bit Big Endian integer whose value is derived from the DAT
      Connection Qualifier of the DAFS server (for more information, see
      E.8. "DAFS Name Service Mapping for InfiniBand Reliable Connec-
      tion").

   Transport Service Type

      Reliable Connection.

   Primary Remote Port GID

      Specifies the requested Channel Adapter Address. The Remote Port
      GID is derived from the DAT Host Name (for more information, see
      E.8. "DAFS Name Service Mapping for InfiniBand Reliable Connec-
      tion"). When multiple virtual servers are supported on the same
      HCA, each SHOULD have its own virtual GID to differentiate the
      requests.


Wittle                                                        [Page 393]

INTERNET-DRAFT         Direct Access File System          September 2001


   Local QPN

      The local QP on which the DAFS client wants to establish a connec-
      tion.

   PrivateData

      The content of the PrivateData field is defined in E.9. "DAFS
      Client Connection Request PrivateData".

   The partition key for all connections of a DAFS session MUST be the
   same. All other fields of the DAFS client REQ message are defined by
   the standard IB rules. These include among others: Remote CM Response
   Timeouts, Alternative Remote Port GID, Primary and Alternative Local
   Port GID, Primary and Alternative Traffic Class, Primary and Alterna-
   tive Packet Rate, Primary and Alternative LIDs, and Local Communica-
   tion ID.

E.5.3.1.2.  Responce Messages

   The CM on the DAFS client SHALL handle response to its connection
   request message of the following types:

   1) REP-Reply to Request for Communication. The Remote Communication
      ID from this message SHALL be used for Disconnect of this connec-
      tion (see E.5.4. "Disconnect" for more information).

   2) MRA-Message Receipt Acknowledgement means that the DAFS server CM
      can not respond to the REQ message within the requested timeout.
      The MRA extends the timeout period for the original request.

   3) Redirecting REJ-One form of Rejection message can be used in con-
      junction with Proxy Communications Management. The connection is
      rejected as requested, but the CM supplies alternate values for
      the primary and alternate endpoints. The CM on the DAFS client
      MUST resubmit the connection request message with the supplied
      alternate values. The DAT Provider and/or CM MUST implement this
      process transparently to the DAFS Provider.

   4) Normal REJ-Connection request is rejected.

E.5.3.1.3.  Ready to Use Message

   Upon receiving REP message from the DAFS server, the DAFS client can
   issue a RTU message using the same Local Communication ID as the REQ
   message and the Remote Communication ID from REP message. DAFS client
   does not use PrivateData field of RTU message.


Wittle                                                        [Page 394]

INTERNET-DRAFT         Direct Access File System          September 2001


   Prior to issuing the RTU message, the DAFS client SHALL ensure that
   RDMA Read and RDMA Write are enabled on the QP endpoint for the
   Operation channel. Prior to the RTU message the DAFS client SHALL
   ensure that RDMA Read is enabled on the QP endpoint of the RDMA Read
   channel (if created in response to the DAFS server request).

E.5.3.2.  DAFS Server

E.5.3.2.1.  Connection Request Message Receipt

   The contacted DAFS server MAY respond through its CM with four dif-
   ferent types of messages.

   MRA - Message Receipt Acknowledgement

      MRA's are sent when the recipient of the message anticipates that
      it will not be able to respond within the time specified within
      the REQ message. It avoids unnecessary retries. Frequently the
      server side needs to create and/or modify queue pairs before the
      connection is usable. The MRA enables it to hold off retries while
      it finishes this work. HCAs are NOT REQUIRED to be able to gen-
      erate MRAs. Therefore a CM on such an HCA would have to finish its
      work promptly.

   Redirecting REJ

      Reject message is used to reject the connection as requested. This
      rejection specifies a different primary and/or alternate end-
      points. There is no requirement to reserve these resources at the
      time the redirecting REJect is issued.

   Normal REJ

      Reject message is used to reject the connection.

   REP - Reply

      Accepts the connection, specifying the local QPN for the DAFS
      server endpoint to be used for the requested connection.

      All other fields of the DAFS server REP message are defined by the
      standard IB rules.

   The server-side Channel Interface MUST allocate or create listener(s)
   to accept new connections. Operations Channel listeners MUST be ready
   to process the first receive for the DAFS Session establishment
   exchanges. The Channel Interface MAY need to use MRAs (if supported
   by IB Channel Interface) until the listener is ready to begin the


Wittle                                                        [Page 395]

INTERNET-DRAFT         Direct Access File System          September 2001


   DAFS Session establishment.

   The DAFS server can either create QPs to be used by its CM or let the
   CM create QPs with appropriate parameters in response to a REQ mes-
   sage from the DAFS client.

   DAFS server is NOT REQUIRED to have RDMA Read or RDMA Write enabled
   on the QP endpoints for any of the DAFS communication channels.

E.5.4.  Disconnect

   Per InfiniBand Architecture Specification Volume 1, Release 1.0,
   Chapter 12.10.8 - Communication Release, page 561:

           Communication release as illustrated in this section  is
           ungraceful.  Upon  receipt  of  a Disconnection Request,
           each CM SHALL cause the affected QP to  be  placed  into
           the  error  state, causing pending work requests to com-
           plete with the Flush error status.

           Consumers are free to define and execute a more graceful
           communication  release  protocol that allows for an ord-
           erly shutdown of communications. Any such protocol SHALL
           utilize  the  communication release protocol illustrated
           below after the termination of normal  message  process-
           ing.

   DAFS clients SHOULD terminate communications at the DAFS Protocol
   layer before requesting the release of the Connection. Since any
   server- side initiated termination is inherently ungraceful, there is
   no need for a DAFS layer disconnect, nor is there any method of doing
   so.

   Note that the Channel Interface MUST accept DREQ requests, even if
   the DAFS layer has failed to properly shutdown. Attempts to use a
   disconnected endpoint will return an error.

E.5.5.  Automatic Path Migration

   This DAFS mapping does not require use of Automatic Path Migration
   (APM) capabilities of InfiniBand. Switching between DAFS servers is
   handled at the DAFS layer.

   However, in some deployments the use of Automatic Path Migration MAY
   be a requirement and/or a default behavior built into the Communica-
   tions Manager. In order to provide greater compatibility with other
   local services, the DAT Provider MAY utilize APM capabilities. How-
   ever, it SHALL NOT require the DAFS Provider to interact with the


Wittle                                                        [Page 396]

INTERNET-DRAFT         Direct Access File System          September 2001


   migration process.

   APM is allowed to be used only to switch paths between the DAFS
   client and DAFS server. It MUST not be used to switch a DAFS client
   to an alternate DAFS server. Migration to fallback servers is the
   responsibility of the DAFS layer.

E.6.  IBA Memory Semantics

E.6.1.  Memory Regions and Memory Windows

   InfiniBand allows the Consumer to register virtual or physical memory
   with a specific HCA. This process returns an L-Key (local key) and an
   optional R-Key (remote key). The L-Key is used only in local interac-
   tions. The R-Key MUST be supplied in all on-the-wire references. The
   R-Key corresponds to a DAT RMR Context. Registering memory for
   local-only purposes is outside the scope of this mapping.

   A distinct verb (operation) exists to re-register the same memory
   region and receive an additional set of keys. This extra region can
   have different access attributes.

   IBA also supports Memory Windows, which allow access to specific por-
   tions of a Memory Region for a specific QP. The dynamic binding
   receives a new R-Key. The remote holder of an R-Key does not have to
   be aware of whether it refers to a Memory Region or a Memory Window.

   DAFS requires that all Memory Regions or Memory Windows whose R- Keys
   will be used as RMR Contexts SHALL have their Access Control set to
   Enable Remote Write Access or Enable Remote Read Access or both,
   according to the client's use of this memory and corresponding RMR
   Context.

   Note: Since DAFS does not use the RDMA Atomic operation, DAFS never
         requires that Access Control on a Memory Region or Memory Win-
         dow have Remote Atomic Operation Access Enabled.

E.6.2.  Protection Domains

   Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec-
   tion 3.5.5:

           Not only does memory registration allow the use of  vir-
           tual   memory  ad-dressing,  but  it  also  provides  an
           increased level of protection against  in-advertent  and
           unauthorized access.

           Since a consumer might communicate with  many  different


Wittle                                                        [Page 397]

INTERNET-DRAFT         Direct Access File System          September 2001


           destinations  but not wish to let all those destinations
           have the same access to its registered memory, IBA  pro-
           vides  protection  domains.  Protection  domains allow a
           consumer to control which set of its Memory Regions  and
           Memory Windows can be accessed by which set of its QPs.

           Before a consumer allocates a QP or registers memory, it
           creates  one  or  more protection domains. QPs are allo-
           cated to, and memory registered to, a protection domain.
           L_Keys  and  R_Keys  for  a particular memory domain are
           only valid  on  QPs  created  for  the  same  protection
           domain.

   All resources supporting a specific DAFS Session MUST belong to a
   single Protection Domain. Specifically: The DAFS client's R-Key that
   corresponds to the RMR Context can be passed over the Operation Chan-
   nel to a remote server site. The Protection Domain of the Memory
   Region or Memory Window corresponding to the R-Key MUST match that of
   the QP that is the DAFS client's endpoint of the Operation Channel.
   Furthermore, if an RDMA Read Channel exists for the session, the QP
   that is the DAFS client's endpoint of the RDMA Read Channel MUST also
   be assigned to this same Protection Domain.

E.7.  IBA Data Transfer Operations

   Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec-
   tion 9.4.1 on Send Operation:

           The SEND Operation is sometimes referred to  as  a  Push
           operation  or  as  having  channel semantics. Both terms
           refer to how the SW  client  of  the  transport  service
           views  the  movement  of data. With a SEND operation the
           initiator of the data transfer pushes data to the remote
           QP.  The  initiator doesn't know where the data is going
           on the remote node. The remote  node's  Channel  Adapter
           places  the  data into the next available receive buffer
           for that QP. On an HCA, the receive buffer is pointed to
           by the WQE at the head of the QP's receive queue.

   Per [IB] section 9.4.3 on RDMA Write Operation:

           The RDMA WRITE Operation is used by the requesting  node
           to write into the virtual address space of a destination
           node. The message MAY be between zero  and  2**31  bytes
           (inclusive)  and is written to a contiguous range of the
           destination QP's virtual address space (not  necessarily
           a contiguous range of physical memory).


Wittle                                                        [Page 398]

INTERNET-DRAFT         Direct Access File System          September 2001


   Per [IB] section 9.4.4 on RDMA Read Operation:

           RDMA READ Operations are similar to  RDMA  WRITE  Opera-
           tions.  They  allow the requesting node to read a virtu-
           ally contiguous block of memory on  a  remote  node.  As
           with  RDMA  WRITEs, the responding node first allows the
           requesting node permission to  access  its  memory.  The
           responder  passes  to  the  requestor a virtual address,
           length, and R_Key  to  use  in  the  RDMA  READ  request
           packet.

   Per [IB] section 10.7.2.2 on RDMA Operations:

           The target address of an  RDMA  request  is  the  remote
           node's  virtual  address,  a valid R_Key and length. The
           R_Key MUST be associated either a  Memory  Region  or  a
           Memory Window containing that virtual address.

   In the above paragraph, a "target address" corresponds to a DAT "RMR
   Target Address" and an "R_Key" corresponds to a DAT "RMR Context."

   DAFS does not use Immediate Data, therefore the following Base Tran-
   sport Header (BTH) OpCodes are never used by DAFS clients or servers
   (See [IB] Section 9.2.1):

   o  SEND Only With Immediate

   o  SEND Last With Immediate

   o  RDMA Only With Immediate

   o  RDMA Last With Immediate

   Note that RDMA Read Packets never carry Immediate Data.

   DAFS does not specify a mechanism where Solicited Events can be used
   to control CQ event generation on remote endpoints. Therefore in
   order to guarantee interoperability, DAFS clients and Servers SHALL
   always set the Solicited Event Bit (SE) bit to 0. See [IB] Section
   9.2.3.

   DAFS does not use the RDMA Atomic Operation.

E.8.  DAFS Name Service Mapping for InfiniBand Reliable Connection

   The DAFS Name Service has three transport-related fields:

   o  The Transport Type: "IBRC"


Wittle                                                        [Page 399]

INTERNET-DRAFT         Direct Access File System          September 2001


   o  The DAT Host Name.

   o  The DAT Connection Qualifier: which is a right-justified zero-
      padded 16 digit hex printable ASCII string encoding an InfiniBand
      sixty-four (64) bit Service ID.

   The last field Transport Specific Server attributes field is not used
   by the DAT to IB mapping.

   The DAFS client MUST translate the hostname to a GID using whatever
   host to address services it normally uses. This mapping would
   include, but not be limited to, use of IPv6 compatible name servers
   and administrative configuration of the host. The DAFS client can
   rely upon this mapping being relatively stable (as with DNS to IP
   mapping), and MAY make use of caching to avoid per-connection network
   traffic that delays completion of the new connection.

   DAFS Servers are encouraged to use a standard Service ID for the DAT
   operations channel. This number will be obtained from IBTA and/or
   IETF. If multiple distinct virtual DAFS Servers are available at the
   same physical host, use of multiple virtual GIDs SHOULD be used to
   differentiate them. However, alternate Service IDs MAY also be used.

   DAFS Clients SHALL NOT assume that the suggested Service ID is in
   use, and MUST use the Service ID provided via the DAFS Name Service.

   Note that multiple connections from the same client to the same
   server for the same type of DAFS Connection will request the same
   Service ID. However each request MUST provide a different client-side
   QPN, and will receive a different server-side QPN.

E.9.  DAFS Client Connection Request PrivateData

   The DAFS client uses the PrivateData field in the REQ message.

   Rationale: A DAFS server should have the ability to differentiate
              among connection establishment requests from different
              clients, client sessions, and session's channel types.
              While this can be achieved in multiple ways (for example,
              Local Communication ID), it was decided to use private
              data to make mappings for VI and IB to be more alike. This
              simplifies transitions between VI- and IB-based implemen-
              tations.

   The 16 bytes starting from the start of the PrivateData (bits 140-
   267) are defined as follows:


Wittle                                                        [Page 400]

INTERNET-DRAFT         Direct Access File System          September 2001


        #define Dafs_Channel_Dafs_Operation    0x01
        #define Dafs_Channel_Back_Control      0x02
        #define Dafs_Channel_Rdma_Read         0x04

        typedef uint32            Dafs_Channel_Enum_Type;


        struct DAFS_client_address_discriminator
           {
           opaque32               discriminator_magic;
           opaque32               client_instance_differentiator;
           opaque32               client_session_differentiator;
           Dafs_Channel_Enum_Type channel_type;
           };


   Fields:

   discriminator magic

      The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S') is used
      to identify a client discriminator as belonging to DAFS. This can
      be used as a sanity check and aid to identification of DAFS-
      related connection packets on bus analyzers, etc. This field
      determines the endianness of the remaining fields in the Private-
      Data

   client_instance_differentiator

      An opaque value chosen by the client to allow the server to dif-
      ferentiate different clients using the same remote IB HCA port.
      Different clients using the same IB HCA port MUST choose different
      client_instance_differentiators.

   client_session_differentiator

      An opaque value chosen by the client to insure that different ses-
      sions can be differentiated by the server.

   channel_type

      Identifies the type of DAFS channel requested by the DAFS client.
      This allows the DAFS server to accept the connection request using
      a local QP with attributes appropriate for the given channel type.


Wittle                                                        [Page 401]

INTERNET-DRAFT         Direct Access File System          September 2001


E.10.  References

   [IB]

       Infiniband Architecture Release 1.0a Volume 1 - General Specifi-
      cations, released June 19, 2001.


Wittle                                                        [Page 402]

INTERNET-DRAFT         Direct Access File System          September 2001


Full Copyright Statement

   Copyright (C) The Internet Society (2000, 2001). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works. However, this docu-
   ment itself may not be modified in any way, such as by removing the
   copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of develop-
   ing Internet standards in which case the procedures for copyrights
   defined in the Internet Standards process must be followed, or as
   required to translate it into languages other than English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MER-
   CHANTABILITY OR FITNESS FOR A PARTICULAR PURPOS


Wittle                                                        [Page 403]