INTERNET-DRAFT M. Wittle draft-wittle-dafs-00.txt Network Appliance, Inc. Expires March 2002 September 2001 Direct Access File System (DAFS) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Key Words The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Abstract The Direct Access File System (DAFS) is a file access and management protocol designed for local file-sharing or clustered environments. It addresses two primary goals: o Provide low-latency, high-throughput, and low-overhead data movement that takes advantage of modern memory-to-memory networking technology. o Define a set of file management and file access operations for local file-sharing requirements. Wittle [Page 1] INTERNET-DRAFT Direct Access File System September 2001 Table of Contents Chapter 1. Introduction to the Direct Access File System Protocol .... 7 1.1. New System Trends ............................................ 7 1.1.1. Local File-Sharing Architecture ........................... 8 1.2 New Networking Technology Trends ............................. 8 1.2.1. Direct Access Transport ................................... 9 1.3. The DAFS Opportunity ........................................ 10 Chapter 2. DAFS Overview ............................................ 11 2.1. DAFS Goals .................................................. 11 2.2. Local File-Sharing Requirements ............................. 11 2.3. Direct Access Transport ..................................... 15 2.3.1. DAT Glossary ............................................. 15 2.3.2. DAT Description .......................................... 17 2.3.3. DAT Requirements ......................................... 19 2.3.4. Physical Interconnect .................................... 21 2.4. DAFS Protocol ............................................... 21 2.4.1. DAFS Deployment Models ................................... 24 2.4.2. DAFS File Name Space ..................................... 24 2.4.3. DAFS Terminology ......................................... 25 Chapter 3. Communication Model ...................................... 28 3.1. Session Management .......................................... 28 3.1.1. Security Model ........................................... 29 3.1.2. Session Attributes ....................................... 33 3.1.3. Session Operations ....................................... 42 3.1.4. Sharing Sessions ......................................... 44 3.2. Message Handling ............................................ 44 3.2.1. DAT Data Transfer Operations ............................. 44 3.2.2. DAT Error Reporting ...................................... 45 3.2.3. Mapping DAFS onto Memory-to-Memory Architectures ......... 46 3.2.4. Separate Communications Channel for RDMA Read Operations . 51 3.2.5. Checksums ................................................ 53 3.2.6. Message Flow Control ..................................... 54 Chapter 4. File System Operations ................................... 62 4.1. Concepts and Structures ..................................... 62 4.1.1. DAFS and NFS Version 4 ................................... 62 4.1.2. Typographical Conventions ................................ 62 4.1.3. Recurring Differences Between DAFS and NFS Version 4 ..... 63 4.1.4. Objects Naming And Filehandles ........................... 64 4.1.5. Named Attributes ......................................... 74 4.2. Data Transfer Operations .................................... 75 4.2.1. Send-Receive ............................................. 75 4.2.2. RDMA Transfers ........................................... 77 4.2.3. Batch I/O Operations ..................................... 79 4.2.4. Server Caching Hints ..................................... 80 Wittle [Page 2] INTERNET-DRAFT Direct Access File System September 2001 4.3. Request Chaining ............................................ 80 4.3.1. Chaining Restrictions .................................... 81 4.3.2. Chaining Flags ........................................... 85 4.3.3. Chaining and Flow Control ................................ 86 4.3.4. Chaining and Recovery .................................... 86 4.4. Locking and Access Control .................................. 88 4.4.1. Locking .................................................. 88 4.4.2. Shared Key Reservations ................................. 111 4.4.3. Access Control Lists (ACLs) ............................. 112 4.4.4. Fencing ................................................. 119 4.5. NFS-Derived Operations ..................................... 122 Chapter 5. Failure Recovery ........................................ 124 5.1. Exactly Once Semantics ..................................... 124 5.2. Server Response Cache ...................................... 124 5.2.1. Response Cache .......................................... 124 5.2.2. Response Cache Handling of OPNreq Decrease .............. 126 5.2.3. Handling Batch I/O Requests ............................. 128 5.2.4. Server Response Cache in Stable Storage ................. 128 5.2.5. Use of the Server Response Cache ........................ 128 5.2.6. Response Cache Operations ............................... 129 5.3. Server Failover ............................................ 130 5.3.1. Changing failover_locations ............................. 131 Chapter 6. Message Formats ......................................... 132 6.1. Message Headers and Common Structures ...................... 132 6.1.1. Message Format .......................................... 132 6.1.2. Request Header .......................................... 139 6.1.3. Response Header ......................................... 141 6.1.4. Basic Types ............................................. 141 6.1.5. File Attributes ......................................... 147 6.1.6. File System Attributes .................................. 155 6.1.7. Direct Operations ....................................... 165 6.1.8. Cache Hints ............................................. 167 6.1.9. Authentication .......................................... 167 6.1.10. Procedures .............................................. 169 6.2. Connection and Security Management ......................... 172 6.2.1. DAFS_PROC_CLIENT_CONNECT ................................ 172 6.2.2. DAFS_PROC_CLIENT_AUTH ................................... 176 6.2.3. DAFS_PROC_SERVER_AUTH ................................... 179 6.2.4. DAFS_PROC_CLIENT_CONNECT_AUTH ........................... 181 6.2.5. DAFS_PROC_CONNECT_BIND .................................. 183 6.2.6. DAFS_PROC_DISCONNECT .................................... 186 6.2.7. DAFS_PROC_SECINFO ....................................... 187 6.2.8. DAFS_PROC_REGISTER_CRED ................................. 189 6.2.9. DAFS_PROC_RELEASE_CRED .................................. 192 6.3. Response Cache ............................................. 193 6.3.1. DAFS_PROC_CHECK_RESPONSE ................................ 193 Wittle [Page 3] INTERNET-DRAFT Direct Access File System September 2001 6.3.2. DAFS_PROC_FETCH_RESPONSE ................................ 195 6.3.3. DAFS_PROC_DISCARD_RESPONSES ............................. 196 6.4. Fencing Procedures ......................................... 197 6.4.1. DAFS_PROC_GET_FENCING_LIST .............................. 197 6.4.2. DAFS_PROC_SET_FENCING_LIST .............................. 198 6.5. File System Procedures ..................................... 201 6.5.1. DAFS_PROC_NULL .......................................... 201 6.5.2. DAFS_PROC_ACCESS ........................................ 202 6.5.3. DAFS_PROC_APPEND_INLINE ................................. 206 6.5.4. DAFS_PROC_APPEND_DIRECT ................................. 209 6.5.5. DAFS_PROC_BATCH_SUBMIT .................................. 212 6.5.6. DAFS_PROC_CACHE_HINT .................................... 216 6.5.7. DAFS_PROC_CLOSE ......................................... 219 6.5.8. DAFS_PROC_COMMIT ........................................ 221 6.5.9. DAFS_PROC_CREATE ........................................ 225 6.5.10. DAFS_PROC_DELEGPURGE .................................... 228 6.5.11. DAFS_PROC_DELEGRETURN ................................... 230 6.5.12. DAFS_PROC_GET_ROOT_HANDLE ............................... 231 6.5.13. DAFS_PROC_GETATTR_INLINE ................................ 232 6.5.14. DAFS_PROC_GETATTR_DIRECT ................................ 235 6.5.15. DAFS_PROC_GET_FSATTR .................................... 238 6.5.16. DAFS_PROC_HURRY_UP ...................................... 241 6.5.17. DAFS_PROC_LINK .......................................... 243 6.5.18. DAFS_PROC_LOCK .......................................... 246 6.5.19. DAFS_PROC_LOCKT ......................................... 250 6.5.20. DAFS_PROC_LOCKU ......................................... 253 6.5.21. DAFS_PROC_LOOKUP ........................................ 255 6.5.22. DAFS_PROC_LOOKUPP ....................................... 258 6.5.23. DAFS_PROC_NVERIFY ....................................... 260 6.5.24. DAFS_PROC_OPEN .......................................... 262 6.5.25. DAFS_PROC_OPENATTR ...................................... 275 6.5.26. DAFS_PROC_OPEN_DOWNGRADE ................................ 277 6.5.27. DAFS_PROC_READ_INLINE ................................... 279 6.5.28. DAFS_PROC_READ_DIRECT ................................... 283 6.5.29. DAFS_PROC_READDIR_INLINE ................................ 287 6.5.30. DAFS_PROC_READDIR_DIRECT ................................ 292 6.5.31. DAFS_PROC_READLINK_INLINE ............................... 299 6.5.32. DAFS_PROC_READLINK_DIRECT ............................... 299 6.5.33. DAFS_PROC_REMOVE ........................................ 301 6.5.34. DAFS_PROC_RENAME ........................................ 305 6.5.35. DAFS_PROC_SETATTR_INLINE ................................ 309 6.5.36. DAFS_PROC_SETATTR_DIRECT ................................ 312 6.5.37. DAFS_PROC_VERIFY ........................................ 316 6.5.38. DAFS_PROC_WRITE_INLINE .................................. 318 6.5.39. DAFS_PROC_WRITE_DIRECT .................................. 324 6.6. Back-Control Directives .................................... 330 6.6.1. DAFS_PROC_BC_NULL ....................................... 330 6.6.2. DAFS_PROC_BC_BATCH_COMPLETION ........................... 331 Wittle [Page 4] INTERNET-DRAFT Direct Access File System September 2001 6.6.3. DAFS_PROC_BC_GETATTR .................................... 332 6.6.4. DAFS_PROC_BC_RECALL ..................................... 334 Chapter 7. Error Status Result Codes ............................... 336 Chapter 8. Security and IANA Considerations ........................ 346 8.1. Security Considerations .................................... 346 8.2. IANA Considerations ........................................ 346 Chapter 9. Bibliography ............................................ 347 Chapter 10. Author Information and Acknowledgements ................ 349 10.1. Editor ..................................................... 349 10.2. Authors .................................................... 349 10.3. Comments ................................................... 349 10.4. Acknowledgements ........................................... 349 Appendix A. DAFS Name Service ...................................... 350 A.1. Introduction ............................................... 350 A.2. DAFS Name Space ............................................ 350 A.3. DAFS Name .................................................. 350 A.4. DAFS Location .............................................. 351 A.4.1. DAT Location ............................................ 351 A.4.2. DAFS Directory Path ..................................... 353 A.4.3. DAFS Version ............................................ 353 A.5. DAFS Names and Locations ................................... 353 A.6. Name Space Repository ...................................... 354 A.7. LDAP Schema ................................................ 355 A.8. References ................................................. 358 Appendix B. DAT Semantics .......................................... 359 B.1. DAT Glossary ............................................... 359 B.2. DAT Model .................................................. 361 B.3. DAT Provider ............................................... 361 B.4. Transport Endpoints and Connections ........................ 362 B.5. DAT Memory Semantics ....................................... 364 B.6. DAT Data Transfer Operations and Connection Properties ..... 365 Appendix C. DAT Name Service ....................................... 368 Appendix D. DAFS Mapping to VI Architecture ........................ 370 D.1. Terminology Mapping from DAT to VI ......................... 370 D.2. Additional VI Terminology .................................. 372 D.3. DAT Requirements Mapping ................................... 373 D.4. VI & Connections ........................................... 375 D.4.1. VI Discriminators ........................................375 D.4.2. VI Connection Attributes ................................ 375 D.4.3. VI Endpoint Attributes .................................. 376 D.4.4. DAFS Flow Control Initialization ........................ 376 D.4.5. VI Disconnect ........................................... 377 D.5. VI Architecture Memory Semantics ........................... 377 Wittle [Page 5] INTERNET-DRAFT Direct Access File System September 2001 D.6. VI Data Transfer Operations ................................ 377 D.7. Name Service Mapping for VI Architecture ................... 378 D.8. DAFS Client Discriminators ................................. 380 D.9. Design Notes ............................................... 381 D.9.1. Connection Establishment ................................ 382 D.9.2. Memory Registration ..................................... 382 D.9.3. NIC Attributes .......................................... 382 D.10. References ................................................. 383 Appendix E. DAFS Mapping to InfiniBand Reliable Connection ......... 384 E.1. Terminology Mapping from DAT to InfiniBand ................. 384 E.2. Additional InfiniBand Terminology .......................... 386 E.3. DAT Requirements Mapping ................................... 387 E.4. IBA Model .................................................. 390 E.5. InfiniBand Architecture Transport Endpoints and Connections 391 E.5.1. Proxy Communications Managers ........................... 392 E.5.2. Partitions .............................................. 392 E.5.3. DAFS Connection Establishment Requirements .............. 393 E.5.4. Disconnect .............................................. 396 E.5.5. Automatic Path Migration ................................ 396 E.6. IBA Memory Semantics ....................................... 397 E.6.1. Memory Regions and Memory Windows ....................... 397 E.6.2. Protection Domains ...................................... 397 E.7. IBA Data Transfer Operations ............................... 398 E.8. DAFS Name Service Mapping for InfiniBand Reliable Connection 399 E.9. DAFS Client Connection Request PrivateData ................. 400 E.10. References ................................................. 402 Full Copyright Statements .................................. 403 Wittle [Page 6] INTERNET-DRAFT Direct Access File System September 2001 1. Introduction to the Direct Access File System Protocol This chapter introduces the Direct Access File System (DAFS) proto- col. It describes the technology trends that created the need for DAFS and how DAFS fulfills the need. The need for DAFS arose out of three trends. The first two trends involve the deployment of large systems. The third trend is related to new networking technologies. 1.1. New System Trends The first trend is the separation of storage systems from application servers. This separation enables storage to be managed and scaled independently from the applications, operating systems, and machine architectures that are attached to the storage system. Typical appli- cations include databases and collaboration software. The storage can be accessed either through block access protocols (for example, SCSI) or file access protocols (for example, NFS, CIFS). However, many choose file access protocols because of the following benefits: o Hide storage details File access protocols hide the details of the underlying storage system from application server software while enabling the file system resident in the file server to take advantage of storage system geometry. This keeps applications running smoothly without retuning after each change in storage capacity. o Enable controlled data management File access protocols provide fine-grained data management. Data access permission, storage utilization, backup, and even disaster recovery can be controlled at the individual file and user level. Data management operations affect the specified application data only, rather than all the blocks in the volume. o Off load application servers File access protocols can off load application servers from run- ning file system software and reduce application server I/O requirements by eliminating the need to transfer file system meta-data. This is sometimes application-dependent, because Net- work Attached Storage (NAS) protocols can add TCP/IP processing overhead, though in general the overhead is no worse than native file system overhead. The second trend is the rapid growth of the Internet. This growth Wittle [Page 7] INTERNET-DRAFT Direct Access File System September 2001 requires service providers to develop architectures that are resi- lient to failure and can rapidly scale in both computing power and storage capacity. The resulting designs spread the service load across a set of application servers. The application servers can be large machines or relatively small ones. This architecture is resi- lient in that if one application server fails another can take its place. The architecture is also scalable, because extra computing resources can be added by simply adding more application servers. Typical applications include email, news, web servers, geographical information systems, and clustered databases. Often these scalable designs also separate the storage from the application servers. Many service providers choose file access proto- cols because of the benefits they provide with storage separation, as discussed above, but in addition: o Simplify data sharing File access protocols enable data to be easily shared even among heterogeneous systems. The file access paradigm is the same whether processes are sharing file data on a single machine or on a distributed system. This provides all the application servers access to a common pool of data for load balancing. It also allows another application server to take over data previously accessed by a failed application server. 1.1.1. Local File-Sharing Architecture When used with file access protocols the system architectures described above are referred to as local file-sharing architectures. In both cases, the system is comprised of a limited number of appli- cation servers and is typically geographically constrained (within a data center) and under the control of a single set of administrators who are responsible for configuration and maintenance. The application servers are connected to storage over a dedicated, high- speed interconnection network. The application servers are relatively homogeneous in both hardware and software, and typically run a limited set of high-performance applications. In contrast, wide file-sharing is comprised of a large number of diverse machines that typically provide direct services to end users. 1.2. New Networking Technology Trends The third trend is the advent of standard memory-to-memory intercon- nection networks. These networks grew out of the research in tightly coupled distributed systems. The tightly coupled interconnection net- works (sometimes called Cache Coherent Non-Uniform Memory Access, or Wittle [Page 8] INTERNET-DRAFT Direct Access File System September 2001 CC-NUMA) were highly proprietary and were tightly integrated with particular computer architectures. Eventually, more loosely coupled versions (not cache coherent, i.e., Non-Uniform Memory Access or NUMA) that could use generic I/O interfaces, such as PCI, were developed. All these interconnect technologies support some form of remote memory access and are designed for low latency and high throughput. They are intended for use within data centers and are not designed to support high-latency, wide-area data transport. These are sometimes called System Area Networks. Examples of these networks include Virtual Interface Architecture [VIA], [VIDG], InfiniBand Architecture [IB], and the Warp protocol for the Internet [WARP]. 1.2.1. Direct Access Transport This section describes the set of transport capabilities that DAFS protocol depends on. These capabilities are referred to as the Direct Access Transport (DAT) and are defined in Appendix B. "DAT Seman- tics". The DAT semantic is the minimal set of transport capabilities that DAFS requires to provide high-performance DAFS implementations. The DAT semantics can be mapped onto networks that support memory-to- memory operations, such as Virtual Interface Architecture, InfiniBand Architecture, and WARP. DAT does not define a specific transport layer interface, but describes the functionality and concepts neces- sary to support the DAFS protocol. The DAT-based network provides two fundamental capabilities beyond those of traditional networking architecture. The first capability is called Remote Direct Memory Access (RDMA), which is the ability to move data directly to or from a local memory buffer to or from a specified memory buffer on a remote node. The second capability allows application software to directly access Channel Adapter hardware (Channel Adapter is sometimes called a Network Interface Card or Network Adapter), bypassing the operating system. This operating system bypass capability enables application programs to directly address the hardware and to initiate I/O operations without operating system intervention. Channel Adapters that support DAT capability can be implemented on a variety on interconnection net- works such as Fibre Channel, Ethernet, InfiniBand, and proprietary fabrics. The advantages of a Channel Adapter with DAT capability are the fol- lowing features implemented in on the Channel Adapter: o Packet fragmentation and reassembly o Reliable data delivery o Multiplexing and demultiplexing data from different connections Wittle [Page 9] INTERNET-DRAFT Direct Access File System September 2001 o Checksum computations. However, these advantages can be had with traditional networking by implementing transport protocols such as TCP in a Channel Adapter. The IETF WARP proposal defines a protocol that provides such capabil- ities for TCP and SCTP. The key advantages are that RDMA operations not only provide applications with a mechanism to separate bulk data transport from control information, they also provide a way to specify exactly where the data belongs. Traditional message transfer operations enable only the destination target of a message to specify the specific location on the destination node were the message pay- load is deposited. Remote memory writes allow the operation initiator to specify the target memory location on the destination node. Remote memory read operations allow the operation initiator to specify both the remote memory location that is to be the source of a data "fetch" operation as well as the local destination where the fetched remote memory contents are to be deposited. The advantages are significant. Consider a typical network file access protocol like NFS. In NFS, each I/O operation or I/O operation reply, such as read or write, is embodied in a header. The header is inserted into the network byte stream followed by any user data. The receiver needs to parse the header and determine the appropriate place to put any user data that follows. For example, during a read operation the requester needs to parse the reply packet header to determine with which of the many possible outstanding read operations this data is associated. Then it needs to determine the destination data buffer associated with the request and copy the data into the buffer or otherwise cause the data to be copied there. RDMA opera- tions allow the file server to directly place the data into a desti- nation buffer specified in the request without any parsing or copy- ing. These advantages combine to provide extremely low-overhead and low-latency messaging and bulk data transfer. 1.3. The DAFS Opportunity Local file-sharing architectures and memory-to-memory interconnection networks are ideally suited for one another. One of the main reasons local file-sharing architectures are deployed is to gain high throughput and high performance. A file access protocol using such a network enables extremely low-overhead access to shared data. The protocol off-loads file system processing and meta-data I/O from the application servers and eliminate protocol processing overhead, while it preserves the advantages of file access. Using the virtual inter- face capability provides local file- sharing applications more con- trol over the high-performance data path. Any performance issues can be addressed directly in application libraries without requiring OS patches. Wittle [Page 10] INTERNET-DRAFT Direct Access File System September 2001 Lastly, note that current file access protocols were designed for a wide- area file-access environment. Local file-sharing architectures use the current protocols, but the applications sometimes have to work around specific deficiencies [Christianson] in the areas of file locking and fencing. A file access protocol specifically designed for local file sharing environments can provide the necessary semantics to improve application performance in these situations and eliminate many complexities of current protocols that were designed to deal with widely varying latencies and unreliable networks. Wittle [Page 11] INTERNET-DRAFT Direct Access File System September 2001 2. DAFS Overview This chapter provides an overview of DAFS and describes the following topics: o DAFS goals o Local File Sharing Requirements o The Direct Access Transport o The DAFS protocol 2.1. DAFS Goals The Direct Access File System (DAFS) is a file access and management protocol designed for local file-sharing or clustered environments. It addresses two primary goals: o Provide low-latency, high-throughput, and low-overhead data move- ment that takes advantage of modern memory-to-memory networking technologies. o Define a set of file management and file access operations for local file-sharing requirements. The DAFS protocol takes advantage of system area networks that pro- vide Direct Access Transport (DAT) capabilities. The DAFS protocol defines file access operations that use remote memory-to-memory copy and other high performance primitives provided by DAT. The current revision of DAFS borrows heavily from the IETF NFS Ver- sion 4 specification [Shepler] to provide a full set of file manage- ment operations. Although recent enhancements to NFS are often directed toward improvements for "wide-sharing" environments, a large number of NFS file operations define basic semantics fully appropri- ate for use in a local file-sharing architecture In areas where DAFS is not intended to add significant value beyond existing systems, it seems best to build on that work, rather than duplicate the effort. We'd like to explicitly acknowledge that many of the file operations defined by DAFS are either based on or are directly a result of work done by authors of the NFS v4 specification and contributors to that IETF working group. 2.2. Local File-Sharing Requirements Local file-sharing has a number of unique requirements for file Wittle [Page 12] INTERNET-DRAFT Direct Access File System September 2001 access protocols: o Optimize for high-throughput, low-latency networks Local file-sharing architectures use high-throughput low-latency networks. Current file access protocols are optimized for general- purpose internetworking, in which latencies can vary dramatically and packet processing is expensive. Memory-to-memory networks have low latencies and very low packet processing overhead. o Optimize for high-throughput, low-overhead client implementations In the local file-sharing, environment client machines often run one application that is trying to achieve high throughput by hav- ing many operations pending on the storage system and by fully utilizing client CPU resources. Local file-sharing creates a dif- ferent environment than wide sharing. Software running on the client machine in the wide- sharing environment is typically only trying to service a single user's requests. In a local sharing environment, the client software only issues a limited number of concurrent operations on the storage system and client CPU resources are usually not fully utilized. o Support for different operating system file access semantics The local file-sharing environment is typically comprised of simi- lar client machines running the same base operating system. How- ever, the base operating system can differ between different local file- sharing applications. At a minimum, DAFS needs to be able to support the file access semantics for UNIX and NT. o High-speed consistent locking Many local file-sharing applications share data among clients. Typically, this is accomplished by locking and unlocking the shared data. Older NFS protocols have loose data consistency and require complete lock and unlock messages for each interval of shared access, resulting in relatively low performance. DAFS needs to enforce data consistency while locked, allow lock caching, and provide for fast transfer of cached locks between client machines. o Client failure recovery When a client fails, it is unacceptable that other clients are locked out for long intervals and unable to access data that was in use by the failed client. At the same time, it is likely that releasing the locks of a failed client supports orderly recovery so as not to compromise the data integrity. Wittle [Page 13] INTERNET-DRAFT Direct Access File System September 2001 o File server reboot or network interruption recovery Applications SHOULD not necessarily fail when a file server reboots or the network suffers a temporary interruption. o File server failover Applications SHOULD not necessarily fail when a file server fails over to an alternate file server that has a consistent copy of the data. The failover SHOULD be supported in the DAFS protocol and not rely on transport level routing tricks. o Fencing Clustered application servers maintain their own notion of nodes that are considered a part of the cluster. Nodes ejected from the cluster need to be prevented from accessing shared data. o Online migration Local file-sharing architectures are intended to be highly avail- able systems that go long periods without need for reboot. They are also intended to enable storage scalability. For this reason it SHOULD be possible to move file systems (or finer units of data) among the available file servers without requiring applica- tion servers to reboot to connect to the new data location. o Security Memory-to-memory networking, by its very nature, requires some trust between machines. (For example, exporting memory with only a small protection key in hardware.) However, DAFS SHOULD provide some level of user authentication to meet the general need for trust in this environment. o Flow control Local file-sharing client machines are typically high-speed appli- cation servers. The file server needs to be able to throttle each client to ensure fairness and to avoid file server congestion. o Enhanced locking Existing approaches to file and byte-range locking are not ade- quate for data sharing between active processes. If the entire cluster crashes while a lock is held (for example, due to a power failure), the file data may be inconsistent after the cluster reboots because the lock was broken. A better semantic would be Wittle [Page 14] INTERNET-DRAFT Direct Access File System September 2001 to have a "stateful" lock that informed the first lock attempt after a lock is broken that a failure occurred. An even better semantic would be to roll back file changes to the state of the file at the time the broken lock was granted. 2.3. Direct Access Transport General-purpose computer networks are designed for a very open and diverse environment. Computers of diverse types in many widely dispersed locations running a variety of software from different manufacturers and under the control of different administrative domains can communicate with each other. The communication can occur at any time using many different protocols that pass through several different network types and switches that can fragment and reorder packets. The underlying network protocols are designed to deal with these situations, but with significant host cost in both CPU utiliza- tion and host memory requirements. Some of the sources of this over- head are: o Network packet fragmentation and reassembly o Multiplexing and demultiplexing data from different connections o Realignment of user data following transmission o Checksum computations o Buffer space allocations sized for largest transfer unit o Operating system buffer copying and various overhead costs. The Direct Access Transport semantic provides a set of standard facilities that address many of the deficiencies in standard network- ing transport protocols. Furthermore DAT mapped onto Memory-to-Memory networks like FC-VI [T11-FCVI], VI/TCP [Dicecco], IB, and WARP offloads most of the solution into a Channel Adapter. 2.3.1. DAT Glossary The following terms are useful in describing the Direct Access Tran- sport. Channel Adapter Channel Adapter is a host-resident device that transfers messages to and from host memory associated with a specific Endpoint and a Fabric. Wittle [Page 15] INTERNET-DRAFT Direct Access File System September 2001 Channel Adapter Address Channel Adapter Address on the fabric. Connection Qualifier The Connection Qualifier is a value that the Connection Manager uses to associate an incoming Connection request with the entity providing the service. DAT Consumer DAT Consumer is a Upper Layer Protocol or application that requires Direct Access Transport services. DAT Provider DAT Provider is the mechanism that provides the Transport services for a Direct Access application. Data Transfer Completion (DTC) DTC is the status of a completed data transfer operation. Data Transfer Operation (DTO) DTO is a requested data movement transfer submitted to a DAT Pro- vider. Endpoint Endpoint is the local part of a Connection that supports posting data transfer operation requests. Fabric A Fabric is a network with RDMA capabilities. Operation Type The Operation types in DAT are Send, Receive, RDMA Read or RDMA Write data transfer operations (DTO). RDMA Remote Direct Memory Access is and operation involving access of local memory by the remote Endpoint. There are two RDMA opera- tions: RDMA Read and RDMA Write. Wittle [Page 16] INTERNET-DRAFT Direct Access File System September 2001 RDMA Memory Region Context (RMR Context) RDMA Memory Region Context (RMR Context) is a representation for an arbitrary-sized, registered, contiguous virtual space that belongs to a Channel Adapter so it can support Remote DMA opera- tions on the Connection whose local Endpoint belongs to the Chan- nel Adapter. RMR Target Address RMR Target Address specifies the memory address within a region of memory represented by RDMA Memory Region Context. (The specifica- tion can be either by virtual address or offset from the start of the memory represented by the RMR Context.) 2.3.2. DAT Description DAT specifies a connection-oriented, peer-to-peer communication architecture. A pair of hosts that want to communicate need to first establish a connection between them through a pair of connected end- points. The mechanisms for the connection establishment and for con- nection endpoint creation are specific to individual transports that provide DAT semantics. Each node can have many connections to the same remote node or to other nodes. DAT is designed so that user processes can initiate data transfer with low overhead and without operating system intervention. There are three basic data transfer operations (DTO): o Send The sender's DAT Provider forwards the payload of a send DTO into the memory of the receiver specified by a receive DTO on the other side of the connection. Upon the completion of the receive DTO corresponding to the send DTO, the remote DAT Consumer is noti- fied. o Remote DMA (RDMA) write The originator copies the payload of an RDMA Write DTO from the local memory to a remote memory on the remote node identified by an RMR Target Address and an RDMA Memory Region Context (RMR Con- text) created by the remote DAT Consumer of the connection. o Remote DMA (RDMA) read The originator of an RDMA Read copies the payload of an RDMA Read DTO from a remote memory on the remote node identified by an RMR Target Address and an RDMA Memory Region Context (RMR Context) Wittle [Page 17] INTERNET-DRAFT Direct Access File System September 2001 created by remote DAT Consumer into local memory. RDMA write and RDMA read provide remote memory access without receiver software intervention. They provide the basic bulk data transfer primitives. The send operation provides fast messaging. Send can transfer any amount of data, but is typically used for smaller messages that can contain an RMR Context and an RMR Target Address for the RDMA operations to use. It is important to understand that DAT does not specify a physical implementation nor define an API. It merely specifies a common communication style and set of capabili- ties. Further information is available at: o Appendix B. "DAT Semantics" defines DAT capabilities that DAFS requires. o Appendix D. "DAFS Mapping to VI Architecture" defines a mapping of DAT onto Virtual Interface Architecture. o Appendix E. "DAFS Mapping to InfiniBand Reliable Connection" pro- vides a mapping of DAT onto Infiniband HCA. DAT functionality can eliminate most of the following networking pro- tocol inefficiencies: o Fragmentation and reassembly Channel Adapters typically perform all fragmentation and reassem- bly. In addition, the DAT RDMA operations are self-addressing and this enables the Channel Adapter to break up a large transfer operation into independent units with a length that is appropriate to the underlying packet size. o Multiplexing and demultiplexing data from different connections Channel Adapters allow multiple DAT connections to be established. Data sent by different DAT connections are multiplexed and demul- tiplexed directly by the hardware. An RDMA Memory Region Target Address (RMR Target Address) is only valid within its RMDA Memory Region Context (RMR Context). Typically RMR Contexts are only valid within the connection where they are used. The hardware automatically identifies the connection to which the RDMA data belongs and translates the RDMA address to the underlying physical address in that context. o Realignment of user data following transmission A DAT-capable node can place data at specific addresses in a remote node. This lets senders separate control information from Wittle [Page 18] INTERNET-DRAFT Direct Access File System September 2001 bulk data, and place bulk data in properly aligned locations. Receivers need not extract the actual user data from the stream of packets and copy it to properly aligned buffers. o Checksum computation Networks that provide DAT semantics typically do not require an end- to-end software checksum, because the processing at inter- mediate switches is extremely simple and the data is protected by the underlying cell checksum that is checked by hardware. o Buffer space allocations sized for largest transfer unit Networks that support DAT semantics can provide senders with immediate useful knowledge of the state of buffers in the receiver. Senders can also place large send DTO messages and small send DTO messages into appropriately sized buffers on the receiver. The receiver need not have a large number of maximally sized packet buffers reserved for the networking hardware just in case one or more client machines send to it. In addition, DAT- capable networks usually have higher level protocols that separate buffers on a per sender basis providing further segregation of buffer resource utilization. o Operating system buffer copying and various overhead costs DAT allows applications to completely bypass operating systems for data transfer. In addition, DAT provides an efficient communica- tion model permitting large I/O throughput at little CPU cost. 2.3.3. DAT Requirements DAFS depends on the following network transport capabilities that are provided by DAT: o DAT supports a connection that provides send-recv message transfers and RDMA Read and Write operations. o DAT supports reliable connections that provides the following features: o All data transfer operations submitted to the DAT Provider com- plete successfully in the absence of errors, with data delivered uncorrupted, in the order defined by the ordering rules. o Corruption of the data delivered to the local Consumer is detected as an error and reported to the Consumer. Wittle [Page 19] INTERNET-DRAFT Direct Access File System September 2001 o Data loss (inability to deliver data to the remote endpoint of the connection, or to the local endpoint for RDMA Read) is detected as an error and reported to the Consumer. o Upon detection of an error, the connection is broken and all outstanding and in-progress data transfer operations complete with an error. o There is a one-to-one correspondence between send operations on one endpoint of the connection and recv operations on the other endpoint of the connection. o There is no correspondence between RDMA operations on one end- point of the connection and recv or send data transfer opera- tion on the other endpoint of the connection. o Data Transfer Operation Completion means that the Consumer can reclaim resources associated with the operation including the memory that contains the data. o Ordering rules: o The data payload for the send operation matching a receive operation is delivered into the receive-indicated memory buffer prior to the receive completion. o Receive operations on a connection are completed in the order of posting of their corresponding sends. o Each RDMA write operation posted on a connection prior to a send operation has its data payload delivered to the target memory region prior to the completion of the receive opera- tion matching that send. o DAT supports multiple connections between the same or dif- ferent pairs of nodes (client server pairs). o An RDMA Memory Region Context (RMR Context) supports RDMA opera- tions for the set of DAT connections that are associated with it. The association between a connection and an RMR Context is esta- blished by the local endpoint of the connection where the Memory Region is located. o The same RMR Context can be associated with multiple connections. In addition, a connection can have multiple RMR Contexts associ- ated with it. o The DAT Provider allows the DAT Consumer to create multiple RDMA Wittle [Page 20] INTERNET-DRAFT Direct Access File System September 2001 Memory Region Contexts from the same memory. o DAT supports connection management including the client-server connection establishment and connection termination by either side of the connection. For more information on DAT, see the following: o DAT defines the transport layer semantics necessary to support the DAFS protocol in Appendix B. "DAT Semantics". o The mappings of DAT functionality on VI Architecture and Infini- Band memory-to-memory interconnection networks are provided in Appendix D. "DAFS Mapping to VI Architecture" and Appendix E. "DAFS Mapping to InfiniBand Reliable Connection". 2.3.4. Physical Interconnect The DAFS architecture does not specify or mandate any specific physi- cal interconnect technology. However, the media chosen SHOULD exhibit the following characteristics: o The interconnect needs to support the transport requirements of protocols that feature remote memory-to-memory communications. o The interconnect SHOULD be high speed and low latency. o The interconnect needs to be highly reliable. Media errors and connection breaks SHOULD be rare. Examples of physical interconnect Channel Adapters that provide the above features are FC-VI, VI/TCP, and IB HCA. 2.4. DAFS Protocol The DAFS definition has two principal goals: to provide a high- per- formance file access solution by taking advantage of the remote memory-to-memory communication model, and to address the data- shar- ing needs of distributed local file-sharing applications. The main attributes of DAFS are: o Client-server communication model DAFS uses a request-response message paradigm between the client and server for communication. This method is used both for file operations initiated by the client and for "back-control" direc- tives initiated by the server. Wittle [Page 21] INTERNET-DRAFT Direct Access File System September 2001 o Session-based protocol that leverages underlying DAFS communica- tion channels DAFS establishes Sessions between the client and server that are used to simplify authentication and manage ongoing aspects of the communication. The DAFS protocol leverages underlying communica- tion channel primitives to allow control of errors on a communica- tion channel basis, rather than for each DAFS message. o Security The remote memory-to-memory communication architecture, by its nature, requires some trust between machines. The DAFS protocol authenticates clients to servers and servers to clients. It can also authenticate individual users within a client-server Session. o Optimized for high-throughput and low-latency networks Local file-sharing client machines can be relatively high- bandwidth, multiprocessor systems that can generate many threads' worth of load on the file server. DAFS is optimized for high throughput and takes advantage of the low latency characteristics of the network. The protocol support multiple outstanding opera- tions within a single connection. o Chaining DAFS allows a series of dependent operations to be submitted con- currently without waiting for intermediate results. The dependent operations can be pipelined without stalling. o Flow control Local file-sharing client machines are typically high-speed servers. DAFS provides mechanisms for the file server to throttle client's independently to ensure fairness and to avoid file server congestion. o Internationalization All user-accessible strings use internationalized representations. o Multiple operating system file access semantics The local file-sharing environment is typically composed of simi- lar client machines running the same base operating system. How- ever, the base operating system may differ between different local file- sharing applications. DAFS supports the file access Wittle [Page 22] INTERNET-DRAFT Direct Access File System September 2001 semantics for both UNIX and NT. o High speed consistent locking Many local file-sharing applications share data. Typically, lock- ing and unlocking the shared data does this. DAFS enforces data consistency while files are locked, and allows lock caching through delegation. o Enhanced file locking DAFS provides "stateful" locks that inform the first lock attempt after a lock is broken that this event occurred. This addresses a problem with existing approaches that may leave file data incon- sistent follow a power-failure induced crash of an entire cluster. DAFS also provides "rollback" locks that roll back file changes to the state of the file at the time the broken lock was granted o Atomic write append In addition to general support for enclosing file access opera- tions within file locking semantics, DAFS provides an optimized mechanism for the most common case-atomically appending new data to the end of a file. o Client failure recovery When a client fails, it is unacceptable that the failed client locks out other clients for long intervals from accessing data in use. DAFS uses lease-based locks to ensure timely file availabil- ity after client failure. o File server reboot or network interruption recovery Applications can recover when a file server reboots or the network suffers a temporary interruption. o File server failover Applications need not fail when a file server fails over to an alternate file server that has a consistent copy of the data. DAFS supports failover and does not rely on network routing tricks. o Fencing Clustered application servers often maintain their own notion of nodes that are considered a part of the cluster. DAFS prevents Wittle [Page 23] INTERNET-DRAFT Direct Access File System September 2001 client systems ejected from the cluster from accessing shared data. 2.4.1. DAFS Deployment Models DAFS is a client-server distributed file access protocol. It can be implemented on any underlying network that supports DAT capabilities. The implementation mappings are especially interesting for DAFS client implementations. A number of implementations are possible. The next few paragraphs describe a few basic types: 1) Application uses DAFS via a user-level library that implements DAFS. 2) Application uses DAFS via a user-level library that implements DAFS, but also implements a transparency layer that hides the details of the DAFS implementation. 3) Application used DAFS via a kernel-level DAFS file system imple- mentation. Implementation 1 implements client file access using the DAFS proto- col in a user library. The DAFS client library exports the file access primitives directly as an API, perhaps exposing issues like memory registration. The application itself does not change to use the new API, but its OS adaptation library does. Many high- performance applications have such adaptation libraries to make the application code easier to port among different operating systems. Implementation 2 uses the same DAFS client library as Implementation 1, but adds a transparency library so that the user I/O library can behave as if it was supported by the underlying OS using the standard OS interfaces. Implementation 3 implements a DAFS client as a file system installed into the underlying kernel. The application accesses DAFS through the standard OS interfaces. 2.4.2. DAFS File Name Space To access the DAFS namespace, when a DAFS client first comes up in a local sharing network, the client needs to enumerate a list of avail- able servers. While it is possible that the client might have a pre- configured set of servers, it is desirable for a "clean" client to be able to join the network, presuming the servers are willing to pro- vide it DAFS service. A number of name service mechanisms can be defined to provide this "bootstrap" name service. See Appendix A. "DAFS Name Service". Wittle [Page 24] INTERNET-DRAFT Direct Access File System September 2001 2.4.3. DAFS Terminology This document uses the following terminology: Back-control Channel The Back-control channel is a communication channel between a DAFS client and server that is an OPTIONAL part of a DAFS Session. The primary purpose for the Back-control channel to is allow the server to send unsolicited messages to the client. Client-id-string The Client-id-string is a string that is selected by the client and is intended to uniquely identify that client. It is presented to the server when a Session is established and is opaque to the server. All Sessions created using the same client-id-string can be considered as being joined together representing the same client. This serves a similar purpose as the client-opaque-string in NFS Version 4. Client-verifier The Client-verifier is a 64-bit quantity, which identifies an instance of a client. The client-verifier is presented to the server at Session establishment along with the client-id-string. The server uses this to determine that all Sessions (which should have been disconnected) for the old instance of the client are gone and that all locks for that client need to be freed. This serves a similar purpose as the client verifier in NFS Version 4. Client-id Client-id is a 64-bit quantity chosen by the server that is used as a shorthand identifier for the client-id-string, client- verifier pair supplied in the connection request for a Session. It provides a component of subsequent lock owner identifiers for the client, and is associated with credentials supplied for use by the client. Communication channel A communication channel is a transport protocol level abstraction that provides a communication connection between two endpoints across a network. A DAFS communication channel provides the set of transport delivery and error handling requirements listed previ- ously in 2.3.3., "DAT Requirements". DAFS Protocol Specification 1.0 makes an assumption that there is one-to-one correspondence Wittle [Page 25] INTERNET-DRAFT Direct Access File System September 2001 between a DAFS communication channel and a DAT connection. Note: A many to one mapping of DAFS communication channels onto a DAT connection could be considered for addition later if this feature becomes a requirement, for example, for scala- bility purposes. At a minimum it will require adding Session-id to the header of DAFS messages. Operation Channel The Operation channel is a communication channel between a DAFS client and server that is a REQUIRED part of a DAFS Session. The primary purpose for the Operation channel to is allow the client to send operation request messages to the server. RDMA-read Channel The RDMA-read channel is a communication channel between a DAFS client and server that is an OPTIONAL part of a DAFS Session. The primary purpose for the RDMA-read channel to is allow the server to originate RDMA read operations targeted to the client. Response Cache The Response Cache is an OPTIONAL, server-maintained, Session- based cache that holds the results of recent state-modifying requests. Session A DAFS Session is an abstraction that allows a DAFS client and server to create and manage a collection of communication channels for exchanging messages. Session-id The Session-id is a 64-bit identifier used as a shorthand designa- tion for a communication Session between a client and server. The server returns it to the client when a Session is successfully established. It serves both to identify Sessions for recovery pur- poses and as evidence that the client still exists. In this latter role, it does not have to be explicitly sent, because it is implied by any message sent to the server on that Session. Because it is used for recovery, including recovery that could involved service failover between multiple DAFS servers, it MUST be unique across the any set of DAFS servers that share a failure recovery mechanism. Wittle [Page 26] INTERNET-DRAFT Direct Access File System September 2001 State-id The State-id is a 64-bit opaque quantity that is assigned by the server when a file is opened and serves as a shorthand representa- tion of the lockowner that has the file opened. The state-id is passed as a compact lockowner representation in file lock and close requests and is valid for an Session associated with the Client. In some respects, it is similar to the NFS Version 4 state-id. However, in DAFS the state-id does not have to be unique across server reboots, because the DAFS Session detects server reboots. It does not have to change on each locking request, because there are not going to be delayed transmissions pending in routers. Finally, it plays no role in lease renewal, because any message for a Session associated with that client suffices to renew leases. Wittle [Page 27] INTERNET-DRAFT Direct Access File System September 2001 3. Communication Model This chapter provides the basic design of the DAFS communication model and describes o Session management o Message handling. 3.1. Session Management DAFS communication is a session-based protocol that utilizes a request- response model of message exchange between client and server. A DAFS Session provides a common communication environment between the client and server. The session design incorporates a number of long- lived attributes including authentication and author- ization, features related to segments of the file system name space, message flow control, and transport-level resource management. A Ses- sion MUST be established before DAFS file operations can be per- formed. Rationale: Session-based message transfer enables DAFS to take full advantage of a number of attributes of the local file- sharing environment. First, to make highly effective use of RDMA operations for the transfer of bulk data between client and server, most DAFS implementations might prefer to preallocate and advertise large transfer buffers on specific communication channels. DAFS makes use of the Session mechanism to assist in the management of these resources. Second, within the local file-sharing environ- ment, the trust relationship between client and server permits the use of a set of shorthand credentials to be associated with the Session after initial authentication has been completed. Third, under a Session-based paradigm, message exchanges between client and server can be managed to provide recovery semantics following system or subcom- ponent failure. DAFS Sessions have the following primary functions: o Establishing and negotiating DAFS protocol options to be used dur- ing the Session o Authenticating the client and server o Linking lower-level connections (for example, DAFS communication channels) into a logical entity Wittle [Page 28] INTERNET-DRAFT Direct Access File System September 2001 o Providing context for credential management, message flow control management, DAFS file operations, and recovery operations. 3.1.1. Security Model The DAFS security model is based on a trusted client-server relation- ship, as would be expected in a local file-sharing environment. To provide basic security and establish the trust relationship between client and server, authentication is performed as part of the initial communication for setting up a Session. This includes authenticating the client to server, and OPTIONALLY authenticating the server to client. After initial authentication has succeeded, the trusted client can specify alternate sets of credentials when performing normal DAFS operations. Clients can preregister multiple credentials with the server (that is, server- side credentials caching) and obtain in return opaque cookies to be used in subsequent DAFS operations to identify individual credentials to the server. This avoids repeated transmission and analysis of the identical credentials that would otherwise appear on numerous requests in the local sharing environ- ment. In addition, when a client (specified in the connection request by the Client-id-string field) authenticates multiple Sessions using the same authentication type and principal identifier, credentials registered on one Session are available to all of that client's Ses- sions, and are associated with the same opaque cookies on each of those Sessions. Rationale: By leveraging the connection-oriented nature of transport protocol so that full credentials do not have to be passed and subsequently translated into an internal form by the server for each operation. Preauthentication is particu- larly valuable, because many local file-sharing applica- tions use only one credential for the entire duration of the connection. A DAFS server MAY also support untrusted clients, that can not alter their credentials once the Session is established. For more informa- tion, see 3.1.1.1.2., "Untrusted Clients". 3.1.1.1. Authentication Each DAFS client MUST authenticate itself to a DAFS server as part of Session initialization. This is distinct from any credential cookie used for individual DAFS operations. Wittle [Page 29] INTERNET-DRAFT Direct Access File System September 2001 The DAFS client and server can support one or more authentication mechanisms. The client can use a SECINFO operation to query which mechanism(s) are supported by the server. After client authentication has succeeded, OPTIONAL server authenti- cation can also be performed, to authenticate the server to the client. 3.1.1.1.1. GSS API Authentication DAFS allows clients to authenticate themselves, and to represent other entities, including users and machines. During the connect phase, the client can authenticate its identity to the server, using DAFS_PROC_CLIENT_AUTH or DAFS_PROC_CLIENT_CONNECT_AUTH, and can request that the server authenticate its identity to the client, using DAFS_PROC_SERVER_AUTH. DAFS provides that this can be done with a high degree of security, by employing the General Security Service Application Program Interface (GSS API) in both the client and server. The DAFS protocol provides a GSS API flavor of authentica- tion, which provides the mechanism needed to exchange the tokens gen- erated by GSS_API between the client and the server. The GSS API [Linn] provides a generic wrapper for different security mechanisms. The most widely used security mechanism currently sup- ported by GSS API is Kerberos Version 5. The DAFS protocol provides sufficient support for authentication of clients under GSS_API, regardless of the underlying security mechanism. DAFS does not provide any facility for integrity or privacy services under GSS API at its protocol layer. This could be provided by a lower level network protocol used by DAFS. DAFS does not authenticate each and every packet exchanged in a DAFS Session. Rather, it authen- ticates the Session, and relies on the secure nature of the Session to prevent an interloper from interjecting rouge packets into the client or server. The normal usage of the GSS API by the client and server is as fol- lows: The client makes a call to GSS_Init_sec_context, specifying mutual authentication, but not requesting delegation, replay detec- tion or out-of- sequence detection. This returns to the client a gss_token, which is an opaque carrier for authentication information that will be sent to the server as part of the client authentication. Upon receiving the token in a DAFS_PROC_CLIENT_AUTH or DAFS_PROC_CLIENT_CONNECT_AUTH call, the server will make a call to GSS_Accept_sec_context. If successful, this will generate another gss_token, which needs to be returned to the client in the response to the procedure called. Wittle [Page 30] INTERNET-DRAFT Direct Access File System September 2001 The interaction between client and server can require multiple phases to complete authentication. If the major status code returned by GSS API to the GSS_Accept_sec_context call is GSS_S_CONTINUE_NEEDED, then the client needs to make a subsequent GSS_Init_sec_context call, with the gss_token returned by the server as input, and then make another DAFS_PROC_CLIENT_AUTH request, simply for the purpose of delivering the gss_token returned by that call to GSS_Init_sec_context to the server. Since the client specifies the mutual authentication flag in its GSS_Init_sec_context call, the client and server are mutually authen- ticated once the gss context is created, and there is no need for the client to make a DAFS_PROC_SERVER_AUTH call to authenticate the server. For this reason, the server response to DAFS_PROC_SERVER_AUTH with the AUTH_GSS security flavor is void. The gss context created is associated with a DAFS Session. For that reason, there is no exchange of gss context handles between the client and server. The client and server MUST both destroy the gss context associated with a DAFS Session when tearing down that Ses- sion. Depending on the type of authentication, re-authentication might be necessary periodically in order to renew the authentication. Before the authentication expires, the client SHOULD initiate a new sequence of DAFS authentication operations using either DAFS_PROC_AUTH (for the Operation channel) or DAFS_PROC_BIND (for optional channels). This sequence proceeds in the same manner as the initial authentica- tion sequence. Each channel MUST renew its authentication indepen- dently. Rationale: The GSS integrity and privacy features are unnecessary for use in a local file-sharing environment, where they add processing overhead without extra security. 3.1.1.1.2. Untrusted Clients A DAFS server MAY support untrusted clients. An untrusted client has its identification and credentials established at initial Session connection time. The DAFS server automatically registers the initial set of credentials supplied during Session creation as the creden- tials to be used for the life of the Session. The following constraints are placed on an untrusted client: o The client SHALL NOT use the PROC_REGISTER_CRED or PROC_RELEASE_CRED requests. Wittle [Page 31] INTERNET-DRAFT Direct Access File System September 2001 o The cred_handle for all requests MUST be zero. o The AUTH_NONE authentication method is disallowed. o The AUTH_DEFAULT, AUTH_GSS and AUTH_NAME authentication methods are allowed, if supported by the DAFS server. Rationale: AUTH_NONE provides no accountability or identification so no credentials are available to use for the registered defaults. Thus, it is disallowed. 3.1.1.2. Credential Registration and Caching DAFS clients can pre-register a credential with the DAFS server for use during file operations, and in return, obtain an opaque creden- tial cookie. The cookies are used by the client in subsequent DAFS operations instead of passing full user credentials. The number of credentials that can be cached by the server for a Session is speci- fied during Session setup. After a client determines that a set of credentials is no longer needed, the client advises the server that the set of credentials can be released. 3.1.1.3. Client Identifiers A DAFS client is identified by the client-id-string and client- verifier supplied in the connection request for a Session. The client-id-string, client-verifier pair is mapped to a shorthand client-id identifier that is subsequently identified with the client. In determining the appropriate client-id to use for a given client- id-string, client-verifier pair, the server SHALL apply the following rules: Case A: the server has no record of the client-id-string. In this case, the server can treat this case as though the client has never connected to the server. Specifically, the server SHALL generate a new client-id and save the authenti- cation mechanism and principal for checking against future uses of the client-id string. Case B: the server does have a record of the client-id-string, and there is an active client-id for it which was returned for the same client-id- string, client-verifier pair. This case corresponds to a client instance reconnecting after a Session was disconnected due to a transport error, or a Wittle [Page 32] INTERNET-DRAFT Direct Access File System September 2001 client instance connecting additional Sessions. In this case so long as the client successfully authenticates as the same authentication mechanism and principal that caused the server to generate the client-id, then the server SHALL return the same client-id as it has for the previous use of the (client-id-string, client-verifier) pair. If the client attempts to authenticate using either a different authentication mechanism or principal, then an appropriate error is returned. Case C: the server does have a record of the client-id-string, and there is an active client-id for it which was returned for the same client-id- string, but a different client-verifier. This case corresponds to a restart of a client instance (e.g., client reboot). In this case the client MUST authenticate using the same authentication mechanism and principal as was used for the previous instance of the client-id-string. If the client attempts to authenticate using either a dif- ferent authentication mechanism or principal, then an appropriate error is returned. If the server finds the principal is equal to the previously registered client-id-string then all locking state associated with the old client- id SHALL be immediately released by the server. Note: When the client uses DAFS_PROC_CLIENT_CONNECT followed by DAFS_PROC_CLIENT_AUTH (rather than DAFS_PROC_CLIENT_CONNECT_AUTH) and case C above occurs, there is a window in which the server needs to generate a new client-id but the server can not clean up the old state that goes along with the old client-id. The server MUST wait until the client successfully authenticates as the appropriate prin- cipal using the appropriate mechanism (as discussed in case C) before releasing locking state. 3.1.2. Session Attributes A number of aspects of the client to server communication are nego- tiated at the time the DAFS Session is established. Wittle [Page 33] INTERNET-DRAFT Direct Access File System September 2001 3.1.2.1. Session Identifier Upon successful completion of a client request to establish a DAFS Session, the DAFS server returns a unique Session identifier to the client. The Session identifier associates a series of DAFS operation messages, independent of the lower-level transport implementation used to exchange messages. 3.1.2.2. Session Options A number of Session attributes are established during Session ini- tialization. The client specifies requested values for the negotiated attributes in the Session request message, and the server returns the values for those attributes that will be used for the duration of the Session. Session options are: Protocol Version The client requests a particular protocol version for the Session. If the server supports that version, it responds with the same version number. Otherwise, the server responds with an error indi- cation and can also return a protocol version number that is sup- ported. Endianness The client requests a byte ordering endianness for the Session. The endianness of the initial connection request sent by the client can be either "little-endian" or "big-endian." The server determines the client's byte ordering by examining the protocol- defined static value in the message header. The server's response message to the connection request, and all subsequent messages exchanged on the Session will use the endianness chosen by the client. Maximum Number of Credentials The client requests that the server provides storage for a certain number of credentials. The server responds with the number that it will store. Later the client can use the DAFS_PROC_REGISTER_CRED operation to store credentials with the server and receive a "shorthand" credential identifier to be used in subsequent DAFS operations. Maximum Request Size The client requests that the size of the buffers allocated on the server to receive send-receive style request messages sent from Wittle [Page 34] INTERNET-DRAFT Direct Access File System September 2001 the client is set to this value. The server responds with the buffer size that will be allocated on the server for receiving those requests. Maximum Response Size The client requests that the size of the buffers allocated on the client to receive send-receive style response messages sent from the server is set to this value. The server responds with the buffer size that MUST be allocated on the client for receiving those requests. Use Back-control Channel The client requests that it be allowed to bind an additional DAFS communication channel to this Session to support the transmission of server-initiated "back-control" directive messages on a separate channel. The server responds with whether the Back- control Channel will be used. Use RMDA Read Channel The client requests that it be allowed to bind an additional DAFS communication channel to this Session to support RDMA read opera- tions. The server responds with whether the RDMA-read Channel will be used. Use Checksums The client requests that all user data transferred in read and write operations be checksummed. The server responds with whether or not data will be checksummed. Inline Write Header Size The client requests the offset (in bytes) from the start of a DAFS_PROC_WRITE_INLINE message where the user data being transferred is located. This value is the sum of the DAFS message and operation header sizes and the padding size. This provides improved data alignment following write transfers to the server. The server responds with the offset where it will expect the data. 3.1.2.3. Multiple Communication Channels A DAFS Session includes at least one communication channel and can include up to two additional special-purpose channels. Note: All DAFS communication channels defined for a Session share the Wittle [Page 35] INTERNET-DRAFT Direct Access File System September 2001 property that if an RMR Context can be used on one of them it can be used on any of them for RDMA operations. Rationale: DAFS is designed to support the key advantages of fast, low- overhead remote read and write operations with minimal overhead. The basic message flow pattern needed to support those operations can be accomplished using only one transport channel. Adding features that require server-initiated messages introduces additional complex- ity. However, that complexity is introduced o only for those clients who require the feature o in a way that minimizes complexity and performance overhead for the more common I/O operations. Thus, to support these optional features, DAFS optionally introduces extra complexity in the area of Session manage- ment in the form of establishing an additional communica- tion channel. 3.1.2.3.1. DAFS Operation Channel A DAFS Session includes at least one communication channel (for exam- ple, DAT connection) for transporting DAFS operation messages between the client and server. The message flow consists of a DAFS operation request message sent from the client to the server, followed by a DAFS operation response message sent from the server to the client. The DAFS client initiates all request/response pairs on the DAFS Operation Channel. 3.1.2.3.2. Creation of Special Purpose Channels The client creates all communication channels between the client and server. Connection and authentication of each channel normally con- sists of a connection request being sent from the client to the server, followed by a response message being sent from the server to the client. For some types of Session authentication, this initial paired message exchange MAY be followed by subsequent paired message exchanges that continue until the authentication process is complete. The entire sequence of exchanges is single-threaded, requiring that the client and server each make one receive buffer available for the next message throughout the sequence. Once the connection and authen- tication sequence is complete, the message flow on the channel is dictated by it's intended purpose. If an optional, special-purpose channel is to be used, it MUST be Wittle [Page 36] INTERNET-DRAFT Direct Access File System September 2001 bound to the Session after the initial connection and authentication message sequence has completed successfully on the Operation Channel, but before issuing any request on the Operation Channel that would require the use of the special-purpose channel Note: If the client intends to issue requests that require the use of an optional channel, then the client SHOULD create and bind those channels to the Session as soon as possible after com- pleting the initial connection and authentication message sequence on the Operation Channel. Any delay in establishing these optional channels could increase the risk that a resource shortage on the server could cause an error in establishing the optional channel. However, the client is NOT REQUIRED to create these optional channels if it will not issue requests that would require their use. The specific requests that, depending on the results of the Session option negotiation, can introduce a requirement for the use of an optional channel are: o Back-control Channel o Any operation that requests a delegation o DAFS_PROC_BATCH_SUBMIT o RDMA-read Channel o Any operation that requests the server to initiate an RDMA operation 3.1.2.3.3. Back-control Channel A DAFS Session can include a separate communication channel for tran- sporting DAFS back-control operation messages between the server and the client. Following the initial connection and authentication sequence of messages, this second message flow consists of DAFS back-control directive request message sent from the server to the client, followed by a DAFS response message sent from the client to the server. The DAFS client creates this channel by initiating a request/response message pair to bind this channel to a previously established Ses- sion. The bind request can be followed by a sequence of client ini- tiated request/response message pairs needed to complete the channel Wittle [Page 37] INTERNET-DRAFT Direct Access File System September 2001 authentication. Once the initial connection and authentication sequence is complete, the DAFS server initiates all subsequent request/response pairs on the Back- control Channel. This channel is OPTIONAL in that the client creates it to support DAFS features that are OPTIONAL and require server-initiated request/response messages. For instance, the delegation feature requires the server to initiate delegation revoca- tion message pairs with the client. This second channel provides for that requirement. Rationale: Separating the Back-control Channel from the Operation Channel also separates the traffic on the two channels onto two separate DAT Connections, each with their own DAT Endpoint and Connection parameters. This provides indepen- dent flow control on each channel. In addition, it allows the client implementations to handle back-control requests in a separate flow of control, without additional parsing on the common data path for command responses. 3.1.2.3.4. RDMA-read Channel A DAFS Session can include a communication channel to be used exclusively for RDMA read operations initiated by the DAFS server. A client that intends to issue requests that require the DAFS server to make use of RDMA read operations MUST specify that intent by indi- cating the "use_rdma_read" Session connection option. The server accepts or rejects the use of the RDMA-read Channel. If accepted, then the server will use this communication channel to issue RDMA read operations when performing direct read operations from the client's memory. If the DAFS server specifies the use of an RDMA-read Channel during the connection option negotiation, then the client is REQUIRED to create and make use of such a channel to issue any DAFS request that would use the RDMA-read Channel. Otherwise, the server MAY return an error to any DAFS request on the DAFS Operation Channel that would involve an RDMA read operation. The DAFS client creates this channel by initiating a request/response message pair to bind this channel to a previously established Ses- sion. The bind request can be followed by a sequence of client ini- tiated request/response message pairs needed to complete the channel authentication. Once the initial connection and authentication sequence is complete, subsequent use of the channel is limited to server-initiated RDMA read operations. Wittle [Page 38] INTERNET-DRAFT Direct Access File System September 2001 3.1.2.3.4.1. Use of the RDMA-read Channel Optional for Server The RDMA-read Channel is OPTIONAL for the server. If a client's con- nection request specifies the use of an RDMA-read Channel (meaning that it intends to issue requests that call for the use of that chan- nel), the server MAY respond by accepting the use of an RDMA-read Channel, or MAY respond by rejecting the use of an RDMA-read Channel. If the server accepts the use of the RDMA-read Channel, then the client MUST create it, otherwise the client MUST NOT create it. If a server has rejected the use of an RDMA-read Channel, and the client attempts to create one anyway, the server MAY return an error response to the DAFS_PROC_CONNECT_BIND request. If a client's connection request does not specify the use of an RDMA-read Channel (meaning that it does not intend to issue requests that call for the use of that channel), then the server MAY reject any DAFS request on the DAFS Operation Channel that would have involved the use of the RDMA- read Channel. This seems easier than dropping the Session or requiring that the RDMA-read Channel be set up in length of time after the DAFS Operation Channel. The RDMA-read Channel is NOT OPTIONAL for the client. Regardless of whether the client's connection request specifies the use of an RDMA- read Channel or not, if the server's response specifies an RDMA-read Channel, then the client MUST create the channel if the client intends to issue requests that require its use. Should the server specify the use of an RDMA-read Channel, and the client does not create it, then the server MAY reject any subsequent DAFS request on the DAFS Operation Channel that would have made use of the RDMA-read Channel. 3.1.2.3.4.2. Use of the RDMA-read Channel Not Optional for Client To make the RDMA-read Channel effective, all clients that issue DAFS requests that use RDMA read operations MUST use the RDMA-read Channel if it is specified by the server during Session negotiation. Rationale: VI does not currently define "selective signaling" (per- request interrupt flag) capability, while InfiniBand software transport interface does provide a selective sig- naling semantic. Hence, a portable application can not rely on the transport layer providing such capability. To remedy this VI deficiency, DAFS defines the RDMA-read Channel that a DAT Provider MAY be able to provide. This removes the need for that transport layer to provide a selective signaling capability. Notice that if the DAT Wittle [Page 39] INTERNET-DRAFT Direct Access File System September 2001 optionally provides selective signaling, then a DAT Consu- mer can use that capability directly, thus avoiding the need for additional DAFS communication channel for RDMA- read Channel. Consider a simple VI Architecture-based implementation where each DAFS communication channel is mapped onto a separate VI connection. Consider what happens if the server wants to use the RDMA-read Channel, and all client's but one establish and make use of an RDMA-read Channel. Assume the single client without an RDMA-read Channel tries to perform a DAFS_PROC_WRITE_DIRECT or DAFS_PROC_BATCH_SUBMIT request. The only communication channels available to the server to post the RDMA read will be either the DAFS Operation Chan- nel for the Session or the back-control directive channel for the Session. Both of these channels have their send work queues tied to a completion queue on which the DAFS server normally does not want to have interrupts enabled. Typically a global send completion queue that is shared across all DAFS Sessions. So, if the DAFS server is about to satisfy a DAFS_PROC_WRITE_DIRECT or DAFS_PROC_BATCH_SUBMIT request, the server posts the RDMA read to either the DAFS operation or back-control direc- tive channel. To get timely notification for the RDMA read completion, the DAFS server will need to enable interrupts on the global send completion queue. Thus, a single client could put the server into the posi- tion of needing to take an interrupt for every send or RDMA write completion. That would defeat the point of the RDMA-read Channel. Implementations that support selective signaling need not use an RDMA-read Channel. 3.1.2.3.5. Direct I/O channel negotiation Depending on the details of server and client DAT support transport, it might be desirable for either the client or server to perform RDMA operations on the Operation Channel or on the RDMA-read Channel. This functionality is negotiated as part of the Session. Wittle [Page 40] INTERNET-DRAFT Direct Access File System September 2001 1) Server supports DIRECT I/O on both the RDMA-read Channel and the Operation Channel. o If the client requests use_rdma_channel == TRUE, then the server replies with use_rdma_channel == TRUE. DIRECT I/O proceeds on the RDMA-read Channel. o If the client requests use_rdma_channel == FALSE, and later tries DIRECT I/O on the Operation Channel, the server performs the DIRECT I/O on the Operation Channel. 2) Server requires use of the RDMA-read Channel for DIRECT I/O. o If the client requests use_rdma_channel == TRUE, the server replies with use_rdma_channel == TRUE. DIRECT I/O proceeds on the RDMA-read Channel. o If the client requests use_rdma_channel == FALSE, the server replies with use_rdma_channel == TRUE. If the client later tries DIRECT I/O on the Operation Channel, the server returns an DAFSERR_ENOTSUPP error. 3) Server does not support the RDMA-read Channel, but supports DIRECT I/O on the Operation Channel. o Client requests use_rdma_channel == TRUE, the server replies with use_rdma_channel == FALSE. Later the client tries DIRECT I/O on the Operation Channel, and the server performs the DIRECT I/O on the Operation Channel. o If the client requests use_rdma_channel == FALSE, the scenario proceeds as in the previous situation. 4) Server does not support DIRECT I/O at all. o The client requests use_rdma_channel == TRUE, the server replies with use_rdma_channel == FALSE. Later client tries DIRECT I/O on the Operation Channel, the server returns an error. o If the client requests use_rdma_channel == FALSE, the scenario proceeds as in the previous situation. 3.1.2.3.6. Special Channel Setup Handling The DAFS server might need to be able to differentiate a client's connection requests for various types of Session channels. The imple- mentation requirements are specific to the details of connection Wittle [Page 41] INTERNET-DRAFT Direct Access File System September 2001 establishment for each particular transport. For more discussion see o Appendix B. "DAT Semantics" for details of DAAT connection estab- lishment o Appendix D. "DAFS Mapping to VI Architecture" for details of the DAT mapping to VI o Appendix E. "DAFS Mapping to InfiniBand Reliable Connection" for details of the DAT mapping to IB. 3.1.2.4. Session Response Cache The DAFS server and client negotiate the use of a cache of operation results of recent state-modifying requests issued on each Session. State- modifying requests are those that change the state of a file or other file system object or a file lock on the server (see 4.3., "Request Chaining" for further description of state-modifying requests). The maximum number of outstanding requests allowed for the Session determines the number of entries in the cache. This value is negotiated during Session initialization. The Session identifier pro- vides an identifier for accessing the Response Cache during recovery processing following a system failure. Rationale: DAFS uses a session-based protocol with a fixed number of outstanding requests for each Session. This provides an upper bound on the total number of entries in the Response Cache. The Session-id provides a unique identifier for accessing the cache following a failure. Using these ele- ments, the DAFS server can maintain the state necessary to insure at-most-once semantics for state-modifying opera- tions following a failure. 3.1.2.5. Session Persistence A DAFS Session persists only as long as the DAFS Operation Channel exists. If the DAFS Operation Channel is lost, the client MUST estab- lish a new Session and re-authenticate and possibly re-register all credentials before continuing with additional DAFS requests. The client can use the new Session to transmit queries to the server's about the state of DAFS requests that were outstanding at the time of disconnection. This information is available in the Response Cache associated with the old Session. 3.1.3. Session Operations The following DAFS operations are provided for DAFS Session manage- ment. Wittle [Page 42] INTERNET-DRAFT Direct Access File System September 2001 DAFS_PROC_CLIENT_CONNECT Create a new Session using the current transport connection, nego- tiating basic protocol configuration for the Session. DAFS_PROC_CLIENT_AUTH Authenticate the client to the server, establishing trust for the Session. DAFS_PROC_SERVER_AUTH Authenticate the server to the client, establishing trust in the reverse direction for the Session. DAFS_PROC_CLIENT_CONNECT_AUTH Create a new Session and authenticate the client to the user. DAFS_PROC_CONNECT_BIND Bind a new transport connection to an existing Session. DAFS_PROC_CLIENT_ DISCONNECT Terminate a Session. DAFS_PROC_SECINFO Enumerate server-supported security authentication methods. DAFS_PROC_REGISTER_CRED Register credentials with the server for subsequent usage by the client. DAFS_PROC_RELEASE_CRED Remove previously registered credentials that are no longer needed by the client. The following DAFS operations provide recovery of Response Cache information for a previous Session. DAFS_PROC_CHECK_RESPONSE Check a disconnected Session's Response Cache for the results of a request. Wittle [Page 43] INTERNET-DRAFT Direct Access File System September 2001 DAFS_PROC_FETCH_RESPONSE Fetch information from a disconnected Session's Response Cache. DAFS_PROC_DISCARD_RESPONSES Discard Response Cache information for a disconnected Session's Response Cache. 3.1.4. Sharing Sessions DAFS does not specify how a single DAFS Session is used by applica- tions. However, it does provide a mechanism to facilitate the sharing of a single Session, as might be the case if a multi threaded appli- cation wants to multiplex threads onto a Session. Each DAFS message contains a 64-bit field in the message header that the client is free to use to identify or tag a request. This tag is opaque to the server. The server returns the unmodified 64-bit number in the response, that could then be used by the client to efficiently match the response with the originator of the request. The DAFS pro- tocol does not place any constraints on what this 64-bit tag con- tains. 3.2. Message Handling The DAT provides a rich set of data transfer primitives. Efficient use of those primitives is affected by number of software interface and hardware support attributes. The DAFS protocol defines tradi- tional send/reply messages, as well as remote DMA-based operations. The request- response model uses data buffers transmitted in-line with the messages, whereas the bulk data transfer model uses "direct" buffers transmitted via RDMA independent of the messages. The RDMA model does not require the use of intermediate buffers within the file system or transport. The DAFS protocol defines a message flow control mechanism to help manage the various buffer resources. 3.2.1. DAT Data Transfer Operations DAT provides a rich set of features for data transfer operations: RDMA writes, RDMA reads, and traditional send/receive. Data buffers for all data transfer operations consist of a scatter/gather list of one or more memory segments. Furthermore, the targeted applications of the local file-sharing environment and the DAFS protocol suggest a rich set of application requirements including async I/O and list I/O [POSIX]. Wittle [Page 44] INTERNET-DRAFT Direct Access File System September 2001 3.2.1.1. RDMA The RDMA write and RDMA read facilities enable one end of a communi- cation channel to directly write or read into the address space of its peer. The advantages of RDMA transfers over the traditional send/receive model are that data copies can be avoided and receive buffers need not all be allocated to hold the maximum transfer size of data. 3.2.1.2. Scatter/Gather The initiating side of an RDMA data transfer can provide a set of buffers to use rather than requiring all data in the transfer to be contiguous in the virtual address space of the process. This can be used, for instance, to retrieve a message header into one buffer and a data payload into a data buffer. 3.2.1.3. RDMA Memory Registration All memory that will be accessed by a DAT Channel Adapter in support of RDMA transfer operations needs to be registered with the Channel Adapter. Memory registration serves a number of purposes. First, it allows the operating system to pin the memory so it will be memory- resident during an I/O transfer. Secondly, it provides the Channel Adapter with the physical address mapping for the memory region. Finally, it associates the memory region with an RMR Context and set of protection attributes. The RMR Context can be used to ensure that the memory region is accessible for Remote DMA over the DAT connection(s) that are associated with that RMR Context only. This restricts access of a particular memory region to particular hosts or applications. The protection attributes indicate whether RDMA read or write is allowable on the memory region. As a result of memory registration, the DAT consumer is returned an RMR Context. This RMR Context MUST be provided to the Channel Adapter whenever it is asked to reference the memory region by RDMA data transfer operation (DTO). 3.2.2. DAT Error Reporting DAT provides guarantees for data delivery and data transfer operation (DTO) completions as stated in 2.3.3., "DAT Requirements". DAT makes some guarantees as to the type of errors it detects, but it makes no statements as to the timeliness of the reporting of these errors. It is up to DAFS client and server implementations to address any defi- ciencies in the timely error detection/reporting features of any given DAT provider. Wittle [Page 45] INTERNET-DRAFT Direct Access File System September 2001 Some of the DAT and DAFS protocol tools that handle timely detection of errors include: o DAFS null operations for the main and Back-control Channels that could be used to ping a peer. o Ability of either the client or server to break a connection at any time. o Support for multiple connections between client and server. The DAFS specification does not mandate any particular strategy for timely error detection. In fact, the level of error detection sup- ported by the DAT provider will dictate the degree of error detection that DAFS implementations will need to perform. Note: Once a DAT interaction between the DAT Provider and the DAT Consumer is defined, a timeout parameter for some synchronous data transfer operations (DTO) can also be used for controlling timely detection of errors. Client and server implementations are free to implement any mechanism and enforce any timeliness constraints they see fit. Typically, request initiator clients on the main channel and servers on the back channel are responsible for error detection enhancements on their channel. Possible solutions include: o Use of a per-operation timeout. o Use of keep-alive messages (pings) using the NULL procedures. The sender could then restrict the use of timers to these messages only and not to every operation. 3.2.3. Mapping DAFS onto Memory-to-Memory Architectures The key characteristics of the memory-to-memory architecture that impact the definition of the file access operations are the Remote DMA (RDMA) data transfer facilities. The RDMA write and RDMA read facilities enable one end of a communication channel to directly write or read into the memory of its peer. The advantage of RDMA transfers over the traditional send/receive model is that data copies can be avoided. DAFS assumes that there is one-to-one correspondence between a DAFS communication channel and a DAT connection. DAFS operations are defined that take advantage of remote memory access. For instance, in addition to read file and write file Wittle [Page 46] INTERNET-DRAFT Direct Access File System September 2001 operations that include the data to be transferred in the response or request message, new read and write operations are defined that include the memory address of the client's destination/source buffer. Given the RDMA and send/receive models supplied by DAT, DAFS requests and responses are divided into two categories: o messages that transfer large variable-sized (usually greater than 1 KB) bulk user data o messages that are bounded in size by the file access protocol. 3.2.3.1. Small Bounded-Size Transfers The DAFS message exchange is mapped onto Direct Access Transport mes- sages in the following manner: 1) The DAFS request message is placed in transport-registered memory. A preallocated send data transfer operation (DTO) buffer is ini- tialized so that its data segment's buffer virtual address points to the DAFS message. More generally, the send DTO buffer can con- tain a "gather" list of virtual address pointers, paired with corresponding buffer lengths, that describe a (potentially virtu- ally non-contiguous) series of memory buffers containing the DAFS message. The message is sent using the transport specific send interface. 2) On the receiving end, a server is REQUIRED to have a pre- allocated, registered transport-level receive data transfer opera- tion (DTO) buffer ready to accept a request message. The DTO buffer's data segment(s) form a scatter list of pairs that describe a preallocated virtual buffer that meets the agreed upon maximum message size. It is up to the server implementation to detect message reception through polling or blocking calls to the transport's receive interface using a transport-specific API. 3) The server builds a response and sends it in a manner similar to step 1. Note that as part of the flow control agreement, the client MUST post as many receive descriptor buffers as there are outstanding requests. 4) The client receives the response in a manner similar to step 2. As in step 2, the DAFS architecture does not mandate a mechanism for detecting the arrival of responses. It is therefore possible for a single client thread to asynchronously deliver requests to a server and collect responses at a later time. Wittle [Page 47] INTERNET-DRAFT Direct Access File System September 2001 3.2.3.2. Bulk Data Transfers A small number of operations might require the transfer of large and variable-sized user data frames. Typical RPC-based distributed file systems encode and transmit bulk data in the same fashion as bounded- sized requests. The sequence of operations is the same as described earlier in 3.2.3.1., "Small Bounded-Size Transfers" and the data packets usually have a fixed size header followed immediately (inline to the header) by bulk data. The problem with this encoding of messages is that the bulk data never lands at the desired memory location on the destination. This implies that a data copy needs to be performed to place the bulk data at the intended destination. Using RDMA operations, it is possible, with modifications to the packet encoding, to place the data directly where it is desired. The packet encoding changes so that the bulk data does not follow the header, but rather the header contains a memory reference to where the bulk data can be found. All RDMA operations are initiated by the DAFS server. 3.2.3.2.1. Client to Server Bulk Data Transfer In cases where the bulk data flow is from the client to the server (as in DAFS write), a single DAFS operation maps to the following messages: 1) The client sends to the server a message that contains the DAFS header plus a list of triples that describe where the bulk data resides. Sending this request does not differ from the first step in the bounded-size request processing, in that it is also a single send descriptor buffer with the data segment pointing to the header. However, in this message the header contains memory address information for the bulk data buffer. 2) The server receives the request as in the bounded-size case. The request, however, is not complete, because the bulk data is miss- ing. The next step handle the bulk data transfer. 3) The server decodes the request and posts an RDMA read data transfer operation (DTO) using the addressing information con- tained in the request. Note that the client is not involved, because the RDMA operation does not contain any immediate data and does not require the use of any of the client's receive buffers nor matches against any client's submitted receive DTOs. Also note that now that the server has the file information, it is possible for the server to place the contents of the bulk transfer directly Wittle [Page 48] INTERNET-DRAFT Direct Access File System September 2001 into the buffer that the native file system requires. (However, this is an implementation issue that is not mandated by the DAFS architecture.) Finally, note that depending on the server implementation and the optional channels negotiated for the Session, the RDMA operation MAY be posted on the either the RDMA-read Channel or the Operation channel. 4) The server sends the response (bounded in size) as described in the bounded-sized response case. The client receives the response in a manner similar to the bounded- size case. 3.2.3.2.2. Server to Client Bulk Data Transfer In cases where the bulk data flow is from the server to the client (as in a DAFS read operation), a single DAFS operation maps to the following transport-level messages: 1) The client sends the DAFS header, containing the memory informa- tion for the bulk data buffer, the same as for step 1 for client to server bulk data transfers. The address encoded in the request refers to the client location where data to be transferred from the server to the client will be placed. 2) The server receives the DAFS request as in step 2 of client to server bulk data transfers. 3) The server posts an RDMA write data transfer operation (DTO) to move the bulk data directly to the address advertised in the request header. (As stated in 2.3.3., "DAT Requirements" the server's RDMA write DTO is not matched against any receive DTO pre-submitted by the client.) 4) The server sends the response just as in step 3 for small bounded- size transfers. 5) The client receives the response in a manner similar to step 4 of the small bounded-size case. However, the client is aware that the response does not include the bulk data and that it can be found at the location that the client specified in the request. The DAFS file access defines bulk transfer operations that parallel the traditional RPC model. The bulk transfer operations take advan- tage of DAT RDMA capabilities, while the small bounded-size opera- tions use the traditional send/receive model. DAFS defines standard Wittle [Page 49] INTERNET-DRAFT Direct Access File System September 2001 APPEND_INLINE, READ_INLINE, WRITE_INLINE, and READDIR_INLINE opera- tions as well as APPEND_DIRECT, READ_DIRECT, WRITE_DIRECT, and READDIR_DIRECT operations. DAFS defines two versions, inline and direct, for any operation where the data size might be large due to variable length fields. Rationale: To promote pipelining of DAFS messages through send and receive data transfer operations (DTO), the client and server need to agree to pre-submit multiple receive DTOs. However, DAFS messages are variable length, and the order of requests is unpredictable. Therefore, maintaining variable- length send DTOs would require additional mes- sages (or additional synchronized communication channels) in order for the receiver to allocate and pre-submit a receive DTO for an appropriately sized DAT buffer. A sim- ple solution to this problem is to preallocate the maximally-sized DAT buffers capable of receiving the larg- est DAFS message. This wastes space, because most DAFS messages are smaller than 1 KB, and only a few messages with variable-sized fields (for example, file attributes and long file names) might become much larger than 1 KB. Thus, DAFS essentially implements two buffer sizes: small and large. Small buffers are used for normal send-receive traffic. Anything that will not fit in a small buffer has its bulk data portion transferred directly between client and server buffers using RDMA. The benefits of this approach are primarily: o the performance benefits of simplifying the negotia- tion of buffer sizes through the use of uniformly sized DAT buffers for send/receive data transfer operations (DTO) o the ease of implementation from a managing fixed sized buffers. However, during Session establishment, DAFS leaves open the option for the client to negotiate the buffer size. There might be an implementation or application where operation latency is very critical and memory space is very cheap. In this case, the additional memory costs of large buffers can be traded off against the reduced latency of the DAFS Send- Receive model of communication versus the DAFS RDMA model. Wittle [Page 50] INTERNET-DRAFT Direct Access File System September 2001 3.2.4. Separate Communications Channel for RDMA Read Operations DAFS defines, optionally, a separate communication channel, commonly mapped into a separate DAT connection that can be used specifically for RDMA read operations issued by the server. This option might pro- vide a significant reduction in the latency of DAFS_PROC_WRITE_DIRECT operations. Rationale: The DAFS server should rarely need to block waiting for the send data transfer operation (DTO) of a DAFS response message to complete. One implementation would be for the server to specify that all completion notifications for send DTOs on all DAFS Sessions be associated with a single Completion Queue (CQ). The server could then periodically poll this CQ to harvest send completions in a group, rather than taking an interrupt on each individual send completion. On the other hand, to provide good response time for a DAFS_PROC_WRITE_DIRECT request, the server should be able to receive immediate notification (that is, a hardware interrupt) when the RDMA read completes. A server imple- mentation could use a single CQ that ties together the following: o the completion notifications for receive DTOs of the DAT connections for the DAFS Operation Channels o the completion notifications for receive DTOs of the DAT connections for the back-control directives chan- nels o the completion notifications for the receive DTOs of the DAT connections for the RDMA-read Channels. The server can then have a single worker that blocks on this CQ when idle. With this setup, under moderate load the server receives timely interrupts for RDMA read completions, but does not have to receive interrupts for the (potentially numerous) interrupts for ordinary send completions that indicate that DAFS response resources can be reclaimed. Note: Even with this scheme, per data transfer operation (DTO) Wittle [Page 51] INTERNET-DRAFT Direct Access File System September 2001 interrupt control might be highly desirable for the following reasons: o DAT RDMA read does not support remote gather. Thus, if the DAFS_PROC_WRITE_DIRECT specifies N+1 virtually noncontiguous client buffers, the server will need to post N+1 separate RDMA read operations. The server only cares when the final RDMA read completes. Unfortunately, without per- data transfer operation (DTO) interrupt control, the server can be interrupted when each of the N+1 RDMA read operations complete, rather than just when the desired final RDMA read operation completes. o Adding a third communication channel per DAFS Session is potentially expensive, because communication channels are a limited resource. 3.2.4.1. Error Detection on the RDMA-read Channel A problem could arise when the RDMA-read Channel is employed. Because it is difficult to detect and recover from errors on that channel. The reason is that the client needs to create it, but only the server knows whether the channel is functioning correctly. The client creates the communication channel by establishing a DAT connection, creating RMR Contexts for RDMA read, associating RMR Con- texts with the DAT connection, connecting it to the server, and bind- ing it to the DAFS Session. Following the initial connection and authentication message exchanges, the client does not post descriptor buffers to this communication channel, since it will be used solely by the client's transport to satisfy the server's RDMA read opera- tions. On the other hand, once the initial connection and authentication sequence is complete, the server will not submit any receive data transfer operations (DTO) to the DAT connection for this communica- tion channel, because the client will not send on the channel. In fact, that is the motivation: traffic on the channel is never inter- leaved, so that completions can be efficiently handled. The only sends to the channel are for server RDMA read operations, and these occur for non-inline client transfers only. It is only when the server has an opportunity to perform such a transfer that the status of the channel will be checked. The client software will never see an error from the channel. When a failure condition occurs on the RDMA-read Channel, the server can disconnect the client's Session connection, forcing the client to recover as if Wittle [Page 52] INTERNET-DRAFT Direct Access File System September 2001 a network partition or server restart was encountered. 3.2.5. Checksums The DAFS protocol defines the OPTIONAL use of checksums on all mes- sages exchanged on a Session. During Session creation, the client can specify this option. If specified, two separate checksums will be computed: message_checksum A checksum for the message, including headers and any inline data direct_checksum If there is an RDMA transfer associated with the operation, a checksum for the bulk data transferred via direct RDMA. The message_checksum is transmitted in the message header for both request and response messages. It is computed by the message sender, and inserted into the message header. If the request includes an RDMA operation, the direct_checksum is computed for the RDMA data buffer and inserted into operation header of the request of response accom- pany the RDMA operation. The message receiver verifies the checksums on receipt of the message. If the server detects a checksum failure, a checksum error status is returned. If the client detects a checksum failure it SHOULD take appropriate action. The use of checksums is negotiated during Session creation. The client requests the use of checksums in the connection request mes- sage, and the server replies with an acknowledgement that checksums will be used in the connection response message. Neither the client's connection request nor the server's connection reply message includes a checksum. However, all subsequent messages for the Session, includ- ing messages transferred on optional channels, will include a check- sum. The checksum is computed on the message with the DAFS message headers in the endian byte ordering specified for the Session (i.e., DAFS network byte order). As an input to the checksum algorithm, the value of the checksum field itself is "zero". The checksum value computed is based on the ones-complement Fletcher-32 checksum [Fletcher], [Sklower] using a checksum computation modulus of 65535. o S1 and S2 are 16-bit quantities. The checksum is computed on the data 2 bytes at a time, treating pairs of contiguous data bytes as a single 16-bit data word. The resulting values of S1 and S2 are placed in the 32-bit DAFS checksum field, each as a 16-bit Wittle [Page 53] INTERNET-DRAFT Direct Access File System September 2001 quantity. o S1 is given the initial value 0x0101. Starting from that initial value, S1 is computed to be the ones-complement sum of the data taken 2 bytes at a time (as described in the preceding paragraph) with a modulo function applied by subtracting 65535 whenever the value of S1 becomes larger than 65535. If the length of the data is an odd number of bytes, then S1 is computed as if an additional byte containing the value "zero" had been appended to the end of the data. The zero pad byte is not included in the transmission of the data. o S2 is given the initial value of 0x0000. Starting from that ini- tial value, S2 is computed to be the 16-bit sum of the data multi- plied by the position of the data from the end of the packet. No multiplication is actually necessary in the algorithm. The multi- plication effect results from the way the sum is accumulated. S2 accumulates values of S1 after S1 is updated; so a given 16-bit data word appears multiple times in S2. The number of times each 16-bit data word appears in S2 depends on its position from the end of the packet. The same 65535 modulo function is applied to S2 as it is computed. Rationale: This optional approach to checksumming provides a "last check" on the hardware and software implementations involved in providing DAFS, without imposing a performance penalty. Better fast than slow, but more importantly, better safe than sorry. The initial value for S1 is chosen as a non-zero checksum seed in order to detect an invalid all-zero block, while providing byte order independence to keep the checksum algorithm simple. 3.2.6. Message Flow Control 3.2.6.1. Requesters and Responders The DAFS protocol defines the role of client and server. The client is the party that initiates the Session and submits file-level requests (DAFS operations) to the server. The server waits for con- nection requests from clients and processes the file-level requests. All DAFS communication is of a 'request-response' nature. For the DAFS operations that form the bulk of the communication, the client is the requester submitting the DAFS operation, and the server is the responder, processing the DAFS operation and sending the response. Wittle [Page 54] INTERNET-DRAFT Direct Access File System September 2001 But the DAFS protocol also defines a set of "back-control" directives (for example, delegation revocation, asynchronous notification of operation completion) that the server can send to the client. For these directives, the server takes the role of the requester and the client is the responder. 3.2.6.2. Flow Control Requirements DAT includes no provision for flow control. Under the DAT (see 2.3.3., "DAT Requirements") the communicating DAFS parties "A" and "B" MUST guarantee, through some mechanism external to DAT itself, that: o "A" will not attempt to send data via a send data transfer opera- tion (DTO) when "B" does not have a receive DTO pre-submitted and waiting to receive the data. o "A" will not attempt to send more data than the size of the buffer of the receive data transfer operation (DTO) submitted on "B". The flow control mechanism does not require out-of-band communication of "send credits," nor are buffers disassociated from application- level operations. A requester knows how many requests it is allowed to have outstanding at any given time, and it never has to wait for send credits separately from waiting for application-level operations to complete. Finally, the flow control mechanism enables the server to provide congestion control at the server through back pressure on requesters to reduce the rate of incoming requests. 3.2.6.3. Overview-Use of Communication Channel Facilities As explained earlier in 3.1.2.3., "Multiple Communication Channels", a DAFS Session consists of one or more communication channels mapped onto one or more DAT connections. The following list describes the types of communication channels: o A REQUIRED DAFS Operation Channel over which the client sends DAFS operation requests and the server sends responses. o An optional Back-control Channel over which the server sends directive requests and the client sends responses. The client can decline the use of all DAFS features that necessitate use of this channel, in which case the client need not create and manage this channel. o An optional RDMA-read Channel over which the server issues RDMA Wittle [Page 55] INTERNET-DRAFT Direct Access File System September 2001 read operations. This communication channel is an adjunct used to provide a separate channel for issuing RDMA read operations and is not involved in message flow control. As a part of channel creation, DAFS establishes a flow control proto- col for the channel. However, at least one successful message exchange is necessary in order to establish the flow control proto- col. The first message exchange between a DAFS client and server is governed by these 2 rules: 1) The DAFS server listens for an incoming connection on the DAT transport by posting a buffer where a connection request is to be stored. The client's initial connection request operation MAY be: o DAFS_PROC_CLIENT_CONNECT, o DAFS_PROC_CONNECT_AND_AUTH, o or, for an optional channel, DAFS_PROC_CLIENT_BIND. The server MUST post a buffer at least 4-KB in size to receive this request. This provides space for authentication data for the channel that MAY be included with the connection request. 2) The DAFS client MUST be prepared to receive a reply to the initial connection request. Synchronization requirements regarding the posting of a buffer to receive the DAFS connection reply are tran- sport dependent. However, the client MUST port a buffer of at least 4-KB in size to receive this reply. This provides space for authentication data for the channel that MAY be included with the connection request. This initial exchange of a DAFS connection request and response mes- sage contains the flow control negotiation parameters that will govern the subsequent packet exchange on the channel. The following flow control values are negotiated: OPNreq Number of DAFS operations (or back-control directives) that the requestor can submit simultaneously on the channel. Depending on which channel, the requester might be the client or the server. For the DAFS Operation Channel, OPnreq is a limit on the client; for the Back- control Channel, it is a limit on server. The value can be dynamically renegotiated throughout the lifetime of the DAFS Session. For more information, see 3.2.6.4.2., "Maximum Number of Simultaneous Outstanding Requests". OPNreq MUST be >= 1 at all times. Wittle [Page 56] INTERNET-DRAFT Direct Access File System September 2001 OPSZreq The maximum size of a single DAFS operation request or back- control directive request. OPSZresp The maximum size of a single DAFS operation response or back- control response. The discussion that follows refers to "requester" and "responder." For the DAFS Operation Channel, the requester is the client issuing DAFS operations and the responder is the server. For the back-control directive channel, the requester is the server and the responder is the client. Nreq, SZreq, and SZresp refer to the values (negotiated or static) appropriate for the direction under consideration. On a given channel, the requester and responder use the following protocol to satisfy the flow control requirements described earlier in 3.2.6.2., "Flow Control Requirements": o The requester submits requests using send descriptors no larger than SZreq. o The responder responds using send descriptors no larger than SZresp.p. If the response to a submitted request will not fit in a buffer of size SZresp, an error indication is returned instead. o When no requests are outstanding, the responder guarantees that at least Nreq receive buffers of size >= SZreq are posted. o While processing some number of requests M <= Nreq on a given channel, the responder need only be prepared to receive (Nreq - M) more requests. So upon receiving a request, the responder need not immediately post another receive buffer. But before sending a response, the responder MUST post another receive buffer (unless Nreq is being reduced). For more information, see 3.2.6.4.2., "Maximum Number of Simultaneous Outstanding Requests". This demon- strates the requester's recognition that, upon receipt of the response, only (M - 1) requests are simultaneously outstanding. o The requester guarantees that it will never have more than Nreq requests outstanding. When exactly Nreq requests are outstanding, the requester MUST delay submitting the next request until it receives a response to a previously submitted request. When the response is received, the requester might or might not be able to immediately submit the next request. For more information, see 3.2.6.4.2., "Maximum Number of Simultaneous Outstanding Requests". Wittle [Page 57] INTERNET-DRAFT Direct Access File System September 2001 3.2.6.4. Flow Control Specifics Flow control between a requester and responder requires that the two parties agree on two types of information: o Maximum size of request and response. This governs the size of the receive buffers posted by the requestor and responder. o Maximum number of requests that are allowed to be simultaneously outstanding. This governs the number of receive descriptor buffers that the responder maintains on its transport-level receive work queue. 3.2.6.4.1. Maximum Request/Response Sizes The DAFS protocol allows the maximum size values OPSZreq and OPSZresp be negotiated on a per-Session basis when the Session is created. After the DAFS Session has been established, these values remain in effect for the lifetime of the Session. The protocol allows the values to be negotiated on a per-Session basis to permit the client and server control over the following: o maximum amount of data in the WRITE_INLINE request (impacts OPSZreq of the DAFS Operation Channel) o maximum amount of data that can be returned in a READ_INLINE response (impacts OPSZresp of the DAFS Operation Channel) o maximum number of entries in a NOTIFY_BATCH_COMPLETE back-control directive message (impacts OPSZreq of the back- control directive channel). 3.2.6.4.2. Maximum Number of Simultaneous Outstanding Requests Unlike the size values, the DAFS protocol considers the maximum number of simultaneous outstanding requests for each channel, OPNreq, to be dynamic. The protocol provides the following capabilities: o Requester can ask for an increase or decrease in Nreq in any request packet. This allows a client to request additional server-side resources (for example, additional outstanding receive data transfer operations) during periods of heavy DAFS activity. This also allows a client to "be a good citizen" and yield resources during times of reduced activity. o Responder can increase or decrease Nreq in any response packet. This allows the server to reduce the amount of server-side Wittle [Page 58] INTERNET-DRAFT Direct Access File System September 2001 resources (for example, reduce outstanding receive data transfer operations) it dedicates to a single client Session. This might be necessary to accept additional incoming connections on a given NIC, or to throttle back the rate of incoming DAFS operations from a single overly active client. It also allows the server to restore resources to an active client when they become available. Note: This value MUST always be >= 1; if Nreq were 0, there would be no mechanism for the client to request or the server to grant an increased value. The mechanism by which the DAFS protocol supports dynamic negotiation of Nreq is described as follows: At any given time, the responder has dedicated resources for two classes of requests on a given Session: o a set H of requests currently being handled (request received, response not yet sent) o a set P of requests, not yet received, for which receive data transfer operations (DTO) are currently submitted. The union of these two sets constitutes the complete set of requests that the responder is currently equipped to handle. Thus: Nreq = num_elements(set H) + num_elements(set P) The header of every request contains a value Desired_Nreq. This lets the requester request an increase or decrease in the value Nreq, the header of every request contains a value Desired_Nreq. But the responder is under no obligation to honor the requester's desire regarding Nreq; the responder is the sole owner of the value of Nreq. The requester cannot assume any action by the responder until noti- fied via the Target_Nreq field in a received response (see the fol- lowing paragraph). Furthermore, the requester MUST be prepared for a change in the Nreq value when processing any response, whether a change was asked for or not. There is no reason for the requester to distinguish between solicited and unsolicited changes in Nreq. The responder can change the value of Nreq at any time, either in response to a change request or on its own, and notifies the reques- ter of the current value of Nreq using the Target_Nreq field in the header of every response. But note that the Target_Nreq value commun- icated in every response is the Nreq value that the responder is aim- ing for, and is NOT necessarily the same as the Nreq value in effect at the time of the response. Wittle [Page 59] INTERNET-DRAFT Direct Access File System September 2001 Each request that arrives at the responder consumes one outstanding receive data transfer operation (DTO); this reduces the num_elements(set P) by 1 and increases num_elements(set H) by 1. The responder does not need to submit another receive DTO at this point, but instead can delay submitting another receive DTO until it is ready to send the response. Prior to sending any response, the responder can take one of three courses of action, depending on whether it wants to maintain, increase, or reduce the value of Nreq: o If the responder wants to maintain the value of Nreq at its current value (Target_Nreq = Nreq), it submits one receive data transfer operation (DTO) (possibly reusing the same memory buffer of the receive DTO used for the request just completed). o If the responder wants to increase the value of Nreq by M (Target_Nreq = Nreq + M), it submits M + 1 receive data transfer operation (DTO). o If the responder wants to decrease the value of Nreq by M (Target_Nreq = Nreq - M), it simply declines to submit any receive data transfer operation (DTO) prior to sending the response. When the response is sent, num_elements(set H) is reduced by 1 without a corresponding increase in num_elements(set P), thus reducing Nreq by 1. When the requester receives a response, it recalculates the value of Nreq using the following formula: Nreq = MAX(previous_Nreq - 1, Target_Nreq) where Target_Nreq is the value contained in the response header. Thus: o Increases in the value of Nreq take place immediately; this is possible because the responder can submit receive data transfer operation (DTO) at any time. o Decreases in the value of Nreq take place gradually; the value can be reduced by only 1 on each processed request. This is necessary because the DAT does not provide a way to cancel outstanding and in-progress data transfer operations (DTO). The only ways to retrieve an outstanding DTO is for it to be consumed by incoming requests, or for the DAT connection to be terminated. The response contains the responder's target value for Nreq rather Wittle [Page 60] INTERNET-DRAFT Direct Access File System September 2001 than the current value. This allows the responder to explicitly notify the requester that the responder needs to reclaim resources associated with (Target_Nreq - current Nreq) additional requests. In this situation, the requester SHOULD take necessary action to allow the responder to reclaim those resources in a timely manner. Thus, although Nreq can be decreased by at most one on each response, Target_Nreq can be reduced by M. This provides the server a way to tell the client that he intends to reduce Nreq by one after each of the next M requests, and acts as a hint to the client that if the client does not have M requests queued to issue, that the client SHOULD issue M null requests to allow the server to reduce Nreq. Failure to do so can result in connection failure. Note: Consider the case in which a server needs to reclaim resources associated with a particular DAFS Session. The server notifies the requester of a reduced Target_OPNreq value. The client receives and recognizes the change, but does not have another request to issue. In this case the server might be forced to terminate the Session in order to reclaim (all) resources that had been dedicated to that Session. To avoid this situation, a client that is notified of a reduced Target_OPNreq SHOULD issue NULL DAFS operation requests to consume receive descriptors on the server, allowing the server to achieve its resource recla- mation. Wittle [Page 61] INTERNET-DRAFT Direct Access File System September 2001 4. File System Operations DAFS file system operations can be divided into different areas: o Concepts and structures Key objects being managed include file names, filehandles, and access credentials. DAFS shares a heritage with NFS, but differs in some important ways. o Data transfer Key DAFS operations focus on the efficient transfer of data between client and server. Typically these operations take advan- tage of RDMA capability. o Request chaining DAFS chaining is similar in concept to NFS version 4 compound operations, but is tailored to the DAFS environment o Locking and access control DAFS provides operations to support sharing files with a robust failure recovery framework. o Standard file system support A number of DAFS operations are intended to be functionally equivalent to NFS Version 4 operations. 4.1. Concepts and Structures 4.1.1. DAFS and NFS Version 4 DAFS concepts and procedures are described in the sections that fol- low. Some of these are based on the NFS Version 4 protocol. The dis- cussion of these DAFS procedures includes quoted remarks from the NFS Version 4 specification. 4.1.2. Typographical Conventions Some DAFS procedure descriptions contain references to the NFS Ver- sion 4 protocol as described in the Internet Society's RFC 3010 docu- ment. These references appear inside quotation marks. At the end of each quotation appears a reference to the RFC document. The abbrevi- ated references look like this: (RFC 3010, pp. xxx-yyy) where xxx- yyy refers to the pages where the quoted text is found. Wittle [Page 62] INTERNET-DRAFT Direct Access File System September 2001 Whenever DAFS differs in terminology from the quoted NFS Version 4 text, the DAFS equivalent term appears inside square brackets []. Such simple substitution includes but is not limited to procedure names and error codes. 4.1.3. Recurring Differences Between DAFS and NFS Version 4 4.1.3.1. Filehandles in Compound vs. Chaining Most file system actions operate on a file object. An NFS Version 4 or a DAFS procedure requires a filehandle that specifies the file object to act upon. In the NFS Version 4 protocol, the filehandle is obtained from the COMPOUND operation's current file handle. For DAFS operations, the filehandle is obtained from the arguments in the pro- cedure supplied by the client, unless the operation is chained and the DAFS_CHF_FH flag is set in the DAFS header. In the chained case, the filehandle is the one saved by the previous operation in the chain. When a DAFS operation completes successfully, the filehandle used by the operation becomes available for use by the operation that follows in the chain. This is not the case, however if the operation gen- erates a new filehandle, such as DAFS_PROC_LOOKUP. In this case, the new handle generated becomes available for use by the operation that follows in the chain. 4.1.3.2. Credentials NFS Version 4 requests are enclosed in an RPC request. The RPC header for every operation contains a set of credentials that identifies the user requesting file service. The DAFS protocol has a procedure to register user credentials. This procedure returns a credentials han- dle. Subsequent DAFS request need only include the credentials handle obtained via the credentials registration. 4.1.3.3. Attribute Bitmaps In DAFS, the attribute data structure used in procedures such as DAFS_PROC_GETATTR, has two bitmaps. The included bitmap determines the attribute fields that are present in the attributes packet, that is the fields for which memory is allocated. The valid bitmap represents the attributes with actual values that the server was able to return. The valid bitmap is a subset of the included bitmap. Therefore, it is possible to have attribute fields present as indi- cated in the included map, but with no valid values. In contrast, the NFS Version 4 attributes contain one bitmap only. The remaining attribute structure consists of fields with valid Wittle [Page 63] INTERNET-DRAFT Direct Access File System September 2001 attribute values. 4.1.4. Objects Naming And Filehandles The DAFS name space is structured similarly to the NFS Version 4 name space. This section includes some text quoted from the NFS Version 4 specification that describe the name space. It also includes text from the same NFS Version 4 specification that describes the proper- ties of filehandles. "7. NFS Server Name Space 7.1 Server Exports On a UNIX server the name space describes all the files reachable by pathnames under the root directory or "/". On a Windows NT server the name space constitutes all the files on disks named by mapped disk letters. NFS server administrators rarely make the entire server's file system name space available to NFS clients. More often portions of the name space are made available via an 'export' feature." (RFC 3010, p. 47) Text is omitted regarding use of the mount protocol in previous ver- sions of the NFS protocol. "7.2 Browsing Exports The NFS version 4 protocol provides a root filehandle that clients can use to obtain filehandles for these exports via a multi-component LOOKUP. A common user experience is to use a graphical user interface (perhaps a file 'Open' dialog window) to find a file via progressive browsing through a directory tree. The client must be able to move from one export to another export via single-component, progressive LOOKUP opera- tions." (RFC 3010, p. 48) In DAFS, the root filehandle is obtained via a DAFS_PROC_GETROOTHANDLE procedure call. Text about previous versions of the NFS protocol and the use of MOUNT capabilities has not been quoted here. "7.3 Server Pseudo-Filesystem Wittle [Page 64] INTERNET-DRAFT Direct Access File System September 2001 NFS version 4 servers avoid this name space incon- sistency by presenting all the exports within the frame- work of a single server name space. An NFS version 4 client uses LOOKUP and READDIR operations to browse seamlessly from one export to another. Portions of the server name space that are not exported are bridged via a 'pseudo file system' that provides a view of exported directories only. A pseudo file system has a unique fsid and behaves like a normal, read only file system. Based on the construction of the server's name space, it is possible that multiple pseudo file systems may exist. For example, /a pseudo file system /a/b real file system /a/b/c pseudo file system /a/b/c/d real file system Each of the pseudo file systems are consider[ed] separate entities and therefore will have a unique fsid." (RFC 3010, p. 48) DAFS file systems do not have an fsid as described for the NFS Ver- sion 4 case in the quoted text above. Instead, DAFS file systems have a unique FSHandle. This FSHandle is obtained via a DAFS_PROC_GETFSATTR procedure call. The FSHandle is also a visible part of the DAFS filehandle that DAFS client can consult to determine when a new file system has been reached during a pathname traversal. "7.4 Multiple Roots The DOS and Windows operating environments are sometimes described as having 'multiple roots'. File systems are commonly represented as disk letters. MacOS represents file systems as top level names. NFS version 4 servers for these platforms can construct a pseudo file system above these root names so that disk letters or volume names are simply directory names in the pseudo root tree. 7.5 Filehandle Volatility The nature of the server's pseudo file system is that it is a logical representation of file system(s) available Wittle [Page 65] INTERNET-DRAFT Direct Access File System September 2001 from the server. Therefore, the pseudo file system is most likely constructed dynamically when the server is first instantiated. It is expected that the pseudo file system may not have an on disk counterpart from which persistent filehandles could be constructed. Even though it is preferable that the server provide per- sistent filehandles for the pseudo file system, the NFS client should expect that pseudo file system filehandles are volatile. This can be confirmed by checking the associated 'fh_expire_type' attribute for those filehan- dles in question. If the filehandles are volatile, the NFS client must be prepared to recover a filehandle value (e.g. with a multi-component LOOKUP) when receiv- ing an error of NFS4ERR_FHEXPIRED. 7.6 Exported Root If the server's root file system is exported, one might conclude that a pseudo-file system is not needed. This would be wrong. Assume the following file systems on a server: / disk1 (exported) /a disk2 (not exported) /a/b disk3 (exported) Because disk2 is not exported, disk3 cannot be reached with simple LOOKUPs. The server must bridge the gap with a pseudo-file system. 7.7 Mount Point Crossing The server file system environment may be constructed in such a way that one file system contains a directory which is 'covered' or mounted upon by a second file sys- tem. For example: /a/b (file system 1) /a/b/c/d (file system 2) The pseudo file system for this server may be con- structed to look like: Wittle [Page 66] INTERNET-DRAFT Direct Access File System September 2001 / (place holder/not exported) /a/b (file system 1) /a/b/c/d (file system 2) It is the server's responsibility to present the pseudo file system that is complete to the client. If the client sends a lookup request for the path '/a/b/c/d', the server's response is the filehandle of the file sys- tem '/a/b/c/d'. In previous versions of the NFS proto- col, the server would respond with the directory '/a/b/c/d' within the file system '/a/b'. The NFS client will be able to determine if it crosses a server mount point by a change in the value of the 'fsid' attribute. 7.8 Security Policy and Name Space Representation The application of the server's security policy needs to be carefully considered by the implementor. One may choose to limit the viewability of portions of the pseudo file system based on the server's perception of the client's ability to authenticate itself properly. However, with the support of multiple security mechan- isms and the ability to negotiate the appropriate use of these mechanisms, the server is unable to properly determine if a client will be able to authenticate itself. If, based on its policies, the server chooses to limit the contents of the pseudo file system, the server may effectively hide file systems from a client that may otherwise have legitimate access." (RFC 3010, pp. 49-50) "4. Filehandles The filehandle in the NFS protocol is a per server unique identifier for a file system object. The con- tents of the filehandle are opaque to the client. Therefore, the server is responsible for translating the filehandle to an internal representation of the file system object. Since the filehandle is the client's reference to an object and the client may cache this reference, the server SHOULD not reuse a filehandle for another file system object. If the server needs to reuse a filehandle value, the time elapsed before reuse SHOULD Wittle [Page 67] INTERNET-DRAFT Direct Access File System September 2001 be large enough such that it is unlikely the client has a cached copy of the reused filehandle value. Note that a client may cache a filehandle for a very long time. For example, a client may cache NFS data to local storage as a method to expand its effective cache size and as a means to survive client restarts. Therefore, the lifetime of a cached filehandle may be extended." (RFC 3010, p. 23) DAFS filehandles are mostly opaque to the client. They contain a client- visible FSHandle field as well as an opaque fileid field. "4.1 Obtaining The First Filehandle The operations of the NFS protocol are defined in terms of one or more filehandles. Therefore, the client needs a filehandle to initiate communication with the server." (RFC 3010, p. 24) References to the mount protocol use in previous version of the NFS protocol, have been removed. The DAFS protocol defines a special filehandle, called the Root Filehandle, that is used to initiate this communication. "4.1.1 Root Filehandle The first of the special filehandles is the ROOT filehandle. The ROOT filehandle is the 'conceptual' root of the file system name space at the NFS server." (RFC 3010, p. 24) The client gets the ROOT filehandle by employing the DAFS_PROC_GETROOTHANDLE operation, which returns to the client the ROOT filehandle. This root filehandle is used by the DAFS client to traverse the file name space provided by the server. See "7. NFS Server Name Space" from the NFS Version 4 specification as quoted above and the DAFS notes on name space issues also found in this sec- tion for a description of the name space presented by a DAFS server. The NFS Version 4 specification description of the public filehandle is omitted here as this filehandle concept is not part of the DAFS protocol. "4.2 Filehandle Types In the NFS version 2 and 3 protocols, there was one type of filehandle with a single set of semantics. The NFS version 4 protocol introduces a new type of filehandle Wittle [Page 68] INTERNET-DRAFT Direct Access File System September 2001 in an attempt to accommodate certain server environ- ments. The first type of filehandle is 'persistent'. The semantics of a persistent filehandle are the same as the filehandles of the NFS version 2 and 3 protocols. The second type of filehandle is the 'volatile' filehandle. The volatile filehandle type is being introduced to address server functionality or implementation issues which make correct implementation of a persistent filehandle infeasible. Some server environments do not provide a file system level invariant that can be used to construct a persistent filehandle. The underlying server file system may not provide the invariant or the server's file system programming interfaces may not pro- vide access to the needed invariant. Volatile filehan- dles may ease the implementation of server functionality such as hierarchical storage management or file system reorganization or migration. However, the volatile filehandle increases the implementation burden for the client. However this increased burden is deemed accept- able based on the overall gains achieved by the proto- col. Since the client will need to handle persistent and volatile filehandle differently, a file attribute is defined which may be used by the client to determine the filehandle types being returned by the server." (RFC 3010, p. 25) Disregard the reference to file system migration in the previous paragraph: the DAFS protocol does not support migration. "4.2.1 General Properties of a Filehandle The filehandle contains all the information the server needs to distinguish an individual file. To the client, the filehandle is opaque. The client stores filehandles for use in a later request and can compare two filehan- dles from the same server for equality by doing a byte- by-byte comparison. However, the client MUST NOT other- wise interpret the contents of filehandles. If two filehandles from the same server are equal, they MUST refer to the same file. If they are not equal, the client may use information provided by the server, in the form of file attributes, to determine whether they denote the same files or different files. The client would do this as necessary for client side caching. Wittle [Page 69] INTERNET-DRAFT Direct Access File System September 2001 Servers SHOULD try to maintain a one-to- one correspon- dence between filehandles and files but this is not required. Clients MUST use filehandle comparisons only to improve performance, not for correct behavior. All clients need to be prepared for situations in which it cannot be determined whether two filehandles denote the same object and in such cases, avoid making invalid assumptions which might cause incorrect behavior." (RFC 3010, pp. 25-26) DAFS filehandles are mostly opaque to the client. They contain a client- visible FSHandle field as well as an opaque fileid field. The opaque fileid field shares the same properties as the NFS file handle properties described in the quoted text above. Although the FSHandle is a client-visible field within the DAFS filehandle, the FSHandle itself is opaque to the client. In other words, a DAFS client MUST NOT interpret the contents of the FSHandle field, except for testing it for equality to determine if two file objects reside within the same server's file system. "As an example, in the case that two different path names when traversed at the server terminate at the same file system object, the server SHOULD return the same filehandle for each path. This can occur if a hard link is used to create two file names which refer to the same underlying file object and associated data. For exam- ple, if paths /a/b/c and /a/d/c refer to the same file, the server SHOULD return the same filehandle for both path names traversals. 4.2.2 Persistent Filehandle A persistent filehandle is defined as having a fixed value for the lifetime of the file system object to which it refers. Once the server creates the filehandle for a file system object, the server MUST accept the same filehandle for the object for the lifetime of the object. If the server restarts or reboots the NFS server must honor the same filehandle value as it did in the server's previous instantiation." (RFC 3010, p. 26) Reference to file system migration has been removed. "The persistent filehandle will be become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error Wittle [Page 70] INTERNET-DRAFT Direct Access File System September 2001 of NFS4ERR_STALE. A filehandle may become stale when the file system containing the object is no longer available. The file system may become unavailable if it exists on removable media and the media is no longer available at the server or the file system in whole has been destroyed or the file system has simply been removed from the server's name space (i.e. unmounted in a Unix environment). 4.2.3 Volatile Filehandle A volatile filehandle does not share the same longevity characteristics of a persistent filehandle. The server may determine that a volatile filehandle is no longer valid at many different points in time. If the server can definitively determine that a volatile filehandle refers to an object that has been removed, the server should return DAFSERR_STALE to the client (as is the case for persistent filehandles). In all other cases where the server determines that a volatile filehandle can no longer be used, it should return an error of NFS4ERR_FHEXPIRED. The mandatory attribute 'fh_expire_type' is used by the client to determine what type of filehandle the server is providing for a partic- ular file system. This attribute is a bitmask with the following values: FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a persistent filehandle, which is valid until the object is removed from the file system. The server will not return NFS4ERR_FHEXPIRED for this filehan- dle. FH4_PERSISTENT is defined as a value in which none of the bits specified below are set. FH4_NOEXPIRE_WITH_OPEN The filehandle will not expire while client has the file open. If this bit is set, then the values FH4_VOLATILE_ANY or FH4_VOL_RENAME do not impact expiration while the file is open. Once the file is closed or if the FH4_NOEXPIRE_WITH_OPEN bit is false, the rest of the volatile related bits apply. FH4_VOLATILE_ANY The filehandle may expire at any time and will expire Wittle [Page 71] INTERNET-DRAFT Direct Access File System September 2001 during system migration and rename. FH4_VOL_RENAME The filehandle may expire due to a rename. This includes a rename by the requesting client or a rename by another client. May only be set if FH4_VOLATILE_ANY is not set. Servers which provide volatile filehandles should deny a RENAME or REMOVE that would affect an OPEN file or any of the components leading to the OPEN file. In addition, the server should deny all RENAME or REMOVE requests during the grace or lease period upon server restart. The reader may be wondering why there are three FH4_VOL* bits and why FH4_VOLATILE_ANY is exclusive of FH4_VOL_MIGRATION and FH4_VOL_RENAME. If the a filehandle is normally persistent but cannot persist across a file set migration, then the presence of the FH4_VOL_MIGRATION or FH4_VOL_RENAME tells the client that it can treat the file handle as persistent for pur- poses of maintaining a file name to file handle cache, except for the specific event described by the bit. However, FH4_VOLATILE_ANY tells the client that it should not maintain such a cache for unopened files. A server MUST not present FH4_VOLATILE_ANY with FH4_VOL_RENAME as this will lead to confusion. FH4_VOLATILE_ANY implies that the file handle will expire upon migration or rename, in addition to other events." (RFC 3010, pp. 26-27) The description for FH4_VOL_MIGRATION has been removed. For readabil- ity purposes, references to this flag were kept in the above para- graph. Disregard these references. 4.2.4 One Method of Constructing a Volatile Filehandle As mentioned, in some instances a filehandle is stale (no longer valid; perhaps because the file was removed from the server) or it is expired (the underlying file is valid but since the filehandle is volatile, it may have expired). Thus the server needs to be able to returnNFS4ERR_STALE in the former case and NFS4ERR_FHEXPIRED in the latter case. This can be done by careful construction of the volatile filehandle. One possible implementation follows. Wittle [Page 72] INTERNET-DRAFT Direct Access File System September 2001 A volatile filehandle, while opaque to the client could contain: [volatile bit = 1 | server boot time | slot | generation number] * slot is an index in the server volatile filehandle table * generation number is the generation number for the table entry/slot If the server boot time is less than the current server boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_FHEXPIRED. When the server reboots, the table is gone (it is vola- tile). If volatile bit is 0, then it is a persistent filehandle with a different structure following it. 4.3 Client Recovery From Filehandle Expiration If possible, the client SHOULD recover from the receipt of an NFS4ERR_FHEXPIRED error. The client must take on additional responsibility so that it may prepare itself to recover from the expiration of a volatile filehandle. If the server returns persistent filehandles, the client does not need these additional steps. For volatile filehandles, most commonly the client will need to store the component names leading up to and including the file system object in question. With these names, the client should be able to recover by finding a filehandle in the name space that is still available or by starting at the root of the server's file system name space. If the expired filehandle refers to an object that has been removed from the file system, obviously the client will not be able to recover from the expired filehandle. It is also possible that the expired filehandle refers to a file that has been renamed. If the file was renamed by another client, again it is possible that the original client will not be able to recover. However, in Wittle [Page 73] INTERNET-DRAFT Direct Access File System September 2001 the case that the client itself is renaming the file and the file is open, it is possible that the client may be able to recover. The client can determine the new path name based on the processing of the rename request. The client can then regenerate the new filehandle based on the new path name. The client could also use the com- pound operation mechanism to construct a set of opera- tions like: RENAME A B LOOKUP B . . ." (RFC 3010, pp. 28-29) The DAFS protocol does not support the COMPOUND procedure. Instead, a client can issue the rename and lookup within a DAFS chain. The quote also removed the GETFH call after lookup in the example above since DAFS does not support or require this call. The DAFS_PROC_LOOKUP returns the filehandle for B. 4.1.5. Named Attributes The DAFS protocol supports three classes of file object attributes: mandatory, recommended, and named. Mandatory and recommended attri- butes are discussed in 6.1.5., "File Attributes" and 6.1.6., "File System Attributes". The named attributes model is borrowed from the NFS Version 4 specification and its description follows in the quote below: "These attributes are not supported by direct encoding in the NFS Version 4 protocol but are accessed by string names rather than numbers and correspond to an uninter- preted stream of bytes which are stored with the file system object. The name space for these attributes may be accessed by using the OPENATTR operation. The OPENATTR operation returns a filehandle for a virtual "attribute directory" and further perusal of the name space may be done using READDIR and LOOKUP operations on this filehandle. Named attributes may then be examined or changed by normal READ and WRITE and CREATE opera- tions on the filehandles returned from READDIR and LOOKUP. Named attributes may have attributes. It is recommended that servers support arbitrary named attributes. A client should not depend on the ability to store any named attributes in the server's file sys- tem. If a server does support named attributes, a client which is also able to handle them should be able Wittle [Page 74] INTERNET-DRAFT Direct Access File System September 2001 to copy a file's data and meta-data with complete tran- sparency from one location to another; this would imply that names allowed for regular directory entries are valid for named attribute names as well. Names of attributes will not be controlled by this docu- ment or other IETF standards track documents. See the section 'IANA Considerations' for further discussion." (RFC 3010, p. 31) The reference to the "IANA Considerations" section as it pertains to the named attributes follows: "The NFS version 4 protocol provides for the association of named attributes to files. The name space identif- iers for these attributes are defined as string names. The protocol does not define the specific assignment of the name space for these file attributes; the applica- tion developer or system vendor is allowed to define the attribute, its semantics, and the associated name. Even though this name space will not be specifically con- trolled to prevent collisions, the application developer or system vendor is strongly encouraged to provide the name assignment and associated semantics for attributes via an Informational RFC. This will provide for interoperability where common interests exist." (RFC 3010, p. 174) 4.2. Data Transfer Operations 4.2.1. Send-Receive 4.2.1.1. Inline Bulk Data Transfer A traditional send-receive model of is provided for small transfers and for environments in which remote DMA operations are not desired. This is termed INLINE data transfer because the data is sent inline with the write request or read response. Typical I/O operations specified using DAFS INLINE operations are: o DAFS_PROC_READ_INLINE o DAFS_PROC_WRITE_INLINE o DAFS_PROC_APPEND_INLINE Transport-level scatter/gather facilities can be used by applications Wittle [Page 75] INTERNET-DRAFT Direct Access File System September 2001 to help avoid data copies. A negotiated padding of write request headers enables the server to use scatter/gather to receive WRITE_INLINE data directly into its buffers. DAFS provides an OPTIONAL Session attribute to govern the use of padding between the end of a DAFS message header and the data being transferred by the DAFS_PROC_WRITE_INLINE operation. The OPTIONAL Session attribute "inline_write_header_size" specifies the number of bytes used to pad inline write headers to a more convenient offset from the beginning of the message. The number of bytes specified by inline_write_header_size is the length of the DAFS message header, the write operation message header, and all padding bytes up to the start of the inline data. The DAFS client can write inline data with two different alignments. If the client is writing a data buffer that begins following the negotiated padding length, then the client sets the "padded_write" flag for that DAFS_PROC_WRITE_INLINE operation and include the pad- ding bytes. Otherwise, the client clears the "padded_write" flag and sends the data buffer with no padding so that it immediately follows the operation header. Rationale: While some applications want to use the RDMA operations, other applications might not. Aligning the data portion of a data transport message provides a non-RDMA mechanism that can be used to effect zero-copy file read and write operations. Note: The mechanism is the scatter-gather capability used in conjunc- tion with the padding of headers. A server DAT connection for the DAFS operational channel that is to be used to receive file write operations would submit receive data transport operations describing a buffer with two (or more) memory chunks: one pointing to a header buffer of length "inline_write_header_size", and one (or more) pointing to a data buffer. When a client wants to perform inline writes of large buffers or set of buffers, it can negotiate the inline_write_header_size option, pad the header to this size, and set the padded_inline_write flag for such DAFS_PROC_WRITE_INLINE requests. This causes the data payload to land in the server's data buffer. The reason for inserting padding into the DAFS_PROC_WRITE_INLINE operation is to help the server create well-aligned data buffers. However, since these buffers are used to receive all requests, segmented buffers might introduce inconvenience for other unpadded requests. For this reason, the value chosen for inline_write_header_size might need to be at least as large as the value chosen for max_request_size so that Wittle [Page 76] INTERNET-DRAFT Direct Access File System September 2001 unpadded requests fit within the first segment of the server's receive buffer. File read operations can use this technique if they are for synchronous reads for a single client. Because the DAFS_PROC_READ_INLINE has a fixed-length response header, the client can post a single receive for the request with one seg- ment identifying the response header and the next segment(s) identifying the user data buffer(s). 4.2.1.2. Inline Append In addition inline bulk data movement, the DAFS_PROC_APPEND_INLINE operation provides features specific to appending data to the end of an existing file. The DAFS append operations ensure the atomicity of the determination of the current file size and writing the data into the file. This prevents concurrent append access by multiple clients from overwriting each others' data. 4.2.1.3. Inline Meta-Data Transfer Most DAFS operations that do not include bulk data require only a small send and receive buffer size. However, there are a few opera- tions that include variable-sized fields that benefit from RDMA when the amount of data being transferred is large. By providing two vari- ants of these operations, DAFS reduces the buffer space requirement of the protocol by allowing the standard inline buffers to be smaller (large transfers can use the RDMA-based operation variant). These operations are called INLINE operations because the data is sent inline with the DAFS message header. A number of traditional func- tions have been implemented using DAFS INLINE operations: o DAFS_PROC_GETATTR_INLINE o DAFS_PROC_READDIR_INLINE o DAFS_PROC_READLINK_INLINE o DAFS_PROC_SETATTR_INLINE 4.2.2. RDMA Transfers 4.2.2.1. Memory Registration The DAT requires that host memory that the transport Channel Adapter will access for RDMA operations needs to be registered with the Chan- nel Adapter before it is used. DAFS does not specify how that memory registration is done. Wittle [Page 77] INTERNET-DRAFT Direct Access File System September 2001 4.2.2.2. Direct Bulk Data Transfer Transport level RDMA features are exported to DAFS users via DIRECT versions of read and write. These operations pass RMR Contexts and RMR Target Addresses in their request messages rather than INLINE data. It is the responsibility of the DAFS user to manage the memory registrations appropriately. DAFS defines the following RDMA-based data transfer operations: o DAFS_PROC_READ_DIRECT o DAFS_PROC_WRITE_DIRECT o DAFS_PROC_APPEND_DIRECT The underlying transport-level RDMA I/O operations to support these DAFS requests are issued by the server. An application file read translates into a send of a DAFS_PROC_READ_DIRECT request to the server. The server performs an RDMA write of the data requested directly to the client's indicated buffer, and then follows it with a send of a response message. A file write operation translates into a send of a DAFS_PROC_WRITE_DIRECT request to the server containing the DAT RMR Context and RMR Target Addresses of the client's buffer. The server then performs an RDMA read to read that buffer's contents into its chosen destination buffer. The server ends the operation with a send of a response message to the client. 4.2.2.3. Direct Meta-Data Transfer For operations that transfer variable-length data fields that can be significantly larger than the base message size, DAFS includes a DIRECT variant of the operation. The data is sent via an RDMA opera- tion, separate from the DAFS message header. These DAFS DIRECT opera- tion variants are: o DAFS_PROC_GETATTR_DIRECT o DAFS_PROC_READDIR_DIRECT o DAFS_PROC_READLINK_DIRECT o DAFS_PROC_SETATTR_DIRECT Wittle [Page 78] INTERNET-DRAFT Direct Access File System September 2001 4.2.3. Batch I/O Operations Most DAFS requests are received by the server and acted upon immedi- ately. When the request is complete, the server returns the results to the client. Batch I/O operations introduce a new model of interac- tion between the client and server. In this model, the client requests one or more I/O operations and informs the server that the data being read or written can be performed and then notification of the completion can be done asynchronously to the request. The server can take advantage of this asynchronous batch processing to optimize both the completion of the RDMA operations and the commitment of user data to stable storage. The results of the I/O operations are sent to the client as a request- response callback on the Back-control Channel of the DAFS Session. The batch I/O completion message can contain the results of one or more previously issued batch I/O requests. Note: The client is responsible for the synchronization of batch I/O operations with other operations on the file. Batch I/O's interact with DAFS locks in the same manner that synchronous I/Os do. The batch time window argument provides a hint to the server about the client's throughput requirements so that the server can optimize I/O gathering mechanisms to better support the throughput require- ment. Although the batch I/O operation does not provide a guarantee that the write operation will be completed within the batch window, the server is expected to give batch requests that have reached the window the same priority as a normal synchronous I/O operation. Batch I/Os give a client the ability to go beyond the synchronous request- response model provided by the normal DAFS flow control mechanism. It is possible for a client to overwhelm a server with asynchronous batch I/O requests. Therefore, servers can either: per- form an asynchronous batch I/O in the standard synchronous fashion or return an error (such as EWHOACOWBOY) to notify a client that the server is congested and that the client needs to slow down its gen- eration of batch requests. The EWHOACOWBOY error instructs the client to restrain from posting additional batch I/O requests until it has received a batch write completion message. Rationale: The batch write operation provides a high bandwidth mechanism for transferring a large, batch of I/O requests where the application's latency requirement is based on completion of the entire batch rather than completion of each individual request. By batching request completions within a completion notification messages, the mechanism Wittle [Page 79] INTERNET-DRAFT Direct Access File System September 2001 supports high bandwidth streams of I/O requests. 4.2.4. Server Caching Hints DAFS cache hints provide a way to supply information to the server regarding which file data the client would like the server to cache on writeback, and which file data the client would like the server to prefetch into the server's cache. The hints include two types of information. First the client can supply information about the client's predicted access pattern for the file if it is known. These hints can provide a general hint to inform the server's caching policies for the file. Second, specific byte range cache "weighting" hints are provided, indicating predictions about the client's intentions regarding future read and write file access to the byte range. DAFS allows the client to sent cache hints to the server as a part of normal read, write, and append requests, or in separate cache hints messages. Rationale: Purpose of cache hints is to convey information the appli- cation knows about future data use to the server. This should help the server to intelligently schedule its I/Os and let the application treat the server's memory as a second level cache. The assumption is that the server ages data in its cache. Hints are intended to aid the assign- ment of weights to aid server cache management. Hints can be provided with any read/write request or a separate request can be made to update a cache hint. Cache hint requests can also be part of a batch I/O request. The goal for the cache hints is to provide the server with all possible information so that it can maximize its cach- ing efficiency. However, hints can always be safely ignored (at a possible performance penalty), and hints are not persistent across server failures. The fact that these are only hints allows the server to benefit if it can, without additional resource commitment. The DAFS protocol does not dictate what actions a server SHOULD take upon reception of cache hints or even that the server needs to take any actions. 4.3. Request Chaining Request chaining enables the server to process multiple dependent requests without incurring a round-trip delay between each request, Wittle [Page 80] INTERNET-DRAFT Direct Access File System September 2001 the DAFS protocol implements request chaining. The motivation for chaining is similar to that for the COMPOUND feature of NFS Version 4. Even with the relatively low communication latencies expected in the DAFS environment, there are considerable benefits from pipelining multiple requests so that multiple dependent requests do not incur a latency equal to the sum of the queuing, round-trip, and processing latencies for each of the requests. Chaining differs from NFS Version 4 COMPOUND in that each dependent request continues to retain its own separate identity (for flow- control and Response Cache purposes). This enables better utilization of memory by providing tighter bounds on request and response buffer sizes and limits the amount information being stored in the Response Cache. Chaining is defined for requests made by the client on the Operation Channel. It cannot be used for requests made by the server on the Back- control Channel. When multiple requests are chained, the server MUST execute them in the order they were sent by the client. A request cannot be started until the previous request has completed. If a previous dependent request encountered an error, subsequent requests are aborted with a DAFSERR_CHAIN_BROKEN error. The client can use chaining flags, described in 4.3.2., "Chaining Flags", to specify that certain infor- mation be passed between requests in a chain. 4.3.1. Chaining Restrictions For the purposes of chaining (and for the Response Cache as discussed in 5.2.1., "Response Cache"), all requests are divided into five categories: o Special requests o Bulk fetch requests o Simple requests o FS-state-modifying requests o Lock-state-modifying requests. Special requests are used in the setup and maintenance of DAFS Ses- sions. Such requests cannot be chained to any other request. The spe- cial operations are: o DAFS_PROC_CLIENT_CONNECT Wittle [Page 81] INTERNET-DRAFT Direct Access File System September 2001 o DAFS_PROC_CLIENT_AUTH o DAFS_PROC_SERVER_AUTH o DAFS_PROC_CLIENT_CONNECT_AUTH o DAFS_PROC_CONNECT_BIND o DAFS_PROC_DISCONNECT o DAFS_PROC_SECINFO o DAFS_PROC_REGISTER_CRED o DAFS_PROC_RELEASE_CRED o DAFS_PROC_GET_FENCING_LIST o DAFS_PROC_SET_FENCING_LIST o DAFS_PROC_DISCARD_RESPONSES. Bulk fetch requests retrieve a significant amount of file system data, but do not change the file system state (with the possible exception of file access times). Such requests can be chained, but they MUST not be followed by an fs-state-modifying request in the chain. This allows a request chain to be reissued after server failure without requiring the server to save very large amounts of response data in stable storage. The bulk fetch requests are: o DAFS_PROC_GETATTR_DIRECT o DAFS_PROC_GETATTR_INLINE o DAFS_PROC_GET_FSATTR o DAFS_PROC_READLINK_DIRECT o DAFS_PROC_READLINK_INLINE o DAFS_PROC_READ_DIRECT o DAFS_PROC_READ_INLINE o DAFS_PROC_READDIR_DIRECT o DAFS_PROC_READDIR_INLINE. Wittle [Page 82] INTERNET-DRAFT Direct Access File System September 2001 Simple requests do not modify file system data, but only return a relatively small quantity of data. These can be chained together with other requests, but when they precede an fs-state-modifying request, they SHOULD be marked with a special flag in the request header that indicates to the server that they are to be saved in the Response Cache. This allows a request chain that includes simple requests fol- lowed by state-modifying requests to be completed properly after server failure. The simple requests are: o DAFS_PROC_ACCESS o DAFS_PROC_CACHE_HINT o DAFS_PROC_CHECK_RESPONSE o DAFS_PROC_FETCH_RESPONSE o DAFS_PROC_GET_ROOTHANDLE o DAFS_PROC_LOOKUP o DAFS_PROC_LOOKUPP o DAFS_PROC_NVERIFY o DAFS_PROC_NULL o DAFS_PROC_OPENATTR o DAFS_PROC_VERIFY. Fs-state-modifying requests modify file system data. These requests are always saved in the Response Cache (see 5.2.1., "Response Cache"). As mentioned previously, fs-state-modifying requests cannot appear in a chain following a bulk fetch request. If this occurs, the server SHOULD return DAFSERR_CHAIN_FORM when the fs-state-modifying request is encountered. Fs-state-modifying requests are: o DAFS_PROC_APPEND_DIRECT o DAFS_PROC_APPEND_INLINE o DAFS_PROC_BATCH_SUBMIT o DAFS_PROC_CLOSE o DAFS_PROC_COMMIT Wittle [Page 83] INTERNET-DRAFT Direct Access File System September 2001 o DAFS_PROC_CREATE o DAFS_PROC_DELEGPURGE o DAFS_PROC_DELEGRETURN o DAFS_PROC_HURRY_UP o DAFS_PROC_LINK o DAFS_PROC_OPEN o DAFS_PROC_REMOVE o DAFS_PROC_RENAME o DAFS_PROC_SETATTR_DIRECT o DAFS_PROC_SETATTR_INLINE o DAFS_PROC_WRITE_DIRECT o DAFS_PROC_WRITE_INLINE. Lock-state-modifying requests modify volatile locking state on the server while making no change to stable file system state. These requests are also saved in the Response Cache. Unlike fs-state- modifying requests, lock-state-modifying requests can follow bulk fetch requests in the same chain. Note that DAFS_PROC_OPEN, even though it primarily affects locking state, is an fs-state-modifying request because, with the create option, it has an effect on stable file system storage. In addition, due to the Delete-on-Last-Close semantics of DAFS_PROC_CLOSE, it is also classified as fs-state- modifying. The lock-state-modifying requests are: o DAFS_PROC_OPEN_DOWNGRADE o DAFS_PROC_LOCK o DAFS_PROC_LOCKT o DAFS_PROC_LOCKU. When a chain contains bulk fetch requests that are followed by lock- state- modifying requests, special care is necessary to recover from a Session disconnect. Consider a chain consisting of a lock request followed by a number of read requests, and a final unlock request. Such a chain can be used to provide an atomic fetch of the contents Wittle [Page 84] INTERNET-DRAFT Direct Access File System September 2001 of two noncontiguous regions of a file, without the possibility of an update occurring between the read requests. In the event of discon- nection, if the Response Cache shows that the unlock request has not executed, then a partial chain can be issued, starting at the first read request for which no response was received in the old Session. On the other hand, if the unlock request does appear as executed in the Response Cache, the chain would have to be reissued in its entirety, because there could be a read for which a response was not received in the old Session. Because the read requests are not avail- able in the Response Cache, if a response was not previously received, then reissuing those read requests is the only option. A consistent set of data could only be obtained by reissuing the chain as a whole. 4.3.2. Chaining Flags Chaining is specified in the chaining flags field of the request header. This field is present for all requests, even those for which chaining cannot be done. The chaining flags are defined as follows: DAFS_CHF_FORW (0x01) Indicates that there is a subsequent dependent request chained to this one. If this flag is not set, this is the last or only request in the current chain. If this flag is set and the current request is a special request, a DAFSERR_CHAIN_FORM error results. DAFS_CHF_BACK (0x02) Indicates that there is a previous dependent request in this chain that was sent immediately preceding this one. If this flag is not set, this is the first or only request in the current chain. If the previous request specified DAFS_CHF_FORW and the current one does not have DAFS_CHF_BACK, a DAFSERR_CHAIN_FORM error is returned. This is also true in the converse case: the current request specifies DAFS_CHF_BACK and the previous request sent does not have DAFS_CHF_FORW set. DAFS_CHF_SAVE (0x04) Indicates that the current request is a simple request that SHOULD be saved in the Response Cache because a state-modifying request follows it in the current chain. If the current request is not a simple request or if DAFS_CHF_FORW is not set, a DAFS_ERR_CHAIN_FORM error results. Wittle [Page 85] INTERNET-DRAFT Direct Access File System September 2001 DAFS_CHF_FH (0x08) Indicates that the filehandle for the current operation is to be the filehandle used for or generated by the previous operation. In this case, the filehandle specified in the request header is ignored. If DAFS_CHF_BACK is not set on this request, a DAFSERR_CHAIN_FORM error results. DAFS_CHF_STATEID (0x10) Indicates that the State-id for the current operation is to be taken from the one used for or generated by the previous operation within the chain that used or generated a State-id, in preference to the one specified in the operation itself, which is then ignored. If DAFS_CHF_BACK is not set on this request, a DAFSERR_CHAIN_FORM error results. 4.3.3. Chaining and Flow Control For flow-control purposes, each request within a chain is considered separately. Thus, a client issuing a chain of requests might be unable to issue all the requests within a chain, without waiting for some to finish because of flow-control restrictions. In the worst case, when OPNreq equals one, the client would wait for each request to finish before issuing a subsequent one, vitiating the latency reduction benefits of chaining. If a client chooses to use chaining in a situation in which flow con- trol prevents all the requests from being issued immediately, the client MUST insure that requests that are not intended to be part of the current request chain are not issued concurrently. For example, if a Session is multiplexed among multiple threads, a chain of requests from one thread MUST NOT be interspersed with a request or a chain or requests from a second thread. If an unrelated request is issued while an uncompleted chain exists, this will generally result in a DAFSERR_CHAIN_FORM error. A server can perform internal batching of chained responses (for example, to optimize CPU resources) by waiting for the end of the chain to trigger action. When doing so, the server SHOULD insure that it does not cause a flow-control-constrained client using chaining to wait for an unduly long time, or forever. The server SHOULD never wait for an additional request when existing flow-control restric- tions would prevent a client from sending that request. 4.3.4. Chaining and Recovery In the event of disconnection, server reboot, or server failover, the Wittle [Page 86] INTERNET-DRAFT Direct Access File System September 2001 client SHOULD recover cleanly so that the results of all state- modifying operations are correctly retrieved, the operations that were not executed before the failure are properly retried, and no state-modifying operation is erroneously performed more than once. In the case of disconnection and reconnection without server failure, the server MAY provide Response Cache information that enables the client to do this successfully. In addition, the server can maintain sufficient state within stable storage to enable the client to do reconnect and recover when a server failure occurs. When disconnection occurs, chaining-related information is immedi- ately forgotten. Individual requests within the chain are individu- ally recorded. State-fetching operations are not recorded, because they can be safely reissued. State-modifying operations are recorded in a Response Cache (optionally on stable storage). Traversal opera- tions that preceded state- modifying operations within a chain are also saved in the response case because they have been specially marked to indicate this condition using the DAFS_CHF_SAVE flag. When request chaining is in effect, the client might have received responses for an initial subset of the requests in a chain. For those requests within the chain that have been issued, but for which a response has not been received, the client might need to determine which requests have in fact been executed. There are two cases to consider: o If none of the unreplied-to requests are state-modifying opera- tions, then all these requests can be reissued. A new chain SHOULD be issued that includes only the unreplied-to requests. In some cases, information that in the original chain was passed from a previous operation will be explicitly entered into the parameters for the new chain. This information is available, because all such information is available either from the original requests or the earlier responses for requests that have completed. o If some of the unreplied-to requests are state-modifying requests, then all the preceding requests are available in the Response Cache if they have been executed, because they are either state- modifying requests or simple requests marked specially to be recorded in the Response Cache. Thus the client can determine, with the server's help, exactly which requests need to the reis- sued. As in the previous case, information that was passed from a previous operation in the chain might need to be explicitly entered into parameters for the new chain. This information is available, because the response for all previous requests is available from the original response or from the Response Cache. Wittle [Page 87] INTERNET-DRAFT Direct Access File System September 2001 4.4. Locking and Access Control 4.4.1. Locking DAFS locking extends NFS locking with two new capabilities: PERSIST locks and AUTORECOVER locks. PERSIST locks survive client and server failures and become broken rather than revoked. AUTORECOVER locks have rollback semantics associated with them. 4.4.1.1. DAFS/NFS locking differences DAFS locking is based on the NFS Version 4 locking model. This sec- tion presents the major differences between the DAFS and NFS Version 4 file locking semantics. o Client-id Management The NFS Version 4 protocol has two operations to set the client- id: SETCLIENTID and SETCLIENTID_CONFIRM. In DAFS, the client-ids are established when a connection between the client and the server is established by DAFS_PROC_CLIENT_CONNECT or DAFS_PROC_CLIENT_CONNECT_AUTH. Given the requirements for a reli- able transport layer, there is not need for a confirmation step when a client connects to the server and establishes a client-id. DAFS servers associate a DAFS session with the client-id generated when the session was connected. The DAFS locking requests do not explicitly include the client-id in the arguments as this informa- tion can be obtained from the session that received the request. In the event that a DAFS client receives a DAFSERR_STALE_CLIENTID or DAFSERR_STALE_STATEID, it obtains a new client-id by reconnect- ing to the DAFS server. These errors will most likely occur when the client has failed to renew its leases and the server has freed the client's locking state. o State-id Management The NFS Version 4 protocol requires that the state-id be changed whenever the locking state it represents on the server changes. NFS operations such as LOCK and LOCK return state-ids since these operations modify the lockowner's locking state on the server. DAFS servers set up locking state when a lockowner opens a file with the DAFS_PROC_OPEN procedure. Once established, state-ids do not change when locking state changes. Therefore, lock state modi- fying operation like DAFS_PROC_LOCK and DAFS_PROC_LOCKU do not return state-ids. Wittle [Page 88] INTERNET-DRAFT Direct Access File System September 2001 o Leases and their renewal NFS Version 4 servers use leases to detect clients that have crashed. A server is allowed to free locking state from a client with expired leases. DAFS servers employ client leases in a simi- lar manner. Leases are renewed by an NFS client when it issues a procedure with a valid state id, such as LOCK, LOCKU, RENEW or WRITE, among others. DAFS clients, on the other hand, renew leases by issuing any DAFS procedure, including DAFS_PROC_NULL. The DAFS protocol does not require a special RENEW lease procedure like the NFS ver- sion 4 protocol does. The DAFS server is able to renew leases when a DAFS request is received because the session-based communication model allows the server to quickly identify the client that ori- ginated the request. Note that the DAFS' lease renewal mechanism has a low-overhead. An active client does not need to issue special requests to renew leases as the normal DAFS request traffic implicitly renews the client's leases. An inactive client need only send a DAFS_PROC_NULL procedure every lease expiration period to renew all its leases. o Share Reservations DAFS locking supports share reservations as described in the NFS Version 4 specification. In addition, DAFS introduces the concept of Shared Key Reservations. See 4.4.2., "Shared Key Reservations" for a description of shared keys and how they relate to share reservations. o Failure Recovery A DAFS client identifies itself when connecting to the server using a client id string and a verifier. When a client re- establishes a connection after a client failure, it uses a new verifier. The server releases any locking state it holds for a client whose verifier has changed. Unlike the NFS Version 4 proto- col, a change in verifier in a DAFS connect request always results in the release of all locking state associated with the client represented by the client-id-string in the connect arguments. See 4.4.1.5., "Client Failure and Recovery" and 4.4.1.6., "Server Failure and Recovery" for further discussion of DAFS locking state recovery after client and server crashes. o Migration and Replication Wittle [Page 89] INTERNET-DRAFT Direct Access File System September 2001 The DAFS protocol does not support migration or replication of file systems. The sections of the NFS V4 specification that describe locking functionality when files are replicated or migrated is omitted from the quoted text below. 4.4.1.2. NFS Version 4 Locking Chapter 8 of RFC 3010 describes the NFS Version 4 locking and por- tions of this chapter are included below. The quoted text applies to DAFS locking unless as noted in 4.4.1.1., "DAFS/NFS locking differ- ences". "8. File Locking and Share Reservations Integrating locking into the NFS [DAFS] protocol neces- sarily causes it to be state-full. With the inclusion of 'share' file locks the protocol becomes substantially more dependent on state than the traditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division between client and server o Ability to reliably detect inconsistency in state between client and server o Simple and robust recovery mechanisms In this model, the server owns the state information. The client communicates its view of this state to the server as needed. The client is also able to detect inconsistent state before modifying a file. To support Win32 'share' locks it is necessary to atomi- cally OPEN or CREATE files. Having a separate share/unshare operation would not allow correct imple- mentation of the Win32 OpenFile API. In order to correctly implement share semantics, the previous NFS protocol mechanisms used when a file is opened or created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS version 4 [DAFS] protocol has an OPEN [DAFS_PROC_OPEN] operation that subsumes the functional- ity of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, the traditional LOOKUP [DAFS_PROC_LOOKUP] is preserved to map a file name to filehandle without establishing state on the server. The policy of granting access or modifying files is managed by the server based on the client's Wittle [Page 90] INTERNET-DRAFT Direct Access File System September 2001 state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking. 8.1. Locking It is assumed that manipulating a lock is rare when com- pared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important that the READ and WRITE opera- tions have a lightweight mechanism to indicate if they possess a held lock. A lock request contains the heavy- weight information required to establish a lock and uniquely define the lock owner. The following sections describe the transition from the heavy weight information to the eventual stateid used for most client and server locking and lease interac- tions. 8.1.1. Client ID For each LOCK request, the client must identify itself to the server. This is done in such a way as to allow for correct lock identification and crash recovery. Client identifica- tion is accomplished with two values. o A verifier that is used to detect client reboots. o A variable length opaque array to uniquely define a client. For an operating system this may be a fully qualified host name or IP address. For a user level NFS [DAFS] client it may additionally contain a process id or other unique sequence." (RFC 3010, pp. 51-52) The DAFS protocol defines a client_id structure. See 4.4.1.1., "DAFS/NFS locking differences" for a description of DAFS client_id management. "It is possible through the mis-configuration of a client or the existence of a rogue client that two clients end up using the same nfs_client_id [client-id- string]" (RFC 3010, pp. 52) The DAFS protocol client_id negotiation is similar to NFS, however Wittle [Page 91] INTERNET-DRAFT Direct Access File System September 2001 there is no confirmation step. See 4.4.1.1., "DAFS/NFS locking differences" for a description of DAFS client_id management. "The following describes the two scenarios of negotia- tion. 1 Client has never connected to the server In this case the client generates an nfs_client_id [client-id- string] and unless another client has the same nfs_client_id.id [client-id-string] field, the server accepts the request. The server also records the principal (or principal to uid mapping) from the credential in the RPC request that contains the nfs_client_id negotiation request (SETCLIENTID opera- tion) [DAFS Connect operation]." (RFC 3010, pp. 52) "2 Client is re-connecting to the server after a client reboot In this case, the client still generates an nfs_client_id [client- id-string] but the nfs_client_id.id [client-id-string] field will be the same as the nfs_client_id.id [client-id-string] gen- erated prior to reboot. If the server finds that the principal/uid is equal to the previously 'registered' nfs_client_id.id [client-id-string], then locks asso- ciated with the old nfs_client_id [client-id- string]are immediately released. If the principal/uid is not equal, then this is a rogue client and the request is returned in error." (RFC 3010, pp. 52-53) Since DAFS has no message retransmissions, there is no need for a confirmation step during the re-connection. "In both cases, upon success, NFS4_OK [DAFS_STATUS_OK] is returned. To help reduce the amount of data transferred on OPEN and LOCK, the server will also return a unique 64-bit clientid value that is a short- hand reference to the nfs_client_id [client-id-string] values presented by the client. From this point for- ward, the client will use the clientid to refer to itself. The clientid assigned by the server should be chosen so that it will not conflict with a clientid previously assigned by the server. This applies across server Wittle [Page 92] INTERNET-DRAFT Direct Access File System September 2001 restarts or reboots. When a clientid is presented to a server and that clientid is not recognized, as would happen after a server reboot, the server will reject the request with the error NFS4ERR_STALE_CLIENTID [DAFSERR_STALE_CLIENTID]. When this happens, the client must obtain a new clientid by use of the SETCLIENTID [DAFS Connect] operation and then proceed to any other necessary recovery for the server reboot case (See the section 'Server Failure and Recovery'). The client must also employ the SETCLIENTID operation when it receives a NFS4ERR_STALE_STATEID [DAFSERR_STALE_STATEID] error using a stateid derived from its current clientid, since this also indicates a server reboot which has invalidated the existing clien- tid (see the next section 'nfs_lockowner and stateid Definition' for details). 8.1.2. Server Release of Clientid If the server determines that the client holds no asso- ciated state for its clientid, the server may choose to release the clientid. The server may make this choice for an inactive client so that resources are not con- sumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence [DAFS Connect operation] to establish a new identity. It should be clear that the server must be very hesitant to release a clientid since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a clientid unless there had been no activity from that client for many minutes. 8.1.3. nfs_lockowner and stateid Definition When requesting a lock, the client must present to the server the clientid and an identifier for the owner of the requested lock. These two fields are referred to as the nfs_lockowner [owner] and the definition of those fields are: o A clientid returned by the server as part of the client's use of the SETCLIENTID operation. Wittle [Page 93] INTERNET-DRAFT Direct Access File System September 2001 o A variable length opaque array used to uniquely define the owner of a lock managed by the client. This may be a thread id, process id, or other unique value." (RFC 3010, p. 54) Because of the DAT transport requirement for in order delivery, DAFS maintain sequence information for lock state. Therefore, stateids are not returned by successful lock operations. However, DAFS does issue stateids when a file is opened to implement consistency between sub- sequent lock and I/O operations. "The stateid is used as a shorthand reference to the nfs_lockowner, since the server will be maintaining the correspondence between them. The server is free to form the stateid in any manner that it chooses as long as it is able to recognize invalid and out-of-date stateids. This requirement includes those stateids generated by earlier instances of the server. From this, the client can be properly notified of a server restart. This notification will occur when the client presents a stateid to the server from a previous instantiation. The server must be able to distinguish the following situations and return the error as specified: o The stateid was generated by an earlier server instance (i.e. before a server reboot). The error NFS4ERR_STALE_STATEID [DAFSERR_STALE_STATEID] should be returned. o The stateid was generated by the current server instance but the stateid no longer designates the current locking state for the lockowner-file pair in question (i.e. one or more locking operations has occurred). The error NFS4ERR_OLD_STATEID [DAFSERR_OLD_STATEID] should be returned. This error condition will only occur when the client issues a locking request which changes a stateid while an I/O request that uses that stateid is out- standing. o The stateid was generated by the current server instance but the stateid does not designate a locking state for any active lockowner-file pair. The error Wittle [Page 94] INTERNET-DRAFT Direct Access File System September 2001 NFS4ERR_BAD_STATEID [DAFSERR_BAD_STATEID] should be returned. This error condition will occur when there has been a logic error on the part of the client or server. This should not happen. One mechanism that may be used to satisfy these require- ments is for the server to divide stateids into three fields: o A server verifier which uniquely designates a partic- ular server instantiation. o An index into a table of locking-state structures. o A sequence value which is incremented for each sta- teid that is associated with the same index into the locking- state table. By matching the incoming stateid and its field values with the state held at the server, the server is able to easily determine if a stateid is valid for its current instantiation and state. If the stateid is not valid, the appropriate error can be supplied to the client. 8.1.4. Use of the stateid All READ and WRITE operations contain a stateid. If the nfs_lockowner performs a READ or WRITE on a range of bytes within a locked range, the stateid (previously returned by the server) must be used to indicate that the appropriate lock (record or share) is held." (RFC 3010, pp. 53-55) DAFS defines the special stateid value of zero for use when issuing DAFS_PROC_SETATTR_INLINE and DAFS_PROC_SETATTR_DIRECT, for all opera- tions that do not change the file size. "An explicit lock may not be granted while a READ or WRITE operation with conflicting implicit locking is being performed." (RFC 3010, pp. 55) DAFS does not define explicit lock sequencing because of the in-order and at-most-once requirements that it places on the DAT transport. "8.1.7. Releasing nfs_lockowner State Wittle [Page 95] INTERNET-DRAFT Direct Access File System September 2001 When a particular nfs_lockowner [owner] no longer holds open or file locking state at the server, the server may choose to release the sequence number state associated with the nfs_lockowner. The server may make this choice based on lease expiration, for the reclamation of server memory, or other implementation specific details. In any event, the server is able to do this safely only when the nfs_lockowner [owner] no longer is being util- ized by the client. The server may choose to hold the nfs_lockowner [owner] state in the event that retransmitted requests are received. However, the period to hold this state is implementation specific." (RFC 3010, pp. 57) DAFS does not define special handling for message retransmissions. "8.2. Lock Ranges The protocol allows a lock owner to request a lock with one byte range and then either upgrade or unlock a sub- range of the initial lock. It is expected that this will be an uncommon type of request. In any case, servers or server file systems may not be able to sup- port sub-range lock semantics. In the event that a server receives a locking request that represents a sub-range of current locking state for the lock owner, the server is allowed to return the error NFS4ERR_LOCK_RANGE [DAFSERR_LOCK_RANGE] to signify that it does not support sub-range lock operations. There- fore, the client should be prepared to receive this error and, if appropriate, report the error to the requesting application. The client is discouraged from combining multiple independent locking ranges that happen to be adjacent into a single request since the server may not support sub-range requests and for reasons related to the recovery of file locking state in the event of server failure. As discussed in the section "Server Failure and Recovery" below, the server may employ certain optimiza- tions during recovery that work effectively only when the client's behavior during lock recovery is similar to the client's locking behavior prior to server failure. 8.3. Blocking Locks Some clients require the support of blocking locks. The NFS version 4 [DAFS] protocol must not rely on a Wittle [Page 96] INTERNET-DRAFT Direct Access File System September 2001 callback mechanism and therefore is unable to notify a client when a previously denied lock has been granted. Clients have no choice but to continually poll for the lock. This presents a fairness problem. Two new lock types are added, READW and WRITEW, and are used to indi- cate to the server that the client is requesting a blocking lock. The server should maintain an ordered list of pending blocking locks. When the conflicting lock is released, the server may wait the lease period for the first waiting client to re-request the lock. After the lease period expires the next waiting client request is allowed the lock. Clients are required to poll at an interval sufficiently small that it is likely to acquire the lock in a timely manner. The server is not required to maintain a list of pending blocked locks as it is used to increase fairness and not correct operation. Because of the unordered nature of crash recovery, storing of lock state to stable storage would be required to guarantee ordered granting of blocking locks. Servers may also note the lock types and delay returning denial of the request to allow extra time for a con- flicting lock to be released, allowing a successful return. In this way, clients can avoid the burden of needlessly frequent polling for blocking locks. The server should take care in the length of delay in the event the client retransmits the request. 8.4. Lease Renewal The purpose of a lease is to allow a server to remove stale locks that are held by a client that has crashed or is otherwise unreachable. It is not a mechanism for cache consistency and lease renewals may not be denied if the lease interval has not expired." (RFC 3010, pp. 57-58) Any DAFS message received by the server from a client acks to renew the client's current leases. "This approach allows for low overhead lease renewal which scales well. In the typical case no extra RPC calls are required for lease renewal and in the worst case one RPC is required every lease period (i.e. a RENEW [NULL] operation). The number of locks held by the client is not a factor since all state for the client is involved with the lease renewal action. Wittle [Page 97] INTERNET-DRAFT Direct Access File System September 2001 Since all operations that create a new lease also renew existing leases, the server must maintain a common lease expiration time for all valid leases for a given client. This lease time can then be easily updated upon implicit lease renewal actions. 8.5. Crash Recovery The important requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts or reboots. All READ and WRITE operations that may have been queued within the client or network buffers must wait until the client has successfully recovered the locks protecting the READ and WRITE operations. 8.5.1. Client Failure and Recovery In the event that a client fails, the server may recover the client's locks when the associated leases have expired. Conflicting locks from another client may only be granted after this lease expiration. If the client is able to restart or reinitialize within the lease period the client may be forced to wait the remainder of the lease period before obtaining new locks. To minimize client delay upon restart, lock requests are associated with an instance of the client by a client supplied verifier. This verifier is part of the initial SETCLIENTID [DAFS Connect] call made by the client. The server returns a clientid as a result of the SETCLIENTID [DAFS Connect] operation." (RFC 3010, pp. 59) DAFS does not require a confirmation step when the client receives a client_id as the result of a successful DAFS Connect request. "The clientid in combination with an opaque owner field is then used by the client to identify the lock owner for OPEN. This chain of associations is then used to identify all locks for a particular client. Since the verifier will be changed by the client upon each initialization, the server can compare a new verif- ier to the verifier associated with currently held locks and determine that they do not match. This signifies Wittle [Page 98] INTERNET-DRAFT Direct Access File System September 2001 the client's new instantiation and subsequent loss of locking state. As a result, the server is free to release all locks held which are associated with the old clientid which was derived from the old verifier. For secure environments, a change in the verifier must only cause the release of locks associated with the authenticated requester. This is required to prevent a rogue entity from freeing otherwise valid locks. Note that the verifier must have the same uniqueness properties of the verifier for the COMMIT operation. 8.5.2. Server Failure and Recovery If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re- establish the lost locking state. The client must be able to re- establish the locking state without having the server deny valid requests because the server has granted conflicting access to another client. Likewise, if there is the possibility that clients have not yet re-established their locking state for a file, the server must disallow READ and WRITE operations for that file. The duration of this recovery period is equal to the duration of the lease period. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID [DAFSERR_STALE_STATEID] error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID [DAFSERR_STALE_CLIENTID] error indicates a clientid invalidated by reboot or restart. When either of these are received, the client must establish a new clientid (See the section 'Client ID') and re-establish the locking state as discussed below. The period of special handling of locking and READs and WRITEs, equal in duration to the lease period, is referred to as the 'grace period'. During the grace period, clients recover locks and the associated state by reclaim-type locking requests (i.e. LOCK requests with reclaim set to true and OPEN operations with a claim type of CLAIM_PREVIOUS). During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e. other LOCK and OPEN Wittle [Page 99] INTERNET-DRAFT Direct Access File System September 2001 operations) with an error of NFS4ERR_GRACE. If the server can reliably determine that granting a non-reclaim request will not conflict with reclamation of locks by other clients, the NFS4ERR_GRACE error does not have to be returned and the non-reclaim client request can be serviced. For the server to be able to service READ and WRITE operations during the grace period, it must again be able to guarantee that no pos- sible conflict could arise between an impending reclaim locking request and the READ or WRITE operation. If the server is unable to offer that guarantee, the NFS4ERR_GRACE error must be returned to the client. For a server to provide simple, valid handling during the grace period, the easiest method is to simply reject all non-reclaim locking requests and READ and WRITE operations by returning the NFS4ERR_GRACE [DAFSERR_GRACE] error. However, a server may keep information about granted locks in stable storage. With this information, the server could determine if a regu- lar lock or READ or WRITE operation can be safely pro- cessed. For example, if a count of locks on a given file is available in stable storage, the server can track reclaimed locks for the file and when all reclaims have been processed, non-reclaim locking requests may be pro- cessed. This way the server can ensure that non-reclaim locking requests will not conflict with potential reclaim requests. With respect to I/O requests, if the server is able to determine that there are no outstand- ing reclaim requests for a file by information from stable storage or another similar mechanism, the pro- cessing of I/O requests could proceed normally for the file. To reiterate, for a server that allows non-reclaim lock and I/O requests to be processed during the grace period, it MUST determine that no lock subsequently reclaimed will be rejected and that no lock subsequently reclaimed would have prevented any I/O operation pro- cessed during the grace period. Clients should be prepared for the return of NFS4ERR_GRACE [DAFSERR_GRACE] errors for non-reclaim lock and I/O requests. In this case the client should employ a retry mechanism for the request. A delay (on Wittle [Page 100] INTERNET-DRAFT Direct Access File System September 2001 the order of several seconds) between retries should be used to avoid overwhelming the server. Further discus- sion of the general is included in [Floyd]. The client must account for the server that is able to perform I/O and non-reclaim locking requests within the grace period as well as those that can not do so. A reclaim-type locking request outside the server's grace period can only succeed if the server can guaran- tee that no conflicting lock or I/O request has been granted since reboot or restart. 8.5.3. Network Partitions and Recovery If the duration of a network partition is greater than the lease period provided by the server, the server will have not received a lease renewal from the client. If this occurs, the server may free all locks held for the client. As a result, all stateids held by the client will become invalid or stale. Once the client is able to reach the server after such a network partition, all I/O submitted by the client with the now invalid sta- teids will fail with the server returning the error NFS4ERR_EXPIRED [DAFSERR_EXPIRED]. Once this error is received, the client will suitably notify the applica- tion that held the lock. As a courtesy to the client or as an optimization, the server may continue to hold locks on behalf of a client for which recent communication has extended beyond the lease period. If the server receives a lock or I/O request that conflicts with one of these courtesy locks, the server must free the courtesy lock and grant the new request. If the server continues to hold locks beyond the expira- tion of a client's lease, the server MUST employ a method of recording this fact in its stable storage. Conflicting locks requests from another client may be serviced after the lease expiration. There are various scenarios involving server failure after such an event that require the storage of these lease expirations or network partitions. One scenario is as follows: A client holds a lock at the server and encounters a network partition and is unable to renew the associ- ated lease. A second client obtains a conflicting lock and then frees the lock. After the unlock Wittle [Page 101] INTERNET-DRAFT Direct Access File System September 2001 request by the second client, the server reboots or reinitializes. Once the server recovers, the network partition heals and the original client attempts to reclaim the original lock. In this scenario and without any state information, the server will allow the reclaim and the client will be in an inconsistent state because the server or the client has no knowledge of the conflicting lock. The server may choose to store this lease expiration or network partitioning state in a way that will only iden- tify the client as a whole. Note that this may poten- tially lead to lock reclaims being denied unnecessarily because of a mix of conflicting and non-conflicting locks. The server may also choose to store information about each lock that has an expired lease with an asso- ciated conflicting lock. The choice of the amount and type of state information that is stored is left to the implementor. In any case, the server must have enough state information to enable correct recovery from multi- ple partitions and multiple server failures." (RFC 3010, pp. 59-63) DAFS does not require explicit handling of lock request timeouts. "8.7. Server Revocation of Locks At any point, the server can revoke locks held by a client and the client must be prepared for this event. When the client detects that its locks have been or may have been revoked, the client is responsible for vali- dating the state information between itself and the server. Validating locking state for the client means that it must verify or reclaim state for each lock currently held. The first instance of lock revocation is upon server reboot or re- initialization. In this instance the client will receive an error (NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) [DAFSERR_STALE_STATEID or DAFSERR_STATLE_CLIENTID] and the client will proceed with normal crash recovery as described in the previous section. The second lock revocation event is the inability to renew the lease period. While this is considered a rare or unusual event, the client must be prepared to Wittle [Page 102] INTERNET-DRAFT Direct Access File System September 2001 recover. Both the server and client will be able to detect the failure to renew the lease and are capable of recovering without data corruption. For the server, it tracks the last renewal event serviced for the client and knows when the lease will expire. Similarly, the client must track operations which will renew the lease period. Using the time that each such request was sent and the time that the corresponding reply was received, the client should bound the time that the corresponding renewal could have occurred on the server and thus determine if it is possible that a lease period expira- tion could have occurred. The third lock revocation event can occur as a result of administrative intervention within the lease period. While this is considered a rare event, it is possible that the server's administrator has decided to release or revoke a particular lock held by the client. As a result of revocation, the client will receive an error of NFS4ERR_EXPIRED [DAFSERR_EXPIRED] and the error is received within the lease period for the lock. In this instance the client may assume that only the nfs_lockowner's locks have been lost. The client noti- fies the lock holder appropriately. The client may not assume the lease period has been renewed as a result of failed operation. When the client determines the lease period may have expired, the client must mark all locks held for the associated lease as 'unvalidated'. This means the client has been unable to re- establish or confirm the appropriate lock state with the server. As described in the previous section on crash recovery, there are scenarios in which the server may grant conflicting locks after the lease period has expired for a client. When it is possible that the lease period has expired, the client must validate each lock currently held to ensure that a conflicting lock has not been granted. The client may accomplish this task by issuing an I/O request, either a pending I/O or a zero-length read, specifying the stateid associated with the lock in ques- tion. If the response to the request is success, the client has validated all of the locks governed by that stateid and re-established the appropriate state between itself and the server. If the I/O request is not suc- cessful, then one or more of the locks associated with the stateid was revoked by the server and the client must notify the owner. Wittle [Page 103] INTERNET-DRAFT Direct Access File System September 2001 8.8. Share Reservations A share reservation is a mechanism to control access to a file. It is a separate and independent mechanism from record locking. When a client opens a file, it issues an OPEN operation to the server specifying the type of access required (READ, WRITE, or BOTH) and the type of access to deny others (deny NONE, READ, WRITE, or BOTH). If the OPEN fails the client will fail the application's open request. Pseudo-code definition of the semantics: if ((request.access & file_state.deny)) || (request.deny & file_state.access)) return (NFS4ERR_DENIED)[DAFSERR_DENIED] The constants used for the OPEN and OPEN_DOWNGRADE operations for the access and deny fields are as fol- lows: const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_BOTH = 0x00000003; 8.9. OPEN/CLOSE Operations To provide correct share semantics, a client MUST use the OPEN operation to obtain the initial filehandle and indicate the desired access and what if any access to deny. Even if the client intends to use a stateid of all 0's or all 1's, it must still obtain the filehandle for the regular file with the OPEN operation so the appropriate share semantics can be applied. For clients that do not have a deny mode built into their open Wittle [Page 104] INTERNET-DRAFT Direct Access File System September 2001 programming interfaces, deny equal to NONE should be used. The OPEN operation with the CREATE flag, also subsumes the CREATE operation for regular files as used in previ- ous versions of the NFS protocol. This allows a create with a share to be done atomically. The CLOSE operation removes all share locks held by the nfs_lockowner on that file. If record locks are held, the client SHOULD release all locks before issuing a CLOSE. The server MAY free all outstanding locks on CLOSE but some servers may not support the CLOSE of a file that still has record locks held. The server MUST return failure if any locks would exist after the CLOSE. The LOOKUP operation will return a filehandle without establishing any lock state on the server. Without a valid stateid, the server will assume the client has the least access. For example, a file opened with deny READ/WRITE cannot be accessed using a filehandle obtained through LOOKUP because it would not have a valid stateid (i.e. using a stateid of all bits 0 or all bits 1). 8.10. Open Upgrade and Downgrade When an OPEN is done for a file and the lockowner for which the open is being done already has the file open, the result is to upgrade the open file status maintained on the server to include the access and deny bits speci- fied by the new OPEN as well as those for the existing OPEN. The result is that there is one open file, as far as the protocol is concerned, and it includes the union of the access and deny bits for all of the OPEN requests completed. Only a single CLOSE will be done to reset the effects of both OPEN's. Note that the client, when issuing the OPEN, may not know that the same file is in fact being opened. The above only applies if both OPEN's result in the OPEN'ed object being designated by the same filehandle. When the server chooses to export multiple filehandles corresponding to the same file object and returns dif- ferent filehandles on two different OPEN's of the same file object, the server MUST NOT'OR' together the access and deny bits and coalesce the two open files. Instead the server must maintain separate OPEN's with separate Wittle [Page 105] INTERNET-DRAFT Direct Access File System September 2001 stateid's and will require separate CLOSE's to free them. When multiple open files on the client are merged into a single open file object on the server, the close of one of the open files (on the client) may necessitate change of the access and deny status of the open file on the server. This is because the union of the access and deny bits for the remaining open's may be smaller (i.e. a proper subset) than previously. The OPEN_DOWNGRADE operation is used to make the necessary change and the client should use it to update the server so that share reservation requests by other clients are handled prop- erly. 8.11. Short and Long Leases When determining the time period for the server lease, the usual lease tradeoffs apply. Short leases are good for fast server recovery at a cost of increased RENEW [DAFS_PROC_NULL] or READ (with zero length) requests. Longer leases are certainly kinder and gentler to large internet servers trying to handle very large numbers of clients. The number of RENEW [DAFS_PROC_NULL] requests drop in proportion to the lease time. The disadvantages of long leases are slower recovery after server failure (server must wait for leases to expire and grace period before granting new lock requests) and increased file contention (if client fails to transmit an unlock request then server must wait for lease expiration before granting new locks). Long leases are usable if the server is able to store lease state in non-volatile memory. Upon recovery, the server can reconstruct the lease state from its non- volatile memory and continue operation with its clients and therefore long leases are not an issue. 8.12. Clocks and Calculating Lease Expiration To avoid the need for synchronized clocks, lease times are granted by the server as a time delta. However, there is a requirement that the client and server clocks do not drift excessively over the duration of the lock. There is also the issue of propagation delay across the network which could easily be several hundred mil- liseconds as well as the possibility that requests will be lost and need to be retransmitted. Wittle [Page 106] INTERNET-DRAFT Direct Access File System September 2001 To take propagation delay into account, the client should subtract it from lease times (e.g. if the client estimates the one- way propagation delay as 200 msec, then it can assume that the lease is already 200 msec old when it gets it). In addition, it will take another 200 msec to get a response back to the server. So the client must send a lock renewal or write data back to the server 400 msec before the lease would expire." (RFC 3010, pp. 63-67) 4.4.1.3. PERSIST Locks PERSIST locks are provided so that if a lock-protected sequence of I/O operations is interrupted, the protecting lock is not made avail- able again until the lockholder (or a cooperating client) has an opportunity to repair any inconsistencies in the data that resulted from the interruption. The events which might cause such an interr- uption are a client failure, a network partition which results in lock lease expiration and revocation, and server failure when the lock cannot be reclaimed successfully, or some combination of these events. For example, power loss to an entire cluster of DAFS clients and servers. The model of PERSIST locks is that rather than being released follow- ing certain failure events the locks, instead they become "broken". The state of a broken lock MUST survive client failures, server failures, and network partitions. Where a normal lock would have become subject to revocation, PERSIST locks enter a state of being breakable. The specific conditions for a lock to become breakable are either a client lease expiration or a server restart grace period expiration. When a lock is breakable, any conflicting lock request causes the lock to be broken. The lock also becomes broken when a client re-initialization (e.g., reboot) occurs, regardless of whether it was breakable at the time of the client re- initialization. When a PERSIST lock is broken, it behaves much like a normal lock relative to read and write operations. Read and write requests will receive one of the status values: o DAFS_ERR_LOCKED: if a null State-id (all zeroes) was specified. o DAFS_ERR_STALE_STATEID: if a non-null State-id was specified. It is only when clients either try to acquire a lock using the DAFS_PROC_LOCK operation or inquire about a lock using the DAFS_PROC_LOCKT operation that they see it is broken via the status DAFS_ERR_LOCK_BROKEN. Wittle [Page 107] INTERNET-DRAFT Direct Access File System September 2001 To release a broken PERSIST lock, a client issues a DAFS_PROC_LOCK request with the REPAIR option. The client application would then presumably perform some recovery action to repair the data contained within the locked region and then could release the lock with DAFS_PROC_LOCKU. A lock is specified to be PERSIST by setting the PERSIST option on the DAFS_PROC_LOCK request. 4.4.1.4. Auto-Recovery Locks AUTORECOVER locks provide a limited UNDO or rollback recovery ser- vice. In the absence of failures, an AUTORECOVER lock behaves exactly like a normal NFS Version 4 lock. In failure conditions, however, the server guarantees that any modif- ications made to a file that were done under the protection of an AUTORECOVER lock are undone before the lock is released and made available to other clients. Failure conditions are server restart grace period expirations, client lease expirations, and client re- initializations. This recovery is limited in the sense that there is no atomicity to actions performed on different files or even different byte regions of the same file if those regions were protected with different locks. If the failure occurs during the release of locks, those DAFS_PROC_LOCKU requests that completed will have no recovery associ- ated with them, whereas the locks that have not yet become released will have recovery actions performed. A lock is specified to be an AUTORECOVER lock by setting the AUTORE- COVER option on the DAFS_PROC_LOCK request. If a lock is requested with both AUTORECOVER and PERSIST options, the rollback associated with the lock is delayed until the lock becomes broken. Before the lock becomes broken the client can reclaim the lock and either continue on or forcibly roll back. A new lock type, ABORT_T, is defined to forcibly roll back an AUTORECOVER lock. AUTORECOVER locks are an OPTIONAL feature. An implementation can restrict AUTORECOVER locks to locks that cover the entire valid range for byte-range locks (i.e., from 0 to 264-1 bytes), thus preventing multiple simultaneous AUTORECOER locks on a single file. If an implementation does not support AUTORECOVER locks or it only supports AUTORECOVERY locks that cover the entire valid byte range, and an AUTORECOVER locks is attempted specifying a smaller file range, then the error status DAFSERR_NOTSUPP is Wittle [Page 108] INTERNET-DRAFT Direct Access File System September 2001 returned. 4.4.1.5. Client Failure and Recovery Client failures are seen by the server in the following two ways: o as lease expirations o as Sessions with a new client verifier indicating a client re- initialization The effect of lease expirations on PERSIST locks is to put them in a breakable state. When the lock is in a breakable state, the client still has an opportunity to renew a lease if it does so before any conflicting DAFS_PROC_LOCK request is serviced. This allows the client to continue operation following a network partition and recovery. If a conflicting lock is requested before the lease is renewed, the lock becomes broken. The lock request causing the break as well as any subsequent conflicting lock requests will receive the status DAFS_ERR_LOCK_BROKEN. Repair of the lock requires a client to request the lock with the REPAIR option and to then release it. The effect of lease expiration can be summarized by lock type: o Normal Lock: Revoke the lock. This action can take place immedi- ately or the server can defer it until there is a conflicting request by another client. o PERSIST Lock: Make the lock breakable. After a lock is made break- able it is made broken when a conflicting lock request occurs. o AUTORECOVER Lock: Rollback and revoke the lock. After a lock is made breakable it is made broken when a conflicting lock request occurs. o PERSIST - AUTORECOVER lock: Make the lock breakable. After a lock is made breakable it is made broken when a conflicting lock request occurs. The effect of a client re-initialization on PERSIST locks is to put them into the broken state. Conflicting lock requests will receive the status DAFSERR_LOCK_BROKEN. Repair of the lock requires a client to obtain the lock with the REPAIR option, and then release it. The effect of client re-initialization can be summarized by lock type: o Normal Lock: Release the lock. Wittle [Page 109] INTERNET-DRAFT Direct Access File System September 2001 o PERSIST Lock: Make the lock broken. o AUTORECOVER Lock: Rollback and revoke the lock. o PERSIST - AUTORECOVER Lock. Make the lock broken. 4.4.1.6. Server Failure and Recovery Clients can detect server failures when they establish a new Session after a previous Session with that server has been disconnected. The client presents the same client-id-string and client-verifier that it has used to establish the previous Session, but the server returns a different client-id. If the server had not re-initialized, it would return the same client-id as the client had used to establish the previous Session. Immediately following a server restart, the server enters a "grace period" equal in length to the lease period. During this time read, write, and lock requests other than Reclaim lock requests return the error DAFSERR_GRACE, unless the server can determine that all valid locks for the file have already been reclaimed. Locking behavior during the grace period is the same for all locks regardless of whether they are normal, PERSIST, or AUTORECOVER locks. During the grace period, clients can reclaim locks using the DAFS_PROC_LOCK operation with the reclaim option. If an AUTORECOVER lock is reclaimed during the grace period, any modifications made to the file while it was protected by the lock will be reflected in the file. Note that modifications made using asynchronous requests (i.e., unstable DAFS_PROC_WRITE_INLINE and DAFS_PROC_WRITE_DIRECT operations that have not been committed yet, and modifications made with the DAFS_PROC_BATCH_SUBMIT operation that have not received completion notification yet) might not be reflected in the file). If a DAFS_PROC_COMMIT had been done, the file will reflect the write operation. The client has the option of explicitly rolling back any changes by issuing a DAFS_PROC_LOCK request with the ABORT_T lock type. After the grace period expires, non-PERSIST locks that were not reclaimed are made available to all clients. Any such locks that were AUTORECOVER locks will have their associated modifications undone before any conflicting lock is granted. The server MAY allow reclaim of locks to occur after the grade period has ended, but only if it can be sure that no conflicting locks have been granted and released since the grace period ended. PERSIST locks that are not reclaimed during the grace period enter the state of being breakable. The lock remains in a breakable state Wittle [Page 110] INTERNET-DRAFT Direct Access File System September 2001 until the first conflicting lock request arrives at the server. At that time the lock becomes broken. If it is an AUTORECOVER lock, the rollback will be performed at this time. The lock remains broken until a client attempts to reaquire the lock with the REPAIR option and subsequently releases the lock. If a PERSIST lock becomes breakable at the end of the grace period, the client can still reclaim it, as long as the server is sure that no intervening conflicting lock has been granted (that is, the lock was not repaired and then made breakable because of a different PER- SIST lock). 4.4.2. Shared Key Reservations DAFS extends NFS share reservations with one new capability: SHARE KEY reservations. SHARE KEY reservations enable a set of cooperating clients (identified by a single shared KEY) to simultaneously access a file while at the same time denying access to cooperating clients that are not members of the original set (identified by a different KEY than the original one). SHARE KEY reservations are provided to aid a clustered application to detect rogue instances of the application that are trying to perform conflicting access to a file. Such rogue access is now a common source of corruption in clustered applications. SHARE KEY reserva- tions allow a clustered application to have all components of a clus- ter instance share a SHARE KEY reservation. Thus multiple clients participating in the cluster instance can access the file, but when a client participating in a different cluster instance tries to access the file, then access is denied. SHARE KEY reservation checking is in addition to ordinary NFSv4-style share reservation checking. Pseudocode definition of the semantics: Wittle [Page 111] INTERNET-DRAFT Direct Access File System September 2001 // Do the checking for NFSv4-style open semantics if ((request.access & file_state.deny) || (request.deny & file_state.access)) { return (DAFSERR_DENIED } // Do special SHARE KEY handling, if appropriate if (request.share_key_type) { if (file_state.share_key_type) { file_state.key = request.key } else if (request.key != file_state.key) { return DAFSERR_KEY_MISMATCH; } file_state.share_key_count++; } // Request will succeed. Update the remaining state. file_state.access |= request.access; file_state.deny |= request.deny; The CLOSE operation decrements the file_state.share_key_count for any SHARE KEY locks held by the dafs_lockowner on that file. Similarly, file_state.access and file_state.deny are updated so that they reflect share reservations held by other dafs_lockowners on that file. 4.4.3. Access Control Lists (ACLs) The access control lists in DAFS and NFS version 4 are the same. The NFS Version 4 description of ACLs is quoted below. For the most part, the quote applies to DAFS. The exceptions are the naming of data structures and constants that have changed to adhere to the DAFS pro- tocol naming conventions. See 6.1.4., "Basic Types" for the DAFS- equivalent names of data structures and constants. "The NFS [DAFS] ACL attribute is an array of access con- trol entries (ACE). There are various access control entry types. The server is able to communicate which ACE types are supported by returning the appropriate value within the aclsupport attribute. The types of ACEs are defined as follows: Wittle [Page 112] INTERNET-DRAFT Direct Access File System September 2001 Type Description ALLOW Explicitly grants the access defined in acemask4 to the file or directory DENY Explicitly denies the access defined in acemask4 to the file or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses any of the access methods specified in acemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory for the access methods specified in acemask4. The NFS ACE attribute is defined as follows: typedef uint32_t acetype4; typedef uint32_t aceflag4; typedef uint32_t acemask4; struct nfsace4 { acetype4 type; aceflag4 flag; acemask4 access_mask; utf8string who; }; To determine if an ACCESS or OPEN request succeeds each nfsace4 entry is processed in order by the server. Only ACEs which have a 'who' that matches the requester are considered. Each ACE is processed until all of the bits of the requester's access have been ALLOWED. Once a bit (see below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer considered in the processing of later ACEs. If an ACCESS_DENIED_ACE is encountered where the requester's mode still has unALLOWED bits in common with the 'access_mask' of the ACE, the request is denied. The bitmask constants used to represent the above defin- itions within the aclsupport attribute are as follows: Wittle [Page 113] INTERNET-DRAFT Direct Access File System September 2001 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 5.9.1. ACE type The semantics of the "type" field follow the descrip- tions provided above. The bitmask constants used for the type field are as follows: const ACE4_ACCESS_ALLOWED_ACE_TYPE= 0x00000000; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 5.9.2. ACE flag The "flag" field contains values based on the following descriptions. ACE4_FILE_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new non-directory file created. ACE4_DIRECTORY_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new directory created. ACE4_INHERIT_ONLY_ACE Can be placed on a directory but does not apply to the directory, only to newly created files/directories as specified by the above two flags. ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory. Normally when a new Wittle [Page 114] INTERNET-DRAFT Direct Access File System September 2001 directory is created and an ACE exists on the parent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed on the new directory. One for the directory itself and one which is an inheritable ACE for newly created directories. This flag tells the server to not place an ACE on the newly created directory which is inheritable by subdirectories of the created direc- tory. ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACL4_FAILED_ACCESS_ACE_FLAG Both indicate for AUDIT and ALARM which state to log the event. On every ACCESS or OPEN call which occurs on a file or directory which has an ACL that is of type ACE4_SYSTEM_AUDIT_ACE_TYPE or ACE4_SYSTEM_ALARM_ACE_TYPE, the attempted access is com- pared to the ace4mask of these ACLs. If the access is a subset of ace4mask and the identifier match, an AUDIT trail or an ALARM is generated. By default this happens regardless of the success or failure of the ACCESS or OPEN call. The flag ACE4_SUCCESSFUL_ACCESS_ACE_FLAG only produces the AUDIT or ALARM if the ACCESS or OPEN call is suc- cessful. The ACE4_FAILED_ACCESS_ACE_FLAG causes the ALARM or AUDIT if the ACCESS or OPEN call fails. ACE4_IDENTIFIER_GROUP Indicates that the "who" refers to a GROUP as defined under Unix. The bitmask constants used for the flag field are as follows: Wittle [Page 115] INTERNET-DRAFT Direct Access File System September 2001 const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_IDENTIFIER_GROUP = 0x00000040; 5.9.3. ACE Access Mask The access_mask field contains values based on the fol- lowing: Wittle [Page 116] INTERNET-DRAFT Direct Access File System September 2001 Access Description READ_DATA Permission to read the data of the file LIST_DIRECTORY Permission to list the contents of a directory WRITE_DATA Permission to modify the file's data ADD_FILE Permission to add a new file to a directory APPEND_DATA Permission to append data to a file ADD_SUBDIRECTORY Permission to create a subdirectory to a directory READ_NAMED_ATTRS Permission to read the named attributes of a file WRITE_NAMED_ATTRS Permission to write the named attributes of a file EXECUTE Permission to execute a file DELETE_CHILD Permission to delete a file or directory within a directory READ_ATTRIBUTES The ability to read basic attributes (non-acls) of a file WRITE_ATTRIBUTES Permission to change basic attributes (non-acls) of a file DELETE Permission to Delete the file READ_ACL Permission to Read the ACL WRITE_ACL Permission to Write the ACL WRITE_OWNER Permission to change the owner SYNCHRONIZE Permission to access file locally at the server with synchronous reads and writes The bitmask constants used for the access mask field are as follows: Wittle [Page 117] INTERNET-DRAFT Direct Access File System September 2001 const ACE4_READ_DATA = 0x00000001; const ACE4_LIST_DIRECTORY = 0x00000001; const ACE4_WRITE_DATA = 0x00000002; const ACE4_ADD_FILE = 0x00000002; const ACE4_APPEND_DATA = 0x00000004; const ACE4_ADD_SUBDIRECTORY = 0x00000004; const ACE4_READ_NAMED_ATTRS = 0x00000008; const ACE4_WRITE_NAMED_ATTRS = 0x00000010; const ACE4_EXECUTE = 0x00000020; const ACE4_DELETE_CHILD = 0x00000040; const ACE4_READ_ATTRIBUTES = 0x00000080; const ACE4_WRITE_ATTRIBUTES = 0x00000100; const ACE4_DELETE = 0x00010000; const ACE4_READ_ACL = 0x00020000; const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_OWNER = 0x00080000; const ACE4_SYNCHRONIZE = 0x00100000; 5.9.4. ACE who There are several special identifiers ("who") which need to be understood universally. Some of these identifiers cannot be understood when an NFS client accesses the server, but have meaning when a local process accesses the file. The ability to display and modify these per- missions is permitted over NFS. Wittle [Page 118] INTERNET-DRAFT Direct Access File System September 2001 Who Description "OWNER" The owner of the file. "GROUP" The group associated with the file. "EVERYONE" The world. "INTERACTIVE" Accessed from an interactive terminal. "NETWORK" Accessed via the network. "DIALUP" Accessed as a dialup user to the server. "BATCH" Accessed from a batch job. "ANONYMOUS" Accessed without any authentication. "AUTHENTICATED" Any authenticated user (opposite of ANONYMOUS) "SERVICE" Access from a system service. To avoid conflict, these special identifiers are distin- guish by an appended "@" and should appear in the form "xxxx@" (note: no domain name after the "@"). For exam- ple: ANONYMOUS@." (RFC 3010, pp. 40-44) 4.4.4. Fencing Cluster systems with shared resources need to "fence off" access to shared resources by nodes when those nodes lose membership in the cluster quorum. This is done primarily to prevent a failing or mis- behaving node from improperly accessing the shared resource, and more specifically to "drain" all outstanding I/O requests to the resource from the evicted node. Draining is necessary to permit other nodes in the quorum to repair any damages made to the resource by the evicted node. To perform this repair (that is, recovery) the recovering node needs to know that no more I/Os will be executed by the evicted node. Fencing has also been described as "client access revocation." This is an accurate description in the file server environment. The problem can be broken down into the following subproblems: o Access revocation - Preventing further access to the resource by the node. o Draining - Providing indication of when all outstanding I/O requests of the node have completed or been cancelled. Wittle [Page 119] INTERNET-DRAFT Direct Access File System September 2001 o Authorization and access control - Assuring that agents issuing fence operations are authorized to do so. o Concurrency control - Avoiding race conditions that could result in incorrect or hung system states. The DAFS Fencing mechanism is described below, in terms of o subjects (i.e., dafs clients), o objects (e.g., file systems), o permissions (i.e., allow or deny), and o operations (e.g., get and set permissions) 4.4.4.1. Fencing Subjects A Fencing subject is the active entity whose access to an a file or set of files is being controlled. A dafs client can associate a "fence_id_string" with a Session to the DAFS server by specifying it in the fence_id_string field of the DAFS_PROC_CLIENT_CONNECT opera- tion (this is new argument field added to that operation). The fence_id_string is similar in concept to the existing DAFS Client- id-string argument, but does not overload Fencing semantics onto the Client-id. Rationale: DAFS Fencing is intended to address access control between a set of cooperating DAFS clients. The set of cooperating clients needs to o each use a unique Fence_id_string, and o needs to make the set of Fence_id_strings in use known to some central authority (e.g., cluster manager) for administering the Fencing mechanism. Since this level of cooperation is needed, Fencing is not meant to protect against malicious attacks. Being "spoof- proof" is NOT REQUIRED. 4.4.4.2. Fencing Object A Fencing Object is defined by the DAFS filehandle and object_flag argument fields in the Fencing administrative operations. The object_flag specifies whether the Object being Fenced is the file associated with the filehandle, or the file system specified by the Wittle [Page 120] INTERNET-DRAFT Direct Access File System September 2001 FShandle part of the filehandle. Note: For the case of Fencing a fs_handle, it is up to the underlying DAFS server side file system to export fs_handles to the DAFS server in a way that the implementation-specific unit of storage (e.g., file system) that is associated with the fs_handle can be described well enough so that users who want to use the Fencing feature can place the set of files that need to be fenced as a unit into the underlying file system appropriately. 4.4.4.3. Fencing Permissions Fencing permissions are defined by a "Fencing_list" of Fence_id_stings. The list designates Fence_id_strings, and thus the DAFS clients, who are allowed access to (vis-a-vis Fencing) to the Object (defined by the filehandle and object_flag). The Fencing_list is stored persistently by the DAFS server. A null Fencing_list is a special case means that all dafs clients are allowed access to the Object. A non-null Fencing_list means that all DAFS clients with con- nections that specify a Fence_id_string in the Fencing_list can access the Object. 4.4.4.4. Fencing Operations The Fencing operations used to manage the Fencing_list, and to cause Fencing access controls to be in effect, are DAFS_PROC_SET_FENCING_LIST and DAFS_PROC_GET_FENCING_LIST. The ability to set the Fencing_list for an filehandle object is reserved to the owner of the object, or a trusted Client.The ability to set the Fencing_list for a file system is reserved to trusted Clients. The set operation atomically updates the Fencing_list, adding or removing Fence_id_strings from the existing list, or overwriting the existing list, as specified by the argument flags. A side-effect of the set operation when invoked with the deny flag is to 1) drain (i.e., abort or complete) any in-progress operations received on a Session with the just-denied Fence_id_string. All subsequent requests on a Session that has the associated just- denied Fence_id- string, MUST enforce the denial of access implied by the new Fencing_list. This requires determining that the request is associated with a denied Fence_id_string (e.g., deter- mining that the request's Session has a denied Fence_id_string), Wittle [Page 121] INTERNET-DRAFT Direct Access File System September 2001 and matching the filehandle in the request to Objects that are Fenced. 2) if the Object fenced includes all DAFS file objects (directories, files, symlinks, etc.) provided by the DAFS Server, then all existing Sessions associated with the just-denied Fencing_id_string can be closed in error. Subsequent attempts to create a Session that contains the just-denied Fence_id_string can be returned in error. Rationale: These two effects of Fencing provide a range of capabil- ity. First, if the Object to be Fenced includes all Objects provided by the DAFS server, the runtime checks needed to implement Fencing are reduced and performance is enhanced. Second, by defining Fencing to include only some of the Objects provided by the DAFS server, multiple sets of cooperating dafs clients (e.g., application clusters) can be supported on the same DAFS server with some cost in runtime performance. Note: Fencing an Object does not destroy other file state (e.g., locks) associated with the Client. This state is controlled by lease expiration. 4.5. NFS-Derived Operations The NFS Version 4 file system specification, RFC 3010, identifies a large set of file operations that are common to many file system environments. Most of these file operations are common to any file system and are not specific to either a wide-sharing or local file- sharing environment. A number of these common file operations do not involve bulk data movement between client and server, and the seman- tics are not significantly enhanced through the use of memory-to- memory architectures. For this reason, the DAFS file system currently incorporates these operation semantics as defined in NFS Version 4. Although the operation semantics are the same for these operations, the message packet format and other aspects of the communication between client and server are specific to DAFS and incompatible with NFS Version 4. Specifically, DAFS incorporates the following operational semantics from NFS Version 4 as specified in the corresponding NFS operation: o DAFS_PROC_NULL o DAFS_PROC_ACCESS Wittle [Page 122] INTERNET-DRAFT Direct Access File System September 2001 o DAFS_PROC_CLOSE o DAFS_PROC_COMMIT o DAFS_PROC_CREATE o DAFS_PROC_DELEGPURGE o DAFS_PROC_DELEGRETURN o DAFS_PROC_LINK o DAFS_PROC_LOOKUP o DAFS_PROC_LOOKUPP o DAFS_PROC_NVERIFY o DAFS_PROC_OPEN o DAFS_PROC_OPENATTR o DAFS_PROC_OPEN_DOWNGRADE o DAFS_PROC_REMOVE o DAFS_PROC_RENAME o DAFS_PROC_RENEW o DAFS_PROC_VERIFY o DAFS_PROC_BC_NULL o DAFS_PROC_BC_RECALL. Wittle [Page 123] INTERNET-DRAFT Direct Access File System September 2001 5. Failure Recovery This chapter describes failure recovery in the follow topics: o Exactly-Once semantics o Server Response Cache o Server Failover. 5.1. Exactly Once Semantics DAFS supports "exactly once" semantics in the face of connection and server failures. Building upon the characteristics of DAT message delivery (see 2.3.3., "DAT Requirements"), DAFS makes an important assumption: DAFS requests are not repeatedly reissued until a response is received. During a DAFS Session, the server will not receive multiple copies of a request sent by the client. This means that the server does not need to check each request to see if it is a spurious repetition of a request performed earlier. Further, because there are no retransmissions, the server will not erroneously execute any request twice because of an overflow of a time-based reply cache. It is possible for a DAFS communication channel to fail without an indication to the client. It is expected that clients will implement timeouts. Typically the timeouts will specify long values, simply to detect failed DAFS communication channels. In these error cases, the client will destroy the existing channel and create a new one. In the event of abnormal disconnection, whether because of a timeout, or some other error, DAFS defines a Response Cache that enables the client to determine which requests, issued before the disconnection, were executed and which were not executed. The client is able to reissue only those requests not executed previously. This mechanism prevents a request from being executed more than once. Requests that do not modify file system state are not included in the Response Cache because these can be reissued harmlessly. 5.2. Server Response Cache 5.2.1. Response Cache As an option, negotiated when each Session is created, DAFS servers maintain a Response Cache that stores the results of requests that are not guaranteed to have reached the issuing client. If the use of the Response Cache is negotiated and agreed to during Session crea- tion, then the server MUST store these results for all state- modifying file system requests (see 4.3.1., "Chaining Restrictions" for a list) and for chained requests marked with the DAFS_CHF_SAVE Wittle [Page 124] INTERNET-DRAFT Direct Access File System September 2001 flag (see 4.3.2., "Chaining Flags"). The server can store results for other requests but is NOT REQUIRED to because such requests can be re-executed harmlessly. Rationale: The Response Cache is an optional DAFS server behavior. It's use if negotiated when a Session is created. A client requests a response in order to improve it's service guarantees following a failure in the client, server, or network. Use of the Response Cache could introduce some loss of performance for the Session, particularly since the content of the Response Cache needs to survive server failure. The server is generally expected to follow the wishes of the client in respect to Response Cache use. However, as with other Session options, the server MAY decline a request for Response Cache use, or MAY always maintain a response regardless of client requests. For each request that a client issued but for which it did not receive a response, the Response Cache enables the client to deter- mine whether or not the request was executed by the server, and the results of the response. The Response Cache excludes requests that can be safely reissued, and is only maintained after a Session has been disconnection only when the Session is disconnected abnormally due to and error. Thus, "exactly once" semantics can be maintained across disconnection and server failure. When a new Session is established, information from the Response Cache of the old Session can be used to determine the set of requests that were in progress. This includes requests that were in transit to the server, requests being executed by the server, and requests whose response was in transit to the client. This enables the client and server agree on the state of the file system so that no request will be executed more than once (with the exception of requests that can be reissued harmlessly). Rationale: The session orientation of DAFS combined with the reliable delivery semantics of DAT enable DAFS to reissued requests and responses. By using the limits flow-control places on the size of the set of outstanding requests, the client and the server bound the set of requests whose state needs to be determined following a failure. The client can interrogate the server and resolve ambiguity. Then the client and server can proceed from a known file state. The number of responses that need to be stored in the Response Cache is at most OPNreq, because the flow control algorithm limits the number of client requests per channel that can be in progress a the same time to OPNreq. However, the server does not know which Wittle [Page 125] INTERNET-DRAFT Direct Access File System September 2001 particular Response Cache entries can be reused when a new client request is received. Therefore, the DAFS protocol partitions the Response Cache into OPNreq "streams" that a client can use to submit requests to the server. A given stream can have only one request in progress at a time. Note that this is a way of restating the flow control algorithm explained earlier in 3.2.6., "Message Flow Con- trol". A Response Cache entry is associated with each stream and it contains the most recent state-modifying request (or saved chained request) issued for that stream. Each request is identified by a 32-bit transaction identifier that consists of a 16-bit stream ID and a 16-bit sequence number. The sequence number is incremented for each request sent on the given stream. Wrap around of the 16-bit sequence number does not pose any special difficulties because the transaction ID is used to resolve uncertainties about which requests have been processed only. 5.2.2. Response Cache Handling of OPNreq Decrease When OPNreq is decreased, the highest numbered streams expire. For example, if the old OPNreq values was N, the highest numbered stream would be stream N-1. If OPNreq is decreased by 1, then the stream numbered N-1 expires. Requests already issued by the client that use the newly expired stream id will be completed normally, but the stream id can not be used for subsequent requests. If there is currently a request outstanding that specifies the newly expired stream id, then in order to keep the total number of outstanding requests below Nreq, the client is REQUIRED to refrain from issuing new requests on some other valid stream id, until the outstanding request has completed. Note that OPNreq can be decreased by at most "one" at time. For more information, see 3.2.6.4., "Flow Control Specifics". However, the number of Response Cache entries cannot be immediately decreased, because no entry can be deleted from the Response Cache until the server verifies that the client has received the associated response. If the old OPNreq value was N, then the Response Cache entry associated with stream N-1 might contain an entry that cannot be deleted immediately. The reason is that the client might have transmitted a new request for this stream before receiving notifica- tion of OPNreq being decreased. The new request that the client might have sent on the stream might need its response stored in the Response Cache, pending confirmation of it's receipt by the client. When the client receives the response that contains the new OPNreq Wittle [Page 126] INTERNET-DRAFT Direct Access File System September 2001 value, it will stop using the expired stream. The next request that the client sends will contain the new OPnreq value, and this serves to acknowledge both the new value for OPNreq and the fact that one or more streams have expired. 5.2.2.1. Freeing Entries in a Stream The server relies on DAT connection ordering rules to determine when the client has acknowledged a new OPNreq value. The response message that contained the new OPNreq also contained a particular stream id. When the server subsequently receives a request that also specifies that particular stream id, the server is assured that the client has received the response that contained the new OPNreq value. The Response Cache entry for the expired stream can now be deleted. 5.2.2.2. Freeing Entries in the Highest Numbered Stream If the new value of OPNreq is sent to the client in a response mes- sage that contains a stream id that is itself about to expire (N-1 in the example from the previous section), then the server cannot use receipt of a new request with that stream id as an indication that the client received the new value. The reason is because the client will no longer use that stream id, and therefore the server will not receive subsequent messages that specify that stream id. In this case, acknowledgement of the new OPNreq value is based on the receipt of a request that the server can confirm was send by the client after the client received the response containing the new value of OPNreq. If the server makes no further changes to the value of OPNreq, the server can confirm receipt of the new value when it receives a request from the client that contains the new value. The Response Cache entry for the expired stream can now be deleted. When the server has received OPNreq requests from the client after having sent the response containing the new value of OPNreq, the server is assured that the client has received the request. The Response Cache entry for the expired stream can now be deleted. It could be that the next response sent by the server also decreases OPNreq. In the worst case, OPNreq-1 responses could be sent, each on the highest stream id valid at the time it is sent. In this case, when the server has received OPNreq requests from the client after having sent the first response containing the new value of OPNreq, the server is assured that the client has received all of the responses. Typically the delay in confirming receipt of the new OPNreq value will be shorter than the worst case, and the Response Cache entries for the expired streams can be released. Wittle [Page 127] INTERNET-DRAFT Direct Access File System September 2001 5.2.3. Handling Batch I/O Requests The case of the DAFS_PROC_BATCH_SUBMIT operation special considered. If a disconnection occurs before the final DAFS_PROC_BC_BATCH_COMPLETE message is sent to indicate that all I/O operations are complete, the Response Cache will not contain an entry for the batch submission message even though some of the individual I/O requests might have completed. Clients will need to reissue the DAFS_PROC_BATCH_SUBMIT operation in such circumstances. If other clients were accessing the same areas as the batch I/O requests, the original sequence of operations will be altered. In such cases, the semantics of the repeated I/O operations MAY be different from a sin- gle occurrence. It is up to the client using batch I/O requests to use them in circumstances where such semantic differences are accept- able. 5.2.4. Server Response Cache in Stable Storage If the use of the Response Cache is negotiated and agreed to during Session creation, then, in order to provide exactly once semantics across server failures, the DAFS server MUST keep its Response Cache in stable storage. The server MUST not place an entry in the Response Cache if the corresponding operation is not reflected in the file system. Conversely, if the operation is reflected in the file system state, the corresponding entry MUST appear in the Response Cache. A mismatch between the file system state and the Response Cache could result in an operation being performed more than once or not per- formed at all. Note: Ensuring agreement between the file system data and the Response Cache involves recording operation parameters for fs- state-modifying-requests in low-latency stable storage (for example, nonvolatile RAM) before performing the operation. Fol- lowing a failure, the server consults the saved information and uses it to formulate the Response Cache as it will appear to the client when new Sessions are established. In some cases, the server can determine whether the requested operation was completed by examining the file system. In other cases (for example, write operations), the operation can be repeated as part of server reboot but before allowing any other user access to the file system. In all of these cases, the server needs to deny access to the modified file system data by other requests, before marking the current request complete. 5.2.5. Use of the Server Response Cache Assuming that use of the Response Cache was agree to during Session establishment, then following a disconnection, the server MUST save Wittle [Page 128] INTERNET-DRAFT Direct Access File System September 2001 the Response Cache information so that it can be used by the client upon reconnection. As part of disconnection processing, the server MUST insure that no request issued before disconnection is still being executed and that the Response Cache entries associated with the disconnected Session can no longer be modified. The information in the Response Cache MUST be saved until the client reinitializes, reconnects to the server, and queries the Response Cache for the pre- vious Session, or for an implementation defined period. The reconnec- tion identification can use the same client verifier or, following a client reboot, can use a different client verifier. The client obtains information from the saved Response Cache by specifying the Session-ID of the disconnected Session together with the transaction ID. The DAFS_PROC_CHECK_RESPONSE request determines whether such an entry exists for the specified request. The DAFS_PROC_FETCH_RESPONSE request retrieves the response. The response returned to DAFS_PROC_FETCH_RESPONSE is the same that would have been returned for the original request. Because the server MUST execute operations in a chain in order, all Response Cache entries for chained requests will be ordered. (Note, however, that the actual responses can be delivered in a different order.) If the client queries the Response Cache during replay and finds that the last operation in a chain has been completed success- fully, then all other operations in that chain were also completed successfully. Nonchained operations can complete in any order. After response information is obtained for all requests that the client needs to verify results for, the client issues a DAFS_PROC_DISCARD_RESPONSES request and then proceeds with the rest of the necessary recovery. This will include reestablishing any necessary cached credentials for the new Session. Note that such credentials are NOT REQUIRED to access the Response Cache because no file system requests are executed at that time; only the results of previously executed requests are obtained. When reconnection occurs because of a server reboot or failover, locks SHOULD be reclaimed before issuing any new requests. In this context, new requests include any previously issued requests that were not found in the Response Cache, because they either were not executed or could be reexecuted safely. After any necessary recovery is done, the client can reissue requests that were not found in the Response Cache. 5.2.6. Response Cache Operations The following DAFS operations are provided for DAFS Response Cache Wittle [Page 129] INTERNET-DRAFT Direct Access File System September 2001 access and management. DAFS_PROC_CHECK_RESPONSE Check a disconnected Session's Response Cache for the results of a request. DAFS_PROC_FETCH_RESPONSE Fetch information from a disconnected Session's Response Cache. DAFS_PROC_DISCARD_RESPONSES Discard Response Cache information for a disconnected Session's Response Cache. 5.3. Server Failover Optionally, a file system's failover_locations attribute can be used to specify an alternate location to be used to obtain access to the file system in the event of server failure. Clients can retrieve the failover_locations attribute when they cross into a new file system to determine if alternate locations exist. The file system handle returned by lookup, lookup parent, and open requests SHOULD be exam- ined to see if it contains a file system handle previously unknown to the client. At this point, the file system attribute failover_locations SHOULD be retrieved to determine the proper place to perform failover for that file system. If a disconnection occurs, clients will normally attempt to reconnect to the server. If this fails, the alternate locations can be used. These MAY be the same for all file systems or there MAY be different alternate servers for different locations. After a functioning alternate server is found for a given file sys- tem, recovery is similar to a server reboot. One difference is that the client might already have existing Sessions with some or all of the alternate servers specified by the failover_locations attribute. In any case, the client SHOULD obtain Response Cache information for each request that was in flight at the time of disconnection. Because the alternate server can be different for different file systems, the Response Cache information for each in-flight message MAY need to be obtained from different servers. Servers MUST ensure that the Response Cache information is propagated to the appropriate alternate server for the file system being accessed by the request. After the Response Cache information is obtained, recovery proceeds Wittle [Page 130] INTERNET-DRAFT Direct Access File System September 2001 as it does in other server failure cases, including establishing cached credentials and reclamation of client locks. The client needs to be prepared to perform these activities on multiple servers if some of its locks are located on file systems that have failed over to different alternate servers. 5.3.1. Changing failover_locations When the value of failover_locations changes, any responses to requests from clients who have not fetched the new value will set a special flag in the response header, in the special condition flag field, DAFS_SPCOND_FAILOVER. The client SHOULD notice the special condition flag and retrieve the failover_locations attribute for all file systems so that the client will fail over to the correct loca- tion in the event of server failure. After the client has interro- gated failover_locations for file systems where the value has changed, the DAFS_SPCOND_FAILOVER is reset until a subsequent value change causes it to be set again. Note: There might be significant advantage in introducing a way for the client to obtain failover_location information in a more efficient manner. For instance, the client might look up a dataset name in a name server or distributed directory and get a list of potential servers. See Appendix A. "DAFS Name Ser- vice" for more information. Wittle [Page 131] INTERNET-DRAFT Direct Access File System September 2001 6. Message Formats This chapter describes the format of requests and responses in the DAFS protocol. Some highlights of the layout of DAFS messages include: o DAFS requests and their responses are encoded as individual mes- sage packets. Limiting one DAFS operation per packet places a rea- sonable upper bound on the size of the buffer for the DAT receive data transfer operation (DTO) that is preallocated and submitted on the DAT connections underlying the DAFS communication channels. o Message formats are organized to consolidate fixed-length fields at the beginning of messages. Rearranging fields in this manner fits with the fixed/variable-sized segregated encoding. o Operations that make use of RDMA capabilities define an inline portion of the message along with the format of the data that is transferred in a "direct" buffer using remote DMA. o All DAFS messages start with either a request or response header. Immediately following the request header is a procedure-specific component that contains the arguments of the request. Similarly, immediately following the response header in each response message is a procedure-specific component that contains the information that forms the result. In the remaining portions of this section, the discussion focuses on the format of DAFS message headers, the Session connection management procedures, the Response Cache management operations, the client- initiated file system requests, and the server-initiated DAFS back- control directives. For each procedure, the functionality of the pro- cedure is defined and the format of the argument and result portions of the request and response messages are given. Note that for each procedure, the identities of the requester and responder (for example, DAFS client or server) are implied. For most procedures, the DAFS client is the requester and the DAFS server is the responder. But for the back-control procedures (such as DAFS_PROC_BC_NULL and DAFS_PROC_BC_GETATTR), the DAFS server is the requester and the DAFS client is the responder. 6.1. Message Headers and Common Structures 6.1.1. Message Format Wittle [Page 132] INTERNET-DRAFT Direct Access File System September 2001 6.1.1.1. Overall Packet Format All defined message formats have the following attributes: o All multi-byte fields use the byte ordering negotiated for the DAFS Session. o The offsets of all 2-byte fields MUST be 2-byte aligned; the offsets of all 4-byte fields MUST be 4-byte aligned; the offsets of all 8-byte fields MUST be 8-byte aligned; and variable-sized fields MUST be padded to ensure the proper alignment of the next field. All messages MUST be 8-byte aligned. o Because UTF-8 encoding is used for all string fields, multiple octets could be needed to encode a single character. See the fol- lowing discussion on variable-sized fields. DAFS packets delivered inline using send and receive data transfer operation (DTO) buffers are laid out in two sections: the first con- tains all fields that are fixed in size; the second, herein referred to as the heap, contains fields that are variable in size. Each variable-sized field contains an entry in the fixed sized section consisting of a 4-byte offset into the heap where the actual variable sized data is encoded. Some operations, such as read and write inline, place the count in the fixed portion of the packet: this allows the variable-sized data itself to be aligned at natural buffer boundaries if the fixed-size portion is padded accordingly. DAFS defines data packets that contain variable-sized fields inside the format of a variable-sized field. This nesting is handled by recursively encoding each variable-sized field: its fixed contents are encoded and any nested variable-sized field is encoded using an offset into the next available location in the heap, which will con- tain the length and variable- sized contents as before. In addition to inline data, some operations define data transfers using RDMA functionality. Each such operation defines the format of the information transferred in via RDMA. Encoding of unions is done in a manner similar to what most C- compilers do: unions are encoded in a memory chunk large enough to hold the largest arm of the union. Union arms that are smaller than the allocated chunk are padded to fill in the unused portion of the union's memory. Encoding unions in this fashion makes them fixed in size regardless of which set of data is included. Fixed-sized unions ease computations of fixed offsets within packets as shown. There is also a need for union-like constructs (called joins) whose Wittle [Page 133] INTERNET-DRAFT Direct Access File System September 2001 encoding attempts to minimize memory usage. This encoding does not use padding for arms that are shorter than the maximum size, thus using only as much memory as it is needed. The downside of this encoding is the inability to compute fixed offset after one of these constructs appears in a DAFS packet-that is, definitions containing joins appear in the variable sections of structure definitions. 6.1.1.2. Data Definition Language DAFS packet formats are described in a C-like language. The main differences are as follows: o Counted arrays with an upper bound are described using angle brackets. For example, an array of maximum size N 32-bit numbers would be described as int32 Sample_CountedArray ; This defini- tion would translate to a leading 32-bit unsigned number of entries field followed by the array itself. A counted array with zero entries is defined to be an array with zero in the number of entries field, followed by no array entries. o A counted array with a no upper bound is also defined using angle brackets, but the maximum size is omitted, for example: int32 Sample_VarCountedArray<>; The encoding is the same as that used for the counted arrays discussed previously. o In contrast to counted arrays, the DAFS definition language also supports the more traditional arrays (as in C-traditional). These arrays are defined using the standard square brackets and their encoding omits the leading 32-bit number of entries. A sample definition: int32 Array[N]; An unbounded instance of a traditional array is specified by int32 Array[]; omitting the maximum size. o Unions and joins explicitly define the discriminator by using switch- style syntax. Unlike switch statements, there is no need to add 'break' statements at the end of each case. An empty case statement can be used to share the format with the next nonempty case. [union | join] switch (discriminator) { case A: . case B: . . default; } Wittle [Page 134] INTERNET-DRAFT Direct Access File System September 2001 o Determining whether a field is encoded in the initial fixed-size fields section or the subsequent variable-sized fields section requires understanding whether the field is variable length itself or contains variable length fields. Inline comments are used to note which fields are variable length in each message packet. These fields will be encoded in the message packet with an offset value in the fixed-side section pointing to the start of the (variable-size) value of the field in the variable-size section. This variable-size section of the message is called the "heap." The order that variable-size fields are listed in a structure definition is not necessarily the order in which they will be encoded for transmission. This order is determined by the offsets into the "heap" found in the fixed sections of the packet. More- over, not all variable portions of a structure are necessarily included in all packets. Take the case of a union that contains variable fields in different arms: only those variable fields in the selected arm will appear in the transmitted packet. o In addition, comments are used to note when an RDMA buffer is referenced through the use of a direct_op_buffer field in the mes- sage. These comments are labeled "DIRECT:". Although direct sec- tions are noted via comments sequentially following the structure definitions, the actual memory buffers involved in the transfer will rarely be laid out right after the inline data. Moreover the transfer of the contents of the buffer will occur as separate transport operations. 6.1.1.3. Alignment of Variable Length Fields There are two items to be considered while dealing with a variable sized field, 1) Encoding of a variable sized field. 2) Calculation of offsets. 6.1.1.3.1. Encoding of a variable sized field There are three types of variable sized fields: 1) counted arrays Encoding of a counted array is as follows: o Counted arrays always begin on an 8-byte boundary. Encoding of counted arrays remains the same irrespective of where it appears on the heap. Wittle [Page 135] INTERNET-DRAFT Direct Access File System September 2001 o The individual components of a counted array are naturally aligned. o Counted arrays begin with a 4-byte count. o Next is a 4-byte pad, if the elements of the counted array are 8 byte aligned. o Counted arrays are not padded to any natural boundary at the end. Any padding is dictated by the alignment requirements of the next item in the heap. 2) joins Encoding of a join is straight forward. 3) pathnames A pathname is a counted array of utf8strings, which themselves are variable-sized. They are not encoded recursively. A pathname always begins on a 8-byte boundary just like any other counted array and consists of a 4-byte count followed by a 4-byte pad and the utf8strings themselves. As before, a utf8string is not padded to any natural boundary at the end and any padding that is neces- sary is dictated by the alignment that is REQUIRED for the next item. 6.1.1.3.2. Calculation of offsets Any procedure request or results will have: 1) Header 2) Procedure-specific fixed section 3) Heap. Hereafter the procedure-specific fixed section is called the "fixed section." Offset fields specify the number of bytes from the beginning of the innermost scope that contains the offset field up to the beginning of the variable length field pointed to. Definition of a scope: 1) Fixed section creates the outer scope. Wittle [Page 136] INTERNET-DRAFT Direct Access File System September 2001 2) Any variable sized field (a counted array or a join) creates an inner scope. 3) The above two constructs (fixed section and variable sized field) are the only ones that can define a scope. Some examples: 1) Fixed section contains an offset to a counted array of RDMA buffers: o The offset is the number of bytes from the beginning of the fixed section up to the beginning of the counted array of RDMA buffers that it points to. 2) Fixed section contains an offset to a counted array of read/write requests and each read/write request contains an offset to a counted array of file chunks. o The offset in the fixed section is the number of bytes from the beginning of the fixed section up to the beginning of the counted array of read/write requests that it points to. o The offset in the read/write request is the number of bytes from the beginning of the counted array of read/write requests up to the beginning of the counted array of file chunks that it points to. 3) Fixed section contains an offset to a counted array of directory entries; each directory entry contains an offset to a join of attributes; and each join of attributes contains an offset to an owner name. o The offset in the fixed section is the number of bytes from the beginning of the fixed section up to the beginning of the counted array of directory entries that it points to. o The offset in the directory entry is the number of bytes from the beginning of the counted array of directory entries up to the beginning of the attributes join that it points to. o The offset in the attributes join is the number of bytes from the beginning of the attributes join up to the beginning of the owner name. 6.1.1.4. Basic Data Types The basic building blocks of DAFS messages are: Wittle [Page 137] INTERNET-DRAFT Direct Access File System September 2001 o dafs_int8, dafs_int16, dafs_int32, dafs_int64: 1-, 2-, 4-, and 8- byte signed quantities. o dafs_uint8, dafs_uint16, dafs_uint32, dafs_uint64: 1-, 2-, 4-, and 8- byte unsigned quantities. o dafs_opaque8, dafs_opaque16, dafs_opaque32, dafs_opaque64: 1-, 2-, 4-, and 8-byte opaque containers. These containers are truly opaque and do not require byte swapping by a host with an endian encoding different than the encoding agreed upon for use in the DAFS Session. o typedef dafs_uint8 dafs_boolean; TRUE is defined as 1, FALSE as 0. o Enumeration types are encoded as dafs_uint32. o typedef dafs_uint8 dafs_utf8string<>; This defines a variable- sized UTF 8 string. The dafs_uint32 array count field precedes the array entries and is the number of octets in the array, not the number of characters. o typedef dafs_uint32 dafs_var_offset_type: Heap offset to the beginning of a variable-sized field. The offset is in bytes. 6.1.1.5. Endianness All DAFS messages exchanged between clients and servers MUST adhere to the endianness requirement agreed upon by both parties when a DAFS Session is established. This includes client-initiated requests as well as server-initiated back-control directives. There is a simple rule for negotiating the endianness of data encod- ing: the client chooses. After the encoding format is chosen, neither party can change it during the lifetime of the Session. Moreover, because it is possible for the server to cache replies to recent client's state-modifying requests for recovery purposes, it is strongly encouraged that clients request the same endianness when establishing Sessions to be used to reissue requests of a previous failed Session. In practice this is not be an issue, because it is expected that clients will always prefer one encoding over the other. Rationale: Unlike other network-based file access protocols, DAFS can enable extremely low overhead client access. The server also benefits from a reduction in protocol processing, but still has work to do to simply service the request. Because of this asymmetry in processing overhead, DAFS biases in favor of low-overhead clients by letting the client specify the endianness of the in-memory data Wittle [Page 138] INTERNET-DRAFT Direct Access File System September 2001 structures. 6.1.1.6. Internationalization Support The encoding/representation of strings brings up the issue of inter- nationalization support in the protocol. DAFS requires the use of UTF8 encoding for strings, including file names. The following para- graph is taken from the NFS Version 4 specification and applies to the DAFS protocol as well: "The primary issue in which NFS needs to deal with internationalization, or I18n, is with respect to file names and other strings as used within the protocol. The choice of string representation must allow reason- able name/string access to clients which use various languages. The UTF-8 encoding of the UCS as defined by [ISO10646] allows for this type of access and follows the policy described in 'IETF Policy on Character Sets and Languages', [RFC2277]." (RFC 3010, p. 91) 6.1.2. Request Header All DAFS requests begin with the following header: struct DAFS_Request_Header { dafs_uint32 header_magic; dafs_uint32 protocol_version; dafs_uint16 desired_nreq; dafs_uint16 chain_flags; dafs_uint16 stream_id; dafs_uint16 seq_number; dafs_opaque64 analyzer; dafs_checksum_type message_checksum; dafs_cred_handle_type cred_handle; dafs_uint32 procedure; dafs_uint32 request_len; }; Fields: header_magic The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S'") is used to mark each message header. It also determines the endianness of the message. The first transmitted message on a Session Wittle [Page 139] INTERNET-DRAFT Direct Access File System September 2001 determines the endianness for a Session, and all subsequent mes- sages MUST use the same endianness. This can be used as a sanity check and to aid identification of DAFS-related packets on bus analyzers, etc. protocol_version The protocol version used by the client for the connection is specified in the message header. Once a Session is created, all messages exchanged on the Session MUST specify the same protocol version. If the server does not accept this version of the protocol, but does accept another version of the DAFS protocol, it will respond to the client with a message header containing the DAFS header_magic field, and a protocol version the server will accept. The client can then retry the DAFS connect operation with that protocol version, or any other that it wishes to try. The client can also determine the protocol version supported by the server through the use of the DAFS name service. desired_nreq Flow control field as described in 3.2.6., "Message Flow Control". chain_flags Chaining flags for this request as described in 4.3., "Request Chaining". stream_id Slot number portion of the transaction ID for this request. The responder grants to the requester some number of requests, OPNreq, which can be simultaneously outstanding at any given time. The client manages this pool of requests as a collection of mail slots or mailboxes, and guarantees not to submit a request to a specific slot before receiving the result from a request previously submit- ted to that slot. Stream_id MUST be between 0 and OPNreq - 1. See 5.2.1., "Response Cache" for more information. seq_number The rest of the transaction ID for this request. This is a sequence number that is incremented for each request sent on a given slot. The combined value of stream_id and seq_number serves to identify requests in the Response Cache for the purposes of failure recovery. For more information, see 5.2.1., "Response Wittle [Page 140] INTERNET-DRAFT Direct Access File System September 2001 Cache". analyzer The server can make no assumption about the value of this field, but MUST simply return it in the response message for this request. Note: The analyzer field is opaque to the DAFS server. It is simply returned in each response message associated with a given request. The client MAY store information in this field that it deems helpful to it when the response is received. message_checksum Room for an optional checksum, if the use of one has been nego- tiated, as described in 3.2.5., "Checksums". cred_handle Credentials to be used for the authorizing the request as described in 3.1.1., "Security Model". procedure Procedure number for this request, as defined below. request_len Length in bytes of the entire request, including the header. 6.1.3. Response Header All DAFS responses begin with the following header: Wittle [Page 141] INTERNET-DRAFT Direct Access File System September 2001 struct DAFS_Response_Header { dafs_uint32 header_magic; dafs_uint32 protocol_version; dafs_uint16 target_nreq; dafs_uint16 spec_cond; dafs_uint16 stream_id; dafs_uint16 seq_number; dafs_opaque64 analyzer; dafs_checksum_type message_checksum; dafs_uint32 status; dafs_uint32 response_len; dafs_uint32 reserved; }; Fields: header_magic The magic sequence 0x44 0x41 0x46 0x52 ('D' 'A' 'F' 'R'") is used to mark each reply message header. These can be used as a sanity check and aid in the identification of DAFS-related packets on bus analyzers, etc. protocol_version Protocol version of the packet. For messages on a successfully created Session, this is the same as the protocol_version that was received in the request. In a reply to a connection request that has failed because of a protocol mismatch, this field contains the next lowest numbered protocol version supported by the server, or zero if no such alternative is supported. target_nreq Flow control field as described in 3.2.6., "Message Flow Control". spec_cond Flags for special conditions that the client SHOULD note and act upon. The server SHOULD set all undefined flags to be zero. The flags currently defined follow: o DAFS_SPCOND_LMOVED (0x0001): Indicates that a lease for this client has been migrated to a new server and that the client Wittle [Page 142] INTERNET-DRAFT Direct Access File System September 2001 SHOULD renew his leases at that new location. o DAFS_SPCOND_FAILOVER (0x0002): Indicates that a value for the failover_locations attribute previously fetched by the client has changed and that the client SHOULD fetch new values for all file systems. stream_id Field value copied from the request header to which this response pertains. seq_number Field value copied from the request header to which this response pertains. analyzer The server can make no assumption about the value of this field, but MUST simply copy it from the request message for this response. message_checksum Room for an optional checksum, if the use of one has been nego- tiated, as described in 3.2.5., "Checksums". status Result code for the operation. response_len Length in bytes of the entire response, including the header. reserved Reserved for future use. This reserved field forces the message header to a multiple of 8-bytes, insuring that operation headers will be aligned on 8-byte boundaries. 6.1.4. Basic Types This section defines some basic DAFS types that will be used when describing packet formats for DAFS requests and responses. Wittle [Page 143] INTERNET-DRAFT Direct Access File System September 2001 typedef dafs_uint64 dafs_attr_bitmap_type; typedef dafs_opaque64 dafs_session_id_type; typedef dafs_opaque64 dafs_client_id_type; typedef dafs_opaque64 dafs_state_id_type; typedef dafs_opaque32 dafs_cred_handle_type; typedef dafs_uint32 dafs_status_type; typedef struct dafs_checksum { dafs_uint16 S2; dafs_uint16 S1; } dafs_checksum_type; typedef dafs_uint32 dafs_memhandle_type; typedef dafs_memhandle_type dafs_rmr_context_type; typedef dafs_uint64 dafs_rmr_target_address_type; A RMR Context identifies a virtually contiguous buffer that can be used by other systems to read from or write to using RDMA capabili- ties. Wittle [Page 144] INTERNET-DRAFT Direct Access File System September 2001 typedef dafs_opaque64 dafs_FSHandle_type[2]; typedef dafs_opaque64 dafs_verifier_type; typedef dafs_utf8string dafs_component_type; /*heap */ typedef dafs_component_type<> dafs_pathname_type; /*heap*/ typedef dafs_opaque8<> dafs_lockowner_type;/*heap */ typedef dafs_opaque8<> dafs_client_string_type; /*heap */ typedef dafs_utf8string dafs_fencing_id_type; /*heap */ typedef dafs_fencing_id_type<> dafs_fence_array_type; typedef struct dafs_filehandle { dafs_FSHandle_type fshandle; dafs_opaque64 fileid[6]; } dafs_filehandle_type; typedef dafs_fs_location { dafs_utf8string server; /* heap */ dafs_pathname_type root_path; /* heap */ } dafs_fs_location_type; typedef dafs_fs_locations { dafs_pathname_type fs_root; /* heap */ dafs_fs_location_type locations<>; /* heap */ }; typedef dafs_ace_type { dafs_uint32 type; dafs_uint32 flag; dafs_uint32 access_mask; dafs_utf8string who; /* heap */ } dafs_ace_type; The bitmask values for the type field above: #define DAFS_ACE_ACCESS_ALLOWED_ACE_TYPE 0x00000000 #define DAFS_ACE_ACCESS_DENIED_ACE_TYPE 0x00000001 #define DAFS_ACE_SYSTEM_AUDIT_ACE_TYPE 0x00000002 #define DAFS_ACE_SYSTEM_ALARM_ACE_TYPE 0x00000004 Wittle [Page 145] INTERNET-DRAFT Direct Access File System September 2001 The bitmask values for the flag field above: #define DAFS_ACE_FILE_INHERIT_ACE 0x00000001 #define DAFS_ACE_DIRECTORY_INHERIT_ACE 0x00000002 #define DAFS_ACE_NO_PROPAGATE_INHERIT_ACE 0x00000004 #define DAFS_ACE_INHERIT_ONLY_ACE 0x00000008 #define DAFS_ACE_SUCCESSFUL_ACCESS_ACE_FLAG 0x00000010 #define DAFS_ACE_FAILED_ACCESS_ACE_FLAG 0x00000020 #define DAFS_ACE_IDENTIFIER_GROUP 0x00000040 The values for the access_mask field above: #define DAFS_ACE_READ_DATA 0x00000001 #define DAFS_ACE_LIST_DIRECTORY 0x00000001 #define DAFS_ACE_WRITE_DATA 0x00000002 #define DAFS_ACE_ADD_FILE 0x00000002 #define DAFS_ACE_APPEND_DATA 0x00000004 #define DAFS_ACE_ADD_SUBDIRECTORY 0x00000004 #define DAFS_ACE_READ_NAMED_ATTRS 0x00000008 #define DAFS_ACE_WRITE_NAMED_ATTRS 0x00000010 #define DAFS_ACE_EXECUTE 0x00000020 #define DAFS_ACE_DELETE_CHILD 0x00000040 #define DAFS_ACE_READ_ATTRIBUTES 0x00000080 #define DAFS_ACE_WRITE_ATTRIBUTES 0x00000100 #define DAFS_ACE_DELETE 0x00010000 #define DAFS_ACE_READ_ACL 0x00020000 #define DAFS_ACE_WRITE_ACL 0x00040000 #define DAFS_ACE_WRITE_OWNER 0x00080000 #define DAFS_ACE_SYNCHRONIZE 0x00100000 typedef struct dafs_specdata { dafs_uint64 specdata1; dafs_uint64 specdata2; } dafs_specdata_type; typedef struct dafs_change_info { dafs_uint64 before; dafs_uint64 after; dafs_uint32 atomic; dafs_uint32 pad; } dafs_change_info_type; Servers set the atomic field to TRUE if they can modify a file object and obtain the before and after times atomically. Wittle [Page 146] INTERNET-DRAFT Direct Access File System September 2001 typedef struct dafs_time_type { dafs_int64 seconds; dafs_uint32 nseconds; } dafs_time_type; A positive value in the seconds field refer to times after the 0-hour January 1, 1970 UTC (Universal Coordinated Time). Negative seconds refer to times before the 0-hour January 1, 1970 UTC. enum timeset_how { SET_TO_SERVER_TIME = 1, SET_TO_CLIENT_TIME = 2 }; typedef dafs_settime { dafs_time_type client_time; /* CLIENT TIME */ enum dafs_timeset_how how; } dafs_settime_type; 6.1.5. File Attributes The DAFS attributes format definition provides for selective support and retrieval of a subset of attributes and for extending the number of supported attributes in a straightforward manner. The DAFS defini- tion separates attributes into two subsets: file attributes and file system attributes. File attributes that are labeled mandatory MUST be supported by all DAFS server implementations. A DAFS server MAY support non-mandatory attributes and a DAFS client MUST not rely on a server implementing any of these attributes. The goal of the encoding of the attributes scheme proposed here is to consume bandwidth only for the attributes requested. When fetching attributes, a client can request a subset of all possible attributes. Note that by virtue of using joins, the size of the file attributes structure is variable and, therefore, attributes are placed in the variable section of packet definitions. The server is REQUIRED to return a packet formatted to contain all Wittle [Page 147] INTERNET-DRAFT Direct Access File System September 2001 requested attributes. If the server does not support or is unable to supply a requested attribute, it MUST include the space for the unsupported attributes in the response and indicate its invalid value in the valid attributes field of the response. The "blank" fields for requested-but- unsupported attributes are included to enable the client to quickly locate its requested attributes using precomputed offsets based on the sizes of the attribute fields it requested. The DAFS file attributes definitions follow. Attribute 1 is represented by the least significant bit in the attributes bit map, and subsequent attributes are represented by the corresponding bit. The server MUST support those labeled "mandatory." DAFS_FATTR_NAMED_ATTR (1) Boolean that indicates server support of named attributes. DAFS_FATTR_ARCHIVE (2) Boolean that indicates whether a file has been archived (backed- up) since the time of last modification. This attribute is writ- able by the client. DAFS_FATTR_HIDDEN (3) Determines if file is hidden for Win32 purposes. This attribute is writable by the client. DAFS_FATTR_SYSTEM (4) Whether file object is a system file object for Win32 purposes. This attribute is writable by the client. DAFS_FATTR_OBJECT_ TYPE (5) File object type. Support for this file attribute is mandatory. See 6.5.9., "DAFS_PROC_CREATE" for supported types. DAFS_FATTR_MODE (6) Access mode bits - Unix style. This attribute is writable by the client. DAFS_FATTR_NUM_LINKS (7) Number of links pointing to file object. DAFS_FATTR_CHANGE (8) Wittle [Page 148] INTERNET-DRAFT Direct Access File System September 2001 Server-generated value that changes any time the file object changes. A server MAY use the modification time if the granularity is appropriate. Support for this file attribute is mandatory. DAFS_FATTR_OBJECT_SIZE (9) Size, in bytes, of file object. Attribute is writable by the client. Support for this file attribute is mandatory. DAFS_FATTR_FILE_ID (10) Unique id for this file object. Id is unique within the same FSHandle space. Support for this file attribute is mandatory. DAFS_FATTR_SPACE_USED (11) File systems bytes allocated to this file object. DAFS_FATTR_TIME_ACCESS (12) The last time this object was accessed. DAFS_FATTR_TIME_ACCESS_SET (13) Set the time last accessed to this value. Client write-only attri- bute. DAFS_FATTR_TIME_BACKUP (14) Last time this object was backed-up. This attribute is writable by the client. DAFS_FATTR_TIME_CREATE (15) Creation time of this object. This is not the Unix-style c-time. This attribute is writable by the client. DAFS_FATTR_TIME_DELTA (16) Time granularity supported by server. DAFS_FATTR_TIME_METADATA (17) Last time this object's metadata changed. This attribute is writ- able by the client. DAFS_FATTR_TIME_MODIFY (18) Wittle [Page 149] INTERNET-DRAFT Direct Access File System September 2001 Last the this object's contents were modified. DAFS_FATTR_TIME_MODIFY_SET (19) Set the time last modified to this value. Client write only attri- bute. DAFS_FATTR_RAW_DEV (20) Identifier for raw devices. DAFS_FATTR_FILEHANDE (21) Filehandle for this object. Support for this file attribute is mandatory. DAFS_FATTR_ACL (22) Access control list for this object. This attribute is writable by the client. DAFS_FATTR_MIME_TYPE (23) MIME body type/subtype. This attribute is writable by the client. DAFS_FATTR_OWNER (24) The owner of this object. This attribute is writable by the client. DAFS_FATTR_OWNER_GROUP (25) This object's owner's group. This attribute is writable by the client. The "bitset" pseudo-function is used to define the specification of the file attributes structure. Bitset(x, y) is TRUE if the attribute bit "y" is set in the attribute bitmap "x". Wittle [Page 150] INTERNET-DRAFT Direct Access File System September 2001 typedef struct file_attr { attr_bitmap_type included; attr_bitmap_type valid; join switch (bitset(included, DAFS_FATTR_NAMED_ATTR)) { case TRUE: dafs_boolean contents; case FALSE: void; } named_attributes; join switch (bitset(included, DAFS_FATTR_ARCHIVE)) { case TRUE: dafs_boolean contents; case FALSE: void; } archive; join switch (bitset(included, DAFS_FATTR_HIDDEN)) { case TRUE: dafs_boolean contents; case FALSE: void; hidden; join switch (bitset(included, DAFS_FATTR_SYSTEM)) { case TRUE: dafs_boolean contents; case FALSE: void; } system; join switch (bitset(included, DAFS_FATTR_OBJECT_TYPE)) { case TRUE: dafs_uint32 contents; case FALSE: void; } object_type; join switch (bitset(included, DAFS_FATTR_MODE) { Wittle [Page 151] INTERNET-DRAFT Direct Access File System September 2001 case TRUE: dafs_uint32 contents; case FALSE: void; } mode; join switch (bitset(included, DAFS_FATTR_NUM_LINKS)) { case TRUE: dafs_uint32 contents; case FALSE: void; } num_links; join switch (bitset(included, DAFS_FATTR_CHANGE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } change; join switch (bitset(included, DAFS_FATTR_OBJECT_SIZE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } object_size; join switch (bitset(included, DAFS_FATTR_FILE_ID)) { case TRUE: dafs_uint64 contents; case FALSE: void; } file_id; join switch (bitset(included, DAFS_FATTR_SPACE_USED)) { case TRUE: dafs_uint64 contents; case FALSE: void; space_used; join switch (bitset(included, DAFS_FATTR_TIME_ACCESS)) { Wittle [Page 152] INTERNET-DRAFT Direct Access File System September 2001 case TRUE: dafs_time_type time; case FALSE: void; } time_access; join switch (bitset(included,DAFS_FATTR_TIME_ACCESS_SET)) { case TRUE: dafs_settime_type settime; case FALSE: void; } time_access_set; join switch (bitset(included, DAFS_FATTR_TIME_BACKUP)) { case TRUE: dafs_time_type time; case FALSE: void; } time_backup; join switch (bitset(included, DAFS_FATTR_TIME_CREATE)) { case TRUE: dafs_time_type time; case FALSE: void; } time_create; join switch (bitset(included, DAFS_FATTR_TIME_DELTA)) { case TRUE: dafs_time_type time; case FALSE: void; } time_delta; join switch (bitset(included, DAFS_FATTR_TIME_METADATA)) { case TRUE: dafs_time_type time; case FALSE: void; } time_metadata; join switch (bitset(included, DAFS_FATTR_TIME_MODIFY)) { Wittle [Page 153] INTERNET-DRAFT Direct Access File System September 2001 case TRUE: dafs_time_type time; case FALSE: void; } time_modify; join switch(bitset(included,DAFS_FATTR_TIME_MODIFY_SET)) { case TRUE: dafs_settime_type settime; case FALSE: void; } time_modify_set; join switch (bitset(included, DAFS_FATTR_RAW_DEV)) { case TRUE: dafs_specdata_type contents; case FALSE: void; } specdata; join switch (bitset(included, DAFS_FATTR_FILEHANDLE)) { case TRUE: dafs_filehandle_type contents; case FALSE: void; } filehandle; join switch (bitset(included, DAFS_FATTR_ACL)) { case TRUE: dafs_ace_type acl<>; case FALSE: void; } acl; join switch (bitset(included, DAFS_FATTR_MIME_TYPE)) { case TRUE: dafs_utf8string mimetype; /* heap */ case FALSE: void; } mime_type; join switch (bitset(included, DAFS_FATTR_OWNER)) { Wittle [Page 154] INTERNET-DRAFT Direct Access File System September 2001 case TRUE: dafs_utf8string owner; /* heap */ case FALSE: void; } owner; join switch (bitset(included, DAFS_FATTR_OWNER_GROUP)) { case TRUE: dafs_utf8string owner_goup; /* heap */ case FALSE: void; } owner_group; } dafs_file_attr_type; 6.1.6. File System Attributes The DAFS file system attributes definitions follow. The same encoding described for DAFS file attributes applies to file system attributes. DAFS_FSATTR_LINK_ SUPPORT (1) Denotes server support for hard links on this file system. Support for this file system attribute is mandatory. DAFS_FSATTR_SYMLINK_SUPPORT (2) Denotes server support for symbolic links on this file system. Support for this file system attribute is mandatory. DAFS_FSATTR_CAN_SET_TIME (3) Denotes server support for setting the access/modify times of a file object. DAFS_FSATTR_CASE_INSENSITIVE (4) If TRUE, file names on a server are case insensitive. FALSE other- wise. DAFS_FSATTR_CASE_PRESERVING (5) If TRUE, server preserves file name case. FALSE otherwise. DAFS_FSATTR_CHOWN_RESTRICTED (6) Wittle [Page 155] INTERNET-DRAFT Direct Access File System September 2001 Denotes server restrictions to setting the owner/owner groups file attributes by non-privilege users. DAFS_FSATTR_HOMOGENEOUS (7) TRUE if all objects in a file system have the same file system attributes values. FALSE otherwise. DAFS_FSATTR_NO_TRUNC (8) Whether a server truncates or rejects with an error file names that exceed the server's maximum supported length. DAFS_FSATTR_UNIQUE_HANDLE (9) Whether server guarantees that a file object is always represented by the same unique handle. DAFS_FSATTR_LEASE_TIME (10) The locking lease time. This value is in seconds. Since any lease renewal message renews all of the client's leases on the receiving DAFS server, this value SHOULD be the same for all file systems provided by a single DAFS server. Support for this file system attribute is mandatory. DAFS_FSATTR_RD_ATTR_ERROR (11) Error a server returns when a failure to obtain attributes during a DAFS_PROC_READDIR is encountered. Support for this file system attribute is mandatory. DAFS_FSATTR_ACL_SUPPORT (12) ACL types supported by the server. DAFS_FSATTR_MAX_LINK (13) Maximum number of hard links to a file object allowed. DAFS_FSATTR_MAX_NAME (14) Maximum number of characters allowed in a file object's name. DAFS_FSATTR_SUPPORTED_FATTR (15) Bitmap representing the supported file attributes in this server. Support for this file system attribute is mandatory. Wittle [Page 156] INTERNET-DRAFT Direct Access File System September 2001 DAFS_FSATTR_SUPPORTED_FSATTR (16) Bitmap representing the supported file system attributes in this server. Support for this file system attribute is mandatory. DAFS_FSATTR_FILES_AVAILABLE (17) Number of files available on this file system for use by the user issuing this request. DAFS_FSATTR_FILES_FREE (18) Number of free files in this file system. DAFS_FSATTR_FILES_TOTAL (19) Total number of files in this file system. DAFS_FSATTR_MAX_FILE_SIZE (20) File system's maximum file size, in bytes. DAFS_FSATTR_MAX_READ (21) Maximum read size allowed for objects in this file system. DAFS_FSATTR_MAX_WRITE (22) Maximum write size allowed for objects in this file system. DAFS_FSATTR_QUOTA_HARD (23) Available space in bytes that MAY be allocated to this file object before allocations are refused. This space is not specifically reserved for this object and MAY be allocated to other objects in this file system that by some rule belong to a common set. DAFS_FSATTR_QUOTA_SOFT (24) Available space in bytes that MAY be allocated to this file object before a warning is issued. This space is not specifically reserved for this object and MAY be allocated to other objects in this file system that by some rule belong to a common set. DAFS_FSATTR_QUOTA_USED (25) Disk space, in bytes, used by this file object and possibly others in a set that share the space reported in DAFS_FATTR_QUOTA_HARD. Wittle [Page 157] INTERNET-DRAFT Direct Access File System September 2001 DAFS_FSATTR_SPACE_AVAILABLE (26) Amount of space, in bytes, available in this file system. DAFS_FSATTR_SPACE_FREE (27) Number of bytes in this file system that is free. DAFS_FSATTR_SPACE_TOTAL (28) File system's total size, in bytes. DAFS_FSATTR_FSHANDLE (29) The FSHandle associated with this file system. Support for this file system attribute is mandatory. DAFS_FSATTR_FAILOVER_LOCATIONS (30) List of alternate server locations that might serve this file sys- tem in the event of a server failure. DAFS_FSATTR_MAX_APPEND (32) The maximum number of bytes that can be atomically appended the end of a file. Any DAFS_PROC_APPEND_INLINE or DAFS_PROC_APPEND_DIRECT operation that specifies more bytes than DAFS_FSATTR_MAX_APPEND is returned in error without any data being added to the file. DAFS_FSATTR_MAX_APPEND MUST be set to 65536 or greater. Support for this file system attribute is mandatory. DAFS_FSATTR_PREF_IO_SIZE (33) Server's preferred I/O size for this file system, in bytes. DAFS_FSATTR_FH_EXPIRE_TYPE (34) Volatility of filehandles in this file system The bitmap values for the DAFS_FSATTR_ACL_SUPPORT attribute are: #define DAFS_ACL_SUPPORT_ALLOW = 0x00000001 #define DAFS_ACL_SUPPORT_DENY = 0x00000002 #define DAFS_ACL_SUPPORT_AUDIT = 0x00000004 #define DAFS_ACL_SUPPORT_ALARM = 0x00000008 Wittle [Page 158] INTERNET-DRAFT Direct Access File System September 2001 The bitmap values for the DAFS_FSATTR_FH_EXPIRE_TYPE attribute are: #define DAFS_FH_PERSISTENT = 0x00000000; #define DAFS_FH_NOEXPIRE_WITH_OPEN = 0x00000001; #define DAFS_FH_VOLATILE_ANY = 0x00000002; #define DAFS_FH_VOL_RENAME = 0x00000008; The filesys_attr_type structure definition follows. Wittle [Page 159] INTERNET-DRAFT Direct Access File System September 2001 typedef struct filesys_attr { attr_bitmap_type included; attr_bitmap_type valid; join switch (bitset(included, DAFS_FSATTR_LINK_SUPPORT)) { case TRUE: dafs_boolean contents case FALSE: void; } link_support; join switch(bitset(included,DAFS_FSATTR_SYMLINK_SUPPORT)) { case TRUE: dafs_boolean contents; case FALSE: void; } symlink_support; join switch (bitset(included, DAFS_FSATTR_CAN_SET_TIME)) { case TRUE: dafs_boolean contents; case FALSE: void; } can_set_time; join switch (bitset(included, FSATTR_CASE_INSENSITIVE)) { case TRUE: dafs_boolean contents; case FALSE: void; } case_insensitive; join switch (bitset (included, DAFS_FSATTR_CASE_PRESERVING)) { case TRUE: dafs_boolean contents; case FALSE: void; } case_preserving; join switch (bitset (included, Wittle [Page 160] INTERNET-DRAFT Direct Access File System September 2001 DAFS_FSATTR_CHOWN_RESTRICTED)) { case TRUE: dafs_boolean contents; case FALSE: void; } chown_restricted; join switch (bitset(included, DAFS_FSATTR_HOMOGENEOUS)) { case TRUE: dafs_boolean contents; case FALSE: void; } homogeneous; join switch (bitset(included, DAFS_FSATTR_NO_TRUNC)) { case TRUE: dafs_boolean contents; case FALSE: void; } no_trunc; join switch (bitset(included, DAFS_FSATTR_UNIQUE_HANDLE)) { case TRUE: dafs_boolean contents; case FALSE: void; } unique_handle; join switch (bitset(included, DAFS_FSATTR_LEASE_TIME)) { case TRUE: dafs_uint32 contents; case FALSE: void; } lease_time; join switch (bitset(included, DAFS_FSATTR_RD_ATTR_ERROR)) { case TRUE: dafs_uint32 contents; case FALSE: void; } rd_attr_error; Wittle [Page 161] INTERNET-DRAFT Direct Access File System September 2001 join switch (bitset(included, DAFS_FSATTR_ACL_SUPPORT)) { case TRUE: dafs_uint32 contents; case FALSE: void; } acl_support; join switch (bitset(included, DAFS_FSATTR_MAX_LINK)) { case TRUE: dafs_uint32 contents; case FALSE: void; } max_link; join switch (bitset(included, DAFS_FSATTR_MAX_NAME)) { case TRUE: dafs_uint32 contents; case FALSE: void; } max_name; join switch (bitset(included, DAFS_FSATTR_SUPPORTED_FATTR)) { case TRUE: dafs_attr_bitmap_type contents; case FALSE: void; } supported_fattr; join switch (bitset(included, DAFS_FSATTR_SUPPORTED_FSATTR)) { case TRUE: dafs_attr_bitmap_type contents; case FALSE: void; } supported_fsattr; join switch (bitset(included, DAFS_FSATTR_FILES_AVAILABLE)) { case TRUE: dafs_uint64 contents; case FALSE: Wittle [Page 162] INTERNET-DRAFT Direct Access File System September 2001 void; } files_available; join switch (bitset(included, DAFS_FSATTR_FILES_FREE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } files_free; join switch (bitset(included, DAFS_FSATTR_FILES_TOTAL)) { case TRUE: dafs_uint64 contents; case FALSE: void; } files_total; join switch (bitset(included, DAFS_FSATTR_MAX_FILE_SIZE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } max_file_size; join switch (bitset(included, DAFS_FSATTR_MAX_READ)) { case TRUE: dafs_uint64 contents; case FALSE: void; } max_read; join switch (bitset(included, DAFS_FSATTR_MAX_WRITE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } max_write; join switch (bitset(included, DAFS_FSATTR_QUOTA_HARD)) { case TRUE: dafs_uint64 contents; case FALSE: Wittle [Page 163] INTERNET-DRAFT Direct Access File System September 2001 void; } quota_hard; join switch (bitset(included, DAFS_FSATTR_QUOTA_SOFT)) { case TRUE: uint64 contents; case FALSE: void; } quota_soft; join switch (bitset(included, DAFS_FSATTR_QUOTA_USED)) { case TRUE: dafs_uint64 contents; case FALSE: void; } quota_used; join switch (bitset(included, DAFS_FSATTR_SPACE_AVAILABLE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } space_available; join switch (bitset(included, DAFS_FSATTR_SPACE_FREE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } space_free; join switch (bitset(included, DAFS_FSATTR_SPACE_TOTAL)) { case TRUE: dafs_uint64 contents; case FALSE: void; } space_total; join switch (bitset(included, DAFS_FSATTR_FSHANDLE) { case TRUE: dafs_FSHandle_type contents; Wittle [Page 164] INTERNET-DRAFT Direct Access File System September 2001 case FALSE: void; } fshandle; join switch (bitset(included, DAFS_FSATTR_FAILOVER_LOCATIONS)) { case TRUE: dafs_fs_locations_type failoverlocations<>; case FALSE: void; } failover_locations; join switch (bitset(included, DAFS_FSATTR_MAX_APPEND)) { case TRUE: dafs_uint64 contents; case FALSE: void; } max_append; join switch (bitset(included, DAFS_FSATTR_PREF_IO_SIZE)) { case TRUE: dafs_uint64 contents; case FALSE: void; } pref_io_size; join switch (bitset(included, DAFS_FSATTR_FH_EXPIRE_TYPE)) { case TRUE: dafs_uint32 contents; case FALSE: void; } fh_expire_type; } dafs_filesys_attr_type; 6.1.7. Direct Operations The DAFS direct read and write operations use the following common structure. Wittle [Page 165] INTERNET-DRAFT Direct Access File System September 2001 typedef struct dafs_direct_op_buffer { dafs_uint64 buffer_address; dafs_uint32 buffer_byte_count; dafs_memhandle_type buffer_handle; } dafs_direct_op_buffer_type; typedef dafs_direct_op_buffer_type<> dafs_dob_array_type; typedef struct dafs_file_chunk { dafs_uint64 offset; dafs_uint32 byte_count; dafs_cache_hint_type cache_hint; } dafs_file_chunk_type; typedef dafs_file_chunk_type<> dafs_chunk_array_type; typedef struct dafs_read_write_request { dafs_filehandle_type filehandle; dafs_uint64 request_id; /* per session */ dafs_state_id_type state_id; dafs_rw_flag rw_flag; dafs_file_chunk_arry_type chunks; dafs_dob_array_type request_buffers; dafs_checksum_type direct_checksum; } dafs_rw_request_type; typedef dafs_rw_request_type<> dafs_rw_request_array_type; typedef struct dafs_completion_notification { dafs_uint64 request_id; dafs_status_type status; dafs_uint32 bytes_transferred; dafs_checksum_type direct_checksum; dafs_uint32 pad; } dafs_completion_notification_type; typedef dafs_completion_notification_type<> dafs_completion_array_type; Wittle [Page 166] INTERNET-DRAFT Direct Access File System September 2001 6.1.8. Cache Hints The DAFS i/o and cache hints operations define the following common structures. dafs_uint32 dafs_access_pattern_type; #define DAFS_CACHE_HINT_NORMAL 0 #define DAFS_CACHE_HINT_RANDOM 1 #define DAFS_CACHE_HINT_SEQUENTIAL 2 #define DAFS_CACHE_HINT_WILLNEED 3 #define DAFS_CACHE_HINT_DONTNEED 4 dafs_uint32 dafs_cache_hint_type; #define dafs_prefetch 0x01 #define dafs_readhint_1 0x02 #define dafs_readhint_2 0x04 #define dafs_readhint_3 0x06 #define dafs_readhint_4 0x08 #define dafs_readhint_5 0x0A #define dafs_readhint_6 0x0C #define dafs_readhint_7 0x0E #define dafs_writehint_1 0x10 #define dafs_writehint_2 0x20 #define dafs_writehint_3 0x30 #define dafs_writehint_4 0x40 #define dafs_writehint_5 0x50 #define dafs_writehint_6 0x60 #define dafs_writehint_7 0x70 6.1.9. Authentication The DAFS connection and authentication operations define the follow- ing common structures. Wittle [Page 167] INTERNET-DRAFT Direct Access File System September 2001 enum dafs_auth_type { DAFS_AUTH_NONE = 0, DAFS_AUTH_TEXT = 1, DAFS_AUTH_GSS = 2, DAFS_AUTH_DEFAULT = 3 }; struct dafs_auth_text { dafs_utf8string auth_id; dafs_utf8string auth_password; }; enum dafs_gss_procedure { DAFS_GSS_OP_INIT = 1, DAFS_GSS_OP_CONTINUE_INIT = 2 }; enum dafs_gss_service /* GSS service used */ { DAFS_GSS_SVC_AUTH = 1, DAFS_GSS_SVC_INTEGRITY = 2, DAFS_GSS_SVC_PRIVACY = 3 }; typedef struct { enum dafs_gss_procedure procedure; enum dafs_gss_service service; dafs_opaque8 token<>; /* heap, gss context */ } dafs_auth_gss; typedef union switch (enum dafs_auth_type auth_type) { case DAFS_AUTH_NONE: void; case DAFS_AUTH_TEXT: dafs_auth_text auth_text; case DAFS_AUTH_GSS: dafs_auth_gss auth_gss; case DAFS_AUTH_DEFAULT: void; } dafs_auth_req; Wittle [Page 168] INTERNET-DRAFT Direct Access File System September 2001 typedef struct { dafs_uint32 gss_major; dafs_uint32 gss_minor; dafs_opaque8 gss_token<>; /* continue context */ } dafs_auth_gss_res; typedef union switch (enum dafs_auth_type auth_type) { case DAFS_AUTH_NONE: case DAFS_AUTH_TEXT: case DAFS_AUTH_DEFAULT: void; case DAFS_AUTH_GSS: dafs_auth_gss_res auth_gss_res; } dafs_auth_res; 6.1.10. Procedures DAFS defines the following operations with the associated procedure numbers. Wittle [Page 169] INTERNET-DRAFT Direct Access File System September 2001 #define DAFS_PROC_CLIENT_AUTH 100 #define DAFS_PROC_CLIENT_CONNECT 101 #define DAFS_PROC_CLIENT_CONNECT_AUTH 102 #define DAFS_PROC_CONNECT_BIND 103 #define DAFS_PROC_DISCONNECT 104 #define DAFS_PROC_REGISTER_CRED 105 #define DAFS_PROC_RELEASE_CRED 106 #define DAFS_PROC_SECINFO 108 #define DAFS_PROC_SERVER_AUTH 109 #define DAFS_PROC_CHECK_RESPONSE 110 #define DAFS_PROC_FETCH_RESPONSE 111 #define DAFS_PROC_DISCARD_RESPONSES 112 #define DAFS_PROC_ACCESS 113 #define DAFS_PROC_CACHE_HINT 114 #define DAFS_PROC_CLOSE 115 #define DAFS_PROC_COMMIT 116 #define DAFS_PROC_CREATE 117 #define DAFS_PROC_DELEGPURGE 118 #define DAFS_PROC_DELEGRETURN 119 #define DAFS_PROC_GET_FSATTR 122 #define DAFS_PROC_GET_ROOT_HANDLE 123 #define DAFS_PROC_GETATTR_INLINE 124 #define DAFS_PROC_GETATTR_DIRECT 125 #define DAFS_PROC_LINK 126 #define DAFS_PROC_LOCK 127 #define DAFS_PROC_LOCKT 128 #define DAFS_PROC_LOCKU 129 #define DAFS_PROC_LOOKUP 130 #define DAFS_PROC_LOOKUPP 131 #define DAFS_PROC_NULL 132 #define DAFS_PROC_NVERIFY 133 #define DAFS_PROC_OPEN 134 #define DAFS_PROC_OPEN_DOWNGRADE 135 #define DAFS_PROC_OPENATTR 136 #define DAFS_PROC_READ_INLINE 137 #define DAFS_PROC_READ_DIRECT 138 #define DAFS_PROC_READDIR_INLINE 139 #define DAFS_PROC_READDIR_DIRECT 140 #define DAFS_PROC_READLINK_INLINE 141 #define DAFS_PROC_READLINK_DIRECT 142 #define DAFS_PROC_REMOVE 143 #define DAFS_PROC_RENAME 144 #define DAFS_PROC_SETATTR_INLINE 145 #define DAFS_PROC_SETATTR_DIRECT 146 #define DAFS_PROC_VERIFY 147 Wittle [Page 170] INTERNET-DRAFT Direct Access File System September 2001 #define DAFS_PROC_BATCH_SUBMIT 148 #define DAFS_PROC_WRITE_INLINE 149 #define DAFS_PROC_WRITE_DIRECT 150 #define DAFS_PROC_BC_GETATTR 151 #define DAFS_PROC_BC_NULL 152 #define DAFS_PROC_BC_RECALL 153 #define DAFS_PROC_BC_BATCH_COMPLETION 155 #define DAFS_PROC_APPEND_INLINE 156 #define DAFS_PROC_APPEND_DIRECT 157 #define DAFS_PROC_GET_FENCING_LIST 158 #define DAFS_PROC_SET_FENCING_LIST 159 #define DAFS_PROC_HURRY_UP 160 Wittle [Page 171] INTERNET-DRAFT Direct Access File System September 2001 6.2. Connection and Security Management This section gives the message packet definitions for the connection and security operations. A description of the features and rationale associated with these operations can be found in 3.1.3., "Session Operations". 6.2.1. DAFS_PROC_CLIENT_CONNECT SUMMARY Begins a DAFS Session from a client to a server, including negotia- tion of basic protocol information and parameters for use of the DAFS Operation Channel associated with the Session. ARGUMENTS struct DAFS_Client_Connect_Args { dafs_uint32 use_checksums; dafs_uint32 use_response_cache; dafs_uint32 max_credentials; dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_uint32 inline_write_header_size; dafs_uint32 use_back_control_channel; dafs_uint32 use_rdma_read_channel; dafs_utf8string fence_id_string; dafs_var_offset_type client_id_string; dafs_verifier_type client_verifier; }; RESULTS Wittle [Page 172] INTERNET-DRAFT Direct Access File System September 2001 struct DAFS_Client_Connect_Res { dafs_session_id_type session_id; dafs_client_id_type client_id; dafs_uint32 use_checksums; dafs_uint32 use_response_cache; dafs_uint32 max_credentials; dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_uint32 inline_write_header_size; dafs_uint32 use_back_control_channel; dafs_uint32 use_rdma_read_channel; }; DESCRIPTION Establish a DAFS Session from client to server, including establish- ment of initial protocol settings. The DAFS_PROC_CLIENT_CONNECT operation, the DAFS_PROC_CLIENT_CONNECT_AUTH operation or the DAFS_PROC_CLIENT_CONNECT_BIND operation MUST be the first operation sent on a newly created communication channel. After successfully completing this operation, the DAFS_PROC_SECINFO operation can be issued to discover what authentication mechanism the server supports. Then, the DAFS_PROC_CLIENT_AUTH operation can be used to authenticate the client to the server. (The DAFS_PROC_CLIENT_CONNECT_AUTH opera- tion is available to combine the connection and authentication opera- tions in a single step if it is not necessary to discover the server's authentication support mechanisms.) The connection request specifies a client-id-string and client- verifier so that the server can identity the Session as being associ- ated with a particular client instantiation. The returned session_id field can be used to identify the Session and the Response Cache associated with the Session in the event of con- nection failure. The returned Client-id is associated with the shared set of creden- tials registered on any Session created using this Client-id. Also, the Client-id can be used when interpreting lock information returned by DAFS_PROC_LOCKT operations. A client specifies desired options as part of the Session and Opera- tion Channel, and the server responds with the values to be used for Wittle [Page 173] INTERNET-DRAFT Direct Access File System September 2001 the Session and the Operation Channel. o Checksum data transferred in DAFS. If use_checksums is TRUE, then checksums will be generated and checked. If the client sets use_checksums to TRUE, then the server MUST return use_checksums set to TRUE. o Whether the server is maintaining a Response Cache for the Ses- sion. If use_response_cache is TRUE, then the server maintains the Response Cache. o Maximum number of credentials that can be associated with this client. This is set during the first Session that a client creates. The value is ignored on subsequent connection requests issued by the same client. Following the establishment of the first Session for a client, subsequent connection response mes- sages will specify the same maximum credential value as the ini- tial response. (specify 0 to use server default). o Maximum operation request size on this channel, in bytes (specify 0 to use server default). o Maximum operation response size on this channel, in bytes (specify 0 to use server default). o Maximum operations requests outstanding on this channel. (specify 0 to use server default). o inline_write_header_size gives the padding amount between the end of a DAFS message header and the data being transferred by DAFS_PROC_WRITE_INLINE operations, in bytes. (specify 0 to use server default). o Specify whether a separate channel can be bound to the current Session for use in processing back-control messages from server to client. If use_back_control_channel is TRUE, then the client can establish an additional transport connection for that purpose (see DAFS_PROC_CONNECT_BIND operation). o Specify whether a separate channel can be bound to the current Session for use in processing RDMA read operations from server to client. If use_rdma_read_channel is TRUE, then the client can establish an additional transport connection for that purpose (see DAFS_PROC_CONNECT_BIND operation). The DAFS server MAY accept the options as requested, or MAY respond with different values. The server determined option values are returned as part of the response message. It is the client's Wittle [Page 174] INTERNET-DRAFT Direct Access File System September 2001 responsibility to check and verify whether the server options are acceptable to it. As an example, the client can request a maximum request outstanding limit of 65536, but the server can respond with a limit of 512 due to resource restrictions. The server returns DAFSERR_ILLEGAL_PROT if the protocol version requested by the client is not supported. The client SHOULD retry with a lower protocol version number. Protocol_version contains a suggested protocol version to use that would be supported. DAFSERR_ILLEGAL_STATE indicates that the protocol has already been negotiated. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_ILLEGAL_PROT DAFSERR_ILLEGAL_STATE DAFSERR_INVAL Wittle [Page 175] INTERNET-DRAFT Direct Access File System September 2001 6.2.2. DAFS_PROC_CLIENT_AUTH SUMMARY Authenticates the client to the server. ARGUMENTS struct DAFS Client_Auth_Args { dafs_auth_req auth_req; }; RESULTS struct DAFS_Client_Auth_Res { dafs_auth_res auth_res; dafs_boolean trusted; }; DESCRIPTION Establishes the primary DAFS authentication from client to server for a DAFS Session. Authentication is performed after the DAFS Session is created, but prior to other DAFS operations, with the exception of the SECINFO operation, which can be used to determine which security authentica- tion mechanisms can be used with the server. A response message containing a DAFSERR_NOT_AUTHENTICATED error will be returned in response to any other request messages received when the DAFS Session has not yet been authenticated. DAFS servers MUST support at least one of the following authentica- tion methods: DAFS_AUTH_NONE Authentication is NOT REQUIRED. The client is trusted to provide credentials as needed. DAFS_AUTH_TEXT Wittle [Page 176] INTERNET-DRAFT Direct Access File System September 2001 Session is authenticated using an auth_id and clear text password. DAFS_AUTH_GSS Session is authenticated using the GSS framework (see RFC 2743). DAFS_AUTH_DEFAULT MAY be used for untrusted clients and indicates that the default credentials are to be used for the DAFS Session. A DAFS server providing maximum security needs to support AUTH_GSS. If the DAFS server supports AUTH_GSS, it MUST identify itself in GSS-API via a GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE names are of the form service@hostname. For DAFS, the "service" element is "dafs". Implementations of security mechanisms will convert dafs@hostname to various different forms. For Kerberos V4, the following form is RECOMMENDED: dafs/hostname. This would be a server principal in the Kerberos Key Distribution Center database. If the client desires mutual authentication with the DAFS server, it SHOULD set the mutual_req_flag in its call to GSS_Init_sec_context (see RFC 2743). All GSS mechanisms used by DAFS MUST support mutual authentication between client and server. The response from the server includes the trusted field which speci- fies whether the client is trusted to issue subsequent DAFS_PROC_REGISTER_CREDS operations. If the server fails to authenticate the client due to incorrect authentication information, it returns DAFSERR_NOT_AUTHORIZED. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_INVAL Wittle [Page 177] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_NOT_AUTHORIZED DAFSERR_NOTSUPP Wittle [Page 178] INTERNET-DRAFT Direct Access File System September 2001 6.2.3. DAFS_PROC_SERVER_AUTH SUMMARY Authenticates a server to a client. ARGUMENTS struct DAFS_Server_Auth_Args { dafs_auth_type auth_type; }; RESULTS struct DAFS_Server_Auth_Res { union switch (auth_type) { case DAFS_AUTH_NONE: void; case DAFS_AUTH_TEXT: dafs_auth_text auth_text; case DAFS_AUTH_GSS: void; case DAFS_AUTH_DEFAULT: void; } auth_data; }; DESCRIPTION Authenticate the DAFS server to the DAFS client. This operation is OPTIONAL and can be used only after the DAFS client has been authen- ticated to the server. The same authentication methods are used for server authentication as for client authentication. For information on authentication methods, see 6.2.2., "DAFS_PROC_CLIENT_AUTH". In the case where the client has authenticated using DAFS_AUTH_GSS, further authentication of the server to the client is NOT REQUIRED, as the client and server MAY mutually authenticate during the initial client authentication to the server. This is accomplished by the Wittle [Page 179] INTERNET-DRAFT Direct Access File System September 2001 client by setting the mutual_req_flag on its call to GSS_Init_sec_context, and by checking the result in the mutual_state returned. All GSS implementations used by DAFS MUST support mutual authentication. If the server fails to authenticate the client, it returns DAFSERR_NOT_AUTHORIZED. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_INVAL DAFSERR_NOT_AUTHORIZED DAFSERR_NOTSUPP Wittle [Page 180] INTERNET-DRAFT Direct Access File System September 2001 6.2.4. DAFS_PROC_CLIENT_CONNECT_AUTH SUMMARY Creates a DAFS Session and authenticate client in a single step. ARGUMENTS struct DAFS_Client_Connect_Auth_Args { dafs_uint32 use_checksums; dafs_uint32 use_response_cache; dafs_uint32 max_credentials; dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_uint32 inline_write_header_size; dafs_uint32 use_back_control_channel dafs_uint32 use_rdma_read_channel; dafs_utf8string fence_id_string; dafs_var_offset_type client_id_string; dafs_verifier_type client_verifier; dafs_auth_req auth_req; }; RESULTS struct DAFS_Client_Connect_Auth_Res { dafs_session_id_type session_id; dafs_client_id_type client_id; dafs_uint32 use_checksums; dafs_uint32 use_response_cache; dafs_uint32 max_credentials; dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_uint32 inline_write_header_size; dafs_uint32 use_back_control_channel; dafs_uint32 use_rdma_read_channel; dafs_auth_res auth_res; dafs_boolean trusted; }; Wittle [Page 181] INTERNET-DRAFT Direct Access File System September 2001 DESCRIPTION This operation combines DAFS_PROC_CLIENT_CONNECT and DAFS_PROC_CLIENT_AUTH into a single operation, creating a DAFS Ses- sion and authenticating it in a single step. It is appropriate to use when the authentication method is already known. See 6.2.1., "DAFS_PROC_CLIENT_CONNECT" and 6.2.2., "DAFS_PROC_CLIENT_AUTH" for more information on individual fields. A return status of DAFSERR_ILLEGAL_PROT indicates that the protocol version requested by the client is not supported. The client SHOULD retry with a lower protocol version number. Alt_protocol_version con- tains a suggested protocol version to use that would be supported. If the protocol has already been negotiated, the server returns DAFSERR_ILLEGAL_STATE. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_ILLEGAL_PROT DAFSERR_ILLEGAL_STATE DAFSERR_INVAL DAFSERR_NOT_AUTHENTICATED DAFSERR_NOTSUPP Wittle [Page 182] INTERNET-DRAFT Direct Access File System September 2001 6.2.5. DAFS_PROC_CONNECT_BIND SUMMARY Binds a new communication channel to an existing DAFS Session. Used to bind the Back-control Channel or RDMA-read channel to an existing Session and negotiate the parameters associated with that channel. ARGUMENTS const DAFS_BACK_CONTROL_CHANNEL = 1; const DAFS_RDMA_READ_CHANNEL = 2; struct DAFS_Connect_Bind_Args { dafs_session_id_type session_id; dafs_uint16 channel_use; dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_auth_req auth_req; }; RESULTS struct DAFS_Connect_Bind_Res { dafs_uint32 max_request_size; dafs_uint32 max_response_size; dafs_uint32 max_requests; dafs_auth_res auth_res; dafs_boolean trusted; }; DESCRIPTION Binds a new communication channel to an existing DAFS Session. A new binding can be created by the client if multiple channels are permit- ted for a single DAFS Session and an additional channel is being added to the DAFS Session. A client specifies desired options as part of the DAFS channel, and Wittle [Page 183] INTERNET-DRAFT Direct Access File System September 2001 the server responds with the values to be used for the channel requested. o Maximum operation request size on this channel, in bytes (specify 0 to use server default) o Maximum operation response size on this channel, in bytes (specify 0 to use server default) o Maximum operations requests outstanding on this channel (specify 0 to use server default) The DAFS server MAY accept the options as requested, or MAY respond with different values. The server determined option values are returned as part of the response message. It is the client's respon- sibility to check and verify whether the server options are accept- able to it. As an example, the client (or server, depending on the message flow on the particular channel) can request a maximum request outstanding limit of 65536, but the response MAY be a limit of 512 due to resource restrictions. Only DAFS Sessions initially created with one of the use_xxx_channel attributes are permitted to have subsequent DAFS communication chan- nels bound to them. When a client binds an additional DAFS communica- tion channel mapped onto a separate channel to a DAFS Session it specifies whether this DAFS communication channel is to be used for RDMA read or for back-control directives. A client binding to an existing DAFS Session MUST authenticate itself successfully using the same identification as the original DAFS com- munication channel. (For example, if an initial DAFS Session was created and authenticated as user O then a subsequent client cannot attempt to bind to the Session and authenticate itself as user S.) The response from the server includes the trusted field which speci- fies whether the client is trusted to issue subsequent DAFS_PROC_REGISTER_CREDS operations. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_ILLEGAL_STATE DAFSERR_INVAL Wittle [Page 184] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_NOT_AUTHORIZED DAFSERR_NOTSUPP DAFSERR_UNKNOWN_SESSION Wittle [Page 185] INTERNET-DRAFT Direct Access File System September 2001 6.2.6. DAFS_PROC_DISCONNECT SUMMARY Terminates a DAFS Session. ARGUMENTS None. RESULTS None. DESCRIPTION Disconnect a DAFS Session and all communication channels that have been that been associated with it using DAFS_PROC_CONNECT_BIND. This operation can be used anytime after a Session has been connected. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_INVAL DAFSERR_NOTSUPP Wittle [Page 186] INTERNET-DRAFT Direct Access File System September 2001 6.2.7. DAFS_PROC_SECINFO SUMMARY Queries the server for available security information. ARGUMENTS None. RESULTS struct authtype { dafs_uint32 auth_type; union switch (auth_type) { case DAFS_AUTH_NONE: case DAFS_AUTH_TEXT: void; case DAFS_AUTH_GSS: dafs_opaque8 oid<>; dafs_uint32 qop; enum dafs_gss_service gss_service; case DAFS_AUTH_DEFAULT: void; } authinfo; }; struct DAFS_Secinfo_Res { struct authtype secinfo<>; /* heap */ }; DESCRIPTION Determine which authentication methods the server supports. This is typically used after DAFS_PROC_CLIENT_CONNECT and before DAFS_PROC_CLIENT_AUTH to determine which authentication mechanisms can be used. The result returns an array of auth_type (DAFS_AUTH_TEXT etc.) each with associated authentication-method specific information: Wittle [Page 187] INTERNET-DRAFT Direct Access File System September 2001 o For auth_type DAFS_AUTH_TEXT or DAFS_AUTH_NONE, empty associated information is returned. o For auth_type DAFS_AUTH_GSS, the gss Object Identifier, the sup- ported service, and supported qop values for that service are returned. The object identifier is encoded as a variable length array of opaque bytes. It is NOT REQUIRED to encode the length field of the oid as part of the opaque data of the oid field. Rather, the length field of the oid is encoded as the length field of the opaque array, and the data bytes of the oid are transferred as the entire contents of the opaque8 byte array. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_NOTSUPP Wittle [Page 188] INTERNET-DRAFT Direct Access File System September 2001 6.2.8. DAFS_PROC_REGISTER_CRED SUMMARY Registers credentials. ARGUMENTS const DAFS_CRED_NAME = 0; const DAFS_CRED_ID = 1; const DAFS_CRED_GSS = 2; const DAFS_CRED_DEFAULT = 3; struct DAFS_Register_Cred_Args { dafs_uint32 cred_type; union switch (cred_type) { case DAFS_CRED_ID: dafs_int32 uid; dafs_int32 gid; dafs_int32 groups<>; /* heap */ case DAFS_CRED_NAME: dafs_utf8string name; /* heap */ case DAFS_CRED_GSS: dafs_opaque8 name<>; /* heap */ case DAFS_CRED_DEFAULT: void; } cred_data; }; RESULTS struct DAFS_Register_Cred_Res { dafs_cred_handle_type cred_handle; }; DESCRIPTION This operation is used to advise a server that a set of credentials is active for the client associated with the Session. For each valid Wittle [Page 189] INTERNET-DRAFT Direct Access File System September 2001 specified credentials, the server returns a client-unique credential handle that can be used for subsequent DAFS messages. The new set of credentials can be used during subsequent DAFS operations on any Ses- sion that is used for communicating between this client instance and the current server instance. Note that accessing credentials is distinct from authentication. The client has already performed appropriate authentication, and is trusted by the server to request proxy credentials as necessary to perform DAFS operations. The server can support enhanced credentials, such as multiple proto- col support or id mappings. Because of this the client only needs to provide the identification component of the credential set (for exam- ple, DAFS_CRED_NAME). The client MAY also want to provide more tradi- tional UNIX-style credentials to the server as a hint of the result- ing credential set (DAFS_CRED_ID). The client MAY also request that a server-defined default set of credentials be used (DAFS_CRED_DEFAULT). The server is REQUIRED to keep credentials for as long as it is REQUIRED to keep a Client-id. If there are not currently connected Sessions, and the lease time of any locks have expired, it is permis- sible for the server to release all client state, including any credentials associated with the client. A client that fails to recon- nect quickly enough to avoid the release of previous client state can detect this case by noticing that the returned Client-id has changed, prompting it to re-register it's credentials. The following types of credentials can be specified: o DAFS_CRED_NAME: The username identifying the individual or entity to be associated with the credentials. o DAFS_CRED_ID: The numeric uid/gid to be used in identifying the credentials. o DAFS_CRED_GSS: A GSS-based set of credentials. This is specified as a mechanism name, i.e. a GSS mechanism-specific format name, as returned by GSS_Inquire_context, and GSS_Display_name, for exam- ple. o DAFS_CRED_DEFAULT: Set of default credentials supplied by the server. The server MUST support at least one credential type. The number of credentials that a client can register is negotiated Wittle [Page 190] INTERNET-DRAFT Direct Access File System September 2001 during the client's initial Session creation. See 6.2.1., "DAFS_PROC_CLIENT_CONNECT" for more information. If the client attempts to register more credentials than was negotiated, the server will return a error. The client can use DAFS_PROC_RELEASE_CRED to release an existing credential. The server returns DAFSERR_INVAL if there are too many groups in the groups list. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_INVAL DAFSERR_NOT_AUTHORIZED DAFSERR_NOTSUPP Wittle [Page 191] INTERNET-DRAFT Direct Access File System September 2001 6.2.9. DAFS_PROC_RELEASE_CRED SUMMARY Releases registered credentials. ARGUMENTS struct DAFS_Release_Cred_Args { dafs_cred_handle_type cred_handle; }; RESULTS None. DESCRIPTION The RELEASE_CRED message is used to advise a server that a set of credentials (as obtained from the REGISTER_CRED operation) is no longer REQUIRED for the client. The cred_handle specifies which credential handle is to be released. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_NOT_AUTHORIZED DAFSERR_NOTSUPP Wittle [Page 192] INTERNET-DRAFT Direct Access File System September 2001 6.3. Response Cache This section gives the message packet definitions for the operations that provide Response Cache recovery following a failure. A descrip- tion of the features and rationale associated with these operations can be found in 5.2., "Server Response Cache". 6.3.1. DAFS_PROC_CHECK_RESPONSE SUMMARY Determines the availability of cached results from a previously issued fs- state-modifying operation. ARGUMENTS struct DAFS_Check_Response_Args { dafs_session_id_type session_id; dafs_uint16 xid_stream; dafs_uint16 xid_seq; dafs_uint32 procedure; }; RESULTS None. DESCRIPTION CHECK_RESPONSE is used to determine the availability of a Response Cache entry when recovering from disconnection, server reboot, or server failover. The client uses it so that it can determine which of its in-flight requests have actually been executed. The server returns DAFS_STATUS_OK when a Response Cache is present indicating the specified request was performed and has response information that can be fetched using the DAFS_PROC_FETCH_REPONSE procedure. DAFSERR_NO_XID_MATCH indicates that the request identified by the xid_stream and xid_seq did not have an entry in the cache. In general, this is an indication that the request was not executed before the session represented by the session_id failed. The server might not have knowledge of the session represented by the session_id. In this case, DAFSERR_UNKNOWN_SESSION will be returned. Assuming the client submitted a valid session_id, this status is Wittle [Page 193] INTERNET-DRAFT Direct Access File System September 2001 returned because the server lost state and it does not maintain its response caches in stable storage. ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_BROKEN DAFSERR_NO_XID_MATCH DAFSERR_UNKNOWN_SESSION Wittle [Page 194] INTERNET-DRAFT Direct Access File System September 2001 6.3.2. DAFS_PROC_FETCH_RESPONSE SUMMARY Retrieves the cached results from fs-state-modifying operations and chained operations. ARGUMENTS struct DAFS_Fetch_Response_Args { dafs_session_id_type session_id; dafs_uint16 xid_stream; dafs_uint16 xid_seq; dafs_uint32 procedure; }; RESULTS See "Implementation" below. DESCRIPTION The result from DAFS_PROC_FETCH_RESPONSE is the result from the ori- ginal request. The header (xid, analyzer, and other fields) is taken from the DAFS_PROC_FETCH_RESPONSE request. The results proper are taken from the response in the server's Response Cache. If the response is not found, the result is zero-length. This SHOULD not happen, because the caller SHOULD use DAFS_PROC_CHECK_RESPONSE to find out if the results it is intending to fetch have been cached by the server. Otherwise, the status returned is the status from the fetched response, not the status of the fetch operation itself. IMPLEMENTATION A recommendation is to make these operations chainable. Making the operations chainable ensures that a CHECK_RESPONSE, FETCH_RESPONSE chain can be issued to check for the response and fetch it, if the response is in the server's Response Cache. ERRORS See "Description" above. Wittle [Page 195] INTERNET-DRAFT Direct Access File System September 2001 6.3.3. DAFS_PROC_DISCARD_RESPONSES SUMMARY Tells server that client is finished with the Response Cache for a disconnected Session. ARGUMENTS struct DAFS_Discard_Responses_Args { dafs_session_id_type session_id; }; RESULTS None. DESCRIPTION This operation lets the server to remove the Response Cache for the specified disconnected Session. If the client does not do this, the Response Cache will be maintained until client reinitialization. IMPLEMENTATION ERRORS DAFSERR_STATUS_OK DAFSERR_CHAIN_FORM DAFSERR_NOT_AUTHORIZED DAFSERR_UNKNOWN_SESSION Wittle [Page 196] INTERNET-DRAFT Direct Access File System September 2001 6.4. Fencing Procedures This section describes the fencing operations used to manage the Fencing_list and put special fencing access controls in to effect. 6.4.1. DAFS_PROC_GET_FENCING_LIST SUMMARY Return the Fencing_Id_List for a file or file system. ARGUMENTS struct DAFS_Get_Fencing_List_Args { dafs_filehandle_type filehandle; /* file or fshandle */ }; RESULTS struct DAFS_Get_Fencing_List_Res { dafs_fence_array_type fence_list; /* heap */ }; DESCRIPTION The get fencing list procedure returns the current fencing list for the file or file system specified by filehandle argument. The ability to get the Fencing_list for an filehandle object is reserved to the owner of the object, or a trusted Client. IMPLEMENTATION ERRORS DAFS_ERR_CHAIN_FORM DAFSERR_INVAL DAFSERR_NOT_AUTHORIZED Wittle [Page 197] INTERNET-DRAFT Direct Access File System September 2001 6.4.2. DAFS_PROC_SET_FENCING_LIST SUMMARY Set the Fencing_Id_List for a file or file system. ARGUMENTS enum object_type { FILE = 1, FILESYSTEM = 2 }; enum access_type { ALLOW = 1, DENY = 2 }; enum update_type { OVERWRITE = 1, APPEND = 2 REMOVE = 3 }; struct DAFS_Set_Fencing_List_Args { dafs_filehandle filehandle; /* file/fshandle */ object_type object; access_type access; /* add or remove */ update_type update; /* create or add */ dafs_fence_array_type fence_list; }; RESULTS None. Wittle [Page 198] INTERNET-DRAFT Direct Access File System September 2001 DESCRIPTION The set operation atomically updates the Fencing_list, adding or removing Fence_id_strings from the existing list, or overwriting the existing list, as specified by the argument flags. The filehandle specifies the file or file system that the fencing list will be associated with. The object argument indicates whether the file or file system is the object to be fenced. The fencing list field is the list of fence_id_strings that is to be added or removed from the fencing list. The access field specifies whether the fencing list specifies a list of fence_id_strings that are to be allowed access to the fenced object, or denied access to that object. The update field specifies whether the fencing list argument modifies an existing fencing list for the object, or, in the case of allowing access, that it overwrites the existing list. A side-effect of the set operation when invoked access = DENY is to 1) drain (i.e., abort or complete) any in-progress operations received on a DAFS Session with the just-denied Fence_id_string. All subsequent requests on the Session that has the associated just- denied Fence_id_string, MUST enforce the denial of access implied by the new Fencing_list. This requires determining that the request is associated with a denied Fence_id_string (e.g., determining that the request's Session has a denied Fence_id_string), and matching the filehandle in the request to Objects that are Fenced. 2) if the Object fenced includes all DAFS file objects (directories, files, symlinks, etc.) provided by the DAFS Server, then all existing DAFS Sessions associated with the just-denied Fence_id_string can be closed in error. Subsequent attempts by to create a Session that contains the just-denied Fence_id_string can be returned in error. The ability to set the Fencing_list for an filehandle object is reserved to the owner of the object, or a trusted Client.The ability to set the Fencing_list for a file system is reserved to trusted Clients. IMPLEMENTATION It is server implementation specific how to handle a Set_Fencing_List Wittle [Page 199] INTERNET-DRAFT Direct Access File System September 2001 operation on a filehandle that already has Fencing_list defined for associated dafs_FShandle, or a Set_Fencing_List operation on a dafs_FShandle that already has a Fencing_list defined for the filehandle. ERRORS DAFSERR_CHAIN_FORM DAFSERR_INVAL DAFSERR_NOT_AUTHORIZED Wittle [Page 200] INTERNET-DRAFT Direct Access File System September 2001 6.5. File System Procedures This section describes the individual operations, along with the for- mats of the arguments portion of the request and the results portion of the response. All operations in this section are initiated by clients and are processed and responded to by servers. 6.5.1. DAFS_PROC_NULL SUMMARY No operation. ARGUMENTS None. RESULTS None. DESCRIPTION "Standard NULL procedure. Void [no] arguments, void [no] results. This procedure has no functionality associated with it. Because of this, it is sometimes used to meas- ure the overhead of processing a service request. There- fore, the server should ensure that no unnecessary work is done in servicing this procedure." (RFC 3010, p. 102) One other potential use of this procedure is as a means for a reques- ter to receive flow control information from the server in a timely fashion during an otherwise quiet period. See 3.2.6., "Message Flow Control" for a discussion. A side effect of a DAFS_PROC_NULL request is the renewal of a client's lock leases. Wittle [Page 201] INTERNET-DRAFT Direct Access File System September 2001 6.5.2. DAFS_PROC_ACCESS SUMMARY Checks an object's access rights. ARGUMENTS const ACCESS_READ = 0x00000001; const ACCESS_LOOKUP = 0x00000002; const ACCESS_MODIFY = 0x00000004; const ACCESS_EXTEND = 0x00000008; const ACCESS_DELETE = 0x00000010; const ACCESS_EXECUTE = 0x00000020; struct DAFS_Access_Args { dafs_filehandle_type filehandle; dafs_uint32 access; }; RESULTS struct DAFS_Access_Res { dafs_uint32 supported; dafs_uint32 access; }; DESCRIPTION "ACCESS [DAFS_PROC_ACCESS] determines the access rights that a user, as identified by the credentials in the RPC request, has with respect to the file system object specified by the current filehandle. (RFC 3010, p. 105) DAFS does not use a separate RPC section of the message to transmit credentials information. See 4.1.3.2., "Credentials" for explanation on DAFS credential handles. In addition, DAFS explicitly exchanges filehandle information in argument and result message fields. Further, DAFS provides server-based implicit transmission of a "current_filehandle" between chained operations. See 4.1.3.1., Wittle [Page 202] INTERNET-DRAFT Direct Access File System September 2001 "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The client encodes the set of access rights that are to be checked in the bit mask 'access.' The server checks the permissions encoded in the bit mask. If a status of NFS4_OK [DAFS_STATUS_OK] is returned, two bit masks are included in the response. The first, 'supported', represents the access rights for which the server can verify reliably. The second, 'access,' represents the access rights available to the user for the filehandle provided. On success, the current filehandle retains its value. Note that the supported field will contain only as many values as was originally sent in the arguments. For example, if the client sends an ACCESS [DAFS_PROC_ACCESS] operation with only the ACCESS4_READ[ACCESS_READ] value set and the server sup- ports this value, the server will return only ACCESS4_READ [ACCESS_READ] even if it could have reli- ably checked other values. The results of this operation are necessarily advisory in nature. A return status of NFS4_OK [DAFS_STATUS_OK] and the appropriate bit set in the bit mask does not imply that such access will be allowed to the file sys- tem object in the future. This is because access rights can be revoked by the server at any time. The following access permissions may be requested: ACCESS4_READ [ACCESS_READ]: Read data from file or read a directory. ACCESS4_LOOKUP [ACCESS_LOOKUP]: Look up a name in a directory (no meaning for non-directory objects). ACCESS4_MODIFY [ACCESS_MODIFY]: Rewrite existing file data or modify existing directory entries. ACCESS4_EXTEND [ACCESS_EXTEND]: Write new data or add directory entries. ACCESS4_DELETE [ACCESS_DELETE]: Delete an existing directory entry (no meaning for non-directory objects). ACCESS4_EXECUTE [ACCESS_EXECUTE]: Execute file (no Wittle [Page 203] INTERNET-DRAFT Direct Access File System September 2001 meaning for a directory" (RFC 3010, pp. 105-106) IMPLEMENTATION "For the NFS version 4 [and DAFS] protocol, the use of the ACCESS [DAFS_PROC_ACCESS] procedure when opening a regular file is deprecated in favor of using OPEN [DAFS_PROC_OPEN]. In general, it is not sufficient for the client to attempt to deduce access permissions by inspecting the uid, gid, and mode fields in the file attributes or by attempting to interpret the contents of the ACL attri- bute. This is because the server may perform uid or gid mapping or enforce additional access control restric- tions. It is also possible that the server may not be in the same ID space as the client. In these cases (and perhaps others), the client can not reliably perform an access check with only current file attributes. In the NFS version 2 protocol, the only reliable way to determine whether an operation was allowed was to try it and see if it succeeded or failed. Using the ACCESS [DAFS_PROC_ACCESS] procedure in the NFS version 4 [and DAFS] protocol, the client can ask the server to indi- cate whether or not one or more classes of operations are permitted. The ACCESS [DAFS_PROC_ACCESS] operation is provided to allow clients to check before doing a series of operations which will result in an access failure. The OPEN [DAFS_PROC_OPEN] operation provides a point where the server can verify access to the file object and method to return that information to the client. The ACCESS [DAFS_PROC_ACCESS] operation is still useful for directory operations or for use in the case the UNIX API 'access' is used on the client. The information returned by the server in response to an ACCESS [DAFS_PROC_ACCESS] call is not permanent. It was correct at the exact time that the server performed the checks, but not necessarily afterwards. The server can revoke access permission at any time. The client should use the effective credentials of the user to build the authentication information in the ACCESS [DAFS_PROC_ACCESS] request used to determine access rights. (RFC 3010, pp. 106-107) In DAFS, a client needs to register the user's effective credentials Wittle [Page 204] INTERNET-DRAFT Direct Access File System September 2001 and include the credentials handle thus obtained in the DAFS header of the ACCESS operation. "It is the effective user and group credentials that are used in subsequent read and write operations. "Many implementations do not directly support the ACCESS4_DELETE [ACCESS_DELETE] permission. Operating systems like UNIX will ignore the ACCESS4_DELETE [ACCESS_DELETE] bit if set on an access request on a non-directory object. In these systems, delete permis- sion on a file is determined by the access permissions on the directory in which the file resides, instead of being determined by the permissions of the file itself. Therefore, the mask returned enumerating which access rights can be determined will have the ACCESS4_DELETE [ACCESS_DELETE] value set to 0. This indicates to the client that the server was unable to check that particu- lar access right. The ACCESS4_DELETE [ACCESS_DELETE] bit in the access mask returned will then be ignored by the client." (RFC 3010, pp. 106-107) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 205] INTERNET-DRAFT Direct Access File System September 2001 6.5.3. DAFS_PROC_APPEND_INLINE SUMMARY Appends data to the end of a file. The data to be written is part of the request packet and is passed inline. ARGUMENTS enum dafs_append_stable_how { DATA_SYNC = 1, FILE_SYNC = 2 }; struct DAFS_Append_Inline_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_append_stable_how stable_how; dafs_uint32 byte_count; dafs_uint32 write_padded; dafs_cache_hint_type cache_hint; dafs_opaque8 data[byte_count]; }; RESULTS struct DAFS_Append_Inline_Res { dafs_uint64 offset; dafs_verifier_type verifier; dafs_append_stable_how committed; }; DESCRIPTION Using the append inline procedure, a client requests that the server atomically append the data at the end of the file. The request is atomic with respect to: 1) the end-of-file. The server insures that the determination of Wittle [Page 206] INTERNET-DRAFT Direct Access File System September 2001 current end-of-file file offset and appending the data at that offset are atomic with respect to other write operations on the file. 2) data to be written. The server insures that either all of the data is appended to the file, or none of the data is appended to the file. Append operations MUST specify either Data_Sync or File_Sync stabil- ity. If the append request specifies a byte count larger than the DAFS_FSATTR_MAX_APPEND file system attribute associated with the file, the server MAY return the error value DAFSERR_WRITE_TOOBIG and not append any of the data. IMPLEMENTATION The append operation is REQUIRED to support both atomicities described. The server MAY need to provide local buffer space for up to DAFS_FSATTR_MAX_APPEND bytes of data, and MAY need to adjust the resulting file size in order to eliminate any effects of an partially completed append operation. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED Wittle [Page 207] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LEASE_MOVED DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID DAFSERR_WRITE_TOOBIG Wittle [Page 208] INTERNET-DRAFT Direct Access File System September 2001 6.5.4. DAFS_PROC_APPEND_DIRECT SUMMARY Initiates a write append to file using data retrieved via RDMA read directly from client memory buffers. ARGUMENTS enum dafs_append_stable_how { DATA_SYNC = 1, FILE_SYNC = 2 }; struct DAFS_Append_Direct_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_append_stable_how stable_how; dafs_uint32 byte_count; dafs_cache_hint_type cache_hint; dafs_checksum_type direct_checksum; dafs_direct_op_buffer write_data_buffer; }; /* DIRECT: opaque write_data_buffer[byte_count]; */ RESULTS struct DAFS_Append_Direct_Res { dafs_uint64 offset; dafs_verifier_type verifier; dafs_append_stable_how committed; }; DESCRIPTION See DAFS_PROC_APPEND_INLINE for a description. Wittle [Page 209] INTERNET-DRAFT Direct Access File System September 2001 ERRORS: DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_DELAY DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LEASE_MOVED DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_OLD_STATEID DAFSERR_RDMA-READ_CHANNEL_UNUSABLE DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT Wittle [Page 210] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_STALE DAFSERR_STALE_STATEID DAFSERR_WRITE_TOOBIG Wittle [Page 211] INTERNET-DRAFT Direct Access File System September 2001 6.5.5. DAFS_PROC_BATCH_SUBMIT SUMMARY Submits a batch of I/O requests to the server. ARGUMENTS typedef enum { DAFS_READ_OP = 1, DAFS_WRITE_OP = 2 } dafs_rw_flag; struct DAFS_Batch_Submit_Args { dafs_rw_request_array_type requests; /* heap */ dafs_uint32 usec_window; dafs_uint32 num_completions; dafs_boolean synchronous; }; RESULTS struct DAFS_Batch_Submit_Res { dafs_completion_array_type completions; /* heap */ }; DESCRIPTION The DAFS_PROC_BATCH_SUBMIT operation is used to initiate a set of I/O requests from and to regular files. This can be used as both a syn- chronous list-io mechanism as well as an asynchronous I/O request. In synchronous mode, the server completes all I/O requests, then reports each of their status results in a single reply message to the DAFS client. In asynchronous mode, the server reports results over the Back Channel. Each I/O request is described by the "dafs_read_write_request" struc- ture, which includes a request_id, unique to the DAFS Session. The request_id allows a client to match completion notifications to Wittle [Page 212] INTERNET-DRAFT Direct Access File System September 2001 previously issued requests. It also includes a filehandle, a read/write flag, a state id, and a checksum field. Lastly, each dafs_read_write_request contains two lists, one of file chunks and another of memory buffers. The number of elements in those two lists does not need to match, nor does the size of the individual file regions need to match the size of the individual memory buffers, but the total number of bytes specified in the file region list MUST match the total number of bytes specified in the memory list. These two lists represent a combined scatter-gather list, either moving data from the file to the specified addresses in the case of a read, or from the addresses to a file in the case of a write. The server can decompose the two lists to a single list of transfers of memory to file regions, and is free to perform any optimizations. The decom- posed list can be thought of as the list of transfers defined by dividing all the data to be moved for a given filehandle at every point that separates one memory buffer from another or one file region from another. This list can easily be generated by running a cursor through each of the file list and memory buffer list provided. The condensed description using two lists, however, makes certain optimizations trivial. Two such examples are the case of I/O from a single file region into multiple memory buffers, or from a single memory buffer into discontiguous segments of a file. If the "synchronous" flag is set in the DAFS_Batch_Submit_Args struc- ture, then the server treats the batch operation as a list of syn- chronous I/O requests to be performed immediately. When all the requests complete (either successfully or in error), the server sends a single reply on the Operation Channel containing the status of each I/O request. In this case, the server ignores the values of "usec_window" and "num_completions". Otherwise, if "synchronous" is not set, then the message becomes an asynchronous batch submit. In the asynchronous case, the client MAY set the "usec_window" parameter as a hint to the server indicating how quickly the client would like the have the batch requests satis- fied. The client can also set "num_completions" as a hint to tell the server how many completions it would like reported at once. Note that the server SHOULD attempt to respect these hints, but is NOT REQUIRED to do so. That is, a client MUST be prepared to receive a different number of completions than it requested, or to wait longer than desired for a given completion. The server sends a reply message on the Operation Channel acknowledging the batch request. Then, as each request is completed, the server sends DAFS_PROC_BC_BATCH_COMPLETION messages to the client over the Back-control Channel. The server is not obligated to complete the requests in the batch in any specific order or with any atomicity obligations. Also, if a client submits multiple asynchronous batch operations, the server MAY Wittle [Page 213] INTERNET-DRAFT Direct Access File System September 2001 coalesce them and report completions from multiple batches in the same back-channel completion notification or future request poll. The "num_completions" parameter is global to a DAFS Session. If more than one batch of asynchronous I/O requests is in progress, the server will respect the most recently received "num_completions" parameter, and MAY combine requests from different DAFS_PROC_BATCH_SUBMIT messages into the same DAFS_PROC_BC_BATCH_COMPLETION message. The client can use DAFS_PROC_BATCH_SUBMIT to change the current value of "num_completions" without submitting any additional requests. To do so, the client sets "num_completions" to the desired size, clears the synchronous flag, and issues the DAFS_PROC_BATCH_SUBMIT message with a zero-length I/O request list. Conversely, a client can submit more I/O requests in a batch without changing the Session's current "num_completions" parameter by setting "num_completions" to 0 in the new batch message. If the client and server have negotiated the use of data checksums, each batch I/O request is independently checksummed, using the DAFS checksum. For write requests, the client fills in the checksum field of the dafs_read_write_request structure, which the server verifies. For read requests, the server fills in the checksum field of the proper dafs_completion_notification structure, which the client veri- fies upon receipt of the data. IMPLEMENTATION In general, the DAFS_PROC_BATCH_SUBMIT response message SHOULD not be sent to the client until the server is able to allocate the resources necessary to store/queue the list of requests. This delays the operation channel flow control credits somewhat, but it automatically causes the client to slow down when the server experiences transient loads. It allows the server to respond rapidly on any Session as long as it has resources, but it need not dedicate resources for large BATCH_SUBMIT requests until they are requested. In some sense, this is no different than any other 'normal' message - the response isn't sent until the message is processed. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID Wittle [Page 214] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LEASE_MOVED DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_OLD_STATEID DAFSERR_RDMA-READ_CHANNEL_UNUSABLE DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID DAFSERR_WHOA_COWBOY DAFSERR_WRITE_TOOBIG Wittle [Page 215] INTERNET-DRAFT Direct Access File System September 2001 6.5.6. DAFS_PROC_CACHE_HINT SUMMARY Provide the server with cache management hints. ARGUMENTS struct DAFS_Cache_Hint_Args { dafs_filehandle_type filehandle; dafs_uint64 offset; dafs_uint32 count; dafs_access_pattern_type access_pattern; dafs_cache_hint_type cache_hint; } RESULTS None. DESCRIPTION The access_pattern indicates the client's predicted general access pattern for the filehandle, if the pattern is know. The cache_hint defines a bit-field argument containing the cache hint for the specified byte range of the file. The first bit indicates whether a prefetch of this data is to be performed. The following 3 bits indicate the likelihood of a read in the near future. The 3 bits following those indicate the likelihood that a write will be per- formed on this data in the near future. Detailed definitions for each of the cache_hint bits are given below. Cache_hint set to NULL is a special value that means that the hint is not being used or the client does not know currently what cache weighting to assign to the byte range. In the NULL case the server SHOULD assume default values (prefetch 0, dafs_readhint_4, dafs_writehint4). dafs_prefetch Dafs_prefetch indicates that the given data range is to be pre- fetched. If this is set and accompanies a read or write request then it is ignored. It is to intended be used when the DAFS_PROC_CACHE_HINT function is called. If the prefetch bit is Wittle [Page 216] INTERNET-DRAFT Direct Access File System September 2001 set and both the radiant and write=>intv are negative (i.e., readhint < 0x8 and writehint < 0x40) then the request SHOULD be ignored. If not negative and the server is not otherwise busy it is then recommended that the requested data be prefetched into the server's cache. If prefetching is not supported by the server then an error DAFS_PREFETCH_NOT_SUPPORTED is returned (provided that the hint is not accompanying a read or write request in which case read/write errors are to be reported). dafs_readhint_1 The client is confident that it will not read the data again in the near future. dafs_readhint_2 The client believes there is a good chance that it will not read the data again in the near future. dafs_readhint_3 The client believes there is a better than even chance that it will not read the data again in the near future. dafs_readhint_4 The client does not know whether it will read the data again or not (default value). dafs_readhint_5 The client believes there is a better than even chance that it will read this data again in the near future. dafs_readhint_6 The client believes there is a good chance that it will read this data again in the near future. dafs_readhint_7 The client believes there is an excellent chance that it will read the data again in the near future. dafs_writehint_1 The client is confident that it will not write the data again in the near future. Wittle [Page 217] INTERNET-DRAFT Direct Access File System September 2001 dafs_writehint_2 The client believes there is a good chance that it will not write the data again in the near future. dafs_writehint_3 The client believes there is a better than even chance that it will not write the data again in the near future. dafs_writehint_4 The client does not know whether it will write the data again or not (default value). dafs_writehint_5 The client believes there is a better than even chance that it will write this data again in the near future. dafs_writehint_6 The client believes there is a good chance that it will write this data again in the near future. dafs_writehint_7 The client believes there is an excellent chance that it will write the data again in the near future. IMPLEMENTATION The DAFS protocol does not dictate what actions a server SHOULD take upon reception of cache hints or even that the server needs to take any actions at all. The protocol does not dictate how long a server SHOULDMUST make use of a hint issued by the client. ERRORS DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_PREFETCH_NOT_SUPPORTED Wittle [Page 218] INTERNET-DRAFT Direct Access File System September 2001 6.5.7. DAFS_PROC_CLOSE SUMMARY Closes a file. ARGUMENTS struct DAFS_Close_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; }; RESULTS None. DESCRIPTION "The CLOSE [DAFS_PROC_CLOSE] operation releases share reservations for the file as specified by the current filehandle. The share reservations and other state information released at the server as a result of this CLOSE [DAFS_PROC_CLOSE] is only associated with the sup- plied stateid [state_id]. The sequence id provides for the correct ordering. (RFC 3010, p. 108) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures "State associated with other OPENs [DAFS_PROC_OPENs] is not affected. If record locks are held, the client SHOULD release all locks before issuing a CLOSE [DAFS_PROC_CLOSE]. The server MAY free all outstanding locks on CLOSE[DAFS_PROC_CLOSE] but some servers may not support the CLOSE [DAFS_PROC_CLOSE] of a file that still has record locks held. The server MUST return failure if any locks would exist after the CLOSE [DAFS_PROC_CLOSE]." (RFC 3010, p.108) The DAFS_PROC_CLOSE operation provides "Delete On Last Close" seman- tics. Once a file has been opened, the DAFS Server MUST continue to provide access to the file to the Clients that have the file open, Wittle [Page 219] INTERNET-DRAFT Direct Access File System September 2001 even after the file has been removed, up until the number of Clients that have the file open has dropped to zero. However, once the file has been removed, subsequent lookup and open operations will fail. IMPLEMENTATION ERRORS DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_ISDIR DAFSERR_LEASE_MOVED DAFSERR_MOVED DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 220] INTERNET-DRAFT Direct Access File System September 2001 6.5.8. DAFS_PROC_COMMIT SUMMARY Commits data cached on the server as the result of previous asynchro- nous write requests. ARGUMENTS struct DAFS_Commit_Args { dafs_filehandle_type filehandle; dafs_uint64 offset; dafs_uint32 count; }; RESULTS struct DAFS_Commit_Res { dafs_verifier_type writeverf; }; DESCRIPTION "The COMMIT [DAFS_PROC_COMMIT] operation forces or flushes data to stable storage for the file specified by the current file handle." (RFC 3010, p. 109) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures "The flushed data is that which was previously written with a WRITE [DAFS_PROC_WRITE_INLINE or DAFS_PROC_WRITE_DIRECT] operation which had the stable field set to UNSTABLE4 [UNSTABLE]. The offset specifies the position within the file where the flush is to begin. An offset value of 0 (zero) means to flush data starting at the beginning of the file. The count specifies the number of bytes of data to flush. If count is 0 (zero), a flush from offset to the end of the file is done. Wittle [Page 221] INTERNET-DRAFT Direct Access File System September 2001 The server returns a write verifier upon successful com- pletion of the COMMIT [DAFS_PROC_COMMIT]. The write verifier is used by the client to determine if the server has restarted or rebooted between the initial WRITE(s) [DAFS_PROC_WRITE_INLINEs or DAFS_PROC_WRITE_DIRECTs] and the COMMIT [DAFS_PROC_COMMIT]. The client does this by comparing the write verifier returned from the initial writes and the verifier returned by the COMMIT [DAFS_PROC_COMMIT] procedure. The server must vary the value of the write verifier at each server event or instantiation that may lead to a loss of uncommitted data. Most commonly this occurs when the server is rebooted; however, other events at the server may result in uncommitted data loss as well." (RFC 3010, pp. 109-110) IMPLEMENTATION "The COMMIT [DAFS_PROC_COMMIT] procedure is similar in operation and semantics to the POSIX fsync(2) system call that synchronizes a file's state with the disk (file data and metadata is flushed to disk or stable storage). COMMIT [DAFS_PROC_COMMIT] performs the same operation for a client, flushing any unsynchronized data and metadata on the server to the server's disk or stable storage for the specified file. Like fsync(2), it may be that there is some modified data or no modified data to synchronize. The data may have been synchron- ized by the server's normal periodic buffer synchroniza- tion activity. COMMIT [DAFS_PROC_COMMIT] should return NFS4_OK [DAFS_STATUS_OK], unless there has been an unex- pected error. COMMIT [DAFS_PROC_COMMIT] differs from fsync(2) in that it is possible for the client to flush a range of the file (most likely triggered by a buffer- reclamation scheme on the client before [the] file has been com- pletely written). The server implementation of COMMIT [DAFS_PROC_COMMIT] is reasonably simple. If the server receives a full file COMMIT [DAFS_PROC_COMMIT] request, that is starting at offset 0 and count 0, it should do the equivalent of fsync()'ing the file. Otherwise, it should arrange to have the cached data in the range specified by offset and count to be flushed to stable storage. In both cases, any metadata associated with the file must be flushed to stable storage before returning. It is not Wittle [Page 222] INTERNET-DRAFT Direct Access File System September 2001 an error for there to be nothing to flush on the server. This means that the data and metadata that needed to be flushed have already been flushed or lost during the last server failure. The client implementation of COMMIT [DAFS_PROC_COMMIT] is a little more complex. There are two reasons for wanting to commit a client buffer to stable storage. The first is that the client wants to reuse a buffer. In this case, the offset and count of the buffer are sent to the server in the COMMIT [DAFS_PROC_COMMIT] request. The server then flushes any cached data based on the offset and count, and flushes any metadata asso- ciated with the file. It then returns the status of the flush and the write verifier. The other reason for the client to generate a COMMIT [DAFS_PROC_COMMIT] is for a full file flush, such as may be done at close. In this case, the client would gather all of the buffers for this file that contain uncommitted data, do the COMMIT [DAFS_PROC_COMMIT] operation with an offset of 0 and count of 0, and then free all of those buffers. Any other dirty buffers would be sent to the server in the normal fashion. After a buffer is written by the client with the stable parameter set to UNSTABLE4 [UNSTABLE], the buffer must be considered as modified by the client until the buffer has either been flushed via a COMMIT [DAFS_PROC_COMMIT] operation or written via a WRITE operation [one of DAFS WRITE operations] with stable parameter set to FILE_SYNC4 or DATA_SYNC4 [FILE_SYNC or DATA_SYNC]. This is done to prevent the buffer from being freed and reused before the data can be flushed to stable storage on the server. When a response is returned from either a WRITE [one of DAFS WRITE operation flavors] or a COMMIT [DAFS_PROC_COMMIT] operation and it contains a write verifier that is different than previously returned by the server, the client will need to retransmit all of the buffers containing uncommitted cached data to the server. How this is to be done is up to the implemen- tor. If there is only one buffer of interest, then it should probably be sent back over in a WRITE request [one of DAFS WRITE operation flavors] with the appropri- ate stable parameter. If there is more than one buffer, it might be worthwhile retransmitting all of the buffers in WRITE requests [one of DAFS WRITE request flavors] Wittle [Page 223] INTERNET-DRAFT Direct Access File System September 2001 with the stable parameter set to UNSTABLE4 [UNSTABLE] and then retransmitting the COMMIT [DAFS_PROC_COMMIT] operation to flush all of the data on the server to stable storage. The timing of these retransmissions is left to the implementor. The above description applies to page-cache-based sys- tems as well as buffer-cache-based systems. In those systems, the virtual memory system will need to be modi- fied instead of the buffer cache." (RFC 3010, pp. 110- 111) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_ISDIR DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 224] INTERNET-DRAFT Direct Access File System September 2001 6.5.9. DAFS_PROC_CREATE SUMMARY Creates a non-regular file object. ARGUMENTS const DAFS_TYPE_INVALID = 0; const DAFS_TYPE_DIR = 2; const DAFS_TYPE_BLK = 3; const DAFS_TYPE_CHR = 4; const DAFS_TYPE_LNK = 5; const DAFS_TYPE_SOCK = 6; const DAFS_TYPE_FIFO = 7; const DAFS_TYPE_ATTRDIR = 8; const DAFS_TYPE_NAMEDATTR = 9; struct DAFS_Create_Args { dafs_filehandle_type filehandle; dafs_component_type component; /* heap */ dafs_uint32 obj_type; union switch (obj_type) { case DAFS_TYPE_LNK: dafs_utf8string_type linkdata; /* heap */ case DAFS_TYPE_BLK: case DAFS_TYPE_CHR: dafs_specdata_type specdata; case DAFS_TYPE_SOCK: case DAFS_TYPE_FIFO: case DAFS_TYPE_DIR: void; } create_type; dafs_file_attr attr; /* heap */ }; RESULTS Wittle [Page 225] INTERNET-DRAFT Direct Access File System September 2001 struct DAFS_Create_Res { dafs_filehandle_type filehandle; dafs_change_info_type change_info; }; DESCRIPTION "The CREATE [DAFS_PROC_CREATE] operation creates a non- regular file object in a directory with a given name. The OPEN [DAFS_PROC_OPEN] procedure MUST be used to create a regular file. The objname [component] specifies the name for the new object. If the objname [component] has a length of 0 (zero), the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned. The objtype [obj_type] determines the type of object to be created: directory, symlink, etc. If an object of the same name already exists in the directory, the server will return the error NFS4ERR_EXIST [DAFSERR_EXIST]. For the directory where the new file object was created, the server returns change_info4 [dafs_change_info] information in cinfo [change_info]. With the atomic field of the change_info4 [dafs_change_info] struct, the server will indicate if the before and after change attributes were obtained atomically with respect to the file object creation. If the objname has a length of 0 (zero), or if objname does not obey the UTF-8 definition, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned. The current filehandle is replaced by that of the new object." (RFC 3010, p. 113) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. DAFS_PROC_CREATE allows a client to specify attributes to be set at the time an object of type DAFS_TYPE_DIR is created. Setting attri- butes for other object types is not permitted and will result in the server returning DAFSERR_INVAL. Setting the attributes when creating Wittle [Page 226] INTERNET-DRAFT Direct Access File System September 2001 a directory saves the client the need to issue a SETATTR call after the CREATE. IMPLEMENTATION ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BADTYPE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DQUOT DAFSERR_EXIST DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NAMETOOLONG DAFSERR_NOSPC DAFSERR_NOTDIR DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 227] INTERNET-DRAFT Direct Access File System September 2001 6.5.10. DAFS_PROC_DELEGPURGE SUMMARY Purges delegations awaiting recovery. ARGUMENTS None. RESULTS None. DESCRIPTION "Purges all of the delegations awaiting recovery for a given client. This is useful for clients which do not commit delegation information to stable storage to indi- cate that conflicting requests need not be delayed by the server awaiting recovery of delegation information. This operation should be used by clients that record delegation information on stable storage on the client. In this case, DELEGPURGE [DAFS_PROC_DELEGPURGE] should be issued immediately after doing delegation recovery on all delegations know[n] to the client. Doing so will notify the server that no additional delegations for the client will be recovered allowing it to free resources, and avoid delaying other clients who make requests that conflict with the unrecovered delegations. The set of delegations known to the server and the client may be different. The reason for this is that a client may fail after making a request which resulted in delegation but before it received the results and committed them to the client's stable storage." (RFC 3010, p. 114) DAFS_PROC_DELEGPURGE takes no arguments. Delegations are purged for the client associated with the session on which the DELEGPURGE request arrives. ERRORS DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_RESOURCE Wittle [Page 228] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_SERVERFAULT DAFSERR_STALE_CLIENTID Wittle [Page 229] INTERNET-DRAFT Direct Access File System September 2001 6.5.11. DAFS_PROC_DELEGRETURN SUMMARY Returns delegation. ARGUMENTS struct DAFS_DelegReturn_Args { dafs_state_id_type state_id; }; RESULTS None. DESCRIPTION "Returns the delegation represented by the given stateid [state_id]." (RFC 3010, p. 115) ERRORS DAFSERR_BAD_STATEID DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE_STATEID Wittle [Page 230] INTERNET-DRAFT Direct Access File System September 2001 6.5.12. DAFS_PROC_GET_ROOT_HANDLE SUMMARY Retrieves the root filehandle from a server. ARGUMENTS None. RESULTS struct DAFS_Get_Root_Handle_Args { dafs_filehandle_type root_handle; }; DESCRIPTION This procedure returns the root of the server file name space. See 4.1.4., "Objects Naming And Filehandles" for a detailed description of server name space in the DAFS protocol. ERROR DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN Wittle [Page 231] INTERNET-DRAFT Direct Access File System September 2001 6.5.13. DAFS_PROC_GETATTR_INLINE SUMMARY Gets attributes of an object. ARGUMENTS struct DAFS_GetAttr_Inline_Args { dafs_filehandle_type filehandle; dafs_attr_bitmap_type attr_request_bitmap; }; RESULTS struct DAFS_GetAttrInline_Res { dafs_file_attr_type obj_attributes; }; DESCRIPTION "The GETATTR [DAFS_PROC_GETATTR_INLINE] operation will obtain attributes for the file system object specified by the current filehandle." (RFC 3010, p. 116) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The client sets a bit in the bitmap [attr_request_bitmap] argument for each attribute value that it would like the server to return. The server returns an attribute bitmap that indicates the attribute values for which it was able to return, followed by the attribute values ordered lowest attribute number first." (RFC 3010, p. 116) See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding of attributes. "The server must return a value for each attribute that the client requests if the attribute is supported by the server. If the server does not support an attribute or Wittle [Page 232] INTERNET-DRAFT Direct Access File System September 2001 cannot approximate a useful value then it must not return the attribute value and must not set the attri- bute bit in the result bitmap. The server must return an error if it supports an attri- bute but cannot obtain its value. In that case no attribute values will be returned. All servers must support the mandatory attributes as specified in the section 'File Attributes'." (RFC 3010, p. 116) See 6.1.5., "File Attributes" for a list of DAFS' mandatory file attributes. "On success, the current filehandle retains its value." (RFC 3010, p. 116) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION The server MUST return an object attributes structure with all the attribute fields that the client requested in the attr_request_bitmap parameter. If a server does not support or cannot provide a requested attribute, it MUST still include space for the attribute but mark its contents as invalid by not setting the corresponding bit in the valid bitmap of the returned obj_attributes. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO Wittle [Page 233] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 234] INTERNET-DRAFT Direct Access File System September 2001 6.5.14. DAFS_PROC_GETATTR_DIRECT SUMMARY Gets attributes of an object. ARGUMENTS struct DAFS_GetAttr_Direct_Args { dafs_filehandle_type filehandle; dafs_attr_bitmap_type attr_request_bitmap; dafs_direct_op_buffer data_buffer; }; RESULTS struct DAFS_GetAttr_Direct_Res { dafs_checksum_type direct_checksum; }; /* Server copies the object attributes directly into the data_buffer passed by the client in the arguments structure */ /* DIRECT: dafs_file_attr_type obj_attributes; */ DESCRIPTION "The GETATTR [DAFS_PROC_GETATTR_DIRECT] operation will obtain attributes for the file system object specified by the current filehandle." (RFC 3010, p. 116) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The client sets a bit in the bitmap [attr_request_bitmap] argument for each attribute value that it would like the server to return. The server Wittle [Page 235] INTERNET-DRAFT Direct Access File System September 2001 returns an attribute bitmap that indicates the attribute values for which it was able to return, followed by the attribute values ordered lowest attribute number first." (RFC 3010, p. 116) See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding of attributes "The server must return a value for each attribute that the client requests if the attribute is supported by the server. If the server does not support an attribute or cannot approximate a useful value then it must not return the attribute value and must not set the attri- bute bit in the result bitmap. The server must return an error if it supports an attri- bute but cannot obtain its value. In that case no attribute values will be returned. All servers must support the mandatory attributes as specified in the section 'File Attributes'." (RFC 3010, p. 116) See 6.1.5., "File Attributes" for a list of DAFS' mandatory file attributes. "On success, the current filehandle retains its value." (RFC 3010, p. 116) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION The server MUST return an object attributes structure with all the attribute fields that the client requested in the attr_request_bitmap parameter. If a server does not support or cannot provide a requested attribute, it MUST still include space for the attribute but mark its contents as invalid by not setting the corresponding bit in the valid bitmap of the returned obj_attributes. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM Wittle [Page 236] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 237] INTERNET-DRAFT Direct Access File System September 2001 6.5.15. DAFS_PROC_GET_FSATTR SUMMARY Gets the attributes of the file system where a filehandle resides. ARGUMENTS struct DAFS_Get_FSAttr_Args { dafs_filehandle_type filehandle; dafs_attr_bitmap_type attr_request_bitmap; }; RESULTS struct DAFS_Get_FSAttr_Res { dafs_filesys_attr_type obj_attributes; }; DESCRIPTION "The GETATTR [DAFS_PROC_GET_FSATTR] operation will obtain attributes for the file system object specified by the current filehandle." (RFC 3010, p. 116) This DAFS operation returns attributes for the file system to which this file object belongs.See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures "The client sets a bit in the bitmap [attr_request_bitmap] argument for each attribute value that it would like the server to return. The server returns an attribute bitmap that indicates the attribute values for which it was able to return, followed by the attribute values ordered lowest attribute number first." (RFC 3010, p. 116) See 4.1.3.3., "Attribute Bitmaps" for an explanation of DAFS encoding of attributes. "The server must return a value for each attribute that the client requests if the attribute is supported by the Wittle [Page 238] INTERNET-DRAFT Direct Access File System September 2001 server. If the server does not support an attribute or cannot approximate a useful value then it must not return the attribute value and must not set the attri- bute bit in the result bitmap. The server must return an error if it supports an attribute but cannot obtain its value. In that case no attribute values will be returned. All servers must support the mandatory attributes as specified in the section 'File Attributes'." (RFC 3010, p. 116) See 6.1.6., "File System Attributes" for a list of DAFS' mandatory file system attributes. On success, the current filehandle retains its value." (RFC 3010, p. 116) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION The server MUST return an object attributes structure with all the attribute fields that the client requested in the attr_request_bitmap parameter. If a server does not support or cannot provide a requested attribute, it MUST still include space for the attribute but mark its contents as invalid by not setting the corresponding bit in the valid bitmap of the returned obj_attributes. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO Wittle [Page 239] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 240] INTERNET-DRAFT Direct Access File System September 2001 6.5.16. DAFS_PROC_HURRY_UP SUMMARY Set to zero the usec_window of an I/O requests that was previously submitted to the server by DAFS_PROC_BATCH_SUBMIT. ARGUMENTS struct DAFS_Hurry_Up_Args { dafs_uint64 request_id; /* per session */ } RESULTS None. DESCRIPTION The DAFS_PROC_HURRY_UP operation is used to inform the server that the client would like the server to "hurry up" processing of a single a single outstanding dafs_read_write_request from an asynchronous DAFS_PROC_BATCH_SUBMIT request (that is, the server has acknowledged receipt of the request by responding to the DAFS_PROC_BATCH_SUBMIT, but not yet completed it by identifying it in DAFS_PROC_BC_BATCH_COMPLETION). Specifically this is intended for the case when the original request was submitted with a non- zero usec_window (indicating that the client would tolerate an unusually long latency), but circumstances have changed on the client such that the client would like the operation to be treated in an "ordinary" fashion rather than a "latency-insensitive" fashion. The request_id, unique to the DAFS Session, allows the server to identify the correct outstanding dafs_read_write_request. If the server does not have an outstanding dafs_read_write_request on this Session with matching request id (either because it the client never submitted such a request, the client has submitted the request, but the server hasn't acknowledged receipt of the request, or because the server has completed it and reported it in a DAFS_PROC_BC_BATCH_COMPLETION) the server SHALL return the error DAFSERR_BATCH_REQUEST_NOT_FOUND. Note: this operation is intended only to instruct the server to alter the processing priority associated with the indicated dafs_read_write_request. Generally the server implementation Wittle [Page 241] INTERNET-DRAFT Direct Access File System September 2001 would be to move the request from a "high latency operation queue" to an ordinary operation queue. The server is recom- mended to complete the DAFS_PROC_HURRY_UP operation as soon as it has altered the processing priority of the indicated dafs_read_write_request. There is no expectation that the server complete the indicated dafs_read_write_request before completing the DAFS_PROC_BATCH_SUBMIT operation. The usec_window parameter to DAFS_PROC_BATCH_SUBMIT is a hint that the server is free to ignore. If the server ignores the usec_window parameter, then the server MAY return DAFSERR_NOTSUPP to DAFS_PROC_HURRY_UP. However, if the server does interpret the usec_window parameter to DAFS_PROC_BATCH_SUBMIT, the server SHALL NOT return DAFSERR_NOTSUPP. ERRORS DAFSERR_BATCH_REQUEST_NOT_FOUND DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_NOTSUPP Wittle [Page 242] INTERNET-DRAFT Direct Access File System September 2001 6.5.17. DAFS_PROC_LINK SUMMARY Creates a hard link to a file. ARGUMENTS struct DAFS_Link_Args { dafs_filehandle_type source_object; dafs_filehandle_type destination_object; dafs_utf8string newname; /* heap */ }; RESULTS struct DAFS_Link_Res { dafs_change_info_type change_info; }; DESCRIPTION "The LINK [DAFS_PROC_LINK] operation creates an addi- tional newname for the file represented by the saved filehandle, as set by the SAVEFH operation, in the directory represented by the current filehandle." (RFC 3010, p. 118) When the DAFS_PROC_LINK operation occurs within a DAFS operation chain (see 4.3.2., "Chaining Flags", for a description of chaining), the DAFS chain current_filehandle specifies the target directory, and the source object and newname are taken from the message arguments. "The existing file and the target directory must reside within the same file system on the server. On success, the current filehandle will continue to be the target directory." (RFC 3010, p. 118) The target directory is also passed along in a DAFS chain. See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. Wittle [Page 243] INTERNET-DRAFT Direct Access File System September 2001 "For the target directory, the server returns change_info4 [dafs_change_info] information in cinfo [change_info]. With the atomic field of the change_info4 [dafs_change_info] struct, the server will indicate if the before and after change attributes were obtained atomically with respect to the link creation. If the newname has a length of 0 (zero), or if newname does not obey the UTF-8 definition, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned." (RFC 3010, pp. 118- 119) IMPLEMENTATION "Changes to any property of the 'hard' linked files are reflected in all of the linked files. When a link is made to a file, the attributes for the file should have a value for numlinks [num_links] that is one greater than the value before the LINK operation. The comments under RENAME [DAFS_PROC_RENAME] regarding object and target residing on the same file system apply here as well. The comments regarding the target name applies as well. Note that symbolic links are created with the CREATE [DAFS_PROC_CREATE] operation." (RFC 3010, p. 119) ERRORS DAFSERR_ACCES DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DQUOT DAFSERR_EXIST DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO Wittle [Page 244] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ISDIR DAFSERR_MLINK DAFSERR_MOVED DAFSERR_NAMETOOLONG DAFSERR_NOSPC DAFSERR_NOTDIR DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_XDEV Wittle [Page 245] INTERNET-DRAFT Direct Access File System September 2001 6.5.18. DAFS_PROC_LOCK SUMMARY Creates a lock. ARGUMENTS #define RECLAIM 1 #define PERSIST 2 #define AUTORECOVERY 4 enum dafs_lock_type { READ_LT = 1, WRITE_LT = 2, READW_LT = 3, /* blocking read */ WRITEW_LT = 4, /* blocking write */ ABORT_LT = 5 /* rollback the lock */ }; struct DAFS_Lock_Args { dafs_filehandle_type filehandle; enum dafs_lock_type lock_type; dafs_uint32 options; /* Bitmap */ dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint64 length; }; RESULTS Wittle [Page 246] INTERNET-DRAFT Direct Access File System September 2001 struct DAFS_Lock_Res { union switch (status) { case DAFS_STATUS_OK: void; case DAFSERR_DENIED: case DAFSERR_LOCK_BROKEN: dafs_client_id owner_clientid; dafs_lockowner_type owner; /* heap */ dafs_uint64 offset; dafs_uint64 length; enum dafs_lock_type lock_type; default: void; } lock_res; }; DESCRIPTION "The LOCK [DAFS_PROC_LOCK] operation requests a record lock for the byte range specified by the offset and length parameters. The lock type is also specified to be one of the nfs4_lock [dafs_lock_type] types. If this is a reclaim request, the reclaim parameter will be TRUE." (RFC 3010, p. 120) In DAFS, reclaim operations are specified by setting the RECLAIM bit in the options arguments field. If this is a persistent lock, the options parameter will include PER- SIST. If this is an auto recovery lock, the options parameter will include AUTORECOVERY. The options parameter value is formed by OR'ing together desired options. If a server does not support a locking option, it returns DAFSERR_NOTSUPP. "Bytes in a file may be locked even if those bytes are not currently allocated to the file. To lock the file from a specific offset through the end-of-file (no matter how long the file actually is) use a length field with all bits set to 1 (one). To lock the entire file, use an offset of 0 (zero) and a length with all bits set to 1. A length of 0 is reserved and should not be used. Wittle [Page 247] INTERNET-DRAFT Direct Access File System September 2001 In the case that the lock is denied, the owner, offset, and length of a conflicting lock are returned." (RFC 3010, p. 120) DAFS also returns the client-id for the client that owns the con- flicting lock. "On success, the current filehandle retains its value." (RFC 3010, p. 120) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "If the server is unable to determine the exact offset and length of the conflicting lock, the same offset and length that were provided in the arguments should be returned in the denied results. The File Locking sec- tion contains a full description of this and the other file locking operations." (RFC 3010, p. 120) See 4.4., "Locking and Access Control", for a full description of this and the other file locking operations ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL Wittle [Page 248] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ISDIR DAFSERR_LEASE_MOVED DAFSERR_LOCK_BROKEN DAFSERR_LOCK_RANGE DAFSERR_MOVED DAFSERR_NOTSUPP DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_CLIENTID DAFSERR_STALE_STATEID Wittle [Page 249] INTERNET-DRAFT Direct Access File System September 2001 6.5.19. DAFS_PROC_LOCKT SUMMARY Tests for a file lock. ARGUMENTS struct DAFS_LockT_Args { dafs_filehandle_type filehandle; enum dafs_lock_type lock_type; dafs_lockowner_type owner; /* heap */ dafs_uint64 offset; dafs_uint64 length; }; RESULTS struct DAFS_LockT_Res { union switch (status) { case DAFSERR_DENIED: case DAFSERR_LOCK_BROKEN: dafs_client_id owner_clientid; dafs_lockowner_type owner; /* heap */ dafs_uint64 offset; dafs_uint64 length; enum dafs_lock_type lock_type; case DAFS_STATUS_OK: default: void; } results; }; DESCRIPTION "The LOCKT [DAFS_PROC_LOCKT] operation tests the lock as specified in the arguments. If a conflicting lock exists, the owner, offset, and length of the conflicting lock are returned; if no lock is held, nothing other than NFS4_OK [DAFS_STATUS_OK] is returned." (RFC 3010, p. 121) Wittle [Page 250] INTERNET-DRAFT Direct Access File System September 2001 DAFS also returns the client-id for the client that owns the con- flicting lock. "On success, the current filehandle retains its value." (RFC 3010, p. 121) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "If the server is unable to determine the exact offset and length of the conflicting lock, the same offset and length that were provided in the arguments should be returned in the denied results. The File Locking sec- tion contains further discussion of the file locking mechanisms." (RFC 3010, pp. 121-122) See 4.4., "Locking and Access Control", for further discussion of the file locking mechanisms. "LOCKT [DAFS_PROC_LOCKT] uses nfs_lockowner4 [lockowner_type] instead of a stateid4 [state_id_type], as LOCK [DAFS_PROC_LOCK] does, to identify the owner so that the client does not have to open the file to test for the existence of a lock." (RFC 3010, p. 122) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_ISDIR Wittle [Page 251] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_LEASE_MOVED DAFSERR_LOCK_BROKEN DAFSERR_LOCK_RANGE DAFSERR_MOVED DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_CLIENTID Wittle [Page 252] INTERNET-DRAFT Direct Access File System September 2001 6.5.20. DAFS_PROC_LOCKU SUMMARY Unlocks a file. ARGUMENTS struct DAFS_LOCKU_Arg { dafs_filehandle_type filehandle; enum dafs_lock_type lock_type; dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint64 length; }; RESULTS None. DESCRIPTION "The LOCKU [DAFS_PROC_LOCKU] operation unlocks the record lock specified by the parameters. On success, the current filehandle retains its value." (RFC 3010, p. 123) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION See 4.4., "Locking and Access Control", for a full description of this and the other file locking procedures. ERRORS DAFSERR_ACCES DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN Wittle [Page 253] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_LOCK_RANGE DAFSERR_LEASE_MOVED DAFSERR_MOVED DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_CLIENTID DAFSERR_STALE_STATEID Wittle [Page 254] INTERNET-DRAFT Direct Access File System September 2001 6.5.21. DAFS_PROC_LOOKUP SUMMARY Looks up a file object given its name. ARGUMENTS struct DAFS_Lookup_Args { dafs_filehandle_type directory; dafs_pathname_type path; /* heap */ }; RESULTS struct DAFS_Lookup_Res { dafs_filehandle_type filehandle; dafs_uint32 component_count; }; DESCRIPTION "This operation LOOKUPs or finds a file system object starting from the directory specified by the current filehandle." (RFC 3010, p. 124) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "LOOKUP [DAFS_PROC_LOOKUP] evaluates the pathname con- tained in the array of names and obtains a new current filehandle from the final name. All but the final name in the list must be the names of directories. If the pathname cannot be evaluated either because a component does not exist or because the client does not have permission to evaluate a component of the path, then an error will be returned and the current filehan- dle will be unchanged. If the path is a zero length array, if any component does not obey the UTF-8 definition, or if any component Wittle [Page 255] INTERNET-DRAFT Direct Access File System September 2001 in the path is of zero length, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned." (RFC 3010, p. 124) If a DAFS_PROC_LOOKUP request contains multiple pathname segments, the response packet specifies the number of pathname components suc- cessfully looked-up in the component_count result field, and the filehandle of the last component successfully looked-up. A component_count result value == 0 indicates that the lookup of the first component failed and that the content of the returned filehan- dle is invalid. A return value > 0 indicates that one or more com- ponent lookups succeeded. If the server encounters an error before all pathname components are looked-up, the error status returned is either DAFSERR_NO_PARTIAL_INFO, meaning that the component_count field and filehandle are both invalid, or the error status that applies to the component the operation failed on. IMPLEMENTATION "NFS version 4 [and DAFS] servers depart from the seman- tics of previous NFS versions in allowing LOOKUP [DAFS_PROC_LOOKUP] requests to cross mountpoints on the server. The client can detect a mountpoint crossing by comparing the fsid attribute of the directory with the fsid attribute of the directory looked up. If the fsids[dafs_FS_Handles] are different then the new direc- tory is a server mountpoint. Unix clients that detect a mountpoint crossing will need to mount the server's filesystem. This needs to be done to maintain the file object identity checking mechanisms common to Unix clients. Servers that limit NFS [DAFS] access to 'shares' or 'exported' filesystems should provide a pseudo- filesystem into which the exported filesystems can be integrated, so that clients can browse the server's name space. The clients view of a pseudo filesystem will be limited to paths that lead to exported filesystems. Note: previous versions of the protocol assigned special semantics to the names '.' and '..'. NFS version 4 [DAFS] assigns no special semantics to these names. The LOOKUPP [DAFS_PROC_LOOKUPP] operator must be used to lookup a parent directory. Note that this procedure does not follow symbolic links. The client is responsible for all parsing of filenames Wittle [Page 256] INTERNET-DRAFT Direct Access File System September 2001 including filenames that are modified by symbolic links encountered during the lookup process. If the current file handle supplied is not a directory but a symbolic link, the error NFS4ERR_SYMLINK [DAFSERR_SYMLINK] is returned as the error. For all other non-directory file types, the error NFS4ERR_NOTDIR [DAFSERR_NOTDIR] is returned." (RFC 3010, p. 125) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NAMETOOLONG DAFSERR_NOENT DAFSERR_NOTDIR DAFSERR_NO_PARTIAL_INFO DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_SYMLINK Wittle [Page 257] INTERNET-DRAFT Direct Access File System September 2001 6.5.22. DAFS_PROC_LOOKUPP SUMMARY Looks up parent directory. ARGUMENTS struct DAFS_Lookupp_Args { dafs_filehandle_type filehandle; }; RESULTS struct DAFS_Lookupp_Res { dafs_filehandle_type filehandle; }; DESCRIPTION "The current filehandle is assumed to refer to a regular directory or a named attribute directory." (RFC 3010, p. 126) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "LOOKUPP [DAFS_PROC_LOOKUPP] assigns the filehandle for its parent directory to be the current filehandle. If there is no parent directory an NFS4ERR_ENOENT [DAFSERR_NOENT] error must be returned. Therefore, NFS4ERR_ENOENT [DAFSERR_NOENT] will be returned by the server when the current filehandle is at the root or top of the server's file tree." (RFC 3010, p. 126) IMPLEMENTATION "As for LOOKUP [DAFS_PROC_LOOKUP], LOOKUPP [DAFS_PROC_LOOKUPP] will also cross mountpoints. Wittle [Page 258] INTERNET-DRAFT Direct Access File System September 2001 If the current filehandle is not a directory or named attribute directory, the error NFS4ERR_NOTDIR [DAFSERR_NOTDIR] is returned." (RFC 3010, p. 126) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOENT DAFSERR_NOTDIR DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 259] INTERNET-DRAFT Direct Access File System September 2001 6.5.23. DAFS_PROC_NVERIFY SUMMARY Verifies difference in attributes. ARGUMENTS struct DAFS_Nverify_Args { dafs_filehandle_type filehandle; dafs_file_attr_type obj_attributes; }; RESULTS None. DESCRIPTION "This operation is used to prefix a sequence of opera- tions to be performed if one or more attributes have changed on some filesystem object. If all the attri- butes match then the error NFS4ERR_SAME [DAFSERR_SAME] must be returned. On success, the current filehandle retains its value." (RFC 3010, p. 127) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "This operation is useful as a cache validation opera- tor. If the object to which the attributes belong has changed then the following operations may obtain new data associated with that object." (RFC 3010, p. 127) "In the case that a recommended attribute is specified in the NVERIFY [DAFS_PROC_NVERIFY] operation and the server does not support that attribute for the file sys- tem object, the error NFS4ERR_NOTSUPP [DAFSERR_NOTSUPP] is returned to the client." (RFC 3010, p. 127) ERRORS Wittle [Page 260] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_SAME DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 261] INTERNET-DRAFT Direct Access File System September 2001 6.5.24. DAFS_PROC_OPEN SUMMARY Opens a regular file. ARGUMENTS enum createmode { UNCHECKED = 0, GUARDED = 1, EXCLUSIVE = 2 }; enum opentype { OPEN_NOCREATE = 0, OPEN_CREATE = 1 }; enum open_claim_type { CLAIM_NULL = 0, CLAIM_PREVIOUS = 1, CLAIM_DELEGATE_CUR = 2, CLAIM_DELEGATE_PREV = 3, CLAIM_CREATE_UNLINKED = 4 }; enum open_delete_disp { DELETE_DONT_CARE = 0, DELETE_DENY = 1 }; Wittle [Page 262] INTERNET-DRAFT Direct Access File System September 2001 enum limit_by { DAFS_LIMIT_SIZE = 1, DAFS_LIMIT_BLOCKS = 2 }; enum open_delegation_type { OPEN_DELEGATE_NONE = 0, OPEN_DELEGATE_READ = 1, OPEN_DELEGATE_WRITE = 2 }; const DAFS_OPEN_SHARE_ACCESS_READ = 0x00000001; const DAFS_OPEN_SHARE_ACCESS_WRITE = 0x00000002; const DAFS_OPEN_SHARE_ACCESS_BOTH = 0x00000003; const DAFS_OPEN_SHARE_DENY_NONE = 0x00000000; const DAFS_OPEN_SHARE_DENY_READ = 0x00000001; const DAFS_OPEN_SHARE_DENY_WRITE = 0x00000002; const DAFS_OPEN_SHARE_DENY_BOTH = 0x00000003; const DAFS_OPEN_SHARE_KEY_NONE = 0x00000000; const DAFS_OPEN_SHARE_KEY_BOTH = 0x00000003; struct DAFS_Open_Args { enum_open_claim_type claim_type; union switch (claim_type) { case CLAIM_NULL: dafs_filehandle_type dir_handle; dafs_pathname_type claimnull_pathname; /*heap */ Wittle [Page 263] INTERNET-DRAFT Direct Access File System September 2001 case CLAIM_PREVIOUS: dafs_filehandle_type filehandle; dafs_uint32 delegate_type; case CLAIM_DELEGATE_CUR: dafs_filehandle_type dir_handle; dafs_pathname_type claimdelcur_pathname; /*heap */ dafs_state_id_type claimdelcur_stateid; case CLAIM_DELEGATE_PREV: dafs_filehandle_type dir_handle; dafs_pathname_type claimdelprev_pathname; /*heap */ case CLAIM_PREVIOUS: case CLAIM_CREATE_UNLINKED: dafs_filehandle_type dir_handle; } open_claim; enum opentype open_type; union switch (open_type) { case OPEN_CREATE: enum opentype createmode; union createhow switch (createmode) { case UNCHECKED: case GUARDED: dafs_file_attr_type createattrs; case EXCLUSIVE: dafs_verifier_type create_verifier; }; default: void; } openflag; Wittle [Page 264] INTERNET-DRAFT Direct Access File System September 2001 enum open_delete_disp delete_disp; dafs_lockowner_type owner; /* heap */ dafs_uint32 share_access; dafs_uint32 share_deny; dafs_uint32 share_key_type; dafs_uint32 pad; dafs_uint64 share_key; }; RESULTS const OPEN_RESULT_MLOCK = 0x00000001; struct DAFS_Open_Res { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_change_info_type change_info; dafs_uint32 component_count; dafs_uint32 result_flags; enum open_delegation_type delegation_type; union switch (open_delegation_type) { case OPEN_DELEGATE_NONE: void; case OPEN_DELEGATE_READ: state_id_type state_id; dafs_uint32 recall; dafs_uint32 permissions_acetype; dafs_uint32 permissions_aceflag; dafs_uint32 permissions_acemask; dafs_utf8string readdel_who; /* heap */ Wittle [Page 265] INTERNET-DRAFT Direct Access File System September 2001 case OPEN_DELEGATE_WRITE: dafs_state_id_type state_id; dafs_uint32 recall; enum limit_by limitby; union space_limit switch (limitby) { case DAFS_LIMIT_SIZE: dafs_uint64 filesize; case DAFS_LIMIT_BLOCKS: dafs_uint32 num_blocks; dafs_uint32 bytes_per_block }; dafs_uint32 permissions_acetype; dafs_uint32 permissions_aceflag; dafs_uint32 permissions_acemask; dafs_utf8string writedel_who; /* heap */ } opendelegation; }; "WARNING TO CLIENT IMPLEMENTORS OPEN [DAFS_PROC_OPEN] resembles LOOKUP [DAFS_PROC_LOOKUP] in that it generates a filehandle for the client to use. Unlike LOOKUP [DAFS_PROC_LOOKUP] though, OPEN [DAFS_PROC_OPEN] creates server state on the filehandle. In normal circumstances, the client can only release this state with a CLOSE [DAFS_PROC_CLOSE] operation. CLOSE [DAFS_PROC_CLOSE] uses the current filehandle to determine which file to close." (RFC 3010, p. 132) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "Simply waiting for the lease on the file to expire is insufficient because the server may maintain the state indefinitely as long as another client does not attempt to make a conflicting access to the same file." (RFC 3010, p. 132) DESCRIPTION "The OPEN [DAFS_PROC_OPEN] operation creates and/or opens a regular file in a directory with the provided name. If the file does not exist at the server and Wittle [Page 266] INTERNET-DRAFT Direct Access File System September 2001 creation is desired, specification of the method of creation is provided by the openhow [open_type] parame- ter. The client has the choice of three creation methods: UNCHECKED, GUARDED, or EXCLUSIVE. UNCHECKED means that the file should be created if a file of that name does not exist and encountering an existing regular file of that name is not an error. For this type of create, createattrs specifies the initial set of attributes for the file. The set of attributes may includes any writable attribute valid for regular files. When an UNCHECKED create encounters an existing file, the attributes specified by createattrs is not used, except that when an object_size of zero is speci- fied, the existing file is truncated. If GUARDED is specified, the server checks for the presence of a duplicate object by name before performing the create. If a duplicate exists, an error of NFS4ERR_EXIST [DAFSERR_EXIST] is returned as the status. If the object does not exist, the request is performed as described for UNCHECKED. EXCLUSIVE specifies that the server is to follow exclusive creation semantics, using the verifier [create_verifier] to ensure exclusive creation of the target. The server should check for the presence of a duplicate object by name. If the object does not exist, the server creates the object and stores the verifier with the object. If the object does exist and the stored verifier matches the client provided verifier, the server uses the existing object as the newly created object. If the stored verifier does not match, then an error of NFS4ERR_EXIST [DAFSERR_EXIST] is returned. No attributes may be provided in this case, since the server may use an attribute of the target object to store the verifier. For the target directory, the server returns change_info4 [dafs_change_info_type] information in cinfo [change_info]. With the atomic field of the change_info4 [dafs_change_info_type] struct, the server will indicate if the before and after change attributes were obtained atomically with respect to the link crea- tion. Upon successful creation, the current filehandle is replaced by that of the new object." (RFC 3010, pp. 132-133) Wittle [Page 267] INTERNET-DRAFT Direct Access File System September 2001 See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The OPEN [DAFS_PROC_OPEN] procedure provides for DOS SHARE capability with the use of the access and deny fields of the OPEN [DAFS_PROC_OPEN] arguments. The client specifies at OPEN [DAFS_PROC_OPEN] the required access and deny modes. For clients that do not directly support SHAREs (i.e. Unix), the expected deny value is DENY_NONE [DAFS_OPEN_SHARE_DENY_NONE]. In the case that there is a existing SHARE reservation that conflicts with the OPEN [DAFS_PROC_OPEN] request, the server returns the error NFS4ERR_DENIED [DAFSERR_DENIED]. For a complete SHARE request, the client must provide values for the owner and seqid fields for the OPEN [DAFS_PROC_OPEN] argument. For additional discussion of SHARE semantics see the section on "Share Reserva- tions"." (RFC 3010, p. 133) DAFS locking model does not require use of sequence ids. Therefore, the DAFS_PROC_OPEN arguments structure does not contain one. "In the case that the client is recovering state from a server failure, the reclaim [claim_type] field of the OPEN [DAFS_PROC_OPEN] argument is used to signify that the request is meant to reclaim state previously held. The 'claim [claim_type]' field of the OPEN [DAFS_PROC_OPEN] argument is used to specify the file to be opened and the state information which the client claims to possess. There are four basic claim types which cover the various situations for an OPEN [DAFS_PROC_OPEN]. They are as follows: CLAIM_NULL For the client, this is a new OPEN [DAFS_PROC_OPEN] request and there is no previous state associate[d] with the file for the client. CLAIM_PREVIOUS The client is claiming basic OPEN [DAFS_PROC_OPEN] state for a file that was held previous to a server reboot. Generally used when a server is returning persistent file handles; the client may not have the file name to reclaim the OPEN [DAFS_PROC_OPEN]. Wittle [Page 268] INTERNET-DRAFT Direct Access File System September 2001 CLAIM_DELEGATE_CUR The client is claiming a delegation for OPEN [DAFS_PROC_OPEN] as granted by the server. Gen- erally this is done as part of recalling a delega- tion. CLAIM_DELEGATE_PREV The client is claiming a delegation granted to a pre- vious client instance; used after the client reboots. For OPEN [DAFS_PROC_OPEN] requests whose claim type is other than CLAIM_PREVIOUS (i.e. requests other than those devoted to reclaiming opens after a server reboot) that reach the server during its grace or lease expira- tion period, the server returns an error of NFS4ERR_GRACE [DAFSERR_GRACE]. For any OPEN [DAFS_PROC_OPEN] request, the server may return an open delegation, which allows further opens and closes to be handled locally on the client as described in the section Open Delegation. Note that delegation [delegation_type] is up to the server to decide. The client should never assume that delegation [delegation_type] will or will not be granted in a par- ticular instance. It should always be prepared for either case. A partial exception is the reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. In this case, delegation will always be granted, although the server may specify an immediate recall in the delegation structure. The rflags [result_flags] returned by a successful OPEN [DAFS_PROC_OPEN] allow the server to return information governing how the open file is to be handled. OPEN4_RESULT_MLOCK [OPEN_RESULT_MLOCK] indicates to the caller that mandatory locking is in effect for this file and the client should act appropriately with regard to data cached on the client. OPEN4_RESULT_CONFIRM indi- cates that the client MUST execute an OPEN_CONFIRM operation before using the open file." (RFC 3010, pp. 133-134) There is no need for confirming a DAFS_PROC_OPEN. Consequently, this flag does not exist in the DAFS protocol and there is no correspond- ing open confirm operation. Wittle [Page 269] INTERNET-DRAFT Direct Access File System September 2001 "If the file [claimnull_pathname, claimdelcur_pathname, or claimdelprev_pathname] is a zero length array, if any component does not obey the UTF-8 definition, or if any component in the path is of zero length, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned. When an OPEN [DAFS_PROC_OPEN] is done and the specified lockowner [owner] already has the resulting filehandle open, the result is to 'OR' together the new share and deny status together with the existing status. In this case, only a single CLOSE [DAFS_PROC_CLOSE] need be done, even though multiple OPEN's [DAFS_PROC_OPEN] were completed." (RFC 3010, pp. 132-134) If the OPEN claim_type is CLAIM_CREATE_UNLINKED, then DAFS_PROC_OPEN creates an unlinked regular file in the file system in which the specified directory is located. A subsequent link request can be used to link the file in that directory, or in any other directory which could link to the file if it already had a link in the specified directory. The OPEN procedure provides for Share Key Reservations with the use of the share_key_type and share_key fields of the OPEN arguments. The client specifies at OPEN the share_key_type, and if the share_key_type is not SHARE_KEY_NONE, the client also specifies the target SHARE KEY. For clients that wish to bypass SHARE KEY verifica- tion (i.e. all legacy clients), the expected share_key_type value is SHARE_KEY_NONE. If there is an existing SHARE KEY reservation that conflicts with the OPEN request, the server returns the error DAFSERR_KEY_MISMATCH. See DAFS_PROC_LOOKUP for a description of the component_count result field. IMPLEMENTATION "The OPEN [DAFS_PROC_OPEN] procedure contains support for EXCLUSIVE create. The mechanism is similar to the support in NFS version 3 [RFC1813]. As in NFS version 3, this mechanism provides reliable exclusive creation. Exclusive create is invoked when the how [createmode] parameter is EXCLUSIVE. In this case, the client pro- vides a verifier [create_verifier] that can reasonably be expected to be unique. A combination of a client identifier, perhaps the client network address, and a unique number generated by the client, perhaps the RPC transaction identifier, may be appropriate." (RFC 3010, p. 135) Wittle [Page 270] INTERNET-DRAFT Direct Access File System September 2001 DAFS does not use RPC. Clients could use the equivalent stream_id/seq_number information that is already generated for the DAFS header. "If the object does not exist, the server creates the object and stores the verifier [create_verifier] in stable storage. For file systems that do not provide a mechanism for the storage of arbitrary file attributes, the server may use one or more elements of the object meta-data to store the verifier [create_verifier]. The verifier [create_verifier] must be stored in stable storage to prevent erroneous failure on retransmission of the request. It is assumed that an exclusive create is being performed because exclusive semantics are crit- ical to the application. Because of the expected usage, exclusive CREATE does not rely solely on the normally volatile duplicate request cache for storage of the verifier. (RFC 3010, p. 135) DAFS clients MAY rely on the persistent response cache for exclusive- create semantics, if use of the Response Cache has been agreed to for the Session. DAFS servers, however, MUST handle create verifiers as described here, regardless of whether the server implements per- sistent response caches. "The duplicate request cache in volatile storage does not survive a crash and may actually flush on a long network partition, opening failure windows. In the UNIX local file system environment, the expected storage location for the verifier on creation is the meta-data (time stamps) of the object. For this reason, an exclusive object create may not include initial attri- butes because the server would have nowhere to store the verifier. If the server can not support these exclusive create semantics, possibly because of the requirement to commit the verifier to stable storage, it should fail the OPEN [DAFS_PROC_OPEN] request with the error, NFS4ERR_NOTSUPP [DAFSERR_NOTSUPP]. During an exclusive CREATE request, if the object already exists, the server reconstructs the object's verifier and compares it with the verifier [create_verifier] in the request. If they match, the server treats the request as a success. The request is presumed to be a duplicate of an earlier, successful request for which the reply was lost and that the server Wittle [Page 271] INTERNET-DRAFT Direct Access File System September 2001 duplicate request cache mechanism did not detect. If the verifiers do not match, the request is rejected with the status, NFS4ERR_EXIST [DAFSERR_EXIST]. Once the client has performed a successful exclusive create, it must issue a SETATTR [DAFS_PROC_SETATTR_INLINE or DAFS_PROC_SETATTR_DIRECT] to set the correct object attributes. Until it does so, it should not rely upon any of the object attributes, since the server implementation may need to overload object meta-data to store the verifier. The subsequent SETATTR must not occur in the same COMPOUND request as the OPEN. This separation will guarantee that the exclusive create mechanism will continue to function properly in the face of retransmission of the request. (RFC 3010, pp. 135-136) The setattr and open-exclusive MAY be part of the same DAFS chain. However, a DAFS client SHOULD be aware that this could cause problems in the event a server crashes and it doesn't keep persistent response cache information. In this event, the create exclusive and the setattr completion status might be unknown to the client and the client might not be able to determine accurately if the file was created exclusively or not. "Use of the GUARDED attribute does not provide exactly- once semantics. In particular, if a reply is lost and the server does not detect the retransmission of the request, the procedure can fail with NFS4ERR_EXIST [DAFSERR_EXIST], even though the create was performed successfully." (RFC 3010, p. 136) DAFS clients do not retransmit a request on an active session. This type of error would occur if a client issues the request on a new session and either the response cache (volatile) has been lost, or the client does not properly check the cache. "For SHARE reservations, the client must specify a value for access that is one of READ, WRITE, or BOTH [DAFS_OPEN_SHARE_ACCESS_READ, DAFS_OPEN_SHARE_ACCESS_WRITE or DAFS_OPEN_SHARE_ACCESS_BOTH]. For deny, the client must specify one of NONE, READ, WRITE, or BOTH [DAFS_OPEN_SHARE_DENY_NONE, DAFS_OPEN_SHARE_DENY_READ, DAFS_OPEN_SHARE_DENY_WRITE, or DAFS_OPEN_SHARE_DENY_BOTH]. If the client fails to do this, the server must return NFS4ERR_INVAL [DAFSERR_INVAL]. Wittle [Page 272] INTERNET-DRAFT Direct Access File System September 2001 If the final component provided to OPEN [DAFS_PROC_OPEN] is a symbolic link, the error NFS4ERR_SYMLINK [DAFSERR_SYMLINK] will be returned to the client. If an intermediate component of the pathname provided to OPEN is a symbolic link, the error NFS4ERR_NOTDIR [DAFSERR_NOTDIR] will be returned to the client." (RFC 3010, pp. 135-136) For SHARE KEY reservations the client specifies a value for share_key_type that is one of SHARE_KEY_NONE or SHARE_KEY_BOTH. If the client fails to do this, the server returns DAFSERR_INVAL. If the server cannot support SHARE KEY semantics and the share_key_type is not SHARE_KEY_NONE, the server fails the OPEN request ant returns the error DAFSERR_NOTSUPP. The open_delete_disp flags specify a disposition by which subsequent remove requests are handled with respect to the open file. See 6.5.33., "DAFS_PROC_REMOVE" for more on file removal semantics. An open operation that specifies a delete disposition that is not fully supported by the server results in DAFSERR_DENYDISP_NOTSUPP status. Note: If the server supports multiple protocols, then requesting a disposition of DELETE_DENY MAY result in the server returning either this error or DAFSERR_STATUS_OK. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_DELAY DAFSERR_DENYDISP_NOTSUPP DAFSERR_EXIST DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO Wittle [Page 273] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ISDIR DAFSERR_MOVED DAFSERR_NOENT DAFSERR_NOTDIR DAFSERR_NOTSUPP DAFSERR_NO_PARTIAL_INFO DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_SYMLINK Wittle [Page 274] INTERNET-DRAFT Direct Access File System September 2001 6.5.25. DAFS_PROC_OPENATTR SUMMARY Opens the named attribute directory. ARGUMENTS struct DAFS_OpenAttr_Args { dafs_filehandle_type filehandle; }; RESULTS struct DAFS_OpenAttr_Res { dafs_filehandle_type filehandle; }; DESCRIPTION "The OPENATTR [DAFS_PROC_OPENATTR] operation is used to obtain the filehandle of the named attribute directory associated with the current filehandle." (RFC 3010, p. 137) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. The result of the OPENATTR [DAFS_PROC_OPENATTR] will be a filehandle to an object of type NF4ATTRDIR [DAFS_TYPE_ATTRDIR]. From this filehandle, READDIR [DAFS_PROC_READDIR] and LOOKUP [DAFS_PROC_LOOKUP] pro- cedures can be used to obtain filehandles for the vari- ous named attributes associated with the original file system object. Filehandles returned within the named attribute directory will have a type of NF4NAMEDATTR [DAFS_TYPE_NAMEDATTR]." (RFC 3010, p. 137) IMPLEMENTATION "If the server does not support named attributes for the current filehandle, an error of NFS4ERR_NOTSUPP Wittle [Page 275] INTERNET-DRAFT Direct Access File System September 2001 [DAFSERR_NOTSUPP] will be returned to the client." (RFC 3010, p. 137) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOENT DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 276] INTERNET-DRAFT Direct Access File System September 2001 6.5.26. DAFS_PROC_OPEN_DOWNGRADE SUMMARY Reduces open file access rights. ARGUMENTS struct DAFS_Open_Downgrade_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint32 share_access; dafs_uint32 share_deny; dafs_uint32 share_key_type; dafs_uint32 pad; dafs_uint64 share_key; }; RESULTS None. DESCRIPTION "This operation is used to adjust the access and deny bits for a given open. This is necessary when a given lockowner opens the same file multiple times with dif- ferent access and deny flags. In this situation, a close of one of the open's may change the appropriate access and deny flags to remove bits associated with open's no longer in effect. The access and deny bits specified in this operation replace the current ones for the specified open file. If either the access or the deny mode specified includes bits not in effect for the open, the error NFS4ERR_INVAL [DAFSERR_INVAL] should be returned. Since access and deny bits are subsets of those already granted, it is not possible for this request to be denied because of conflicting share reservations. On success, the current filehandle retains its value." (RFC 3010, p. 141) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences Wittle [Page 277] INTERNET-DRAFT Direct Access File System September 2001 in filehandle management in DAFS procedures. This operation is also used to release the SHARE KEY reservation held by a given open. This is necessary when a given lockowner wishes to exit the group of lockowners with SHARE KEY access to the file. If share_key_type is SHARE_KEY_NONE and the lockowner previously held a SHARE KEY reservation on the file then that lockowner's share key reservation is released (and hence file_state.share_key_count is decremented). If share_key_type is not SHARE_KEY_NONE, and both share_key_type and share_key do not match the current open, then the error DAFSERR_INVAL is returned. Since this definition only permits SHARE KEY reservations to be released, and not acquired, it is not possible for this request to be denied because of conflicting share_key reservations. ERRORS DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_MOVED DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 278] INTERNET-DRAFT Direct Access File System September 2001 6.5.27. DAFS_PROC_READ_INLINE SUMMARY Reads data from a file. The data transfer is done inline using memory pointed to by the read descriptor buffers. ARGUMENTS struct DAFS_Read_Inline_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint32 byte_count; dafs_cache_hint_type cache_hint; }; RESULTS struct DAFS_Read_Inline_Res { dafs_uint32 eof; dafs_uint32 bytes_read; dafs_opaque8 read_data[byte_count]; /* Split count & data for alignment */ }; DESCRIPTION "The READ [DAFS_PROC_READ_INLINE] operation reads data from the regular file identified by the current filehan- dle." (RFC 3010, p. 144) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. "The client provides an offset of where the READ [DAFS_PROC_READ_INLINE] is to start and a count [byte_count] of how many bytes are to be read. An offset of 0 (zero) means to read data starting at the beginning of the file. If offset is greater than or equal to the size of the file, the status, NFS4_OK [DAFS_STATUS_OK], is returned with a data length Wittle [Page 279] INTERNET-DRAFT Direct Access File System September 2001 [bytes_read] set to 0 (zero) and eof is set to TRUE. The READ [DAFS_PROC_READ_INLINE] is subject to access permissions checking. If the client specifies a count [byte_count] value of 0 (zero), the READ [DAFS_PROC_READ_INLINE] succeeds and returns 0 (zero) bytes of data again subject to access permissions checking. The server may choose to return fewer bytes than specified by the client. The client needs to check for this condition and handle the condi- tion appropriately. The stateid [state_id] value for a READ [DAFS_PROC_READ_INLINE] request represents a value returned from a previous record lock or share reserva- tion request. Used by the server to verify that the associated lock is still valid and to update lease timeouts for the client." (RFC 3010, pp. 144-145) In DAFS, leases are updated by any DAFS procedure, including DAFS_PROC_NULL. DAFS servers use information associated with the ses- sion of the incoming request to determine which client's leases to renew. "If the read ended at the end-of-file (formally, in a correctly formed READ [DAFS_PROC_READ_INLINE] request, if offset + count [byte_count] is equal to the size of the file), or the read request extends beyond the size of the file (if offset + count [byte_count] is greater than the size of the file), eof is returned as TRUE; otherwise it is FALSE. A successful READ [DAFS_PROC_READ_INLINE] of an empty file will always return eof as TRUE. On success, the current filehandle retains its value." (RFC 3010, pp. 144-145) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS requests. IMPLEMENTATION "It is possible for the server to return fewer than count [byte_count] bytes of data. If the server returns less than the count requested and eof set to FALSE, the client should issue another READ [DAFS_PROC_READ_INLINE] to get the remaining data. A server may return less data than requested under several circumstances. The Wittle [Page 280] INTERNET-DRAFT Direct Access File System September 2001 file may have been truncated by another client or perhaps on the server itself, changing the file size from what the requesting client believes to be the case. This would reduce the actual amount of data available to the client. It is possible that the server may back off the transfer size and reduce the read request return. Server resource exhaustion may also occur necessitating a smaller read return. If the file is locked the server will return an NFS4ERR_LOCKED [DAFSERR_LOCKED] error. Since the lock may be of short duration, the client may choose to retransmit the READ [DAFS_PROC_READ_INLINE] request (with exponential backoff) until the operation succeeds." (RFC 3010, p. 145) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LOCKED DAFSERR_LEASE_MOVED DAFSERR_MOVED Wittle [Page 281] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_NXIO DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 282] INTERNET-DRAFT Direct Access File System September 2001 6.5.28. DAFS_PROC_READ_DIRECT SUMMARY Reads from file and returns data using RDMA write directly into client memory buffers. ARGUMENTS struct DAFS_Read_Direct_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint32 byte_count; dafs_cache_hint_type cache_hint; dafs_dob_array_type read_data_buffers; }; RESULTS struct DAFS_Read_Direct_Res { dafs_uint32 eof; dafs_uint32 bytes_read; dafs_checksum_type direct_checksum; } /* Data placed in read_data_buffers advertised by client */ /* DIRECT: dafs_opaque8 readdata[bytes_read]; */ DESCRIPTION "The READ [DAFS_PROC_READ_DIRECT] operation reads data from the regular file identified by the current filehan- dle." (RFC 3010, p. 144) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures "The client provides an offset of where the READ Wittle [Page 283] INTERNET-DRAFT Direct Access File System September 2001 [DAFS_PROC_READ_DIRECT] is to start and a count [byte_count] of how many bytes are to be read. An offset of 0 (zero) means to read data starting at the beginning of the file. If offset is greater than or equal to the size of the file, the status, NFS4_OK [DAFS_STATUS_OK], is returned with a data length [bytes_read] set to 0 (zero) and eof is set to TRUE. The READ [DAFS_PROC_READ_DIRECT] is subject to access permissions checking. If the client specifies a count [byte_count] value of 0 (zero), the READ [DAFS_PROC_READ_DIRECT] succeeds and returns 0 (zero) bytes of data again subject to access permissions checking. The server may choose to return fewer bytes than specified by the client. The client needs to check for this condition and handle the condi- tion appropriately. The stateid [state_id] value for a READ [DAFS_PROC_READ_DIRECT] request represents a value returned from a previous record lock or share reserva- tion request. Used by the server to verify that the associated lock is still valid and to update lease timeouts for the client." (RFC 3010, pp. 144-145) In DAFS, leases are updated by any DAFS procedure, including DAFS_PROC_NULL. DAFS servers use information associated with the ses- sion of the incoming request to determine which client's leases to renew. "If the read ended at the end-of-file (formally, in a correctly formed READ [DAFS_PROC_READ_DIRECT] request, if offset + count [byte_count] is equal to the size of the file), or the read request extends beyond the size of the file (if offset + count [byte_count] is greater than the size of the file), eof is returned as TRUE; otherwise it is FALSE. A successful READ [DAFS_PROC_READ_DIRECT] of an empty file will always return eof as TRUE. On success, the current filehandle retains its value." (RFC 3010, pp. 144-145) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS requests. The file data is written directly into the specified client memory buffers using RDMA write. Wittle [Page 284] INTERNET-DRAFT Direct Access File System September 2001 IMPLEMENTATION "It is possible for the server to return fewer than count [byte_count] bytes of data. If the server returns less than the count requested and eof set to FALSE, the client should issue another READ [DAFS_PROC_READ_DIRECT] to get the remaining data. A server may return less data than requested under several circumstances. The file may have been truncated by another client or perhaps on the server itself, changing the file size from what the requesting client believes to be the case. This would reduce the actual amount of data available to the client. It is possible that the server may back off the transfer size and reduce the read request return. Server resource exhaustion may also occur necessitating a smaller read return. If the file is locked the server will return an NFS4ERR_LOCKED [DAFSERR_LOCKED] error. Since the lock may be of short duration, the client may choose to retransmit the READ [DAFS_PROC_READ_DIRECT] request (with exponential backoff) until the operation succeeds." (RFC 3010, p. 145) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_EXPIRED DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL Wittle [Page 285] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_IO DAFSERR_LOCKED DAFSERR_LEASE_MOVED DAFSERR_MOVED DAFSERR_NXIO DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 286] INTERNET-DRAFT Direct Access File System September 2001 6.5.29. DAFS_PROC_READDIR_INLINE SUMMARY Reads the contents of a directory. The data transfer is done inline, that is, using memory pointed to by the read descriptor buffers. ARGUMENTS struct DAFS_Readdir_Inline_args { dafs_filehandle_type filehandle; dafs_uint64 cookie; dafs_verifier_type cookieverf; dafs_uint32 dircount; dafs_uint32 maxcount; dafs_attr_bitmap_type attr_request_bitmap; }; RESULTS struct direntry { dafs_uint64 cookie; dafs_file_attr_type attrs; /* heap */ dafs_var_offset_type name_offset; /* heap */ }; struct DAFS_Readdir_Inline_Res { dafs_verifier_type cookieverf; dafs_uint32 eof; dafs_struct_direntry entries<>; /* heap */ }; DESCRIPTION "The READDIR [DAFS_PROC_READDIR_INLINE] operation retrieves a variable number of entries from a file sys- tem directory and returns client requested attributes for each entry along with information to allow the client to request additional directory entries in a Wittle [Page 287] INTERNET-DRAFT Direct Access File System September 2001 subsequent READDIR [DAFS_PROC_READDIR_INLINE]." (RFC 3010, p. 147) A client is free to use the cookies and cookie verifiers obtained by previous DAFS readdir operations, regardless of whether the opera- tions were done INLINE or DIRECT. Keep this in mind when reading the remainder of the description of this DAFS procedure. "The arguments contain a cookie value that represents where the READDIR [DAFS_PROC_READDIR_INLINE] should start within the directory. A value of 0 (zero) for the cookie is used to start reading at the beginning of the directory. For subsequent READDIR [DAFS_PROC_READDIR_INLINE] requests, the client speci- fies a cookie value that is provided by the server on a previous READDIR [DAFS_PROC_READDIR_INLINE] request. The cookieverf value should be set to 0 (zero) when the cookie value is 0 (zero) (first directory read). On subsequent requests, it should be a cookieverf as returned by the server. The cookieverf must match that returned by the READDIR [DAFS_PROC_READDIR_INLINE] in which the cookie was acquired. The dircount portion of the argument is a hint of the maximum number of bytes of directory information that should be returned. This value represents the length of the names of the directory entries and the cookie value for these entries. This length represents the XDR [DAFS] encoding of the data (names and cookies) and not the length in the native format of the server. The server may return less data. The maxcount value of the argument is the maximum number of bytes for the result. This maximum size represents all of the data being returned and includes the XDR [DAFS encoding] overhead. The server may return less data. If the server is unable to return a single direc- tory entry within the maxcount limit, the error NFS4ERR_READDIR_NOSPC [DAFSERR_READDIR_NOSPC] will be returned to the client. Finally, attrbits [attr_request_bitmap] represents the list of attributes to be returned for each directory entry supplied by the server. On successful return, the server's response will provide a list of directory entries. Each of these entries Wittle [Page 288] INTERNET-DRAFT Direct Access File System September 2001 contains the name of the directory entry, a cookie value for that entry, and the associated attributes as requested." (RFC 3010, p. 147) See 4.1.3.3., "Attribute Bitmaps" for a discussion on attribute encoding in DAFS. "The cookie value is only meaningful to the server and is used as a 'bookmark' for the directory entry. As mentioned, this cookie is used by the client for subse- quent READDIR [DAFS_PROC_READDIR_INLINE] operations so that it may continue reading a directory. The cookie is similar in concept to a READ offset but should not be interpreted as such by the client. Ideally, the cookie value should not change if the directory is modified since the client may be caching these values. In some cases, the server may encounter an error while obtaining the attributes for a directory entry. Instead of returning an error for the entire READDIR [DAFS_PROC_READDIR_INLINE] operation, the server can instead return the attribute 'fattr4_rdattr_error [DAFS_FATTR_RDATTR_ERROR]'. With this, the server is able to communicate the failure to the client and not fail the entire operation in the instance of what might be a transient failure. Obviously, the client must request the fattr4_rdattr_error [DAFS_FATTR_RDATTR_ERROR] attribute for this method to work properly. If the client does not request the attribute, the server has no choice but to return failure for the entire READDIR [DAFS_PROC_READDIR_INLINE] operation. For some file system environments, the directory entries '.' and '..' have special meaning and in other environ- ments, they may not. If the server supports these spe- cial entries within a directory, they should not be returned to the client as part of the READDIR [DAFS_PROC_READDIR_INLINE] response. To enable some client environments, the cookie values of 0, 1, and 2 are to be considered reserved. Note that the Unix client will use these values when combining the server's response and local representations to enable a fully formed Unix directory presentation to the application. For READDIR [DAFS_PROC_READDIR_INLINE] arguments, cookie values of 1 and 2 should not be used and for READDIR [DAFS_PROC_READDIR_INLINE] results cookie values of 0, Wittle [Page 289] INTERNET-DRAFT Direct Access File System September 2001 1, and 2 should not returned. On success, the current filehandle retains its value." (RFC 3010, pp. 147-148) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. IMPLEMENTATION "The server's file system directory representations can differ greatly. A client's programming interfaces may also be bound to the local operating environment in a way that does not translate well into the NFS [DAFS] protocol. Therefore the use of the dircount and max- count fields are provided to allow the client the abil- ity to provide guidelines to the server. If the client is aggressive about attribute collection during a READ- DIR [DAFS_PROC_READDIR_INLINE], the server has an idea of how to limit the encoded response. The dircount field provides a hint on the number of entries based solely on the names of the directory entries. Since it is a hint, it may be possible that a dircount value is zero. In this case, the server is free to ignore the dircount value and return directory information based on the specified maxcount value. The cookieverf may be used by the server to help manage cookie values that may become stale. It should be a rare occurrence that a server is unable to continue properly reading a directory with the provided cookie/cookieverf pair. The server should make every effort to avoid this condition since the application at the client may not be able to properly handle this type of failure. The use of the cookieverf will also protect the client from using READDIR [DAFS_PROC_READDIR_INLINE] cookie values that may be stale. For example, if the file sys- tem has been migrated, the server may or may not be able to use the same cookie values to service READDIR [DAFS_PROC_READDIR_INLINE] as the previous server used. With the client providing the cookieverf, the server is able to provide the appropriate response to the client. This prevents the case where the server may accept a cookie value but the underlying directory has changed and the response is invalid from the client's context of its previous READDIR [DAFS_PROC_READDIR_INLINE]. Wittle [Page 290] INTERNET-DRAFT Direct Access File System September 2001 Since some servers will not be returning '.' and '..' entries as has been done with previous versions of the NFS protocol, the client that requires these entries be present in READDIR [DAFS_PROC_READDIR_INLINE] responses must fabricate them." (RFC 3010, pp. 148-149) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_COOKIE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOTDIR DAFSERR_NOTSUPP DAFSERR_READDIR_NOSPC DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_TOOSMALL Wittle [Page 291] INTERNET-DRAFT Direct Access File System September 2001 6.5.30. DAFS_PROC_READDIR_DIRECT SUMMARY Reads directory and returns data using RDMA write directly into client memory buffers. ARGUMENTS struct DAFS_Readdir_Direct_args { dafs_filehandle_type filehandle; dafs_uint64 cookie; dafs_verifier_type cookieverf; dafs_uint32 dircount; dafs_uint32 maxcount; dafs_attr_bitmap_type attr_request_bitmap; dafs_direct_op_buffer readdir_data_buffers<>; }; RESULTS struct DAFS_Readdir_Direct_Res { dafs_verifier_type cookieverf; dafs_uint32 eof; dafs_checksum_type direct_checksum; }; /* Readdir entries are returned in readdir_data_buffers specified in the arguments */ /* DIRECT: struct direntries entries<>; */ DESCRIPTION "The READDIR [DAFS_PROC_READDIR_DIRECT] operation retrieves a variable number of entries from a file sys- tem directory and returns client requested attributes for each entry along with information to allow the client to request additional directory entries in a sub- sequent READDIR [DAFS_PROC_READDIR_DIRECT]." (RFC 3010, Wittle [Page 292] INTERNET-DRAFT Direct Access File System September 2001 p.147) A client is free to use the cookies and cookie verifiers obtained by previous DAFS readdir operations, regardless of whether the opera- tions were done INLINE or DIRECT. Keep this in mind when reading the remainder of the description of this DAFS procedure. "The arguments contain a cookie value that represents where the READDIR [DAFS_PROC_READDIR_DIRECT] should start within the directory. A value of 0 (zero) for the cookie is used to start reading at the beginning of the directory. For subsequent READDIR [DAFS_PROC_READDIR_DIRECT] requests, the client speci- fies a cookie value that is provided by the server on a previous READDIR [DAFS_PROC_READDIR_DIRECT] request. The cookieverf value should be set to 0 (zero) when the cookie value is 0 (zero) (first directory read). On subsequent requests, it should be a cookieverf as returned by the server. The cookieverf must match that returned by the READDIR [DAFS_PROC_READDIR_DIRECT] in which the cookie was acquired. The dircount portion of the argument is a hint of the maximum number of bytes of directory information that should be returned. This value represents the length of the names of the directory entries and the cookie value for these entries. This length represents the XDR [DAFS] encoding of the data (names and cookies) and not the length in the native format of the server. The server may return less data. The maxcount value of the argument is the maximum number of bytes for the result. This maximum size represents all of the data being returned and includes the XDR [DAFS encoding] overhead. The server may return less data. If the server is unable to return a single direc- tory entry within the maxcount limit, the error NFS4ERR_READDIR_NOSPC [DAFSERR_READDIR_NOSPC] will be returned to the client. Finally, attrbits [attr_request_bitmap] represents the list of attributes to be returned for each directory entry supplied by the server. On successful return, the server's response will provide a list of directory entries. Each of these entries con- tains the name of the directory entry, a cookie value Wittle [Page 293] INTERNET-DRAFT Direct Access File System September 2001 for that entry, and the associated attributes as requested." (RFC 3010, p.147) See 4.1.3.3., "Attribute Bitmaps" for a discussion on attribute encoding in DAFS. "The cookie value is only meaningful to the server and is used as a 'bookmark' for the directory entry. As mentioned, this cookie is used by the client for subse- quent READDIR [DAFS_PROC_READDIR_DIRECT] operations so that it may continue reading a directory. The cookie is similar in concept to a READ offset but should not be interpreted as such by the client. Ideally, the cookie value should not change if the directory is modified since the client may be caching these values. In some cases, the server may encounter an error while obtaining the attributes for a directory entry. Instead of returning an error for the entire READDIR [DAFS_PROC_READDIR_DIRECT] operation, the server can instead return the attribute 'fattr4_rdattr_error [DAFS_FATTR_RDATTR_ERROR]'. With this, the server is able to communicate the failure to the client and not fail the entire operation in the instance of what might be a transient failure. Obviously, the client must request the fattr4_rdattr_error [DAFS_FATTR_RDATTR_ERROR] attribute for this method to work properly. If the client does not request the attribute, the server has no choice but to return failure for the entire READDIR [DAFS_PROC_READDIR_DIRECT] operation. For some file system environments, the directory entries '.' and '..' have special meaning and in other environ- ments, they may not. If the server supports these spe- cial entries within a directory, they should not be returned to the client as part of the READDIR [DAFS_PROC_READDIR_DIRECT] response. To enable some client environments, the cookie values of 0, 1, and 2 are to be considered reserved. Note that the Unix client will use these values when combining the server's response and local representations to enable a fully formed Unix directory presentation to the application. For READDIR [DAFS_PROC_READDIR_DIRECT] arguments, cookie values of 1 and 2 should not be used and for READDIR [DAFS_PROC_READDIR_DIRECT] results cookie values of 0, 1, and 2 should not returned. Wittle [Page 294] INTERNET-DRAFT Direct Access File System September 2001 On success, the current filehandle retains its value." (RFC 3010, pp. 147-148) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. IMPLEMENTATION "The server's file system directory representations can differ greatly. A client's programming interfaces may also be bound to the local operating environment in a way that does not translate well into the NFS [DAFS] protocol. Therefore the use of the dircount and max- count fields are provided to allow the client the abil- ity to provide guidelines to the server. If the client is aggressive about attribute collection during a READ- DIR [DAFS_PROC_READDIR_DIRECT], the server has an idea of how to limit the encoded response. The dircount field provides a hint on the number of entries based solely on the names of the directory entries. Since it is a hint, it may be possible that a dircount value is zero. In this case, the server is free to ignore the dircount value and return directory information based on the specified maxcount value. The cookieverf may be used by the server to help manage cookie values that may become stale. It should be a rare occurrence that a server is unable to continue properly reading a directory with the provided cookie/cookieverf pair. The server should make every effort to avoid this condition since the application at the client may not be able to properly handle this type of failure. The use of the cookieverf will also protect the client from using READDIR [DAFS_PROC_READDIR_DIRECT] cookie values that may be stale. For example, if the file sys- tem has been migrated, the server may or may not be able to use the same cookie values to service READDIR [DAFS_PROC_READDIR_DIRECT] as the previous server used. With the client providing the cookieverf, the server is able to provide the appropriate response to the client. This prevents the case where the server may accept a cookie value but the underlying directory has changed and the response is invalid from the client's context of its previous READDIR [DAFS_PROC_READDIR_DIRECT]. Since some servers will not be returning '.' and '..' Wittle [Page 295] INTERNET-DRAFT Direct Access File System September 2001 entries as has been done with previous versions of the NFS protocol, the client that requires these entries be present in READDIR [DAFS_PROC_READDIR_DIRECT] responses must fabricate them." (RFC 3010, pp. 148-149) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_COOKIE DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOTDIR DAFSERR_NOTSUPP DAFSERR_READDIR_NOSPC DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_TOOSMALL Wittle [Page 296] INTERNET-DRAFT Direct Access File System September 2001 6.5.31. DAFS_PROC_READLINK_INLINE SUMMARY Reads the contents of a symbolic link. Contents of the link are returned inline. ARGUMENTS struct DAFS_Readlink_Inline_Args { dafs_filehandle_type filehandle; }; RESULTS struct DAFS_Readlink_Inline_Res { utf8string_type link; /* heap */ }; DESCRIPTION "READLINK [DAFS_PROC_READLINK_INLINE] reads the data associated with a symbolic link. The data is a UTF-8 string that is opaque to the server. That is, whether created by an NFS [DAFS] client or created locally on the server, the data in a symbolic link is not inter- preted when created, but is simply stored. On success, the current filehandle retains its value." (RFC 3010, p.150) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. IMPLEMENTATION "A symbolic link is nominally a pointer to another file. The data is not necessarily interpreted by the server, just stored in the file. It is possible for a client implementation to store a path name that is not meaning- ful to the server operating system in a symbolic link. A READLINK [DAFS_PROC_READLINK_INLINE] operation returns Wittle [Page 297] INTERNET-DRAFT Direct Access File System September 2001 the data to the client for interpretation. If different implementations want to share access to symbolic links, then they must agree on the interpretation of the data in the symbolic link. The READLINK [DAFS_PROC_READLINK_INLINE] operation is only allowed on objects of type NF4LNK [DAFS_TYPE_LNK]. The server should return the error, NFS4ERR_INVAL [DAFSERR_INVAL], if the object is not of type, NF4LNK [DAFS_TYPE_LNK]." (RFC 3010, p. 150) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 298] INTERNET-DRAFT Direct Access File System September 2001 6.5.32. DAFS_PROC_READLINK_DIRECT SUMMARY Reads the contents of a symbolic link. Contents of the link are returned via RDMA operations to the buffer specified by the client. ARGUMENTS struct DAFS_Readlink_Direct_Args { dafs_filehandle_type filehandle; dafs_direct_op_buffer buffer; }; RESULTS struct DAFS_Readlink_Direct_Res { dafs_checksum_type direct_checksum; }; /* Contents of link are returned in buffer described in the arguments packet */ /* DIRECT: dafs_utf8string_type linkcontents; */ DESCRIPTION "READLINK [DAFS_PROC_READLINK_DIRECT] reads the data associated with a symbolic link. The data is a UTF-8 string that is opaque to the server. That is, whether created by an NFS [DAFS] client or created locally on the server, the data in a symbolic link is not inter- preted when created, but is simply stored. On success, the current filehandle retains its value." (RFC 3010, p. 150) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. IMPLEMENTATION Wittle [Page 299] INTERNET-DRAFT Direct Access File System September 2001 "A symbolic link is nominally a pointer to another file. The data is not necessarily interpreted by the server, just stored in the file. It is possible for a client implementation to store a path name that is not meaning- ful to the server operating system in a symbolic link. A READLINK [DAFS_PROC_READLINK_DIRECT] operation returns the data to the client for interpretation. If different implementations want to share access to symbolic links, then they must agree on the interpretation of the data in the symbolic link. The READLINK [DAFS_PROC_READLINK_DIRECT] operation is only allowed on objects of type NF4LNK [DAFS_TYPE_LNK]. The server should return the error, NFS4ERR_INVAL [DAFSERR_INVAL], if the object is not of type, NF4LNK [DAFS_TYPE_LNK]." (RFC 3010, p. 150) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 300] INTERNET-DRAFT Direct Access File System September 2001 6.5.33. DAFS_PROC_REMOVE SUMMARY Removes a file object. ARGUMENTS enum removemode { UNCHECKED_REMOVE = 0, CHECK_OPEN = 1 }; struct DAFS_Remove_Args { dafs_filehandle_type filehandle; dafs_component_type target; /* heap */ enum removemode remove_mode; }; RESULTS struct DAFS_Remove_Res { dafs_change_info_type change_info; }; DESCRIPTION "The REMOVE [DAFS_PROC_REMOVE] operation removes (deletes) a directory entry named by filename from the directory corresponding to the current filehandle." (RFC 3010, p. 151) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandle in DAFS procedures. "If the entry in the directory was the last reference to the corresponding file system object, the object may be destroyed." (RFC 3010, p. 152) Wittle [Page 301] INTERNET-DRAFT Direct Access File System September 2001 Notice DAFS exceptions for open files in this DESCRIPTION section. "For the directory where the filename was removed, the server returns change_info4 [dafs_change_info_type] information in cinfo [change_info]. With the atomic field of the change_info4 [dafs_change_info_type] struct, the server will indicate if the before and after change attributes were obtained atomically with respect to the removal. If the target has a length of 0 (zero), or if target does not obey the UTF-8 definition, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned. On success, the current filehandle retains its value." (RFC 3010, pp. 151-152) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. DAFS_PROC_REMOVE has the ability to guard against the removal of files that are currently open. If remove_mode is set to UNCHECKED, then the server will attempt to delete the file, regardless of any outstanding open references. The request will fail with DAFSERR_DENYDISP_CONFLICT if the file is currently open with a DELETE_DENY disposition. If remove_mode is set to CHECK_OPEN, then the server will attempt to delete the file only if the file is not currently open. If it is, the request fails with DAFSERR_DENYDISP_CONFLICT. The request will fail with DAFSERR_DENYDISP_NOTSUPP if he server is unable to guarantee this behavior for DAFS clients. [Note: the remove request with CHECK_OPEN MAY or MAY NOT succeed if the server is unable to detect open references from other protocols.] The DAFS_PROC_REMOVE operation provides "Delete On Last Close" seman- tics. Once a file has been opened, the DAFS Server MUST continue to provide access to the file to the Clients that have the file open, even after the file has been removed, up until the number of Clients that have the file open has dropped to zero. However, once the file has been removed, subsequent lookup and open operations will fail. IMPLEMENTATION "NFS versions 2 and 3 required a different operator RMDIR for directory removal. NFS version 4 [DAFS] REMOVE [DAFS_PROC_REMOVE] can be used to delete any directory entry independent of its file type. Wittle [Page 302] INTERNET-DRAFT Direct Access File System September 2001 The concept of last reference is server specific. How- ever, if the numlinks field in the previous attributes of the object had the value 1, the client should not rely on referring to the object via a file handle. Like- wise, the client should not rely on the resources (disk space, directory entry, and so on) formerly associated with the object becoming immediately available. Thus, if a client needs to be able to continue to access a file after using REMOVE to remove it, the client should take steps to make sure that the file will still be accessible. The usual mechanism used is to RENAME the file from its old name to a new hidden name." (RFC 3010, p. 152) DAFS supports delete-on-last-close. Clients do not have to rename files if it needs to protect access to an open file that is being removed. The rename MAY be necessary if the client wants to prevent deletion of a file that is NOT open but for which the client holds a filehandle obtained via a lookup operation. ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENYDISP_CONFLICT DAFSERR_DENYDISP_NOTSUPP DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NAMETOOLONG DAFSERR_NOENT Wittle [Page 303] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_NOTDIR DAFSERR_NOTEMPTY DAFSERR_NOTSUPP DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 304] INTERNET-DRAFT Direct Access File System September 2001 6.5.34. DAFS_PROC_RENAME SUMMARY Renames a directory entry. ARGUMENTS struct DAFS_Rename_Args { dafs_filehandle_type sourcedir; dafs_filehandle_type targetdir; dafs_component_type oldname; /* heap */ dafs_component_type newname; /* heap */ }; RESULTS struct DAFS_Rename_Res { dafs_change_info_type source; dafs_change_info_type target; }; DESCRIPTION "The RENAME [DAFS_PROC_RENAME] operation renames the object identified by oldname in the source directory corresponding to the saved filehandle, as set by the SAVEFH operation, to newname in the target directory corresponding to the current filehandle." (RFC 3010, p. 153) When the DAFS_PROC_RENAME operation occurs within a DAFS operation chain (see 4.3.2., "Chaining Flags", for a description of chaining), the DAFS chain current_filehandle specifies the target directory, and the source directory, oldname, and newname are taken from the message arguments. "The operation is required to be atomic to the client. Source and target directories must reside on the same file system on the server. On success, the current filehandle will continue to be the target directory." (RFC 3010, p. 153) Wittle [Page 305] INTERNET-DRAFT Direct Access File System September 2001 See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in management of filehandles in DAFS procedures. "If the target directory already contains an entry with the name, newname, the source object must be compatible with the target: either both are non-directories or both are directories and the target must be empty. If compa- tible, the existing target is removed before the rename occurs. If they are not compatible or if the target is a directory but not empty, the server will return the error, NFS4ERR_EXIST [DAFSERR_EXIST]. If oldname and newname both refer to the same file (they might be hard links of each other), then RENAME [DAFS_PROC_RENAME] should perform no action and return success. For both directories involved in the RENAME [DAFS_PROC_RENAME] , the server returns change_info4 [dafs_change_info_type] information. With the atomic field of the change_info4 [dafs_change_info_type] struct, the server will indicate if the before and after change attributes were obtained atomically with respect to the rename. If the oldname or newname has a length of 0 (zero), or if oldname or newname does not obey the UTF-8 defini- tion, the error NFS4ERR_INVAL [DAFSERR_INVAL] will be returned." (RFC 3010, pp. 153-154) The DAFS_PROC_RENAME operation provides "Delete On Last Close" seman- tics. Once a file has been opened, the DAFS Server MUST continue to provide access to the file to the Clients that have the file open, even after the file has been renamed, up until the number of Clients that have the file open has dropped to zero. However, once the file has been renamed, subsequent lookup and open operations will fail. IMPLEMENTATION "The RENAME [DAFS_PROC_RENAME] operation must be atomic to the client. The statement 'source and target direc- tories must reside on the same file system on the server' means that the fsid fields in the attributes for the directories are the same. If they reside on dif- ferent file systems, the error, NFS4ERR_XDEV [DAFSERR_XDEV], is returned. A filehandle may or may not become stale or expire on a Wittle [Page 306] INTERNET-DRAFT Direct Access File System September 2001 rename. However, server implementors are strongly encouraged to attempt to keep file handles from becoming stale or expiring in this fashion. On some servers, the filenames, '.' and '..', are ille- gal as either oldname or newname. In addition, neither oldname nor newname can be an alias for the source directory. These servers will return the error, NFS4ERR_INVAL [DAFSERR_INVAL], in these cases." (RFC 3010, p. 154) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DQUOT DAFSERR_EXIST DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_IO DAFSERR_ISDIR DAFSERR_MOVED DAFSERR_NAMETOOLONG DAFSERR_NOENT DAFSERR_NOSPC DAFSERR_NOTDIR DAFSERR_NOTEMPTY DAFSERR_NOTSUPP Wittle [Page 307] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_XDEV Wittle [Page 308] INTERNET-DRAFT Direct Access File System September 2001 6.5.35. DAFS_PROC_SETATTR_INLINE SUMMARY Sets the attributes of a file object. ARGUMENTS struct DAFS_Setattr_Inline_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_file_attr_type obj_attributes; }; RESULTS struct DAFS_Setattr_Inline_Res { dafs_attr_bitmap_type attr_request_bitmap; }; DESCRIPTION "The SETATTR [DAFS_PROC_SETATTR_INLINE] operation changes one or more of the attributes of a file system object. The new attributes are specified with a bitmap and the attributes that follow the bitmap in bit order. The stateid [state_id] is necessary for SETATTRs [DAFS_PROC_SETATTR_INLINEs] that change the size of a file (modify the attribute object_size). This stateid represents a record lock, share reservation, or delega- tion which must be valid for the SETATTR [DAFS_PROC_SETATTR_INLINE] to modify the file data. A valid stateid would always be specified. When the file size is not changed, the special stateid consisting of all bits 0 (zero) should be used. On either success or failure of the operation, the server will return the attrsset [attr_request_bitmap] bitmask to represent what (if any) attributes were suc- cessfully set." (RFC 3010, p. 160) Wittle [Page 309] INTERNET-DRAFT Direct Access File System September 2001 See 4.1.3.3., "Attribute Bitmaps" for a description of file attribute encoding in DAFS. "On success, the current filehandle retains its value." (RFC 3010, p. 160) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "The file size attribute is used to request changes to the size of a file. A value of 0 (zero) causes the file to be truncated, a value less than the current size of the file causes data from new size to the end of the file to be discarded, and a size greater than the current size of the file causes logically zeroed data bytes to be added to the end of the file. Servers are free to implement this using holes or actual zero data bytes. Clients should not make any assumptions regarding a server's implementation of this feature, beyond that the bytes returned will be zeroed. Servers must support extending the file size via SETATTR [DAFS_PROC_SETATTR_INLINE]. SETATTR [DAFS_PROC_SETATTR_INLINE] is not guaranteed atomic. A failed SETATTR [DAFS_PROC_SETATTR_INLINE] may partially change a file's attributes. Changing the size of a file with SETATTR [DAFS_PROC_SETATTR_INLINE] indirectly changes the time_modify. A client must account for this as size changes can result in data deletion. If server and client times differ, programs that compare client time to file times can break. A time maintenance protocol should be used to limit client/server time skew. If the server cannot successfully set all the attributes it must return an NFS4ERR_INVAL [DAFSERR_INVAL] error. If the server can only support 32 bit offsets and sizes, a SETATTR [DAFS_PROC_SETATTR_INLINE] request to set the size of a file to larger than can be represented in 32 bits will be rejected with this same error." (RFC 3010, pp. 160-161) ERRORS Wittle [Page 310] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_BROKEN DAFSERR_CHAIN_FORM DAFSERR_DELAY DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVA DAFSERR_IO DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_NOTSUPP DAFSERR_OLD_STATEID DAFSERR_PERM DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 311] INTERNET-DRAFT Direct Access File System September 2001 6.5.36. DAFS_PROC_SETATTR_DIRECT SUMMARY Sets the attributes of a file object. ARGUMENTS struct DAFS_Setattr_Direct_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_checksum_type direct_checksum; dafs_direct_op_buffer obj_attributes; }; /* DIRECT: file_attr_type obj_attributes; */ RESULTS struct DAFS_Setattr_Direct_Res { dafs_attr_bitmap_type attr_request_bitmap; }; DESCRIPTION "The SETATTR [DAFS_PROC_SETATTR_DIRECT] operation changes one or more of the attributes of a file system object. The new attributes are specified with a bitmap and the attributes that follow the bitmap in bit order. The stateid [state_id] is necessary for SETATTRs [DAFS_PROC_SETATTR_DIRECTs] that change the size of a file (modify the attribute object_size). This stateid represents a record lock, share reservation, or delega- tion which must be valid for the SETATTR [DAFS_PROC_SETATTR_DIRECT] to modify the file data. A valid stateid would always be specified. When the file size is not changed, the special stateid consisting of all bits 0 (zero) should be used. On either success or failure of the operation, the server will return the attrsset [attr_request_bitmap] Wittle [Page 312] INTERNET-DRAFT Direct Access File System September 2001 bitmask to represent what (if any) attributes were suc- cessfully set." (RFC 3010, p. 160) See 4.1.3.3., "Attribute Bitmaps" for a description of file attribute encoding in DAFS. "On success, the current filehandle retains its value." (RFC 3010, p. 160) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "The file size attribute is used to request changes to the size of a file. A value of 0 (zero) causes the file to be truncated, a value less than the current size of the file causes data from new size to the end of the file to be discarded, and a size greater than the current size of the file causes logically zeroed data bytes to be added to the end of the file. Servers are free to implement this using holes or actual zero data bytes. Clients should not make any assumptions regarding a server's implementation of this feature, beyond that the bytes returned will be zeroed. Servers must support extending the file size via SETATTR [DAFS_PROC_SETATTR_DIRECT]. SETATTR [DAFS_PROC_SETATTR_DIRECT] is not guaranteed atomic. A failed SETATTR [DAFS_PROC_SETATTR_DIRECT] may partially change a file's attributes. Changing the size of a file with SETATTR [DAFS_PROC_SETATTR_DIRECT] indirectly changes the time_modify. A client must account for this as size changes can result in data deletion. If server and client times differ, programs that compare client time to file times can break. A time maintenance protocol should be used to limit client/server time skew. If the server cannot successfully set all the attributes it must return an NFS4ERR_INVAL [DAFSERR_INVAL] error. If the server can only support 32 bit offsets and sizes, a SETATTR [DAFS_PROC_SETATTR_DIRECT] request to set the size of a file to larger than can be represented in 32 bits will be rejected with this same error." (RFC 3010, Wittle [Page 313] INTERNET-DRAFT Direct Access File System September 2001 pp. 160-161) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_NOTSUPP DAFSERR_OLD_STATEID DAFSERR_PERM DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT Wittle [Page 314] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_STALE DAFSERR_STALE_STATEID Wittle [Page 315] INTERNET-DRAFT Direct Access File System September 2001 6.5.37. DAFS_PROC_VERIFY SUMMARY Verifies equality of attributes. ARGUMENTS struct DAFS_Verify_Args { dafs_filehandle_type filehandle; dafs_file_attr_type obj_attributes; }; RESULTS None. DESCRIPTION "The VERIFY [DAFS_PROC_VERIFY] operation is used to ver- ify that attributes have a value assumed by the client before proceeding with following operations in the com- pound request." (RFC 3010, p. 165) DAFS_PROC_VERIFY can be used in a similar fashion inside a DAFS chain. "If any of the attributes do not match then the error NFS4ERR_NOT_SAME [DAFSERR_NOT_SAME] must be returned. The current filehandle retains its value after success- ful completion of the operation." (RFC 3010, p. 165) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "In the case that a recommended attribute is specified in the VERIFY [DAFS_PROC_VERIFY] operation and the server does not support that attribute for the file sys- tem object, the error NFS4ERR_NOTSUPP [DAFSERR_NOTSUPP] is returned to the client." (RFC 3010, p. 165) ERRORS Wittle [Page 316] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY DAFSERR_FHEXPIRED DAFSERR_INVAL DAFSERR_MOVED DAFSERR_NOTSUPP DAFSERR_NOT_SAME DAFSERR_RESOURCE DAFSERR_SERVERFAULT DAFSERR_STALE Wittle [Page 317] INTERNET-DRAFT Direct Access File System September 2001 6.5.38. DAFS_PROC_WRITE_INLINE SUMMARY Writes data to a file. The data to be written is part of the request packet and is passed inline. ARGUMENTS enum stable_how { UNSTABLE = 0, DATA_SYNC = 1, FILE_SYNC = 2 }; struct DAFS_Write_Inline_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint32 byte_count; stable_how stable_how; dafs_uint32 write_padded; dafs_cache_hint_type cache_hint; dafs_opaque8 data[byte_count]; /* heap or padded */ }; RESULTS struct DAFS_Write_Inline_Res { dafs_uint32 count; stable_how committed; dafs_verifier_type verifier; }; DESCRIPTION "The WRITE [DAFS_PROC_WRITE_INLINE] operation is used to write data to a regular file. The target file is Wittle [Page 318] INTERNET-DRAFT Direct Access File System September 2001 specified by the current filehandle." (RFC 3010, p. 167) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The offset specifies the offset where the data should be written. An offset of 0 (zero) specifies that the write should start at the beginning of the file. The count [byte_count] represents the number of bytes of data that are to be written. If the count [byte_count] is 0 (zero), the WRITE [DAFS_PROC_WRITE_INLINE] will succeed and return a count of 0 (zero) subject to per- missions checking. The server may choose to write fewer bytes than requested by the client. Part of the write request is a specification of how the write is to be performed. The client specifies with the stable parameter the method of how the data is to be processed by the server. If stable is FILE_SYNC4 [FILE_SYNC], the server must commit the data written plus all file system metadata to stable storage before returning results. This corresponds to the NFS version 2 protocol semantics. Any other behavior constitutes a protocol violation. If stable is DATA_SYNC4 [DATA_SYNC], then the server must commit all of the data to stable storage and enough of the metadata to retrieve the data before returning. The server implementor is free to implement DATA_SYNC4 [DATA_SYNC] in the same fashion as FILE_SYNC4 [FILE_SYNC], but with a possible performance drop. If stable is UNSTABLE4 [UNSTABLE], the server is free to commit any part of the data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will sub- sequently be committed to stable storage. The only guarantees made by the server are that it will not des- troy any data without changing the value of verf and that it will not commit the data and metadata at a level less than that requested by the client. The stateid returned from a previous record lock or share reservation request is provided as part of the argument. The stateid is used by the server to verify that the associated lock is still valid and to update lease timeouts for the client." (RFC 3010, p. 167) DAFS servers renew leases whenever any DAFS request (including NULL) is received from a client. Leases are renewed based on the client Wittle [Page 319] INTERNET-DRAFT Direct Access File System September 2001 associated with the session on which the request is received. "Upon successful completion, the following results are returned. The count result is the number of bytes of data written to the file. The server may write fewer bytes than requested. If so, the actual number of bytes written starting at location, offset, is returned. The server also returns an indication of the level of commitment of the data and metadata via committed. If the server committed all data and metadata to stable storage, committed should be set to FILE_SYNC4 [FILE_SYNC]. If the level of commitment was at least as strong as DATA_SYNC4 [DATA_SYNC], then committed should be set to DATA_SYNC4 [DATA_SYNC]. Otherwise, committed must be returned as UNSTABLE4 [UNSTABLE]. If stable was FILE4_SYNC [FILE_SYNC], then committed must also be FILE_SYNC4 [FILE_SYNC]: anything else constitutes a pro- tocol violation. If stable was DATA_SYNC4 [DATA_SYNC], then committed may be FILE_SYNC4 or DATA_SYNC4 [FILE_SYNC or DATA_SYNC]: anything else constitutes a protocol violation. If stable was UNSTABLE4 [UNSTABLE], then committed may be either FILE_SYNC4, DATA_SYNC4, or UNSTABLE4 [FILE_SYNC, DATA_SYNC, or UNSTABLE]. The final portion of the result is the write verifier, verf [verifier]. The write verifier is a cookie that the client can use to determine whether the server has changed state between a call to WRITE [DAFS_PROC_WRITE_INLINE] and a subsequent call to either WRITE [DAFS_PROC_WRITE_INLINE] or COMMIT [DAFS_PROC_COMMIT]." (RFC 3010, pp. 167-168) DAFS servers use the same write verifiers during a single DAFS server instance, whether the write operation is done INLINE or DIRECT. The client can then apply the same verifier tests regardless of the data transfer method chosen (inline or direct). "This cookie must be consistent during a single instance of the NFS version 4 [DAFS] protocol service and must be unique between instances of the NFS version 4 [DAFS] protocol server, where uncommitted data may be lost. If a client writes data to the server with the stable argument set to UNSTABLE4 [UNSTABLE] and the reply yields a committed response of DATA_SYNC4 or UNSTABLE4 [DATA_SYNC or UNSTABLE], the client will follow up some time in the future with a COMMIT [DAFS_PROC_COMMIT] Wittle [Page 320] INTERNET-DRAFT Direct Access File System September 2001 operation to synchronize outstanding asynchronous data and metadata with the server's stable storage, barring client error. It is possible that due to client crash or other error that a subsequent COMMIT [DAFS_PROC_COMMIT] will not be received by the server. On success, the current filehandle retains its value." (RFC 3010, pp. 167-168) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "It is possible for the server to write fewer than count [byte_count] bytes of data. In this case, the server should not return an error unless no data was written at all. If the server writes less than count [byte_count] bytes, the client should issue another WRITE [DAFS_PROC_WRITE_INLINE] to write the remaining data. It is assumed that the act of writing data to a file will cause the time_modified of the file to be updated. However, the time_modified of the file should not be changed unless the contents of the file are changed. Thus, a WRITE [DAFS_PROC_WRITE_INLINE] request with count [byte_count] set to 0 should not cause the time_modified of the file to be updated. The definition of stable storage has been historically a point of contention. The following expected properties of stable storage may help in resolving design issues in the implementation. Stable storage is persistent storage that survives: 1. Repeated power failures. 2. Hardware failures (of any board, power supply, etc.). 3. Repeated software crashes, including reboot cycle. This definition does not address failure of the stable storage module itself. The verifier is defined to allow a client to detect dif- ferent instances of an NFS version 4 [DAFS] protocol server over which cached, uncommitted data may be lost. Wittle [Page 321] INTERNET-DRAFT Direct Access File System September 2001 In the most likely case, the verifier allows the client to detect server reboots. This information is required so that the client can safely determine whether the server could have lost cached data. If the server fails unexpectedly and the client has uncommitted data from previous WRITE [DAFS_PROC_WRITE_INLINE] requests (done with the stable argument set to UNSTABLE4 [UNSTABLE] and in which the result committed was returned as UNSTABLE4 [UNSTABLE] as well) it may not have flushed cached data to stable storage. The burden of recovery is on the client and the client will need to retransmit the data to the server. A suggested verifier would be to use the time that the server was booted or the time the server was last started (if restarting the server without a reboot results in lost buffers). The committed field in the results allows the client to do more effective caching. If the server is committing all WRITE requests to stable storage, then it should return with committed set to FILE_SYNC4 [FILE_SYNC], regardless of the value of the stable field in the argu- ments. A server that uses an NVRAM accelerator may choose to implement this policy. The client can use this to increase the effectiveness of the cache by dis- carding cached data that has already been committed on the server. Some implementations may return NFS4ERR_NOSPC [DAFSERR_NOSPC] instead of NFS4ERR_DQUOT [DAFSERR_DQUOT] when a user's quota is exceeded." (RFC 3010, pp. 168- 169) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY Wittle [Page 322] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LEASE_MOVED DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID DAFSERR_WRITE_TOOBIG Wittle [Page 323] INTERNET-DRAFT Direct Access File System September 2001 6.5.39. DAFS_PROC_WRITE_DIRECT SUMMARY Initiates a write to file using data retrieved via RDMA read directly from client memory buffers. enum stable_how { UNSTABLE = 0, DATA_SYNC = 1, FILE_SYNC = 2 }; ARGUMENTS struct DAFS_Write_Direct_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint64 offset; dafs_uint32 byte_count; stable_how stable_how; dafs_cache_hint_type cache_hint; dafs_checksum_type direct_checksum; dafs_direct_op_buffer write_data_buffers<>; }; /* DIRECT: opaque writedata[buffer_byte_count]; */ RESULTS struct DAFS_Write_Direct_Res { dafs_uint32 count; stable_how committed; dafs_verifier_type verifier; }; DESCRIPTION "The WRITE [DAFS_PROC_WRITE_DIRECT] operation is used to Wittle [Page 324] INTERNET-DRAFT Direct Access File System September 2001 write data to a regular file. The target file is speci- fied by the current filehandle." (RFC 3010, p. 167) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. "The offset specifies the offset where the data should be written. An offset of 0 (zero) specifies that the write should start at the beginning of the file. The count [byte_count] represents the number of bytes of data that are to be written. If the count [byte_count] is 0 (zero), the WRITE [DAFS_PROC_WRITE_DIRECT] will succeed and return a count of 0 (zero) subject to per- missions checking. The server may choose to write fewer bytes than requested by the client. Part of the write request is a specification of how the write is to be performed. The client specifies with the stable parameter the method of how the data is to be processed by the server. If stable is FILE_SYNC4 [FILE_SYNC], the server must commit the data written plus all file system metadata to stable storage before returning results. This corresponds to the NFS version 2 protocol semantics. Any other behavior constitutes a protocol violation. If stable is DATA_SYNC4 [DATA_SYNC], then the server must commit all of the data to stable storage and enough of the metadata to retrieve the data before returning. The server implementor is free to implement DATA_SYNC4 [DATA_SYNC] in the same fashion as FILE_SYNC4 [FILE_SYNC], but with a possible performance drop. If stable is UNSTABLE4 [UNSTABLE], the server is free to commit any part of the data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the server are that it will not destroy any data without changing the value of verf and that it will not commit the data and metadata at a level less than that requested by the client. The stateid returned from a previous record lock or share reservation request is provided as part of the argument. The stateid is used by the server to verify that the associated lock is still valid and to update lease timeouts for the client." (RFC 3010, p. 167) DAFS servers renew leases whenever any DAFS request (including NULL) Wittle [Page 325] INTERNET-DRAFT Direct Access File System September 2001 is received from a client. Leases are renewed based on the client associated with the session on which the request is received. "Upon successful completion, the following results are returned. The count result is the number of bytes of data written to the file. The server may write fewer bytes than requested. If so, the actual number of bytes written starting at location, offset, is returned. The server also returns an indication of the level of commitment of the data and metadata via committed. If the server committed all data and metadata to stable storage, committed should be set to FILE_SYNC4 [FILE_SYNC]. If the level of commitment was at least as strong as DATA_SYNC4 [DATA_SYNC], then committed should be set to DATA_SYNC4 [DATA_SYNC]. Otherwise, committed must be returned as UNSTABLE4 [UNSTABLE]. If stable was FILE4_SYNC [FILE_SYNC], then committed must also be FILE_SYNC4 [FILE_SYNC]: anything else constitutes a pro- tocol violation. If stable was DATA_SYNC4 [DATA_SYNC], then committed may be FILE_SYNC4 or DATA_SYNC4 [FILE_SYNC or DATA_SYNC]: anything else constitutes a protocol violation. If stable was UNSTABLE4 [UNSTABLE], then committed may be either FILE_SYNC4, DATA_SYNC4, or UNSTABLE4 [FILE_SYNC, DATA_SYNC, or UNSTABLE]. The final portion of the result is the write verifier, verf [verifier]. The write verifier is a cookie that the client can use to determine whether the server has changed state between a call to WRITE [DAFS_PROC_WRITE_DIRECT] and a subsequent call to either WRITE [DAFS_PROC_WRITE_DIRECT] or COMMIT [DAFS_PROC_COMMIT]." (RFC 3010, pp. 167-168) DAFS servers use the same write verifiers during a single DAFS server instance, whether the write operation is done INLINE or DIRECT. The client can then apply the same verifier tests regardless of the data transfer method chosen (inline or direct). "This cookie must be consistent during a single instance of the NFS version 4 [DAFS] protocol service and must be unique between instances of the NFS version 4 [DAFS] protocol server, where uncommitted data may be lost. If a client writes data to the server with the stable argument set to UNSTABLE4 [UNSTABLE] and the reply yields a committed response of DATA_SYNC4 or UNSTABLE4 [DATA_SYNC or UNSTABLE], the client will follow up some Wittle [Page 326] INTERNET-DRAFT Direct Access File System September 2001 time in the future with a COMMIT [DAFS_PROC_COMMIT] operation to synchronize outstanding asynchronous data and metadata with the server's stable storage, barring client error. It is possible that due to client crash or other error that a subsequent COMMIT [DAFS_PROC_COMMIT] will not be received by the server. On success, the current filehandle retains its value." (RFC 3010, pp. 167-168) See 4.1.3.1., "Filehandles in Compound vs. Chaining" for differences in filehandle management in DAFS procedures. IMPLEMENTATION "It is possible for the server to write fewer than count [byte_count] bytes of data. In this case, the server should not return an error unless no data was written at all. If the server writes less than count [byte_count] bytes, the client should issue another WRITE [DAFS_PROC_WRITE_DIRECT] to write the remaining data. It is assumed that the act of writing data to a file will cause the time_modified of the file to be updated. However, the time_modified of the file should not be changed unless the contents of the file are changed. Thus, a WRITE [DAFS_PROC_WRITE_DIRECT] request with count [byte_count] set to 0 should not cause the time_modified of the file to be updated. The definition of stable storage has been historically a point of contention. The following expected properties of stable storage may help in resolving design issues in the implementation. Stable storage is persistent storage that survives: 1. Repeated power failures. 2. Hardware failures (of any board, power supply, etc.). 3. Repeated software crashes, including reboot cycle. This definition does not address failure of the stable storage module itself. The verifier is defined to allow a client to detect dif- ferent instances of an NFS version 4 [DAFS] protocol Wittle [Page 327] INTERNET-DRAFT Direct Access File System September 2001 server over which cached, uncommitted data may be lost. In the most likely case, the verifier allows the client to detect server reboots. This information is required so that the client can safely determine whether the server could have lost cached data. If the server fails unexpectedly and the client has uncommitted data from previous WRITE [DAFS_PROC_WRITE_DIRECT] requests (done with the stable argument set to UNSTABLE4 [UNSTABLE] and in which the result committed was returned as UNSTABLE4 [UNSTABLE] as well) it may not have flushed cached data to stable storage. The burden of recovery is on the client and the client will need to retransmit the data to the server. A suggested verifier would be to use the time that the server was booted or the time the server was last started (if restarting the server without a reboot results in lost buffers). The committed field in the results allows the client to do more effective caching. If the server is committing all WRITE requests to stable storage, then it should return with committed set to FILE_SYNC4 [FILE_SYNC], regardless of the value of the stable field in the argu- ments. A server that uses an NVRAM accelerator may choose to implement this policy. The client can use this to increase the effectiveness of the cache by dis- carding cached data that has already been committed on the server. Some implementations may return NFS4ERR_NOSPC [DAFSERR_NOSPC] instead of NFS4ERR_DQUOT [DAFSERR_DQUOT] when a user's quota is exceeded." (RFC 3010, pp. 168- 169) ERRORS DAFSERR_ACCES DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_CHAIN_FORM DAFSERR_CHAIN_BROKEN DAFSERR_DELAY Wittle [Page 328] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_DENIED DAFSERR_DQUOT DAFSERR_EXPIRED DAFSERR_FBIG DAFSERR_FHEXPIRED DAFSERR_GRACE DAFSERR_INVAL DAFSERR_IO DAFSERR_LEASE_MOVED DAFSERR_LOCKED DAFSERR_MOVED DAFSERR_NOSPC DAFSERR_OLD_STATEID DAFSERR_RESOURCE DAFSERR_ROFS DAFSERR_SERVERFAULT DAFSERR_STALE DAFSERR_STALE_STATEID DAFSERR_WRITE_TOOBIG Wittle [Page 329] INTERNET-DRAFT Direct Access File System September 2001 6.6. Back-Control Directives This section describes the individual operations that a server can submit to the client, along with the formats of the arguments portion of the request and the results portion of the response. In this sec- tion, the requests go from server to client and the replies from client to server. 6.6.1. DAFS_PROC_BC_NULL SUMMARY No operation. ARGUMENTS None. RESULTS None. DESCRIPTION "Standard NULL procedure. Void argument, void response. Even though there is no direct functionality associated with this procedure, the server will use CB_NULL [DAFS_PROC_BC_NULL] to confirm the existence of a path for RPCs from server to client." (RFC 3010, p. 102) DAFS does not use RPC. A server determines whether a path to the client exists when a session contains a bound back channel. ERRORS None. Wittle [Page 330] INTERNET-DRAFT Direct Access File System September 2001 6.6.2. DAFS_PROC_BC_BATCH_COMPLETION SUMMARY Notifies the client of completed batch I/O operations. ARGUMENTS struct DAFS_Batch_Submit_Res { dafs_completion_notification_type completions<>;/* heap*/ }; RESULTS None. DESCRIPTION The DAFS_PROC_BC_BATCH_COMPLETION back-control directive is used by the server to notify the client that one or more outstanding batch I/O requests have completed. For each completed I/O request, the server returns the batch request ID from the original client I/O request, its completion status, the number of bytes transferred, and in the case of reads, an optional checksum. For write requests, or all requests if checksumming is not enabled, the server sets the checksum field to 0. Note that the server MAY return short reads or writes, in which case the status is successful but the byte count varies from the original request. The returned completions MAY have been initiated by any number of prior batch operations; they need not come from a single batch sub- mission. If the client has set "num_completions" in a DAFS_PROC_BATCH_SUBMIT operation, the server SHOULD make return com- pletions in batches of the specified size, but it is NOT REQURIED to return that number and MAY return any number of completions at once (subject to the negotiated maximum message size). In particular, if a server has completed all outstanding batch IO requests, it SHOULD NOT delay reporting those completions over the back channel, regardless of the client's desired batch completion size. ERRORS Wittle [Page 331] INTERNET-DRAFT Direct Access File System September 2001 6.6.3. DAFS_PROC_BC_GETATTR SUMMARY Requests the attributes of a file that has been delegated to a client. ARGUMENTS struct DAFS_BC_Getattr_Args { dafs_filehandle_type filehandle; dafs_attr_bitmap_type attr_request_bitmap; }; RESULTS struct DAFS_BC_Getattr_Res { dafs_file_attr_type obj_attributes; /* heap */ }; DESCRIPTION "The CB_GETATTR [DAFS_PROC_BC_GETATTR] operation is used to obtain the attributes modified by an open delegate to allow the server to respond to GETATTR [DAFS_PROC_GETATTR_INLINE] and DAFS_PROC_GETATTR_DIRECT] requests for a file which is the subject of an open delegation. If the handle specified is not one for which the client holds a write open delegation, an NFS4ERR_BADHANDLE [DAFSERR_BADHANDLE] error is returned." (RFC 3010, p. 173) IMPLEMENTATION "The client returns attrbits and the associated attri- bute values only for attributes that it may change (change, time_modify, object_size)." (RFC 3010, p. 173) See 4.1.3.3., "Attribute Bitmaps" for a description of attribute encoding in DAFS. Wittle [Page 332] INTERNET-DRAFT Direct Access File System September 2001 ERRORS DAFSERR_BADHANDLE DAFSERR_RESOURCE Wittle [Page 333] INTERNET-DRAFT Direct Access File System September 2001 6.6.4. DAFS_PROC_BC_RECALL SUMMARY Recalls an open delegation. ARGUMENTS struct DAFS_BC_Recall_Args { dafs_filehandle_type filehandle; dafs_state_id_type state_id; dafs_uint32 truncate; }; RESULTS None. DESCRIPTION "The CB_RECALL [DAFS_PROC_BC_RECALL] operation is used to begin the process of recalling an open delegation and returning it to the server. The truncate flag is used to optimize recall for a file which is about to be truncated to zero. When it is set, the client is freed of obligation to propagate modified data for the file to the server, since this data is irrelevant. If the handle specified is not one for which the client holds an open delegation, an NFS4ERR_BADHANDLE [DAFSERR_BADHANDLE] error is returned. If the stateid [state_id] specified is not one corresponding to an open delegation for the file speci- fied by the filehandle, an NFS4ERR_BAD_STATEID [DAFSERR_BAD_STATEID] is returned." (RFC 3010, pp. 173- 174) IMPLEMENTATION "The client should reply to the callback immediately. Replying does not complete the recall. The recall is not complete until the delegation is returned using a Wittle [Page 334] INTERNET-DRAFT Direct Access File System September 2001 DELEGRETURN [DAFS_PROC_DELEGRETURN]." (RFC 3010, p. 174) ERRORS DAFSERR_BADHANDLE DAFSERR_BAD_STATEID DAFSERR_RESOURCE Wittle [Page 335] INTERNET-DRAFT Direct Access File System September 2001 7. Error Status Result Codes If a DAFS operation request fails, an error status will be entered into the reply message header status field. The following is a list of error codes and their numeric value: Wittle [Page 336] INTERNET-DRAFT Direct Access File System September 2001 const DAFS_STATUS_OK = 0; const DAFSERR_PERM = 1; const DAFSERR_NOENT = 2; const DAFSERR_IO = 5; const DAFSERR_NXIO = 6; const DAFSERR_ACCES = 13; const DAFSERR_EXIST = 17; const DAFSERR_XDEV = 18; const DAFSERR_NODEV = 19; const DAFSERR_NOTDIR = 20; const DAFSERR_ISDIR = 21; const DAFSERR_INVAL = 22; const DAFSERR_FBIG = 27; const DAFSERR_NOSPC = 28; const DAFSERR_ROFS = 30; const DAFSERR_MLINK = 31; const DAFSERR_NAMETOOLONG = 63; const DAFSERR_NOTEMPTY = 66; const DAFSERR_DQUOT = 69; const DAFSERR_STALE = 70; const DAFSERR_BADHANDLE = 10001; const DAFSERR_BAD_COOKIE = 10003; const DAFSERR_NOTSUPP = 10004; const DAFSERR_TOOSMALL = 10005; const DAFSERR_SERVERFAULT = 10006; const DAFSERR_BADTYPE = 10007; const DAFSERR_DELAY = 10008; const DAFSERR_SAME = 10009; const DAFSERR_DENIED = 10010; const DAFSERR_EXPIRED = 10011; const DAFSERR_LOCKED = 10012; const DAFSERR_GRACE = 10013; const DAFSERR_FHEXPIRED = 10014; const DAFSERR_SHARE_DENIED = 10015; const DAFSERR_WRONGSEC = 10016; const DAFSERR_CLID_INUSE = 10017; const DAFSERR_RESOURCE = 10018; const DAFSERR_MOVED = 10019; const DAFSERR_NOFILEHANDLE = 10020; const DAFSERR_MINOR_VERS_MISMATCH = 10021; const DAFSERR_STALE_CLIENTID = 10022; const DAFSERR_STALE_STATEID = 10023; const DAFSERR_OLD_STATEID = 10024; const DAFSERR_BAD_STATEID = 10025; const DAFSERR_BAD_SEQID = 10026; const DAFSERR_NOT_SAME = 10027; const DAFSERR_LOCK_RANGE = 10028; Wittle [Page 337] INTERNET-DRAFT Direct Access File System September 2001 const DAFSERR_SYMLINK = 10029; const DAFSERR_READDIR_NOSPC = 10030; const DAFSERR_LEASE_MOVED = 10031; const DAFSERR_ILLEGAL_PROT = 15002; const DAFSERR_ILLEGAL_STATE = 15003; const DAFSERR_UNKNOWN_SESSION = 15004; const DAFSERR_NOXID_MATCH = 15005; const DAFSERR_NOT_AUTHORIZED = 15006; const DAFSERR_NOT_FOUND = 15007; const DAFSERR_RDMA_READ_CHANNEL_UNUSABLE = 15008; const DAFSERR_CHAIN_FORM = 15009; const DAFSERR_CHAIN_BROKEN = 15010; const DAFSERR_GSS_CONTINUE_INIT = 15011; const DAFSERR_BAD_SESSION = 15012; const DAFSERR_NO_CREDS = 15013; const DAFSERR_CRHAND_CONFLICT = 15014; const DAFSERR_DENYDISP_CONFLICT = 15015; const DAFSERR_DENYDISP_NOTSUPP = 15016; const DAFSERR_KEY_MISMATCH = 15017; const DAFSERR_WRITE_TOOBIG = 15018; const DAFSERR_BACK_CHANNEL_UNUSABLE = 15019; const DAFSERR_CHKSUM = 15020; The following list is the name and description for each DAFS error. DAFS_STATUS_OK Indicates the operation completed successfully. DAFSERR_ACCESS Permission denied. The caller does not have the correct permission to perform the requested operation. Contrast this with DAFSERR_PERM, which restricts itself to owner or privileged user permission failures. DAFSERR_BADHANDLE Illegal DAFS file handle. The file handle failed internal con- sistency checks. DAFSERR_BADTYPE An attempt was made to create an object of a type not supported by the server. Wittle [Page 338] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_BAD_COOKIE READDIR cookie is stale. DAFSERR_BAD_STATEID A State-id generated by the current server instance, but which does not designate any locking state (either current or super- seded) for a current lockowner-file pair, was used. DAFSERR_BATCH_REQUEST_NOT_FOUND The server does not have an outstanding asynchronous dafs_read_write_request on a session when it receives a hurry up request. DAFSERR_CHAIN_BROKEN A chained operation was received after a previous chain operation failed, breaking the current DAFS chain. DAFSERR_CHAIN_FORM Error returned when a chained operation does not adhere to the chaining rules stated in the chaining section of this spec. DAFSERR_CHECKSUM A checksum mismatch error occurred. DAFSERR_CLID_INUSE The client id is already in use by another client. DAFSERR_DELAY The server initiated the request, but was not able to complete it in a timely fashion. The client SHOULD wait and then retry the request. For example, this error is returned from a server that supports hierarchical storage and receives a request to process a file that has been migrated. In this case, the server SHOULD start the immigration process and respond to client with this error. This error MAY also occur when a necessary delegation recall makes processing a request in a timely fashion impossible. DAFSERR_DENIED An attempt to lock a file is denied. Since this MAY be a temporary Wittle [Page 339] INTERNET-DRAFT Direct Access File System September 2001 condition, the client is encouraged to retry the lock request until the lock is accepted. DAFSERR_DENYDISP_CONFLICT This error is returned from a REMOVB request because of a conflict in dispositions regarding the removal open files. DAFSERR_DENYDISP_NOTSUPP This error is returned from an OPEN or REMOVE request when the server is unable to support the requested remove dispositions. DAFSERR_DQUOT Resource (quota) hard limit exceeded. The user's resource limit on the server has been exceeded. DAFSERR_EXIST File exists. The file specified already exists. DAFSERR_EXPIRED A lease that is being used in the current procedure has expired. DAFSERR_FBIG File is too large. The operation would have caused a file to grow beyond the server's limit. DAFSERR_FHEXPIRED The file handle provided is volatile and has expired at the server. DAFSERR_GRACE The server is in its recovery or grace period. DAFSERR_GSS_CONTINUE_INIT The reply message is an intermediate result of a multi-step sequence of GSS authentication message exchanges. DAFSERR_ILLEGAL_PROT Protocol version is invalid for the client. The client SHOULD Wittle [Page 340] INTERNET-DRAFT Direct Access File System September 2001 retry with a lower protocol version number. Protocol_version con- tains a suggested protocol version that is supported. DAFSERR_ILLEGAL_STATE The protocol has already been negotiated. DAFSERR_INVAL Invalid argument or unsupported argument for an operation. Two examples are attempting a READLINK on an object other than a sym- bolic link or attempting to SETATTR a time field on a server that does not support this operation. DAFSERR_IO I/O error. A hard error (for example, a disk error) occurred while processing the requested operation. DAFSERR_ISDIR Is a directory. The caller specified a directory in a non- directory operation. DAFSERR_KEY_MISMATCH Attempt to obtain a KEY SHARE reservation is denied because a KEY SHARE reservation already exists with a different key. DAFSERR_LEASE_MOVED A lease being renewed is associated with a file system that has been migrated to a new server. DAFSERR_LOCKED A read or write operation was attempted on a locked file. DAFSERR_LOCK_BROKEN An attempt to lock or a lock test was done on a persistent lock that has been marked as broken. DAFSERR_LOCK_RANGE A lock request is operating on a sub-range of a current lock for the lock owner, and the server does not support this type of request. Wittle [Page 341] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_MLINK Too many hard links. DAFSERR_MOVED The file system that contains the current filehandle object has been relocated or migrated to another server. The client MAY obtain the new file system location by obtaining the "fs_locations" attribute for the current filehandle. DAFSERR_NAMETOOLONG The filename in an operation was too long. DAFSERR_NODEV No such device. DAFSERR_NOENT No such file or directory. The file or directory name specified does not exist. DAFSERR_NOFILEHANDLE The specified file handle value is invalid. DAFSERR_NOSPC No space left on device. The operation would have caused the server's file system to exceed its limit. DAFSERR_NOTDIR Not a directory. The caller specified a non-directory in a direc- tory operation. DAFSERR_NOTEMPTY An attempt was made to remove a directory that was not empty. DAFSERR_NOTSUPP Operation is not supported. DAFSERR_NOT_ AUTHORIZED Wittle [Page 342] INTERNET-DRAFT Direct Access File System September 2001 Authorization failed. DAFSERR_NOT_FOUND The specific file does not exist. DAFSERR_NOT_SAME This error is returned by the VERIFY operation to signify that the attributes compared were not the same as the attributes provided in the client's request. DAFSERR_NOXID_MATCH The specified request has no Response Cache entry. Generally this indicates that the request was not executed before the Session was disconnected. DAFSERR_NXIO I/O error. No such device or address. DAFSERR_OLD_STATEID A State-id that designates the locking state for a lockowner-file at an earlier time was used. DAFSERR_PERM Not owner. The operation was not allowed because the caller is either not a privileged user (root) or not the owner of the target of the operation. DAFSERR_PREFETCH_NOT_SUPPORTED The server does not support the prefetch cache hint. DAFSERR_RDMA_READ_CHANNEL_UNUSABLE The request requires the use of the RDMA-read Channel, but that channel has not been established. DAFSERR_READDIR_NOSPC The encoded response to a READDIR request exceeds the size limit set by the initial request. DAFSERR_RESOURCE Wittle [Page 343] INTERNET-DRAFT Direct Access File System September 2001 For the processing of a member of a chained set of operations, the server MAY exhaust available resources and can not continue pro- cessing procedures within the chain. DAFSERR_ROFS Read-only file system. A modifying operation was attempted on a read- only file system. DAFSERR_SAME This error is returned by the NVERIFY operation to signify that the attributes compared were the same as the attributes provided in the client's request. DAFSERR_SERVERFAULT An error occurred on the server that does not map to any of the legal DAFS protocol error values. The client SHOULD translate this into an appropriate error. UNIX clients MAY choose to translate this to EIO. DAFSERR_SHARE_DENIED An attempt to OPEN a file with a share reservation has failed because of a share conflict. DAFSERR_STALE Invalid file handle. The file handle given as an argument was invalid. The file referred to by that file handle no longer exists or access to it has been revoked. DAFSERR_STALE_CLIENTID The Client-id specified as an argument is no recognized by the server. DAFSERR_STALE_STATEID A State-id generated by an earlier server instance was used. DAFSERR_SYMLINK The current file handle provided for a LOOKUP is not a directory but a symbolic link. The final component of the OPEN path is a symbolic link. Wittle [Page 344] INTERNET-DRAFT Direct Access File System September 2001 DAFSERR_TOOSMAL Buffer size is too small. DAFSERR_UNKNOWN_SESSION The Session-id specified in the request is not known to the server. DAFSERR_VERS_MISMATCH The DAFS server does not support the specified version. DAFSERR_WRITE_TOOBIG A request to write data to a file exceeds the maximum allowed I/O size for the target server. DAFSERR_WRONGSEC The security mechanism being used by the client for the procedure does not match the server's security policy. The client SHOULD change the security mechanism being used and retry the operation. DAFSERR_XDEV Attempted to perform a cross-device hard link. Wittle [Page 345] INTERNET-DRAFT Direct Access File System September 2001 8. Security and IANA Considerations 8.1. Security Considerations The key security concern for DAFS is authenticating clients. This issue is discussed in 3.1.1., "Security Model". 8.2. IANA Considerations Like NFS version 4 (as specified in RFC 3010), DAFS includes the use of named attributes. Wittle [Page 346] INTERNET-DRAFT Direct Access File System September 2001 9. Bibliography [Christianson] N. Christenson, T. Bosserman, D. Beckemeyer, "A Highly Scalable Electronic Mail Service Using Open Systems", First Usenix Sympo- sium on Internet Technologies and Systems, December 1997. [Dicecco] S. DiCecco, J. Williams, B. Terrell, J. Scott, C. Sapuntzakis, "VI / TCP (Internet VI)", http://www.ietf.org/internet-drafts/draft- dicecco-vitcp-01.txt [Fletcher] Fletcher, An Arithmetic Checksum for Serial Transmission, IEEE Transactions on Communications, Volume COM-30, No. 1, January 1982, pp.247-252. [IB] "InfiniBandTM Architecture Specification Release 1.0", InfiniBand Trade Association.SM [Linn] J. Linn, "Generic Security Service Application Program Interface, Version 2, Update 1", IETF RFC 2743, http://www.ietf.org/rfc/rfc2743.txt [POSIX] IEEE Standard 1003.1 (POSIX.1) [RFC1813] Callaghan, B., Pawlowski, B. and P. Staubach, "NFS Version 3 Pro- tocol Specification", RFC 1813, June 1995. [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998. [Sklower] Sklower, Improving the Efficiency of the OSI Checksum Calculation, Wittle [Page 347] INTERNET-DRAFT Direct Access File System September 2001 http://www.cs.berkeley.edu/~sklower/cksmosi.ps [Shepler] S. Shepler, C. Beame, B. Callaghan, M. Eisler, D. Noveck, D. Robinson, R. Thurlow, "NFS Version 4 protocol", IETF RFC 3010, http://www.ietf.org/rfc/rfc3010.txt [T11-FCVI] T11, NCITS working group, "dpANS - Fibre Channel - Virtual Inter- face Architecture Mapping", http://www.t11.org.index.html [VIA] "Virtual Interface Architecture Specification Version 1.0", December 16,1997, Compaq/Intel/Microsoft. [VIDG] "Intel Virtual Interface (VI) Architecture Developer's Guide Revi- sion 1.0", Sept. 9, 1998, Intel Corporation. [WARP] "WARP Architectural Requirements Summary", J. Pink, http://www.ietf.org/internet-drafts/draft-jpink-warp-summary- 00.txt Wittle [Page 348] INTERNET-DRAFT Direct Access File System September 2001 10. Author Information and Acknowledgements 10.1. Editor Mark Wittle Network Appliance 627 Davis Drive, Suite 200 Morrisville, NC 27560 Phone: 919-993-5627 Email: mwittle@netapp.com 10.2. Authors This document is the result of a truly collaborative effort by many people from many companies within the DAFS Collaborative. Since July 2000 when the DAFS Specification 0.5 was released, Mark Wittle has acted as the primary author and editor of the specification. 10.3. Comments Comments on this document should be sent to dafs-discussions@groups.yahoo.com. 10.4. Acknowledgements I'd like to thank all of the member companies of the DAFS Collabora- tive for supporting the effort of creating this specification. If I tried to list all of the people who contributed to the creation of this DAFS Specification, my list would surely be incomplete. So, simply, I'd like to thank the many individuals from Network Appliance and the DAFS Collaborative who helped create DAFS. Wittle [Page 349] INTERNET-DRAFT Direct Access File System September 2001 Appendix A. DAFS Name Service A.1. Introduction DAFS provides a simple and flexible name space for the file and file- related objects (for example, directories, symlinks) that exist within a DAFS administrative domain. In addition, DAFS defines a name service structure for mapping the name space into specific locations in a distributed environment. This document describes the DAFS name space. DAFS provides a two-stage discovery process for connection establish- ment between DAFS clients and servers. First, the DAFS client queries DAFS Name Service with DAFS Name as its input and gets back a set of DAFS locations. A DAFS location consists of a DAT location and a directory path. Second, the DAFS client queries the DAT Name Service (see Appendix C. "DAT Name Service") with the DAT hostname from the DAT Location and a client's local channel adapter. The query returns the server's host channel adapter address(es). The server host chan- nel adapter address and DAT connection qualifier are used by the client to request a DAT connection to the DAFS server for a DAFS Ses- sion. A DAFS server advertises its services by filling the DAFS Name Ser- vice. There MAY be more than one DAFS server for the same DAFS file or file- related object. A.2. DAFS Name Space The DAFS name space allows DAFS file objects to be distributed within a collection of DAFS server systems. Each of the DAFS servers pro- vides access to a subset of the file objects within the DAFS name space. The collection of DAFS servers that participate within a par- ticular DAFS name space make up the DAFS name space domain. A.3. DAFS Name A "DAFS Name" is a simple string. Each DAFS Name is associated with a set of "DAFS Locations." Each DAFS Name is unique within the DAFS name space; that is, a given DAFS Name refers to a single set of DAFS Locations within the domain. For instance, if a name service is used to store and lookup DAFS Names and their Locations, then each DAFS Name can have at most one entry in the name service. Note: The runtime components that implement a DAFS name space are NOT REQUIRED to enforce the assertion of DAFS Name uniqueness (although they MAY). However, in order to provide reasonably- expected results when file objects are accessed, DAFS servers Wittle [Page 350] INTERNET-DRAFT Direct Access File System September 2001 MAY assume that the name space domain is administered under this policy. DAFS clients MAY assume that any failover, migra- tion, or replication feature provided by the DAFS server is also governed by the same policy. Depending on how the implementations handle DAFS Names, res- triction on the use of common separator characters like "/" MAY be needed. DAFS Names are associated with DAFS Locations. In this sense, the DAFS Name acts as the "key" to a database lookup, and the DAFS Loca- tion is the returned "value." A.4. DAFS Location The DAFS location specifies a DAFS server and is made up of three components: 1) DAT Location 2) DAFS Directory Path 3) DAFS Version. A.4.1. DAT Location The DAT Location provides two pieces of information. One is the "server address" information needed by a DAFS client to establish a DAFS communication channel between client and server. The other is the transport semantics supported by the DAFS server host including transport attributes and optional semantics. The DAFS Location is filled in by the DAFS server. The DAT Location consists of the following four parts: 1) DAT Transport Type 2) DAT Hostname 3) DAT Connection Qualifier 4) Transport Specific Server Attributes. The Transport Type is one of a well-defined set of simple strings that specifies a particular DAT Transport type. The Transport type is expected to be used to identify the appropriate DAT context in which the DAT Hostname and DAT Connection Qualifier are to be interpreted. Wittle [Page 351] INTERNET-DRAFT Direct Access File System September 2001 DAFS protocol version-1.0 defines two DAT transport types: 1) VI-for Virtual Interface for FC/VI and VI-TCP 2) IB-RC-for Infiniband Reliable Connection based implementations. const char* DAFS_DAT_VI = "VI"; const char* DAFS_DAT_IBRC = "IBRC"; Note: the enumerated transport mappings listed in this section are not meant to be exhaustive or exclusive in any way. As addi- tional DAT transports are identified and become available, it is expected that additional types will be added to this set. For each DAT transport type defined by the DAFS protocol there is an appendix that describes how the transport type supports DAT seman- tics. For VI, see Appendix D. "DAFS Mapping to VI Architecture" and for IB- RC, see Appendix E. "DAFS Mapping to InfiniBand Reliable Con- nection". The DAT Hostname is also a simple string that the client can pass to the appropriate transport-specific DAT Name Service Provider. The DAT Name Service Provider maps the DAT Hostname to a Channel Adapter Address or set of channel adapter Address(es) that identifies the DAFS server's channel adapter(s) for a set of remote endpoints for connection establishment. For more information, see Appendix C. "DAT Name Service" for a description of DAT Name Service characteristics and requirements. Note: Different hosts SHOULD have different DAT hostnames. However, a single host can have multiple hostnames. The DAT Connection Qualifier is also a simple string. It is the value that the DAFS client uses to specify a remote Endpoint on the target Channel Adapter associated with the DAFS server. This Endpoint will be used to create a DAT connection upon which a DAFS Session can be established. For a description of the DAT Connection Qualifier and its use in the DAT connection establishment process, see B.4., "Tran- sport Endpoints and Connections". Note: While the DAT Connection Qualifier is viewed by the DAFS name space as a simple string, internally it MAY have more intricate structure. The internal structure is determined by the DAT transport type. The definition of this internal structure is provided in the DAFS mapping for each DAT Transport type. For more information, see the appropriate appendices for the DAFS Wittle [Page 352] INTERNET-DRAFT Direct Access File System September 2001 on VI and DAFS on IB mappings. The content of this Transport Specific Server Attributes field is defined for each DAT Transport Type in the appropriate appendix for the mapping of DAT to that Transport Type. These attributes provide clients information about transport specific attributes that are set up or supported by the server. Example are the maximum transfer size, reliability levels supported by the server's transport provider, and support of optional transport functionality. A.4.2. DAFS Directory Path The Directory Path is a list (array) of directory name components that specifies a hierarchical directory path provided by the DAFS server at the specified DAFS Location. The directory path can be accessed through the use of the DAFS_PROC_LOOKUP operation executed using the return value from DAFS_PROC_GETROOTHANDLE (for example, the rootfilehandle of the DAFS server) as the directory for the lookup operation. The Directory Path MAY be NULL, meaning that the directory path asso- ciated with the DAFS Name is the rootfilehandle returned by the DAFS_PROC_GETROOTHANDLE operation. Note: Although NOT REQUIRED, the Directory Path is expected to be the same for the same host regardless of which Transport Type chan- nel adapter is used to reach the host. A.4.3. DAFS Version The DAFS Version is a simple string that specifies the DAFS version number that is supported by the DAFS server. If DAFS server support multiple DAFS versions, then it needs to create a separate DAFS Location(s) for each of them. A.5. DAFS Names and Locations A DAFS Name MAY be associated with one or more DAFS Locations. Allow- ing multiple Locations provides the capability to associate different semantics with additional Locations. For instance, a DAFS Name with two Locations MAY indicate multiple transport paths to the Name that are capable of concurrent access, and this might be useful as a per- formance and/or availability enhancement. It might indicate some form of replication of resources. Multiple paths might indicate the existence of an active primary location and an inactive secondary (or backup) location. The structure defined by the on- to-many relation- ship of DAFS Name to Location: Wittle [Page 353] INTERNET-DRAFT Direct Access File System September 2001 DAFS Name: {DAFS Location, DAFS Location, ... } The structure does not specify the relationship between the multiple locations. Currently, the DAFS client needs to determine the rela- tionship dynamically. Accordingly, the DAFS server MUST be prepared for clients that probe a set of a locations attempting to determine their status (for example, "active" or "inactive"). Rationale: The Name to Locations mappings provided in DAFS name space are potentially more static than the relationship between two locations. One goal of the DAFS name space design is to provide the client with quick failover capability, and this prohibits reliance on propagation of new name space information at failover time. Thus, the approach here is to enable the client to capture and maintain the fairly static picture of the possible Locations for each Name in the domain. And then in realtime, allow the client to probe those Locations as it deems appropriate in response to various failure scenarios. However, since the mapping of Names to Locations MAY change and grow to include new entries over time, the DAFS client SHOULD have a mechanism to update those mappings. A DAFS server can listen on multiple Connection Qualifiers on the same host. Nevertheless, different Connection Qualifiers MUST be advertised as separate DAFS Locations. If a DAFS server host has multiple channel adapters of the same tran- sport type then the Connection Qualifier that is valid on one channel adapter of a transport type MUST also be valid on all channel adapters of that transport type. If a host has multiple channel adapters for the same DAT Transport Type, there will be a single DAFS Location Entry for all of them. By calling the transport-specific DAT Name Service and specifying the client's local channel adapter for that transport type for the DAT Hostname, the DAFS client will get all channel adapter addresses for that host. A.6. Name Space Repository For simple configurations a DAFS client MAY wish to directly connect to a DAFS server that provides the name space subset of interest. For more complex environments, where there are multiple DAFS clients, DAFS servers, multiple connection paths, and support for failover and Wittle [Page 354] INTERNET-DRAFT Direct Access File System September 2001 trunking, the "DAFS Name Service" component is used to discover DAFS servers that provide the name space subset. The DAFS Name Service maps a list of DAFS Names and Locations. A.7. LDAP Schema The DAFS Name Space provides the ability to define the relationship between DAFS named objects and their locations. This section defines the basic elements of a mapping of the DAFS Name Space into the Lightweight Directory Access Protocol (LDAP) [Wahl]. The LDAP schema is extended to add the dafsSchema, that supports the following object class and attribute definitions: Object Classes o dafsNameSpace: the DAFS name space o dafsNameSpaceEntry: an entry mapping a DAFS name to one or more locations o dafsLocationList: a list of one or more locations for a DAFS name o dafsLocation: a DAFS location o datLocation: the DAT transport-specific information for a loca- tion. Attributes o dafsName: a DAFS name string o datTransportType: a transport type, currently either "VI" or "IBRC" o datTransportHostname: a transport-specific name for the host- channel location o datTransportConnectionQualifer: more specific information for the transport address o datTransportAttributes: transport specific server attributes o dafsDirectoryPath: a list of one or more directory pathname com- ponents o dafsProtocolVersion: a list of one or more supported protocol ver- sion numbers. Wittle [Page 355] INTERNET-DRAFT Direct Access File System September 2001 ######################################################## # # DAFS name service # # DAFS_NAMESERVICE # attribute ( DAFS_NAMESERVICE.1.0 NAME 'dafsName' DESC 'DAFS Name' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 SINGLE-VALUE ) attribute ( DAFS_NAMESERVICE.1.1 NAME 'datTransportType' DESC 'Type of dafs transport address (VI, IBRC)' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 SINGLE-VALUE ) attribute ( DAFS_NAMESERVICE.1.2 NAME 'datTransportHostname' DESC 'TransportHostname for DAFS' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 SINGLE-VALUE ) attribute ( DAFS_NAMESERVICE.1.3 NAME 'datTransportConnectionQualifier' DESC 'TransportConnectionQualifier for DAFS' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 SINGLE-VALUE ) attribute ( DAFS_NAMESERVICE.1.4 NAME 'datTransportAttributes' DESC 'TransportAttributes for DAFS' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 SINGLE-VALUE ) attribute ( DAFS_NAMESERVICE.1.5 NAME 'dafsDirectoryPath' DESC 'Dafs Directory Path Name component' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 ) Wittle [Page 356] INTERNET-DRAFT Direct Access File System September 2001 attribute ( DAFS_NAMESERVICE.1.6 NAME 'dafsProtocolVersion' DESC 'Dafs Protocol Version number' EQUALITY caseExactIA5Match SYNTAX 1.3.6.1.4.1.1466.115.121.1.15 ) objectClass ( DAFS_NAMESERVICE.2.1 NAME 'dafsNameSpace' DESC 'DAFS Name Space object' SUP top STRUCTURAL MAY cn ) objectClass ( DAFS_NAMESERVICE.2.2 NAME 'dafsNameSpaceEntry' DESC 'DAFS Name Space entry' SUP dafsNameSpace AUXILIARY MUST ( dafsName $ dafsLocationList ) MAY ( description ) ) objectClass ( DAFS_NAMESERVICE.2.3 NAME 'dafsLocationList' SUP dafsNameSpaceEntry AUXILIARY DESC 'List of locations for accessing a DAFS Name' MUST ( dafsLocation ) ) objectClass ( DAFS_NAMESERVICE.2.4 NAME 'dafsLocation' SUP dafs_LocationList AUXILIARY DESC 'A location for accessing a DAFS Name' MUST ( datLocation $ dafsDirectoryPath $ dafsProtocolVersion ) ) objectClass ( DAFS_NAMESERVICE.2.5 NAME 'datLocation' SUP dafsLocation AUXILIARY DESC 'Location for access to DAT interface' MUST ( datTransportType $ datTransportHostname $ datTransportConnectionQualifier $ datTransportAttributes ) ) Wittle [Page 357] INTERNET-DRAFT Direct Access File System September 2001 The following illustrates how an LDAP search for the DAFS name "data- base17" might return information, in this case 2 transport addresses associated with the same DAFS server: Query: (& (objectClass=dafsNameSpaceEntry) (dafsNameSpaceEntry=database17)) Result: objectClass: top objectClass: dafsNameSpaceEntry dafsName: database17 datTransportType: VI datTransportHostname: server28 datTransportConnectionQualifier: dafs datTransportAttributes: dafsDirectoryPath: dbms_dir dafsProtocolVersion: 1 datTransportType: IBRC datTransportHostname: server28 datTransportConnectionQualifier: dafs datTransportAttributes: dafsDirectoryPath: dbms_dir dafsProtocolVersion: 1 A.8. References [Wahl] M. Wahl, T. Howes, S. Kille, "Lightweight Directory Access Proto- col (v3)", IETF RFC 2251, http://www.ietf.org/rfc/rfc2251.txt Wittle [Page 358] INTERNET-DRAFT Direct Access File System September 2001 Appendix B. DAT Semantics New interconnect networks that provide direct access to remote memory are emerging. In addition to offering direct access to remote memory, these new interconnect networks also provide low latency and high throughput. Their transport protocols support remote memory read and write in addition to more traditional message transfer operations. Examples of these transports include Virtual Interface Architecture, InfiniBand Architecture, and the WARP protocol for the Internet. Traditional message transfer operations allow only the receiver of a message to specify the particular location on the destination node were the message payload will be deposited. Remote memory writes allow the operation initiator to specify the target memory location on the destination node. Remote memory reads allow the operation ini- tiator to specify both the remote memory location that is to be the source of a data "fetch" operation as well as the local destination for the fetched remote memory contents. The addition of remote memory semantics to the transport layer supports a new class of networked applications. This appendix defines the semantics of a particular set of abstract transport capabilities. A transport whose semantics support these capabilities is called a Direct Access Transport (DAT). These seman- tics is intended to be mapped easily onto networks that support memory-to- memory operations, such as Virtual Interface Architecture, and InfiniBand Architecture. This Appendix does not define a specific transport layer interface, but does describe some functionality and concepts necessary to support the DAFS protocol. B.1. DAT Glossary Channel Adapter Channel Adapter is a host resident device that transfers messages to/from host memory associated with a specific Endpoint and a Fabric. Channel Adapter Address Channel Adapter Address on the fabric. Connection An association between a pair of Endpoints such that data of posted data-transfer operations requests of either Endpoint arrive at the other Endpoint of the Connection. Connection Qualifier Wittle [Page 359] INTERNET-DRAFT Direct Access File System September 2001 A value that enables a new connection request to be associated with the upper-level-protocol entity providing the service. DAT Consumer An application that requires Direct Access Transport services. DAT Provider Provider of the Transport services for a Direct Access applica- tion. Data Transfer Completion (DTC) Status of a completed data transfer operation. Data Transfer Operation (DTO) Requested data movement transfer submitted to a DAT Provider. Endpoint (EP) The local part of a Connection that supports posting data-transfer operation requests. Fabric A network with RDMA capabilities. Operation Type Send, Receive, RDMA Read or RDMA Write data transfer operations (DTO). Remote Direct Memory Access (RDMA) Access of local memory by the remote Endpoint. There are two RDMA operations: RDMA Read and RDMA Write. RDMA Memory Region Context (RMR Context) A representation for an arbitrary-sized, registered, contiguous virtual space that belongs to a Channel Adapter so that it can support Remote DMA operations on the Connection whose local End- point belongs to the Channel Adapter. Wittle [Page 360] INTERNET-DRAFT Direct Access File System September 2001 RMR Target Address Specifies the memory address within a region of memory represented by an RDMA Memory Region Context. (The specification can be either by virtual address or offset from the start of the memory represented by the RMR Context.) B.2. DAT Model There are two significant interfaces to a Direct Access Transport service provider. One interface defines the boundary between the con- sumer of a set of transport services and the local transport provider of these services. In the DAT model, this would be the interface between the DAT Consumer and the DAT Provider. The other interface defines the set of interactions between a local and remote transport provider that enables the local and remote providers to offer a set of transport services between the local and remote transport consu- mers. In the DAT model, this would be the set of interactions between a local DAT Provider and a remote DAT Provider that are visible to the local DAT Consumer and/or remote DAT Consumer. This document defines the minimal set of necessary semantics for the interaction between DAT Providers that are visible to the local DAT Consumer. The transport protocol-specific details of the DAT Provider to DAT Provider interactions for a specific transport is outside of the scope of this document.These lower level, transport-specific details are not defined here; it is expected that they are provided as part of the specification of a particular transport protocol (e.g. VI/TCP, FC-VI, IB, and WARP) Furthermore, except as needed to characterize the semantics of the target set of abstract transport services between a local and remote DAT Consumer, the local interactions between the DAT Provider and the DAT Consumer are not defined in this document. B.3. DAT Provider There can be multiple DAT Providers on the same node. Each DAT Pro- vider controls resources and provides RDMA and message transfer ser- vices for one or more DAT Consumer processes. A DAT Provider controls Channel Adapters. A Channel Adapter is controlled by at most one DAT Provider. A DAT Provider can have multiple Channel Adapters. Each Channel Adapter can have multiple Endpoints. An Endpoint belongs to exactly one Channel Adapter. An Endpoint is the local part of a con- nection that supports the posting of a data transfer operation (DTO), including RDMA operations. A Connection is an association between a Wittle [Page 361] INTERNET-DRAFT Direct Access File System September 2001 pair of Endpoints such that the data payload described by data transfer operations posted on either Endpoint arrives at the other Endpoint of the Connection. In order for an endpoint of a connection to support RDMA operation, the remote endpoint MUST have access rights to the local memory accessed by RDMA. The DAT Provider controls one or more Channel Adapters that provide access to the network Fabric. Each Channel Adapter is identified on the Fabric by a unique address. The assignment and maintenance of Channel Adapter addresses on the Fabric as well as name service for these addresses are outside the scope of DAT. For connection estab- lishment purposes, a DAT Consumer such as a DAFS client needs to have a mechanism to specify remote Channel Adapters (see Appendix C. "DAT Name Service"). The mechanisms by which a DAT Consumer discovers remote Channel Adapters and identifies them to the DAT Provider is outside the scope of the DAT. B.4. Transport Endpoints and Connections DAT semantics require reliable data exchange on point-to-point con- nections. A point-to-point connection is an association, formed by a connection establishment protocol, between two transport endpoints. The transport MUST be capable of supporting more than one connection between a pair of DAT Providers. A DAT Endpoint is an object in the transport layer that upper layers use to create connections and exchange data. When it is part of a connection, an endpoint supports four data transfer operations: send, receive, RDMA write, and RDMA read. Data of a data transfer operation posted to an endpoint of a connection arrives to the other endpoint of the connection. The DAT Provider manages endpoint creation and destruction. A DAT Provider supports explicit connection endpoint creation by the DAT Consumer. The mechanism by which the DAT Consumer and the DAT Pro- vider interact to create and destroy endpoints is outside the scope of DAT. The properties of an unconnected endpoint are outside the scope of DAT. Each endpoint is associated with a single Channel Adapter for the lifetime of the connection, while a local endpoint is connected to a remote endpoint. An Endpoint is the interaction point for data transfer operation requests between a DAT Consumer and a DAT Provider. A DAT consumer submits data transfer operation requests to a local endpoint. A local endpoint executes data transfer requests only if it is a part of an established connection. DAT semantics require DAT Provider's support for active-passive Wittle [Page 362] INTERNET-DRAFT Direct Access File System September 2001 (client- server) connection establishment. Some DAT Providers MAY offer automatic connection endpoint creation for the client or for the server. Many DAT Providers support specification of the desig- nated server connection endpoint in a generic fashion only. The actual server connection endpoint MAY be dynamically created and/or allocated upon connection request. An endpoint can be a part of at most one connection at a time. A DAT Provider supports a request for connection establishment of an unconnected local DAT endpoint. The DAT Consumer specifies a local endpoint that it wants to connect, the Connection Qualifier, and the Channel Adapter address for a remote endpoint. See Appendix C. "DAT Name Service" for a discussion of the Name Service for Channel Adapter Addresses. The DAT Provider notifies the DAT Consumer of suc- cessful or unsuccessful connection establishment. During the connection establishment process, the active side speci- fies a Connection Qualifier that is used by the passive side DAT pro- vider to associate the incoming connection request with the appropri- ate listening process. The Connection Qualifier does not uniquely specify the endpoint of the passive side. The same Connection Qualif- ier can be re-used by a an active DAT Consumer for establishing mul- tiple connections between the same pair of hosts. In many cases, a single DAFS server will listen only on a single Connection Qualifier for all incoming connection requests for all Sessions for all DAFS clients. The connection parameters negotiation, connection establishment, and other interactions between the DAT Providers of two sides of the con- nection are internal to the Transport and are outside the scope of the DAT protocol. Other interactions between the DAT Provider and Consumer, like attributes of the connection and timeout for connec- tion establishment, are local interactions and are also outside the scope of DAT. The passive DAT Provider supports notification of the DAT Consumer connection request based on the Connection Qualifier. The details of the interactions between a DAT Consumer and DAT Pro- vider, for example, providing an existing unconnected local endpoint or asking a Provider to create one on the fly, are outside the scope of DAT. A DAT Provider MUST offer a mechanism to enable the DAT Consumer on either Endpoint to break a connection. Upon receiving a request for a connection termination, the DAT Provider SHALL break the connection and SHALL complete all outstanding and in-progress data-transfer operations, with an error indication if they have not yet completed. The DAT Provider SHALL not process outstanding data transfer opera- tions subsequent to receiving a request for connection termination. DAT does not define how the remote DAT Provider discovers that the Wittle [Page 363] INTERNET-DRAFT Direct Access File System September 2001 connection has been broken. However, the remote DAT Provider SHALL report to its local DAT Consumer that the connection has been broken. For example the remote DAT Provider can detect connection termination by the inability to deliver data subsequent to the connection termi- nation. For more specific details, see B.6., "DAT Data Transfer Operations and Connection Properties". If the DAT Provider exposes an error to the DAT Consumer (either due to a transport error for a transport that does not attempt transport-level error recovery, or due to an unrecoverable error), the DAT Provider MUST break the connection upon reporting the error and notify the DAT Consumer that the connection has been broken. When and how this notification takes place is outside the scope of DAT. After a single data transfer operation (DTO) has completed with an error status, all subsequently posted DTOs SHALL also be completed with an error status. B.5. DAT Memory Semantics A DAT Provider MUST have the right to read memory that contains the source data for a data transfer operation (DTO) and to write to memory that is the destination location for the payload carried by a DTO. The registration of local memory with the DAT Provider in order to establish local access rights is a local interaction between the DAT Provider and the DAT Consumer and is outside the scope of DAT. Furthermore, the registration of local memory needed for remote memory accesses for RDMA operations is also outside the scope of DAT. The DAT Consumer requires of the DAT Provider the following property: an RDMA Memory Region Context (RMR Context) which is the outcome of the memory registration MUST be able to be passed to the remote side of the connection for it to initiate an RDMA operation on the memory specified by the RMR Context. An RMR Context is a representation of an arbitrary-sized, registered, contiguous virtual space that can be directly accessed by a Channel Adapter to support: o only Remote DMA Read operations, o only Remote DMA Write operations, or o both Remote DMA Read and Remote DMA Write operations. The mechanism by which an RMR Context is created is outside the scope of DAT. An RMR Context can be advertised to a remote DAT Consumer to allow Wittle [Page 364] INTERNET-DRAFT Direct Access File System September 2001 the remote DAT Consumer to initiate an RDMA operation that targets the RMR Context. The mechanism by which the local DAT consumer adver- tises the RMR Context to the remote DAT Consumer is outside the scope of DAT. Nevertheless, DAT does assume the following properties of an RMR Context: o An RMR Context has an association with a set of connections that support RDMA operation to that RMR Context via its local Endpoint. Note that multiple Endpoints on same Channel Adapter MAY be able to use the same RMR Context and multiple RMR Contexts can be asso- ciated with the same Endpoint. Defining the mechanism by which the DAT Consumer and the DAT Provider interact to set up an associa- tion between RMR Context and a set of Endpoints is outside the scope of DAT. o It is not expected that all RMR Contexts valid on a given Endpoint on a Channel Adapter will be valid across all Endpoints on the Channel Adapter. Defining the mechanism by which the DAT Consumer and DAT Provider interact to limit the scope of an RMR Context to a given set of Endpoints is outside the scope of DAT. o DAT does NOT REQUIRE that the same RMR Context can be used by mul- tiple Channel Adapters. However, DAT requires that an RMR Context be valid within the context of a set of Endpoints within a single Channel Adapter. o The same memory can belong to multiple RMR Contexts within the same or different DAT Providers. The DAT Provider MUST be able to allow the Consumer to create new RMR Context mapping to the same physical memory. An RMR Context is specified as a 32-bit unsigned integer. An RMR Tar- get Address specifies the memory address to be used for an RDMA operation. The RMR Target Address must be within a region of memory represented by an RMR Context. An RMR Target Address is specified as 64-bit unsigned integer. B.6. DAT Data Transfer Operations and Connection Properties There are four types of data transfer operations: send, recv, RDMA Write and RDMA Read. The wire protocol formats of the messages that underlies these operations are defined by specific transport proto- cols and are outside the scope of DAT. The initiator refers to the transport Endpoint of the connection whose consumer posted a given data transfer operation. The target refers to the transport Endpoint on the other end of the connection from the initiator. Each transport Endpoint can initiate data transfer operations and be the target of transport layer messages. Messages supported by the transport layer Wittle [Page 365] INTERNET-DRAFT Direct Access File System September 2001 might be extremely large. The transport protocol is expected to seg- ment messages into transport layer packets and reassemble these pack- ets into messages at the target. An RDMA Write operation MUST contain an RMR Context and an RMR Target Address specifying the remote memory where the data is to be depo- sited. An RDMA Read operation MUST contain an RMR Context and RMR Target Address specifying the remote memory where the data is to be extracted. Delivery of data payloads for DAT operations MUST obey the following rules: o All data transfer operations submitted to the DAT Provider will complete successfully in the absence of errors, with data delivered uncorrupted, in the order specified by the DAT delivery ordering rules. o Corruption of the data delivered to the Consumer the (local Consu- mer for RDMA Read) is detected as an error and reported to the Consumer. o Data loss (inability to deliver data to or from the remote End- point of the connection) SHALL be detected as an error and reported to the Consumer. o Upon detection of an error, the connection SHALL be broken and all outstanding and in-progress data-transfer operations SHALL com- plete with an error. o There is a one-to-one correspondence between send operations on one Endpoint of the connection and receive operations on the other Endpoint of the connection. o There is no correspondence between RDMA operations on one endpoint of the connection and recv or send data transfer operation on the other endpoint of the connection. o Data Transfer Operation completion means that the Consumer can reclaim resources associated with the operation including the memory that contains the data. o Ordering rules: o The data payload of a send operation and associated receive operation MUST be delivered without error into the receiver- specified memory buffer prior to the receive completion. Wittle [Page 366] INTERNET-DRAFT Direct Access File System September 2001 o Receive operations on a connection MUST be completed in the order of the posting of their corresponding sends. o Each RDMA write operation posted on a connection prior to a send operation MUST have its data payload delivered to the tar- get memory region prior to the completion of the receive opera- tion matching the send. DAT can have several send, recv, RDMA write, and RDMA read operations active simultaneously. Out of order packet delivery can lead to com- plex target implementations. However, if the transport layer does not restrict accesses to the memory for data transfer operations (DTO), and the transport protocol packets contain enough information, the transport can write messages to the target memory as soon as these messages arrive. Specific implementations of the transport layer are allowed to implement more stringent memory ordering restrictions. The DAT memory access ordering by DTOs does not define the result of an RDMA write to a memory location followed by an RDMA read from the same memory location. It is up to the DAT Consumer to enforce specific ordering by DTOs for accesses to local or remote memory. Wittle [Page 367] INTERNET-DRAFT Direct Access File System September 2001 Appendix C. DAT Name Service The main purpose for the Name Service is to provide a Channel Adapter address that can be used to identify a Channel Adapter for a remote Endpoint for connection establishment. For connection establishment, the Channel Adapter address and the Connection Qualifier specify the remote side of a requested connection. A Channel Adapter Address is a unique identifier of a Channel Adapter on the (network) Fabric. The same DAT Provider can support multiple Channel Adapters and mul- tiple fabrics. A remote host might not be reachable through all local Channel Adapters. A Fabric might not fully connect all the hosts. Only some remote hosts and only some of the Channel Adapters on the reachable remote hosts can be accessed from any given local Channel Adapter. The DAT Provider MAY require the use of a specific Fabric and a specific local Channel Adapter to access a remote host. The Name Service MAY or MAY NOT generate traffic between DAT Provid- ers on any of the Fabrics that connect hosts. The DAT Name Service SHALL be able to support multiple Fabrics connecting hosts. Note: Some transports, like InfiniBand support path specification and manipulation for a route between local and remote Channel Adapter. Others, like VI Architecture, do not. DAT chooses the least common denominator and does not describe interactions with paths. It describes Channel Adapter addresses and leaves paths and routing to the underlying transport that supports DAT semantics. DAFS has the following requirements for the DAT name service: 1) The DAT Provider SHALL provide a way to enumerate all local Chan- nel Adapters and determine their names. This is needed so that the DAFS protocol (DAT Consumer) can open Channel Adapters to be used for communication. 2) The DAT Provider SHALL provide a way to find addresses of all Channel Adapters on a remote host identified by a host name acces- sible from a specified local Channel Adapter. 3) The name of a remote host SHALL be unique and the same across all Fabrics. Each host has a name. The DAT Name Service does not address the issue of assignment of a name to a host. How a DAT Consumer discovers the names of remote hosts is also outside the scope of DAT and DAFS (see Appendix A. "DAFS Name Service"). The only requirement on a host name is that the name SHALL be the same on all Fabrics and uniquely Wittle [Page 368] INTERNET-DRAFT Direct Access File System September 2001 identify the host on all Fabrics. This ensures that the same name can be used on any of the local Channel Adapter to identify the same remote host independent of the network Fabric(s) that connect Channel Adapters on local and remote hosts. A host MAY have multiple names. It is up to the DAT Consumer to ensure that each name used by the DAT Consumer in the DAT Name Service adheres to the uniqueness rule on all fabrics. For a given local Channel Adapter and a given host identified by a hostname the DAT Provider SHALL return all Channel Adapter address(es) of the host reachable from the local Channel Adapter. An interface that provides that functionality is outside the scope of DAT. The DAT Provider might support a single operation which returns all Channel Adapter addresses on a remote node. Or, a DAT Consumer might need to call multiple times with each call returning a single Channel Adapter Address. A DAT Consumer SHOULD be capable of generat- ing a connectivity matrix between the local and remote hosts using the DAT Name Service semantic. The Channel Adapter address is not guaranteed to be globally unique. A remote Channel Adapter address is valid only for the local Channel Adapter through which it was found. Note: There is no requirement that the DAT Name service provide a host name based on a Channel Adapter Address. A DAT Provider is free to provide this optional functionality, but a DAT Consumer (for example, the DAFS protocol) SHALL NOT rely on this func- tionality. Wittle [Page 369] INTERNET-DRAFT Direct Access File System September 2001 Appendix D. DAFS Mapping to VI Architecture This appendix provides a mapping of Direct Access Transport (DAT) semantics onto the Virtual Interface Architecture (VIA), and how the Direct Access File System (DAFS) makes use of this mapping. D.1. Terminology Mapping from DAT to VI DAT Channel Adapter (CA) VI Network Interface Controller (NIC) A NIC provides an electro-mechanical attachment of a computer to a network. Under program control, a NIC copies data from memory to a network medium, transmission, and from the medium to memory, reception, and implements a unique destination for messaged traversing the network. DAT CA Address VI Host Address The logical network address of the VI NIC. DAT Connection VI Connection An association between a pair of VI endpoints such that data of posted data transfer operations requests of either VI endpoint arrive at the other VI endpoint of the Connection. DAT Connection Qualifier VI Discriminator A value that allows a Connection Manager to associate an incoming Connection request with the entity providing the service. DAT Consumer VI Consumer An application that requires VI services. DAT Provider VI Provider Wittle [Page 370] INTERNET-DRAFT Direct Access File System September 2001 Provider of the DAT services for a VI application. DAT DTC - Data Transfer Completion VI Descriptor Status of the completed data transfer operation. DAT DTO - Data Transfer Operation VI Descriptor Requested data movement transfer submitted to a VI Provider. DAT Endpoint (EP) Virtual Interface Endpoint (VI Endpoint) The local part of a Connection that supports posting data transfer operation requests. DAT Fabric VI Network Fabric A network with RDMA capabilities. DAT Operation Type VI Operation Type Send, Receive, RDMA Read or RDMA Write DTOs. DAT RDMA VI RDMA Remote direct memory access - access of local memory by the remote VI. There are two RDMA operations - RDMA Read and RDMA Write. DAT RDMA Memory Region Context (RMR Context) VI VIP_MEM_HANDLE (Memory Handle) A programmatic construct that represents a process's authorization to specify a memory region to the VI NIC. Memory Handle is a representation for an arbitrary sized, registered contiguous vir- tual space that is registered with a NIC so it can support Remote Wittle [Page 371] INTERNET-DRAFT Direct Access File System September 2001 DMA operations on the Connection whose local VI belongs to the NIC. DAT RMR Target Address VI Virtual Address RMR Target Address specifies the memory address within a region of memory represented by RDMA Memory Region Context. D.2. Additional VI Terminology There are several more VI terms that are used in this appendix that need to define. The definitions are quoted from the Virtual Interface Architecture [VIArch] Chapter 1: Completion Queue (CQ) A queue containing information about completed Descriptors. Used to create a single point of completion notification for multiple queues. Immediate Data Data contained in a Descriptor that is sent along with the data to the remote node and placed in the remote node's per-posted Receive Queue Descriptor. Memory Protection Tag A unique identifier generated by the VI Provider for the use by the VI Consumer. Memory Protection Tags are associated with VIs and Memory Regions to define the access permission the VI has to a memory region. Memory Region An arbitrary sized region of a process's virtual address space registered as communication memory such that it can be directly accessed by the VI NIC. Work Queue (WQ) A posted list of Descriptors being processed by a VI NIC. Every VI has two Work Queues: a send queue and a receive queue. The combi- nation of the Work Queue selected by the post operation and the operation type indicated by the Descriptor determine the exact type of data movement that the VI NIC will perform. Wittle [Page 372] INTERNET-DRAFT Direct Access File System September 2001 D.3. DAT Requirements Mapping VI Architecture supports DAT semantics. The Mapping of DAT Require- ments onto VI Architecture follows. 1. DAT SHALL support a connection that provide send-recv message transfers and RDMA Read and Write operations. Complies 2. DAT SHALL support reliable connection which provides the following features: "The VI Architecture supports three levels of communication relia- bility at the NIC level: Unreliable Delivery, Reliable Delivery and Reliable Connection... Support for Reliable Delivery and Reli- able Reception is OPTIONAL." (Virtual Interface Architecture Specification, Chapter 2.5, Page 16) Both Reliable Delivery and Reliable Reception satisfy all of the DAT requirements. All DAFS servers and clients REQUIRED to support Reliable Delivery and MAY optionally support Reliable Reception. 3. All data transfer operations submitted to the DAT Provider will complete successfully in the absence of errors, with data delivered uncorrupted, in the order defined by DAT ordering rules (see below). "A reliable Delivery VI guarantees that all data submitted for transfer will arrive at its destination exactly once, intact, and in the order submitted, in the absence of errors." (Virtual Inter- face Architecture Specification, Chapter 2.5.2, Page 18) 4. Corruption of the data delivered to the Consumer (local one for RDMA Read) is detected as an error and reported to the Consumer. Complies 5. Data loss (inability to deliver a data to the remote endpoint of the connection (from remote to local one for RDMA Read)) SHALL be detected as an error and reported to the Consumer. Complies 6. Upon detection of an error, the connection SHALL be broken and all outstanding and in progress data transfer operations SHALL complete with an error. Complies Wittle [Page 373] INTERNET-DRAFT Direct Access File System September 2001 7. There is a one-to-one correspondence between send operations on one endpoint of the connection and recv operations on the other end- point of the connection. Complies 8. There is no correspondence between RDMA operations on one endpoint of the connection and recv or send data transfer operation on the other endpoint of the connection. "No Descriptors on the remote node's receive queue are consumed by RDMA operations... The exception to this rule is that if Immediate Data is specified by the initiator of an RDMA Write request it will consume a Descriptor on the remote end when the data transfer is complete, thus allowing for synchronization." (Virtual Inter- face Architecture Specification, Chapter 2.3.1, Pages 14-15). DAFS does not use Immediate Data. 9. Data Transfer Operation Completion means that the Consumer can reclaim resources associated with the operation including the memory that contains the data. Complies 10. The data payload for the send operation matching a receive opera- tion MUST be delivered into the receiver indicated memory buffer without errors prior to the receive completion. Complies 11. Receive operations on a connection MUST be completed in the order of posting of their corresponding sends. Complies 12. Each RDMA write operation posted on a connection prior to a send operation MUST have its data payload delivered to the target memory region prior to the completion of the receive operation matching that send. Complies 13. DAT SHALL support multiple connections between the same or dif- ferent pairs of nodes (client server pairs). Complies 14. An RDMA Memory Region Context (RMR Context) SHALL support RDMA Wittle [Page 374] INTERNET-DRAFT Direct Access File System September 2001 operations for the set of DAT connections that are associated with it. The association between a connection and an RMR Context is esta- blished by the local endpoint of the connection where the Memory Region resides. Complies 15. The same RMR Context can be associated with multiple connections. Complies 16. A connection can have multiple RMR Contexts associated with it. Complies 17. The DAT Provider SHALL allow the DAT Consumer to create multiple RDMA Memory Region Contexts the same memory. Complies 18. DAT SHALL support connection management including the client- server connection establishment and the connection termination by either side of the connection. Complies D.4. VI & Connections There is a one-to-one correspondence between a DAT connection and a VI connection. D.4.1. VI Discriminators A VI Provider can use the same Discriminator to establish multiple connection between the same pair of hosts. Moreover, the DAFS Client should use the same remote Discriminator for all the DAFS communica- tion channels for all DAFS Sessions to the same DAFS Server. The Discriminator and the Host Address can be obtained from the DAFS and DAT Name Service as described in section D.6. "VI Data Transfer Operations". D.4.2. VI Connection Attributes In order for the VI connection to be established the following three attributes of its VI connection endpoints of DAFS client and server need to match: o ReliabilityLevel Wittle [Page 375] INTERNET-DRAFT Direct Access File System September 2001 o MaxTransferSize o QoS - should be set to 0 VI Architecture optionally supports reliable delivery or reliable reception through the VI ReliabilityLevel attribute. Both reliable delivery and reliable reception support all of the reliable data exchange requirements of DAT. DAFS servers are REQUIRED to support Reliable Delivery and MAY optionally support Reliable Reception. A DAFS server advertises its support for Reliable Reception through the Transport Specific Server Attributes field of its DAT Location (see D.7. "Name Service Mapping for VI Architecture"). The DAFS server SHALL support a VI MaxTransferSize up to the value it advertises in its Transport Specific Server Attributes field of its DAT Location (see D.7. "Name Service Mapping for VI Architecture"). A DAFS client is recommended to choose the largest value its own NIC can support up to the DAFS server's advertised MaxTransferSize. D.4.3. VI Endpoint Attributes VI endpoints on the DAFS server are NOT REQUIRED to have EnableRD- MARead and EnableRDMAWrite attributes to be set for any of the DAFS channels. Moreover, it is recommended that they should not be set since DAFS clients do not post RDMA operations and DAFS servers do not advertise any Memory Handles. The DAFS server is REQUIRED to support RDMA Write and MAY support RDMA Read. A DAFS server advertises its support for RDMA Read through the Transport Specific Server Attributes field of its DAT Location (see D.7. "Name Service Mapping for VI Architecture"). If the DAFS server indicates support for RDMA Read and the DAFS client would like to use operations that depends on server RDMA Reads, then the client MUST set EnableRDMARead attribute to TRUE for its VI endpoints of operation and RDMA Read (if requested by the DAFS server) channels. If the DAFS client would like to use operations that depend on server RDMA Writes, then the client MUST set EnableRD- MAWrite attribute to TRUE for its VI endpoints of Operation Channels. The DAFS client is NOT REQUIRED to set EnableRDMAWrite and EnableRD- MARead attributes to TRUE for VI endpoints for Back- control Chan- nels. D.4.4. DAFS Flow Control Initialization When a VI connection is established for the DAFS Session the DAFS Server MUST have one receive Descriptor posted already to receive the initial DAFS connection request of the DAFS communication channel for Wittle [Page 376] INTERNET-DRAFT Direct Access File System September 2001 the DAFS Session. For each VI connection for a DAFS Session (Opera- tion Channel, Back-control Channel and RDMA-read Channel) the DAFS Server Provider SHALL post the receive Descriptor. The size of that pre-posted Descriptor buffer SHALL be 4-KB. The DAFS Client's connec- tion request MUST fit into the 4-KB buffer. D.4.5. VI Disconnect A VI Consumer issues a Disconnect request to its VI Provider in order to disconnect a connected VI. The Disconnect request unilaterally aborts the connection. A Disconnect request results in the completion of all outstanding Descriptors on that VI endpoint with an error. A VI Provider detects that a VI is no longer connected and notifies the VI Consumer that it is no longer a part of a connection. Minimally, the VI Consumer will be notified upon the first data transfer opera- tion that follows the disconnect. D.5. VI Architecture Memory Semantics The VI Architecture provides memory registration functionality that allows DAFS clients to register memory for RDMA operations. DAFS requires that all Memory Regions whose Memory Handles will be used as RMR Contexts SHALL have their attributes for RDMA Read Enable, RDMA Write Enable, or both to be set according to the client's use of this memory. The client's Memory Handle that corresponds to the RMR Context can be passed over the Operation Channel to a remote server site. The Memory Protection Tag of the memory region of that Memory Handle and of the VI endpoint that is the endpoint of the DAFS channel that supports RDMA operations MUST be the same. The memory region registered SHALL be a contiguous virtual space as specified by DAT. The DAFS Client VIs that are the endpoints of DAFS communication channels for the operation and RDMARead channels of the same DAFS Session MUST share the same PTAG. This ensures that registered memory regions accessibly by the Operation Channel of the DAFS Session are also accessible by the RDMARead channel of that DAFS Session. D.6. VI Data Transfer Operations The VI Architecture provides data transfer operations over a connec- tion that support DAT reliability and ordered delivery requirements. VI Architecture data transfer operations are represented by Descrip- tors. Descriptors are posted to the VI endpoints. Descriptors are composed of segments. There are three types of segments: control, Wittle [Page 377] INTERNET-DRAFT Direct Access File System September 2001 address and data. The Control Segments contains control and status information as well as reserved fields that are used for queuing. An Address Segment follows the Control Segment, but only for RDMA opera- tions. This segment contains remote buffer information for RDMA Read and RDMA Write operations. Remote buffer information consists of remote Memory Handle (RMR Context for DAT) and remote buffer virtual address (RMR Target Address for DAT). The Data Segment contains information about the local buffers of a send, receive or RDMA Read or RDMA Write operation. A Descriptor MAY contain multiple Data Seg- ments. VI Architecture supports multiple outstanding Descriptors posted to the Work Queues of a VI endpoint. This provides support for multiple send, receive, RDMA Read, RMDA Write to be active simultaneously. The DAT delivery rules requires support for that and DAFS clients can take advantage of this capability by negotiating OPNreq value with the server for operation and Back-control Channels. DAFS does not use Immediate Data. D.7. Name Service Mapping for VI Architecture This section describes how the DAT and DAFS Name Service features are supported in VI Architecture. The DAT Channel Adapter Address is represented by VI Host Address. The DAT Name Service query is mapped into VipNSGetHostbyName. The DAT Connection Qualifier is mapped onto a VI Discriminator. The DAT Connection Qualifier is a string while the VI Discriminator is not. VI Network Address contains the VI Host Address Discriminator Length and the Discriminator value. The maximum VI Discriminators length REQUIRED to be supported by all compliant implementations is 16 bytes. Hence, DAFS uses discriminator less than or equal to 16 bytes. The mapping scheme between a DAT Connection Qualifier and a VI Discriminator is defined as follows. The DAT Connection Qualifier used for DAT Transport Type of DAFS_DAT_VI is a string of length up to 35 characters. The first two characters of the Connection Qualif- ier encoded the actual length of the Connection Qualifier in the human readable decimal way. This means that value 16 is represented by the string "16", and value 1 is represented by the string "01". The next character is the delimiter space, provided to improve human readability of the DAT Connection Qualifier. The VI Discriminator is an array of octets. Each octet is mapped into 2 consecutive charac- ters. The characters representing the Discriminator value start from the delimiter. Wittle [Page 378] INTERNET-DRAFT Direct Access File System September 2001 A DAFS server can listen on multiple Discriminators and hence dif- ferent DAT Connection Qualifier. Each Connection Qualifier MUST be advertised as different DAFS Locations. All the DAT Connection Qual- ifiers of the same server host NIC share the DAT Hostname and VI Host Address. If there are multiple VI NICs on the host then the DAFS server listens on the same set of Discriminators on all these NICs. The DAT Hostname is the same for all of them. The query to the DAT Name Service provides the list of all VI NIC Addresses. The DAFS Server should specify in the Transport Specific Server Attributes parameter of its DAT Location what values of the parame- ters it supports. These parameters include: ReliableReceptionSupported Indicates that DAFS server VIs can support Reliable Reception. This parameter can be set up if all DAFS server NICs can support Reliable Reception. The first 25 characters of the Transport Specific Server Attributes field specify ReliableReceptionSup- ported. "Reliable Reception TRUE " means that server's VIs can support Reliable Reception, while "Reliable Reception FALSE " means that not all server's VIs can support Reliable Reception. If DAFS client requests a connection on its local VI whose Reliabili- tyLevel attribute set to Reliable Reception (in this case client's NIC to which VI belongs to also supports Reliable Reception) and DAFS server advertised that capability then DAFS server MUST have the ReliabilityLevel of the responding VI to be set to Reliable Reception. MaxTransferSize Specifies the maximum transfer size the server VIs can accept. It is the minimum among MaxTransferSize attributes of the VI NICs advertised by the DAFS server. The 16 characters starting from 26th character of the Transport Specific Server Attributes field specify the MaxTransferSize. The representation is human readable decimal number analogous to the Discriminator length encoding. If the MaxTransferSize of the requesting client's VI is up to the DAFS server advertised MaxTransferSize then the DAFS server MUST have the MaxTransferSize of the accepting VI to be equal to the client's one. RDMAReadSupported Specifies RDMA Read support of the DAFS server. If the Transport Specific Server Attributes field characters starting from the 42nd are "RDMA Read TRUE " then the DAFS server supports RDMA Read and if "RDMA Read FALSE" then it does not. Wittle [Page 379] INTERNET-DRAFT Direct Access File System September 2001 D.8. DAFS Client Discriminators This section discusses support for establishing the Operation, Back- Control and RDMA Read communication channels within a DAFS Session. Recall that there is a one-to-one correspondence between a DAFS com- munication channel and a DAT connection, and there is a one- to-one correspondence between a DAT connection and a VI connection. Hence, there is a one-to-one correspondence between a DAFS communication channel and a VI connection. At the time that the DAFS communication channel is created, the DAFS server will need to be able to differentiate a client's connection request for various types of DAFS Session channels. For instance, when the server accepts connection for an RDMA-read Channel, it will need to use a VI whose Send Work Queue is tied to the CQ for RDMA Read completions, rather than the CQ for send and RDMA Write comple- tions. This can be achieved by encoding the channel type in the client's Discriminator field. Then when a VI Provider for the DAFS server provides Client's connection request, the server can use the address attribute of the remote Client's requesting VI. The client's VI address encodes in the address' discrimination section the type of DAFS communication channel this VI is being used for. This provides the server with enough information to accept the connection using a VI endpoint with the appropriate properties. #define Dafs_Channel_Dafs_Operation 0x01 #define Dafs_Channel_Back_Control 0x02 #define Dafs_Channel_Rdma_Read 0x04 typedef uint32 Dafs_Channel_Enum_Type; struct DAFS_client_address_discriminator { opaque32 discriminator_magic; opaque32 client_instance_differentiator; opaque32 client_session_differentiator; Dafs_Channel_Enum_Type channel_type; }; Fields: discriminator magic Wittle [Page 380] INTERNET-DRAFT Direct Access File System September 2001 The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S') is used to identify a client discriminator as belonging to DAFS. This can be used as a sanity check and aid to identification of DAFS- related connection packets on bus analyzers, etc. client_instance_differentiator An opaque value chosen by the client to allow the server to dif- ferentiate different clients using the same remote VI NIC. Dif- ferent clients using the same VI NIC MUST choose different client_instance_differentiators. client_session_differentiator An opaque value chosen by the client to insure that different discriminators can be used for each connection to the server. [Analogous to port number] channel_type Identifies the type of Operation Channel that the VI will be used for after it is successfully connected. This allows the server to accept the connection request using a local VI endpoint with attributes appropriate for the given channel type (for example, server VI endpoints for DAFS RDMA-read Channels might use dif- ferent CQs than server VI endpoints for DAFS Operation Channels). Note: VIPL-1.0 compliant VI providers are NOT REQUIRED to support a discriminator larger than 16 bytes (See MaxDiscriminatorLen above). Therefore DAFS_client_address_discriminator has been defined so that it fits within this 16 byte limit. The client_session_differentiator has scope within a specific client_instance_differentiator. It MUST be the same for all VI con- nections used for DAFS communication channels within a specific DAFS Session, and different for different DAFS Sessions. The triple {client_instance_differentiator, client_session_differentiator, channel_type} form a unique VI discriminator. Rationale: This allows the server to associate related VI connections as soon as the VI connection is requested (i.e. before the binding to server VI or CQ is made), which might be desir- able for efficiency in implementation. D.9. Design Notes Wittle [Page 381] INTERNET-DRAFT Direct Access File System September 2001 D.9.1. Connection Establishment The DAFS client is expected to issue a VipConnectRequest where remote address info is filled from the Name Service (see Section D.7. "Name Service Mapping for VI Architecture") and the local address which consists of the local VI NIC and VI Discriminator as defined in Sec- tion D.8. "DAFS Client Discriminators". The DAFS server is expected to perform VipConnectWait where the local address is defined by the DAFS Name Service and advertised by the server. If the server intends to accept the connection, it MUST prepare a VI for the connection. The server VI Consumer MAY either choose an existing unconnected VI or it MAY create a new VI with attributes it considers appropriate for this connection request. To accept the con- nection, the server VI Consumer issues ConnectAccept request to its VI Provider, specifying the incoming connection ID as well as the local VI to be used. If the local VI's MaxTransferSize and QoS relia- bility attributes match those needed by the remote VI, the connection is established, otherwise it completes with error. The server VI Con- sumer can also issue ConnectReject specifying the incoming connection ID. D.9.2. Memory Registration The same Memory Handle can be advertised over multiple connections as long the local endpoints of these connections match the Memory Pro- tection Tag of the registered memory specified by the Memory Handle in order to support RDMA operations over these connections. By shar- ing the same Memory Protection Tag among Memory Regions and a VI, multiple Memory regions can be associated with that VI. Hence, the remote endpoint of a connection of that VI can perform RDMA operation on the memory of these Memory Regions. By manipulating Memory Protec- tion Tags VI Consumers can control which VI of a VI NIC is associated with which Memory Regions. Thus, a Memory Region does not have to be able to be accessed by all VIs on the same VI NIC. D.9.3. NIC Attributes Several of the VI NIC attributes effect the scalability of the DAFS client. MaxRegisterRegions The maximum number of memory regions that can be registered. Since all memory used by the DAFS client for both send/receive opera- tions and RDMA operations have to be registered the number of memory regions that can be registered effect the granularity of the DAFS client memory registration. Wittle [Page 382] INTERNET-DRAFT Direct Access File System September 2001 MaxVI The maximum number of VI instances supported by this VI NIC. MaxDescriptorsPerQueue The maximum Descriptors per VI Work Queue supported by this VI Provider. OPNreq can not exceed the MaxDescriptorPerQueue. MaxTransferSize The maximum transfer size supported by this VI NIC. The inline data transfer size can not exceed this value. DAFS server adver- tises the its NIC MaxTransferSize value that all its VI endpoint will support in DAT Location of the DAFS Name Service. D.10. References [VIArch] Virtual Interface Architecture Specification: Version 1.0, December 16, 1997, published by Compaq, Intel & Microsoft (http://www.viarch.org/html/Spec/vi_specification_version_10.htm). Wittle [Page 383] INTERNET-DRAFT Direct Access File System September 2001 Appendix E. DAFS Mapping to InfiniBand Reliable Connection This appendix provides a mapping of Direct Access Transport (DAT) semantics onto the InfiniBand Architecture using Reliable Connection (RC) service, and how the Direct Access File System (DAFS) makes use of this mapping. All InfiniBand references are to Infiniband Architecture Release 1.0a Volume 1 - General Specifications, released June 19, 2001. E.1. Terminology Mapping from DAT to InfiniBand The following table summarizes the terminology used to DAT terms and the corresponding InfiniBand concepts. DAT Channel Adapter (CA) IB Host Channel Adapter (HCA) A Channel Adapter that supports abstract functionality described by a "verbs" interface. IB also supports TCAs (Target Channel Adapters) for devices with simpler interfacing. A DAFS client MUST be an HCA. Technically, the DAFS server could be a TCA, but it also requires the ability to initiate RDMA reads and writes, Reli- able Connection support, and a Communications Manager (CM). These features are generally only found on HCAs. DAT CA Address IB GID (Global Identifier) Address of a Channel Adapter on a specific port. IB also uses LIDs (Local Identifiers) that identify a specific path through the current subnet to a port. However, this mapping does not expose LIDs to the DAFS Provider. DAT Connection IB Connection An association between a Queue Pair (QP) with only one other QP, such that messages transmitted by the send work queue of one QP are reliably delivered to the receive work queue of the other QP. As such each QP is said to be connected to the opposite QP. In this mapping, the IB Reliable Connection (RC) Transport Service Type is used. DAT Connection Qualifier Wittle [Page 384] INTERNET-DRAFT Direct Access File System September 2001 IB Service ID A value that enables a Connection Manager to associate an incoming Connection Request with the entity providing the service. IB Ser- vice ID's are analogous to TCP/UDP port numbers, but are 64-bit integers. DAT Provider IB Channel Interface The presentation of the channel to the Verbs Consumer as imple- mented through the combination of Host Channel Adapter, associated firmware, and device driver software. DAFS Provider IB Verbs Consumer A direct user of the functionality of a Host Channel Adapter. DAT DTO - Data Transfer Operation IB Work Queue Entry (WQE) WQEs are placed on the Work Queues by the implementation-specific APIs that implement the IB verbs. Semantics are standardized; APIs and formats are not. DAT DTC - Data Transfer Completion IB Queue Pair (QP) QP consists of a Send and Receive Work Queues. Note that in the alternate RD mapping, the endpoint would be the end-to-end context instead. DAT EndPoint (EP) IB Queue Pair (QP) QP consists of a Send and Receive Work Queues. Note: in the alter- nate RD mapping, the endpoint would be the end-to-end context instead. DAT Fabric IB Fabric Wittle [Page 385] INTERNET-DRAFT Direct Access File System September 2001 A collection of IB subnets connected by routers. DAT RDMA IB RDMA Remote Direct Memory Access. RDMA Memory Region Context (RMR Context) IB R-Key (Remote Key) An R-Key is a reference to a Memory Region or to a window pointing within a Memory Region. For external interfacing the two types of reference handles are represented as a differentiated union. As with all IB operations, registering memory MUST be done separately for each HCA. DAT RMR Target Address IB Virtual Address Memory Registration provides mechanisms that enable Consumers to describe a set of virtually contiguous memory locations or a set of physically contiguous memory locations to the Channel Inter- face. This enables the HCA to access them in a virtually contigu- ous buffer using Virtual Addresses represented by a 64 bit integer. E.2. Additional InfiniBand Terminology There are several more IB terms that are used in this appendix that need to be defined. All InfiniBand references are to InfiniBand Architecture Release 1.0a Volume 1 - General Specifications, released June 19, 2001: Communications Manager (CM) The software, hardware, or combination of the two that supports communication management mechanisms and protocols. Completion Queue (CQ) A queue containing one or more Completion Queue Entries, that are Channel Interface internal representations of Work Completions. A CQ creates a single point of completion notification for multiple queues. Wittle [Page 386] INTERNET-DRAFT Direct Access File System September 2001 Immediate Data Data contained in a Work Queue Element that is sent along with the payload to the remote Channel Adapter and placed in a Receive Work Completion. Memory Region A virtual contiguous area of arbitrary size within a Consumer's address space that has been registered enabling HCA local access and optional remote access. Partition A collection of Channel Adapter ports that are allowed to communi- cate with one another. Ports MAY be members of multiple partitions simultaneously. Ports in different partitions are unaware of each other's presence (insofar as possible). Protection Domain A mechanism for associating QPs, Memory Windows, and Memory Regions. Work Queue (WQ) A send or receive queue. A send queue contains WQEs that describe data to be transmitted. A receive queue contains WQEs that describe where to place incoming data. E.3. DAT Requirements Mapping The following table describes the mapping of DAT requirements onto InfiniBand architecture that uses Reliable Connection (RC) services. 1. DAT SHALL support a connection that provides send-recv message transfers and RDMA Read and Write operations. Complies. 2. DAT SHALL support reliable connection which provides the following features: From 9.1 (Transport Layer Overview): "When a QP is created it is associated with one of the five transport service types. The tran- sport service describes the degree of reliability and to what and how the QP transfers data. The five transport service types are: 1) Reliable Connection 2) Reliable Datagram 3) Unreliable Datagram Wittle [Page 387] INTERNET-DRAFT Direct Access File System September 2001 4) Unreliable Connection 5) Raw IPv6 Datagram & Raw Ethertype Datagram." The Reliable Connection service type trivially satisfies the DAT requirements for reliable connection oriented operation. All DAFS servers and clients are REQUIRED to support Reliable Connection. Note that it is possible to define an alternate mapping for DAT's reliable connection oriented service onto Reliable Datagrams using End-to-end Connections to provide the necessary connection oriented aspects. Defining such a mapping has been deferred to a later time. This Appendix describes mapping DAT connections onto Infiniband Reliable Connected QPs only. 3. All data transfer operations submitted to the DAT Provider will complete successfully in the absence of errors, with data delivered uncorrupted, in the order defined by ordering rules. Complies. 4. Corruption of the data delivered to the Consumer (local one for RDMA Read) is detected as an error and reported to the Consumer. Complies. 5. Data loss (inability to deliver a data to the remote endpoint of the connection (from remote to local one for RDMA Read)) SHALL be detected as an error and reported to the consumer. Complies. 6. Upon detection of an error, the connection SHALL be broken and all outstanding and in progress data transfer operations SHALL complete with an error. Locally detected errors MAY be corrected through local interac- tion. A regulated number of retries MAY be executed before an error is declared to the consumer. Exception: If an endpoint (Queue Pair) is reset then outstanding data transfer operations (WQEs) are removed from the queues without notifying the consumer. The DAT Provider MUST refrain from resetting Queue Pairs itself, but cannot prevent other management software from doing so. (See Table 78, Section 11.2.3.2 Modify Wittle [Page 388] INTERNET-DRAFT Direct Access File System September 2001 Queue Pair). Exception: If a queue pair is destroyed, outstanding work requests are "out of scope" for the channel interface. The IBA Consumer is responsible for clean up of the resources associated with work requests on destroyed work queues. See section 11.2.3.4 (Destroy Queue Pair) on page 495. 7. There is a one-to-one correspondence between send operations on one endpoint of the connection and recv operations on the other end- point of the connection. Complies. 8. There is no correspondence between RMDA operations on one endpoint of the connection and recv or send data transfer operations on the other endpoint of the connection. "Normally an RDMA operation does not consume a receive WQE at the destination, but there is one exception. That is for an RDMA Write operation which specifies immediate data. Immediate data is 32 bits of information that is optionally provided in a SEND or RDMA WRITE instruction, transferred as part of the operation, but instead of writing the immediate data to memory, the data is treated as another piece of status information and returned as a special field of the RECEIVE CQE status. This means that an RDMA WRITE with immediate data will consume a RECEIVE WQE at the desti- nation." (IB Spec, Chapter 3.2.1, Page 68- 69). DAFS does not use Immediate Data. 9. Data Transfer Operation Completion means that the Consumer can reclaim resources associated with the operation including the memory that contains the data. Complies. 10. Ordering Rule: The data payload for the send operation matching a receive operation MUST be delivered into the receiver indicated memory buffer without errors prior to the receive completion. Complies. 11. Ordering Rule: Receive operations on a connection MUST be com- pleted in the order of posting of their corresponding sends. Complies. 12. Ordering Rule: Each RDMA write operation posted on a connection Wittle [Page 389] INTERNET-DRAFT Direct Access File System September 2001 prior to a send operation MUST have its data payload delivered to the target memory region prior to the completion of the receive operation matching that send. Complies. 13. DAT SHALL support multiple connections between the same or dif- ferent pair of nodes. Complies. 14. An RDMA Memory Region Context (RMR Context) SHALL support RDMA operations for the set of DAT connections that are associated with it. The association between a connection and an RMR Context is esta- blished by the local endpoint of the connection where the Memory Region resides. Complies. 15. The same RMR Context can be associated with multiple connections. Complies. 16. A connection can have multiple RMR Contexts associated with it. Complies. 17. The DAT Provider SHALL allow the DAFS Provider to create multiple RDMA Memory Region Contexts referencing the same memory. Complies. 18. DAT SHALL support connection management including the client- server connection establishment and the connection termination by either side of the connection. Termination of connections using the Communications Manager SHOULD only be done after terminating the associated DAFS Sessions. Because Infiniband Communications Management is conducted via out-of-band Management Datagrams (MADs) it is impossible to guarantee a predictable orderly shutdown of an active connection. E.4. IBA Model InfiniBand offers a wide range of capabilities. Many of them are sim- ply not needed to meet the DAT requirements. These include: o IBA offers both active/passive (client/server) and active/active Wittle [Page 390] INTERNET-DRAFT Direct Access File System September 2001 (peer- to-peer) connection models. The DAT to IB-RC mapping uses only the active/passive (client/server) model. o IBA allows end-to-end flow control as an option on each half of a connection. Since DAFS already provides its own flow control there is no need to exercise this option. o IBA specifies two types of channel adapters: host channel adapters (HCAs) and target channel adapters (TCAs). The DAT to IB-RC map- ping assumes that all DAFS participants have the necessary HCA capabilities. IBA supports many other connection types in addition to Reliable Con- nection. This appendix defines DAFS mapping only to Reliable Connec- tion Transport Type. An alternate mapping using Reliable Datagrams, End-to-end Connections and Reliable Datagram Domains is feasible. While such a mapping would have desirable scalability features for servers, it would be more complex to specify and would impose unneeded burdens on clients that had no need to connect to either multiple servers or servers providing only RD connections. Given the general DAFS objective for low-overhead clients, this would not be a suitable default mapping. It MAY be added at a later date as an alternate mapping. The DAFS mapping to InfiniBand Reliable Connections requires use of normal RC verbs once the connection is established through the Com- munication Manager (CM). The CM is described in Volume 1, Chapter 12 of the InfiniBand Architecture Specification. E.5. InfiniBand Architecture Transport Endpoints and Connections There is a one to one correspondence between a DAT Connection and a connected pair of QPs using the Reliable Connection Transport Service Type. Most of the behavior necessary for DAFS connection management is implemented by the IBA Communications Manager (CM). As per the IBA Spec (Vol 1, Chapter 12.1) Connections are managed over Queue Pairs other than those used for the connection, through the protocol described herein, between the Communication Managers (CMs) on each system. (See Figure 126) The CMs communi- cate using Management Datagrams (MADs), typically over the General Services Interface (GSI) on each system. This document [InfiniBand Architecture specification] defines CM external behaviors, but internal interfaces and implementations are outside the scope of the Wittle [Page 391] INTERNET-DRAFT Direct Access File System September 2001 InfiniBand Architecture specification. In general, InfiniBand places the bulk of the requirements on connec- tion establishment upon the client side. Per the Architecture specif- ication Chapter 12.1: The requirements on participating CMs are not equal. The initiating CM is responsible for collecting or calculat- ing most of the information necessary to establish the connection. Much of the raw information is available from Subnet Administration, but some adjustments MAY be desirable, depending on the application of the channel. The DAFS client is the initiator for establishment of all connec- tions. This includes the back-control channel, even though it will be acting on it as the "responder." Under the DAFS Session establishment procedures, the server requests additional connections for back- control and RDMA Read channels, but the client is responsible for establishing all connections. The Communications Manager MAY be shared with other services on the same host. It is not the intent of this mapping to create special requirements for the Communications Manager. A Communications Manager implemented to flexibly meet the capabilities described by the InfiniBand specification SHOULD be deployable without modification. E.5.1. Proxy Communications Managers IBA allows Proxy Communications Managers (for more information see 12.10.7, Active Client to Passive Server with Redirector, page 590). This means that the Channel Adapter Address for the CM MAY not be the address of the requested server. Instead servers with multiple Chan- nel Adapters MAY elect to have all of their connection establishment done via a single CM. This central CM MAY elect to complete the con- nection request on any of its channel adapters. This feature MAY be used by a multiple-adapter DAFS server to load balance new connec- tions. E.5.2. Partitions Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec- tion 3.5.6: Partitioning enforces isolation among systems sharing an InfiniBand fabric. Partitioning is not related to boun- daries established by subnets, switches, or routers. Rather a partition describes a set of endnodes within the fabric that can communicate. Wittle [Page 392] INTERNET-DRAFT Direct Access File System September 2001 Each port of an endnode is a member of at least one par- tition and MAY be a member of multiple partitions. A partition manager assigns partition keys (P_Keys) to each channel adapter port. Each P_Key represents a par- tition. Each QP 1 and EE context is assigned to a parti- tion and uses that P_Key in all packets it sends and inspects the P_Key in all packets it receives. Reception of an Invalid P_Key causes the packet to be discarded. Switches and routers MAY optionally be used to enforce partitioning. In this case the partition manager pro- grams the switch or router with P_Key information and when the switch or router detects a packet with an invalid P_Key, it discards the packet. DAFS does not require the use of Partitions, but can work with them when they are present. The only requirement is that Partitioning not interfere with client/server communications. E.5.3. DAFS Connection Establishment Requirements E.5.3.1. DAFS Client E.5.3.1.1. Connection Request The DAFS client is responsible for initiating establishment of all IB Reliable Connections by issuing REQ (request for connection) message. The DAFS REQUIRED content of the REQ message is defined as follows: Service ID A 64 bit Big Endian integer whose value is derived from the DAT Connection Qualifier of the DAFS server (for more information, see E.8. "DAFS Name Service Mapping for InfiniBand Reliable Connec- tion"). Transport Service Type Reliable Connection. Primary Remote Port GID Specifies the requested Channel Adapter Address. The Remote Port GID is derived from the DAT Host Name (for more information, see E.8. "DAFS Name Service Mapping for InfiniBand Reliable Connec- tion"). When multiple virtual servers are supported on the same HCA, each SHOULD have its own virtual GID to differentiate the requests. Wittle [Page 393] INTERNET-DRAFT Direct Access File System September 2001 Local QPN The local QP on which the DAFS client wants to establish a connec- tion. PrivateData The content of the PrivateData field is defined in E.9. "DAFS Client Connection Request PrivateData". The partition key for all connections of a DAFS session MUST be the same. All other fields of the DAFS client REQ message are defined by the standard IB rules. These include among others: Remote CM Response Timeouts, Alternative Remote Port GID, Primary and Alternative Local Port GID, Primary and Alternative Traffic Class, Primary and Alterna- tive Packet Rate, Primary and Alternative LIDs, and Local Communica- tion ID. E.5.3.1.2. Responce Messages The CM on the DAFS client SHALL handle response to its connection request message of the following types: 1) REP-Reply to Request for Communication. The Remote Communication ID from this message SHALL be used for Disconnect of this connec- tion (see E.5.4. "Disconnect" for more information). 2) MRA-Message Receipt Acknowledgement means that the DAFS server CM can not respond to the REQ message within the requested timeout. The MRA extends the timeout period for the original request. 3) Redirecting REJ-One form of Rejection message can be used in con- junction with Proxy Communications Management. The connection is rejected as requested, but the CM supplies alternate values for the primary and alternate endpoints. The CM on the DAFS client MUST resubmit the connection request message with the supplied alternate values. The DAT Provider and/or CM MUST implement this process transparently to the DAFS Provider. 4) Normal REJ-Connection request is rejected. E.5.3.1.3. Ready to Use Message Upon receiving REP message from the DAFS server, the DAFS client can issue a RTU message using the same Local Communication ID as the REQ message and the Remote Communication ID from REP message. DAFS client does not use PrivateData field of RTU message. Wittle [Page 394] INTERNET-DRAFT Direct Access File System September 2001 Prior to issuing the RTU message, the DAFS client SHALL ensure that RDMA Read and RDMA Write are enabled on the QP endpoint for the Operation channel. Prior to the RTU message the DAFS client SHALL ensure that RDMA Read is enabled on the QP endpoint of the RDMA Read channel (if created in response to the DAFS server request). E.5.3.2. DAFS Server E.5.3.2.1. Connection Request Message Receipt The contacted DAFS server MAY respond through its CM with four dif- ferent types of messages. MRA - Message Receipt Acknowledgement MRA's are sent when the recipient of the message anticipates that it will not be able to respond within the time specified within the REQ message. It avoids unnecessary retries. Frequently the server side needs to create and/or modify queue pairs before the connection is usable. The MRA enables it to hold off retries while it finishes this work. HCAs are NOT REQUIRED to be able to gen- erate MRAs. Therefore a CM on such an HCA would have to finish its work promptly. Redirecting REJ Reject message is used to reject the connection as requested. This rejection specifies a different primary and/or alternate end- points. There is no requirement to reserve these resources at the time the redirecting REJect is issued. Normal REJ Reject message is used to reject the connection. REP - Reply Accepts the connection, specifying the local QPN for the DAFS server endpoint to be used for the requested connection. All other fields of the DAFS server REP message are defined by the standard IB rules. The server-side Channel Interface MUST allocate or create listener(s) to accept new connections. Operations Channel listeners MUST be ready to process the first receive for the DAFS Session establishment exchanges. The Channel Interface MAY need to use MRAs (if supported by IB Channel Interface) until the listener is ready to begin the Wittle [Page 395] INTERNET-DRAFT Direct Access File System September 2001 DAFS Session establishment. The DAFS server can either create QPs to be used by its CM or let the CM create QPs with appropriate parameters in response to a REQ mes- sage from the DAFS client. DAFS server is NOT REQUIRED to have RDMA Read or RDMA Write enabled on the QP endpoints for any of the DAFS communication channels. E.5.4. Disconnect Per InfiniBand Architecture Specification Volume 1, Release 1.0, Chapter 12.10.8 - Communication Release, page 561: Communication release as illustrated in this section is ungraceful. Upon receipt of a Disconnection Request, each CM SHALL cause the affected QP to be placed into the error state, causing pending work requests to com- plete with the Flush error status. Consumers are free to define and execute a more graceful communication release protocol that allows for an ord- erly shutdown of communications. Any such protocol SHALL utilize the communication release protocol illustrated below after the termination of normal message process- ing. DAFS clients SHOULD terminate communications at the DAFS Protocol layer before requesting the release of the Connection. Since any server- side initiated termination is inherently ungraceful, there is no need for a DAFS layer disconnect, nor is there any method of doing so. Note that the Channel Interface MUST accept DREQ requests, even if the DAFS layer has failed to properly shutdown. Attempts to use a disconnected endpoint will return an error. E.5.5. Automatic Path Migration This DAFS mapping does not require use of Automatic Path Migration (APM) capabilities of InfiniBand. Switching between DAFS servers is handled at the DAFS layer. However, in some deployments the use of Automatic Path Migration MAY be a requirement and/or a default behavior built into the Communica- tions Manager. In order to provide greater compatibility with other local services, the DAT Provider MAY utilize APM capabilities. How- ever, it SHALL NOT require the DAFS Provider to interact with the Wittle [Page 396] INTERNET-DRAFT Direct Access File System September 2001 migration process. APM is allowed to be used only to switch paths between the DAFS client and DAFS server. It MUST not be used to switch a DAFS client to an alternate DAFS server. Migration to fallback servers is the responsibility of the DAFS layer. E.6. IBA Memory Semantics E.6.1. Memory Regions and Memory Windows InfiniBand allows the Consumer to register virtual or physical memory with a specific HCA. This process returns an L-Key (local key) and an optional R-Key (remote key). The L-Key is used only in local interac- tions. The R-Key MUST be supplied in all on-the-wire references. The R-Key corresponds to a DAT RMR Context. Registering memory for local-only purposes is outside the scope of this mapping. A distinct verb (operation) exists to re-register the same memory region and receive an additional set of keys. This extra region can have different access attributes. IBA also supports Memory Windows, which allow access to specific por- tions of a Memory Region for a specific QP. The dynamic binding receives a new R-Key. The remote holder of an R-Key does not have to be aware of whether it refers to a Memory Region or a Memory Window. DAFS requires that all Memory Regions or Memory Windows whose R- Keys will be used as RMR Contexts SHALL have their Access Control set to Enable Remote Write Access or Enable Remote Read Access or both, according to the client's use of this memory and corresponding RMR Context. Note: Since DAFS does not use the RDMA Atomic operation, DAFS never requires that Access Control on a Memory Region or Memory Win- dow have Remote Atomic Operation Access Enabled. E.6.2. Protection Domains Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec- tion 3.5.5: Not only does memory registration allow the use of vir- tual memory ad-dressing, but it also provides an increased level of protection against in-advertent and unauthorized access. Since a consumer might communicate with many different Wittle [Page 397] INTERNET-DRAFT Direct Access File System September 2001 destinations but not wish to let all those destinations have the same access to its registered memory, IBA pro- vides protection domains. Protection domains allow a consumer to control which set of its Memory Regions and Memory Windows can be accessed by which set of its QPs. Before a consumer allocates a QP or registers memory, it creates one or more protection domains. QPs are allo- cated to, and memory registered to, a protection domain. L_Keys and R_Keys for a particular memory domain are only valid on QPs created for the same protection domain. All resources supporting a specific DAFS Session MUST belong to a single Protection Domain. Specifically: The DAFS client's R-Key that corresponds to the RMR Context can be passed over the Operation Chan- nel to a remote server site. The Protection Domain of the Memory Region or Memory Window corresponding to the R-Key MUST match that of the QP that is the DAFS client's endpoint of the Operation Channel. Furthermore, if an RDMA Read Channel exists for the session, the QP that is the DAFS client's endpoint of the RDMA Read Channel MUST also be assigned to this same Protection Domain. E.7. IBA Data Transfer Operations Per InfiniBand Architecture Specification Volume 1, Release 1.0, sec- tion 9.4.1 on Send Operation: The SEND Operation is sometimes referred to as a Push operation or as having channel semantics. Both terms refer to how the SW client of the transport service views the movement of data. With a SEND operation the initiator of the data transfer pushes data to the remote QP. The initiator doesn't know where the data is going on the remote node. The remote node's Channel Adapter places the data into the next available receive buffer for that QP. On an HCA, the receive buffer is pointed to by the WQE at the head of the QP's receive queue. Per [IB] section 9.4.3 on RDMA Write Operation: The RDMA WRITE Operation is used by the requesting node to write into the virtual address space of a destination node. The message MAY be between zero and 2**31 bytes (inclusive) and is written to a contiguous range of the destination QP's virtual address space (not necessarily a contiguous range of physical memory). Wittle [Page 398] INTERNET-DRAFT Direct Access File System September 2001 Per [IB] section 9.4.4 on RDMA Read Operation: RDMA READ Operations are similar to RDMA WRITE Opera- tions. They allow the requesting node to read a virtu- ally contiguous block of memory on a remote node. As with RDMA WRITEs, the responding node first allows the requesting node permission to access its memory. The responder passes to the requestor a virtual address, length, and R_Key to use in the RDMA READ request packet. Per [IB] section 10.7.2.2 on RDMA Operations: The target address of an RDMA request is the remote node's virtual address, a valid R_Key and length. The R_Key MUST be associated either a Memory Region or a Memory Window containing that virtual address. In the above paragraph, a "target address" corresponds to a DAT "RMR Target Address" and an "R_Key" corresponds to a DAT "RMR Context." DAFS does not use Immediate Data, therefore the following Base Tran- sport Header (BTH) OpCodes are never used by DAFS clients or servers (See [IB] Section 9.2.1): o SEND Only With Immediate o SEND Last With Immediate o RDMA Only With Immediate o RDMA Last With Immediate Note that RDMA Read Packets never carry Immediate Data. DAFS does not specify a mechanism where Solicited Events can be used to control CQ event generation on remote endpoints. Therefore in order to guarantee interoperability, DAFS clients and Servers SHALL always set the Solicited Event Bit (SE) bit to 0. See [IB] Section 9.2.3. DAFS does not use the RDMA Atomic Operation. E.8. DAFS Name Service Mapping for InfiniBand Reliable Connection The DAFS Name Service has three transport-related fields: o The Transport Type: "IBRC" Wittle [Page 399] INTERNET-DRAFT Direct Access File System September 2001 o The DAT Host Name. o The DAT Connection Qualifier: which is a right-justified zero- padded 16 digit hex printable ASCII string encoding an InfiniBand sixty-four (64) bit Service ID. The last field Transport Specific Server attributes field is not used by the DAT to IB mapping. The DAFS client MUST translate the hostname to a GID using whatever host to address services it normally uses. This mapping would include, but not be limited to, use of IPv6 compatible name servers and administrative configuration of the host. The DAFS client can rely upon this mapping being relatively stable (as with DNS to IP mapping), and MAY make use of caching to avoid per-connection network traffic that delays completion of the new connection. DAFS Servers are encouraged to use a standard Service ID for the DAT operations channel. This number will be obtained from IBTA and/or IETF. If multiple distinct virtual DAFS Servers are available at the same physical host, use of multiple virtual GIDs SHOULD be used to differentiate them. However, alternate Service IDs MAY also be used. DAFS Clients SHALL NOT assume that the suggested Service ID is in use, and MUST use the Service ID provided via the DAFS Name Service. Note that multiple connections from the same client to the same server for the same type of DAFS Connection will request the same Service ID. However each request MUST provide a different client-side QPN, and will receive a different server-side QPN. E.9. DAFS Client Connection Request PrivateData The DAFS client uses the PrivateData field in the REQ message. Rationale: A DAFS server should have the ability to differentiate among connection establishment requests from different clients, client sessions, and session's channel types. While this can be achieved in multiple ways (for example, Local Communication ID), it was decided to use private data to make mappings for VI and IB to be more alike. This simplifies transitions between VI- and IB-based implemen- tations. The 16 bytes starting from the start of the PrivateData (bits 140- 267) are defined as follows: Wittle [Page 400] INTERNET-DRAFT Direct Access File System September 2001 #define Dafs_Channel_Dafs_Operation 0x01 #define Dafs_Channel_Back_Control 0x02 #define Dafs_Channel_Rdma_Read 0x04 typedef uint32 Dafs_Channel_Enum_Type; struct DAFS_client_address_discriminator { opaque32 discriminator_magic; opaque32 client_instance_differentiator; opaque32 client_session_differentiator; Dafs_Channel_Enum_Type channel_type; }; Fields: discriminator magic The magic sequence 0x44 0x41 0x46 0x53 ('D' 'A' 'F' 'S') is used to identify a client discriminator as belonging to DAFS. This can be used as a sanity check and aid to identification of DAFS- related connection packets on bus analyzers, etc. This field determines the endianness of the remaining fields in the Private- Data client_instance_differentiator An opaque value chosen by the client to allow the server to dif- ferentiate different clients using the same remote IB HCA port. Different clients using the same IB HCA port MUST choose different client_instance_differentiators. client_session_differentiator An opaque value chosen by the client to insure that different ses- sions can be differentiated by the server. channel_type Identifies the type of DAFS channel requested by the DAFS client. This allows the DAFS server to accept the connection request using a local QP with attributes appropriate for the given channel type. Wittle [Page 401] INTERNET-DRAFT Direct Access File System September 2001 E.10. References [IB] Infiniband Architecture Release 1.0a Volume 1 - General Specifi- cations, released June 19, 2001. Wittle [Page 402] INTERNET-DRAFT Direct Access File System September 2001 Full Copyright Statement Copyright (C) The Internet Society (2000, 2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this docu- ment itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of develop- ing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MER- CHANTABILITY OR FITNESS FOR A PARTICULAR PURPOS Wittle [Page 403]