INTERNET-DRAFT                                           D. Ray de Silva
Intended status: Standard                                      Sri Lanka
File: draft-ray-foretrunc-00                               2 August 2011
Expires: 3 February 2012

       Implementing Block-Level Enhancements for POSIX Filesystems:
           Fore-truncation, Ensparsement, and Re-Contiguization

      Copyright (c) 2011 IETF Trust and the persons identified as the
      document authors. All rights reserved.

      This document is subject to BCP 78 and the IETF Trust's Legal
      Provisions Relating to IETF Documents
      (http://trustee.ietf.org/license-info) in effect on the date of
      publication of this document. Please review these documents
      carefully, as they describe your rights and restrictions with
      respect to this document.

      This Internet-Draft is submitted in full conformance with the
      provisions of BCP 78 and BCP 79.
      
      Internet-Drafts are working documents of the Internet Engineering
      Task Force (IETF), its areas, and its working groups.  Note that
      other groups may also distribute working documents as Internet-
      Drafts.

      Internet-Drafts are draft documents valid for a maximum of six
      months and may be updated, replaced, or obsoleted by other
      documents at any time.  It is inappropriate to use Internet-
      Drafts as reference material or to cite them other than as
      "work in progress."

      The list of current Internet-Drafts can be accessed at
      http://www.ietf.org/1id-abstracts.html

      The list of Internet-Draft Shadow Directories can be accessed at
      http://www.ietf.org/shadow.html

Abstract

   This memo describes a method for the truncation, ensparsement and
   block-reallocation of POSIX files for the purpose of improving data
   access and reducing churn.  This specification is primarily useful
   for managing large log files and pre-allocated database files in
   enterprise environments.

Overview and Rationale

   Log files are a necessary curse in the computing world.  Logs must
   be maintained for technical, business, and legal reasons.  Logs
   need not, however, be kept forever on expensive rotating media;
   only a working subset is required on-line.  The issue is that logs
   grow from the end, but must be truncated from the front.  The few
   systems that allow truncation only permit it at the end.

   Some systems rely on sparse files, i.e. those in which not all
   logically sequential blocks are actually allocated.  The net effect
   is that any query on a non-existent block simply returns a block of
   zeroes, without any disk storage overhead (except during backups,
   where some archival methods faithfully copy the non-existent blocks
   and magically enlarge the storage requirement).  It is useful to be
   able to actually drop blocks that are unused, such that sparsity
   can be applied to a file after it was originally written.

   Much is made of the lack of necessity for defragmentation of files
   in POSIX filesystems, but the argumentation is not completely
   convincing. In many cases, poorly allocated files, especially in
   very large files such as log files and pre-allocated database,
   make even a sequential pass through the file akin to a random set
   of seeks.  The ability to enhance the contiguity of large files,
   if possible at reasonable I/O cost, would be a very worthwhile
   addition.

Assumptions on File Allocation Model

   Although technically possible on any filesystem, the methods herein
   are addressed to the POSIX model, in which the following block
   allocation model is followed (or faithfully faked, which amounts to
   the same thing):

      1. The name of the file is independent of the data, and is stored
         elsewhere. The actual identifier of the file is a (usually)
         32-bit identifier termed an Index Node (Inode), unique to the
         filesystem but only if qualified with the filesystem id. 

      2. The Inode contains, either within itself or in extension
         blocks, the allocation table for file space.  The file block
         numbers appear in logical sequential order such that, i.e., if
         4K blocks are being allocated, then byte number 40960001 will
         be allocated in the 10001st block in the list.  This actual
         data block may exist anywhere within the confines of the
         filesystem.

      3. The block size may vary from 512, the original standard, to any
         power of 2, with 4096 bytes being the current favourite.  Some
         SSDs allocate data blocks of 64K each, but it is probably not
         in the interest of the rest of the community to assume such a
         large unit as a default size.  However, varying sizes need to
         be allowed for.

Discussion

   The following additional operations are proposed.

      A. Drop Block Range.  This operation, applied to a file, would
         result in the file freeing the referred blocks.  The effect
         on users would be that any requests to those blocks would now
         simply return a block of zeroes.

         1. Implementation difficulty: Trivial.
               Can be implemented on any known system almost
               immediately.

         2. Operation permitted while file is in use: No.
               Users may have cached blocks or have buffered sequential
               reads.

         3. Operation safe to use in event of catastrophic failure: Yes.
               The only downside would be the failure to write-back the
               freeing of blocks, and this would be a safe failure.

      B. Re-Arrange Block Allocation Order.  This would permit the
         moving of an already-allocated block or blocks to the location
         of a free'd block or blocks.  The Inode allocation table size
         remains unchanged, the location of some block numbers simply
         moves from one place in the virtual array to another (which
         were originally zeroes).

         1. Implementation difficulty: Moderate.
               Can be implemented on most systems easily.

         2. Operation permitted while file is in use: No.
               Users may have cached blocks or have buffered sequential
               reads.

         3. Operation safe to use in event of catastrophic failure: No.
               Metadata updates may cause a loss of the whole file. The
               safe implementation would be to allocate a new inode,
               copy the changed allocation table, and then switch
               inodes. This procedure can be logged for replay/undo.

      C. Optimize Block Allocation.  This could be a nearly
         parameterless request, since the filesystem manager must make
         most of the decisions.  The following algorithm is proposed:
            i. Get the current free list, add to it all blocks assigned
               to the current file, and sort the list.
           ii. Prepare a list of optimal exchange-pairs by eliminating
               blocks already in correct order.
          iii. Grab the free blocks required and lock them.
           iv. Apply your favourite defrag algorithm: move to free
               blocks first, exchange nearest blocks first, whatever.

         1. Implementation difficulty: non-trivial.
               Requires careful design and will use considerable memory
               for operation, in addition to high I/O bandwidth.

         2. Operation permitted while file is in use: Yes.
               The file logical order is unchanged, and buffered data
               remains valid.

         3. Operation safe to use in event of catastrophic failure:
               Yes. Provided that write-before-metadata-update is
               followed, no corruption should occur except to the free
               list, which can be easily rebuilt.

Implementation

   1. User-level implementation
         One operation, that of ensparsement, is actually already
         implemented de-facto in the Sandforce 1200/1500 series of SSD
         controllers.  In these systems, the incoming write data
         effectively generates a hash which is compared with existing
         signatures on the media; if there is a duplicate, the data is
         simply not written.  Consider the effect of overwriting an
         existing file with blocks of zeroes: after the first few blocks
         are written, the next blocks are simply free'd and re-used.
         This feature operates across all files in the filesystem, so
         that if any file has blocks of zeroes, only one such block
         will ever be written.

   2. Application-level implementation
         It is envisaged that this feature would be added to the
         (already cluttered!) fcntl functions.

         One issue that would have to be addressed is the earlier
         mentioned fact that different systems have different block
         sizes.  To enable the implementation to recover smoothly from
         parameter errors, it is suggested that one set of error returns
         be "Incorrect Block Size Specified", in which case the error
         return would be the actual block size available.  Where there
         are multiple block sizes in simultaneous use, such as in some
         tail-packing schemes, the largest size is defaulted.

         To avoid having to specify a long list of block numbers, the
         parameters should permit a block range (i.e. block 0 to 10000)
         to be specified.  It is also possible to let the system decide
         in which order to operate on the blocks, for greatest
         efficiency.

Resulting Features

   The goals set at the beginning of this paper will be achieved by
   using the three operations specified.

   1. Fore-truncation of log files: Operation A applied to the beginning
      of the file, followed by Operation B applied to the entire file
      with offset to the location of the surviving data.  This would
      move all data logically up a number of blocks, and set the new
      EOF.

   2. Applying sparsity to already-written files: Operation A by itself.

   3. De-fragmentation: Operation C by itself.  Various degrees of
      re-contiguization are possible: a) only re-arrange blocks which
      are in reversed order, b) find free space to create large
      contiguous extents, or 3) full-fleged optimization.



Author's Address

   D. Renuk de Silva
   Renux Private Limited
   Ratmalana WP 10370
   Sri Lanka

   Phone: +9478 622-7093

   EMail: renuk007@users.sourceforge.net