Network Services Research Lab David Korn and Kiem-Phong Vo AT&T Labs Submission: June 29, 2001 Expiration: December 29, 2001 The VCDIFF Generic Differencing and Compression Data Format Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This memo describes a general and efficient data format suitable for encoding compressed and/or differencing data so that they can be easily transported among computers. Table of Contents: 1. EXECUTIVE SUMMARY ............................................ 1 2. ALGORITHM CONVENTIONS ........................................ 3 3. DELTA INSTRUCTIONS ........................................... 4 4. VCDIFF ENCODING FORMAT ....................................... 5 5. INSTRUCTION CODE TABLES ...................................... 14 6. FURTHER ISSUES ............................................... 16 7. SUMMARY ...................................................... 17 8. ACKNOWLEDGEMENTS ............................................. 17 9. SECURITY CONSIDERATIONS ...................................... 17 10. SOURCE CODE AVAILABILITY ..................................... 18 11. REFERENCES ................................................... 18 12. AUTHOR'S ADDRESS ............................................. 18 13. APPENDIX: SECTION LAY-OUT IN A DELTA FILE..................... 19 14. EXPIRATION DATE .............................................. 19 1. EXECUTIVE SUMMARY Compression and differencing techniques can greatly improve storage and transmission of files and file versions. Since files are often transported across machines with distinct architectures and performance characteristics, such data should be encoded in a form that is portable and can be decoded with little or no knowledge of the encoders. This document describes Vcdiff, a compact portable encoding format designed for these purposes. RFC 4 VCDIFF Generic Differencing and Compression Data Format Data differencing is the process of computing a compact and invertible encoding of a "target file" given a "source file". Data compression is similar but without the use of source data. The UNIX utilities diff, compress, and gzip are well-known examples of data differencing and compression tools. For data differencing, the computed encoding is called a "delta file", and, for data compression, it is called a "compressed file". Delta and compressed files are good for storage and transmission because they are often smaller than the originals. Data differencing and data compression are traditionally treated as distinct types of data processing. However, as shown in the Vdelta technique by Korn and Vo [1], compression can be thought of as a special case of differencing in which the source data is empty. The basic idea is to unify the string parsing scheme used in the Lempel-Ziv'77 style compressors [2], and the block-move technique of Tichy [3]. Loosely speaking, this works as follows: a. Concatenate source and target data. b. Parse the data from left to right as in LZ'77 but make sure that a parsed segment starts target data. c. Start to output when reaching target data. Parsing is based on string matching algorithms such as suffix trees [4] or hashing with different time and space performance characteristics. Vdelta uses a fast string matching algorithm that requires less memory than other techniques [5]. However, even with this algorithm, the memory requirement can still be prohibitive for large files. A common way to deal with memory limitation is to partition an input file into chunks called "windows" and process them separately. Here, except for unpublished work by Vo, little has been done on designing effective windowing schemes. Current techniques, including Vdelta, simply use windows of the same size with corresponding addresses across source and target files. String matching and windowing algorithms have large influence on the compression rate of delta and compressed files. However, it is desirable to have a portable encoding format that is independent of such algorithms. This enables construction of client-server applications in which a server may serve clients with unknown computing characteristics. Unfortunately, all current differencing and compressing tools, including Vdelta, fall short in this resspect. Their storage formats are closely intertwined with the implemented string matching and windowing algorithms. The encoding format Vcdiff proposed here addresses the above issues. Vcdiff achieves the below characteristics: Output compactness: The basic encoding format compactly represents compressed or delta files. Applications can further extend the basic encoding format with "secondary encoders" (e.g., a Huffman or arithmetic encoder) to achieve more compression. Data portability: The basic encoding format is free from machine byte order and word size issues. This allows data to be encoded on one machine and decoded on a different machine with different architecture. Algorithm genericity: -2- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 The decoding algorithm is independent from string matching and windowing algorithms. This allows competition among implementations of the encoder while keeping the same decoder. Decoding efficiency: Except for secondary encoder issues, the decoding algorithm runs in time proportional to the size of the target file and uses space proportional to the maximal window size. The Vcdiff data format and the algorithms for decoding data shall be described next. Since Vcdiff treats compression as a special case of differencing, we shall use the term "delta file" to indicate the compressed output for both cases. 2. ALGORITHM CONVENTIONS Algorithms to encode and decode the Vcdiff data format will be presented in the ANSI-C language. To simplify the presentation, we shall generally omit error checking and resource allocation and deallocation. In addition, the below conventions will be observed: a. Tokens with all upper cases letters will be C macros. b. Variables with capitalized names are global. c. Code fragments will be presented with line numbers to be used in the subsequent COMMENTS sections. Data will be encoded in byte units. For portability, control data generated by Vcdiff shall be limited to the lower eight bits of a byte even on machines with larger bytes. The bits in a byte are named from right to left so that the first bit has the lowest value, 1<<0 or 1, and the eighth bit has value 1<<7 or 128. To facilitate the algorithms, we shall assume a few types and functions for I/O on streams. To be definite, we shall use interfaces similar to that provided in the Sfio library [6]. Below are descriptions of some of these types and functions. Others can be found in reference [6]. uchar_t: This is the type "unsigned char". When coded in a delta file, this takes a single byte. uint_t: This is some unsigned integer type that is at least as large as an "unsigned int" and should be large enough to hold file offsets. In general, unsigned integers are coded in a portable variable-sized format to be described in Section 4.2. Sfio_t: This is the type of a stream. Sfio_t* sfstropen(uchar_t* buf, int n): This is not an Sfio function but it can be easily implemented on top of the Sfio primitive sfnew(). sfstropen() creates a stream from a given buffer with a given size. We shall assume that such a stream is both readable and writable. As with Sfio, a stream opened for writing will extend its buffer as necessary to accommodate output data. -3- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 3. DELTA INSTRUCTIONS A target file is partitioned into non-overlapping sections or windows to be processed separately. A target window T of length t may be compared against some source data segment S of length s. Such a source data segment may come from some earlier part of the target file or it may come from the source file, if there is one. It is assumed that there is sufficient memory so that both T and S can be processed in main memory. For string processing, we treat S and T as substrings of a superstring U formed by concatenating T and S like this: S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] The index of a byte in S or T is referred to by its location in U. For example, the index of T[i] is s+i. The instructions to encode and direct the reconstruction of a target window are called delta instructions. There are three types: ADD: This instruction has two arguments, a size s and a sequence of s bytes to be copied. COPY: This instruction has two arguments, a size s and an address p in the string U. The arguments specify the substring of U that must be copied. We shall assert that such a substring must be entirely contained in either S or T. RUN: This instruction has two arguments, a size s and a byte b that will be repeated s times. Below are example source and target strings and the delta instructions that encode the target string in terms of the source string. a b c d e f g h i j k l m n o p a b c d w x y z e f g h e f g h e f g h e f g h z z z z COPY 4, 0 ADD 4, w x y z COPY 4, 4 COPY 12, 24 RUN 4, z Thus, the first letter 'a' in the target string will be at location 16 in the superstring. Note that the fourth instruction, "COPY 12, 24", copies data from T itself since address 24 is position 8 in T. In addition, some part of the data to be copied is reconstructed along with the copying. This allows efficient encoding of periodic sequences, i.e., sequences with regularly repeated subsequences. The RUN instruction is a compact way to encode a sequence repeating the same byte even though such a sequence can be thought of as a periodic sequence with period 1. -4- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 4. VCDIFF ENCODING FORMAT A large target file is partitioned into non-overlapping sections called "target windows". Window sizes should be chosen so that each window can be completely processed in memory during both encoding and decoding. A target window may be compared against some source data segment or none. For compression, such a source segment would be from some earlier part of the target file while, for differencing, it is often from the source file. 4.1 Delta File Layout A Vcdiff delta file is organized into sections as follows: Header Window 1 Window 2 ... The header of a delta file includes magic bytes to identify file type and information concerning data processing beyond the basic encoding format. The window sections encode the target windows. 4.1.1 The Header Section Each delta file starts with a header section organized as below. Note the convention that square-brackets enclose optional items. Header1 - uchar_t Header2 - uchar_t Header3 - uchar_t Header4 - uchar_t Indicator - uchar_t [Secondary compressor - uchar_t] [Length of instruction code table data - uint_t] [Instruction code table data] The first four Header bytes are defined below. The first three bytes are the ASCII characters 'V', 'C' and 'D' or-ed with the eighth bit. The fourth byte is currently set to zero. In the future, it may be used to indicate the version of Vcdiff. #define VCD_HEADER1 (0x56 | (1<<7)) /* 'V' | (1<<7) */ #define VCD_HEADER2 (0x43 | (1<<7)) /* 'C' | (1<<7) */ #define VCD_HEADER3 (0x44 | (1<<7)) /* 'D' | (1<<7) */ #define VCD_HEADER4 (0) /* version */ The Indicator byte shows if there are any initialization data required to aid in the reconstruction of data in the Window sections. This byte is composed from some subset of the following bits: #define VCD_DECOMPRESS (1<<0) #define VCD_CODETABLE (1<<1) The bit VCD_DECOMPRESS indicates that a secondary compressor may have -5- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 been used to further compress certain parts of the delta data as described in Section 4.1.3. In that case, the index of the decompressor is given in a subsequent byte. The Length of instruction code table and the Instruction code table data sections are in the delta file only if the bit VCD_CODETABLE is on. The processing of this table will be described in Section 5. 4.1.2 The Format of a Window Each window is organized as follows: Indicator - uchar_t [Source segment size - uint_t] [Source segment position - uint_t] Length of the delta encoding - uint_t The delta encoding Below are the detail of the various items: Indicator: This byte should be zero or one of the below values: #define VCD_SRCWINDOW (1<<0) #define VCD_TARWINDOW (1<<1) The value VCD_SRCWINDOW indicates that there is a segment of data from the source file used for differencing against the current window. Likewise, VCD_TARWINDOW indicates a similar segment of data from the target file. In these cases, encoded next are two integers to indicate respectively the size and position of the data segment in the relevant file. If this byte is zero, the window was compressed without using a source segment. Length of the delta encoding: This is a variable-sized integer that tells the length of the delta encoding data for this window. The delta encoding: This contains the data representing the delta encoding. 4.1.3 The Delta Encoding The delta encoding of a window is organized as follows: Length of target window - uint_t Indicator - uchar_t Length of ADD and RUN data section - uint_t Length of instruction section - uint_t Length of COPY address section - uint_t Data section for ADD and RUN instructions Instructions section Addresses section for COPY instructions Length of target window: This is a variable-sized integer indicating the size of the target window. The decoder can use this value to allocate the necessary memory to decode the window. -6- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 Indicator: This byte is composed from some subset of the below bits: #define VCD_DATACOMP (1<<0) #define VCD_INSTCOMP (1<<1) #define VCD_ADDRCOMP (1<<2) The delta encoding is separated into three parts: the unmatched data accompanying the ADD and RUN instructions, the coding of the instructions, and the encoded addresses of the COPY instructions. If the bit VCD_DECOMPRESS (Section 4.1.1) was on, each of these sections may have been compressed using the given secondary compressor. The above bits indicate these cases. Length of ADD and RUN data section: This is the length of the section of data storing the unmatched data accompanying the ADD and RUN instructions (which may have been compressed using a secondary compressor as discussed). Length of instruction section: This is similar to the above but for the delta instructions. Length of COPY address section: This is similar to the above but for the addresses of the COPY instructions. Data section for ADD and RUN instructions: This section contains the unmatched data for the ADD and RUN instructions (which may have been further compressed). Instructions section: This section contains the instructions. Addresses section for COPY instructions: This section contains the addresses of the COPY instructions. 4.1.4 Processing a Delta File Below is the basic algorithm to decode a delta file: 1. sfgetc(Delta); 2. sfgetc(Delta); 3. sfgetc(Delta); 4. sfgetc(Delta); 5. Compressor = -1; 6. Codetable = Dflttable; 7. if((indi = sfgetc(Delta)) != 0) 8. { if(indi & VCD_DECOMPRESS) 9. Compressor = sfgetc(Delta); 10. if(indi & VCD_CODETABLE) 11. { n_tbl = sfgetu(Delta); 12. tbl = getdata(Delta, n_tbl, -1); 13. Codetable = gettable(tbl, n_tbl); 14. } 15. } 16. for(;;) -7- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 17. { if((indi = sfgetc(Delta)) < 0) 18. break; 19. if(indi != 0) 20. { n_src = sfgetu(Delta); 21. p_src = sfgetu(Delta); 22. if(indi == VCD_TARWINDOW) 23. src = getdata(Target, n_src, p_src); 24. else src = getdata(Source, n_src, p_src); 25. } 26. else n_src = 0; 27. n_del = sfgetu(Delta); 28. del = getdata(Delta, n_del, -1); 29. n_tar = win_inflate(&tar, src, n_src, del, n_del); 30. sfwrite(Target, tar, n_tar); 31. } COMMENTS. 1-4: These lines read the four magic header bytes that indicate the type of this file. 5-15: These lines first initialize the indices of secondary encoders and code table to default values. Then, the initialization data, if any, is processed to reset these variables. The function getdata() reads from the given stream the segment of data at the given position or the current position (if this argument is -1). The function gettable() will be described later. 16-31: These lines read window data and decode them. 19-26: These lines determine if there is a source segment of data and read it from the indicated file. 27-30: These lines read the delta encoding data, get processing space, call win_inflate() to decode target data, then write the results to the target file. Next is the function to recompute a target window: 1. int win_inflate(uchar_t** tarp, 2. uchar_t* src, int n_src, 3. uchar_t* del, int n_del) 4. { int n_tar, n_data, n_inst, n_addr; 5. uchar_t ctrl, *tar, *data, *inst, *addr; 6. Sfio_t *delf, *dataf, *instf, *addrf; 7. delf = sfstropen(del, n_del); 8. n_tar = sfgetu(delf); 9. ctrl = sfgetc(delf); 10. n_data = sfgetu(del); 11. n_inst = sfgetu(del); 12. n_addr = sfgetu(del); 13. tar = malloc(n_tar); 14. data = getdata(delf, n_data, -1); 15. inst = getdata(delf, n_inst, -1); 16. addr = getdata(delf, n_addr, -1); 17. if(ctrl&VCD_DATACOMP) 18. n_data = decompress(Compressor, data, n_data, &data); -8- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 19. if(ctrl&VCD_INSTCOMP) 20. n_inst = decompress(Compressor, inst, n_inst, &inst); 21. if(ctrl&VCD_ADDRCOMP) 22. n_addr = decompress(Compressor, addr, n_addr, &addr); 23. dataf = sfstropen(data, n_data); 24. instf = sfstropen(inst, n_inst); 25. addrf = sfstropen(addr, n_addr); 26. win_decode(tar, n_tar, src, n_src, dataf, instf, addrf); 27. *tarp = tar; 28. return n_tar; 29. } COMMENTS. 7: This line creates a stream to read the delta encoding data. 8-16: These lines read the sizes of the various datasets, then allocate memory and read in data as necessary. 17-22: These lines decompress the above data if necessary. The function decompress() invokes the respective decompressor. It returns the size of the decompressed data while the decompressed data itself is returned via the last argument. See Section 6 for proposed values for Compressor. 23-26: These lines call win_decode() (Section 4.4) to actually reconstruct target data. 27-28: These lines return the reconstructed target data. 4.2 Encoding Integers Using a Variable-Sized Format Vcdiff encodes integer values using the variable-size format introduced in the Sfio library [6] for encoding unsigned values. The code presented below is not quite correct with respect to type handling as in Sfio but they show the ideas. The encoding treats an integer as a number in base 128 so that each digit can be coded in one byte. The eighth bit of such a byte is the continuation bit. It is 1 if there are more digits to come or 0 if it is the last digit. +---------------------+ | 11100000 | 00111001 | +---------------------+ byte 0 byte 1 Table 1 Table 1 shows the variable sized encoding of 12345. The bytes in the encoding are presented in binary to make clear the use of the continuation bit. Omitting the continuation bit, the encoding of 12345 uses two bytes with values 96 and 57. -9- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 Below are the algorithms: 1. int sfputu(Sfio_t* f, uint_t v) 2. { uchar_t c[2*sizeof(uint_t)], *s; 3. s = &c[sizeof(c)-1]; 4. *s = v & 127; 5. while((v >>= 7) != 0) 6. *--s = (v & 127) | (1<<7); 7. sfwrite(f, s, (&c[sizeof(c)]) - s); 8. return &c[sizeof(c)]-s; 9. } 10. uint_t sfgetu(Sfio_t* f) 11. { uint_t b, v; 12. for(v = 0;; ) 13. { b = sfgetc(f); 14. v = (v << 7) | (b & 127); 15. if(!(b & 128) ) 16. return v; 17. } 18. } COMMENTS. 1: This line declares the formal arguments to sfputu(), a stream f to store the encoding and the value v to be encoded. 2: This line declares an array large enough to store the encoding. 4-6: These lines extract digits in base 128 and store them in the array. Note that the right-most digit is extracted first and does not carry a continuation bit. 7-8: These lines write the encoding out to the given stream f and return the length of that encoding. 10-18: These lines define the decoding algorithm. 4.3 Delta Instruction Coding Format Delta instructions are encoded as control bytes with associated data. Each control byte is an index into an instruction code table of 256 entries. Below are the relevant data structures: 1. typedef struct _vcdinst_s 2. { uchar_t type; /* instruction type */ 3. uchar_t size; /* >0 if size is coded here */ 4. uchar_t mode; /* address mode for COPYs */ 5. } Vcdinst_t; 6. typedef struct _vcdcode_s 7. { Vcdinst_t inst1; /* first instruction */ 8. Vcdinst_t inst2; /* second instruction */ 9. } Vcdcode_t; 10. typedef struct _vcdtable_s 11. { uchar_t s_same; /* s_same*256 addresses */ 12. uchar_t s_near; /* s_near addresses */ 13. Vcdcode_t code[256]; /* the instruction codes */ -10- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 14. } Vcdtable_t; COMMENTS. 1-5: An instruction is defined by its type, the size of the associated data and, in the case of a COPY, the mode of how the address is encoded. 6-9: As shown, a code in the range [0-255] can encode up to two instructions. The default Vcdiff code table uses this feature to merge adjacent ADD and COPY instructions with small sizes. 10-14: These lines define the encoding table. Addresses may be encoded against two optional caches, same and near, to be described later. Below are the values identifying the instructions: #define VCD_NOOP 0 /* not an instruction */ #define VCD_ADD 1 /* an ADD instruction */ #define VCD_RUN 2 /* a RUN instruction */ #define VCD_COPY 3 /* a COPY instruction */ Each instruction has a size s of the data involved. If the field Vcdinst_t.size is non-zero, it is the value for s. Otherwise, s is encoded next in the instruction dataset as a variable-sized integer. If the instruction is a COPY, the copy address will follow next in the instruction dataset. Its encoding depends on some addressing scheme to be discussed next. A COPY address is encoded using different modes as indicated by the values of the field Vcdinst_t.mode. Vcdiff allows this field to have values in the range [0-15]. The first two values are: #define VCD_SELF 0 #define VCD_HERE 1 If Vcdtable_t.s_near is positive, the next s_near values indicate addresses coded using the "near" cache. Then, if Vcdtable_t.s_same is positive, the next s_same values indicate addresses coded using the "same" cache. Let "addr" be the address of a COPY instruction and "here" the current location in the target data. During decoding a delta encoding data stream, the address modes have the below meanings: VCD_SELF: addr is encoded separately in the next variable-sized integer. VCD_HERE: Let i be the next variable-sized integer. Then, addr is here-i. Near cache: The "near" cache keeps s_near addresses. Let m be the mode of the instruction and i the next variable-sized integer. In addition, let n be m - (VCD_HERE+1). Then addr is near[m] + i. Same cache: The "same" cache keeps s_same*256 addresses. Let m be the mode (Vcdinst_t.mode) of the instruction and b the next byte. In addition, let n be m - (VCD_HERE+1+s_near). Then, addr is same[m*256 + b]. -11- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 Below are the algorithms to maintain address caches. 1. typedef struct _cache_s 2. { int* near; /* array of size s_near */ 3. int s_near; 4. int n; /* the circular index for near */ 5. int* same; /* array of size s_same*256 */ 6. int s_same; 7. } Cache_t; 8. Cache_t* cache_open(int s_near, int s_same) 9. { int i; 10. Cache_t* ka = malloc(sizeof(Cache_t)); 11. ka->near = malloc(s_near*256*sizeof(int)); 12. ka->same = malloc(s_same*sizeof(int)); 13. ka->s_near = s_near; 14. ka->s_same = s_same; 15. for(i = 0; i < ka->s_near; ++i) 16. ka->near[i] = 0; 17. for(i = 0; i < ka->s_same*256; ++i) 18. ka->same[i] = 0; 19. ka->n = 0; 20. } 21. cache_update(Cache_t* ka, uint_t addr) 22. { if(ka->s_same > 0) 23. ka->same[addr % (ka->s_same*256)] = addr; 24. if(ka->s_near > 0) 25. { ka->near[ka->n] = addr; 26. if((ka->n += 1) >= ka->s_near) 27. ka->n = 0; 28. } 29. } COMMENTS. 1-7: These lines define the caches. 8-20: These lines create a cache with addresses initialized to zero. 21-29: These lines update the caches given an address. Note that the "near" cache is updated in a round-robin manner. Below is the function to encode a COPY address: 1. int addr_encode(Cache_t* ka, int addr, int here, int* mode) 2. { int i, d, bestd, bestm; 3. bestd = addr; bestm = VCD_SELF; 4. if(sfulen(d = here-addr) < sfulen(bestd)) 5. { bestd = d; bestm = VCD_HERE; } 6. for(i = 0; i < ka->s_near; ++i) 7. { if((d = addr - ka->near[i] < 0) 8. continue; 9. if(sfulen(d) < sfulen(bestd)) 10. { bestd = d; bestm = VCD_HERE+1 + i; } -12- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 11. } 12. if(sfulen(bestd) > 1 && ka->s_same > 0 && 13. ks->same[d = addr%(ka->s_same*256)] == addr) 14. { bestd = d; bestm = (VCD_HERE+1) + ka->s_near + d/256; } 15. cache_update(ka,addr); 16. *mode = bestm; 17. return bestd; 18. } COMMENTS. 1: This lines declare the formal arguments. "addr" is the address to be encoded. "here" is the current location in the target data. "mode" is used to return the address mode. The encoded address itself is returned by the function. 3-11: "addr" is encoded as one of VCD_SELF, VCD_HERE or some "near" mode depending on which addressing scheme uses the least space. The sfulen() function computes the number of bytes required to encode a given integer. 12-14: If the coded address based on the above schemes still use more than one byte, the "same" cache is used to see if "addr" can be encoded in a single byte. 15: This line updates the address caches. Below is the function to decode a COPY address: 1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* addrf) 2. { int addr, m; 3. if(type == VCD_SELF) 4. addr = sfgetu(addrf); 5. else if(type == VCD_HERE) 6. addr = here - sfgetu(addrf); 7. else if((m = type - (VCD_HERE+1)) >= 0 && m < ka->s_near) 8. addr = ka->near[m] + sfgetu(addrf); 9. else 10. { m = type - (VCD_HERE+1 + ka->s_near); 11. addr = ka->same[m*256 + sfgetc(addrf)]; 12. } 13. cache_update(ka, addr); 14. return addr; 15. } 4.4 Decoding A Target Window The algorithm to decode a target window is as follows: 1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src, 2. Sfio_t* dataf, Sfio_t* instf, Sfio_t* addrf) 3. { uint_t here, size, addr, i; 4. Vcdinst_t *inst; 5. Vcdcode_t *code, *table = Codetable->code; 6. Vcdcache_t *ka = cache_open(Codetable->s_near,Codetable->s_same); 7. for(here = 0; here < n_tar; ) -13- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 8. { code = &table[sfgetc(instf)]; 9. for(i = 1; i <= 2; ++i) 10. { inst = i == 1 ? &code->inst1 : &code->inst2; 11. if(inst->type == VCD_NOOP) 12. continue; 13. if((size = inst->size) == 0) 14. size = sfgetu(instf); 15. if(inst->type == VCD_ADD) 16. { for(; size > 0; --size) 17. tar[here++] = sfgetc(dataf); 18. } 19. else if(inst->type == VCD_RUN) 20. { int c = sfgetc(dataf); 21. for(; size > 0; --size) 22. tar[here++] = c; 23. } 24. else if(inst->type == VCD_COPY) 25. { uchar_t* from; 26. addr = addr_decode(&ka,here,inst->mode,addrf); 27. from = addr < nsrc ? src+addr : tar+addr-nsrc; 28. for(; size > 0; --size) 29. tar[here++] = *from++; 30. } 31. } 32. } 33. } COMMENTS. 5: "table" is initialized to be the instruction code table to be used. 6: This line initializes the address caches. 8: This line reads a control byte from the instruction dataset and gets the corresponding code table to decode instructions. 9-32: These lines process the given pair of delta instructions. Note that the data for VCD_ADD and VCD_RUN are read from the raw dataset. 5. INSTRUCTION CODE TABLES Delta instructions are encoded based on some instruction code table. Vcdiff provides a default instruction code table. As noted in Section 4.1.1, an application can change this table in the Header section of a delta file. 5.1 The Default Instruction Code Table A COPY instruction with small data size may use more encoding space than the data itself. Thus, it is good to require COPY instructions to have some minimum sizes. In turn, a code table can take advantage of such a requirement to enhance compactness. Vcdiff provides a default instruction code table with the below characteristics: s_near: the near cache has 4 entries. s_same: the same cache has 3*256 entries. COPY: the minimum size for a COPY is 4. Thus, there are 9 address modes for a COPY instruction. The first two are VCD_SELF and VCD_HERE. Following these are the 4 modes for addresses coded against the near cache. The last three modes are for addresses coded against the same cache. -14- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 Table 2 below summarizes the code in the default table. The first 6 columns describe the pairs of instructions per code and the last column shows the index range in the code table for these pairs. The single RUN instruction shown on the first row always encodes the size separately. The second row shows the single ADD instructions. The ADD instruction with size 0 (i.e., its size is coded separately as a variable-sized integer) has the code index 1. ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and their sizes will not be separately encoded. The last row shows instruction pairs where the first instruction is a COPY while the second is an ADD. In this case, only a COPY instruction of size 4 immediately followed by an ADD instruction of size 1 would be coded as a pair. The code indices for such pairs range from 247 to 255. TYPE SIZE MODE TYPE SIZE MODE INDEX --------------------------------------------------------------- RUN 0 - NOOP - - 0 ADD 0, [1,17] - NOOP - - [1,18] COPY 0, [4,18] 0 NOOP - - [19,34] COPY 0, [4,18] 1 NOOP - - [35,50] COPY 0, [4,18] 2 NOOP - - [51,66] COPY 0, [4,18] 3 NOOP - - [67,82] COPY 0, [4,18] 4 NOOP - - [83,98] COPY 0, [4,18] 5 NOOP - - [99,114] COPY 0, [4,18] 6 NOOP - - [115,130] COPY 0, [4,18] 7 NOOP - - [131,146] COPY 0, [4,18] 8 NOOP - - [147,162] ADD [1-4] - COPY [4-6] 0 [163,174] ADD [1-4] - COPY [4-6] 1 [175,186] ADD [1-4] - COPY [4-6] 2 [187,198] ADD [1-4] - COPY [4-6] 3 [199,210] ADD [1-4] - COPY [4-6] 4 [211,222] ADD [1-4] - COPY [4-6] 5 [223,234] ADD [1-4] - COPY 4 6 [235,238] ADD [1-4] - COPY 4 7 [239,242] ADD [1-4] - COPY 4 8 [243,246] COPY 4 [0-8] ADD 1 - [247,255] --------------------------------------------------------------- Table 2 5.2 Instruction Code Table Encoding The format of an instruction code table is as follows: Size of near cache - uchar_t Size of same cache - uchar_t Compressed table data Since each instruction is 3 bytes, an instruction code table can be represented by a string of length 2*3*256 or 1536 bytes. For compact storage, this string is compared against the string of the default table to generate an encoding in the same Vcdiff format. Two functions tbl2str() and str2tbl() are used for converting between a code table and its code string. Below is the description of tbl2str(). str2tbl() is just as straightforward so its description will be omitted. Note that bytes of the same type are grouped together to induce more -15- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 matching and increase compression. 1. tbl2str(Vcdcode_t* tab, uchar_t data[6*256]) 2. { int i, n; 3. n = 0; 4. for(i = 0; i < 256; ++i) 5. data[n++] = tab[i].inst1.type; 6. for(i = 0; i < 256; ++i) 7. data[n++] = tab[i].inst2.type; 8. for(i = 0; i < 256; ++i) 9. data[n++] = tab[i].inst1.size; 10. for(i = 0; i < 256; ++i) 11. data[n++] = tab[i].inst2.size; 12. for(i = 0; i < 256; ++i) 13. data[n++] = tab[i].inst1.mode; 14. for(i = 0; i < 256; ++i) 15. data[n++] = tab[i].inst2.mode; 16. } Below is the algorithm to reconstruct an instruction code table: 1. Vcdtable_t* gettable(uchar_t* data, int n_data) 2. { uchar_t src[6*256], *tar; 3. int n_tar; 3. Vcdtable_t *table; 4. table = malloc(sizeof(Vcdtable_t)); 5. table->s_near = *data++; 6. table->s_same = *data++; 7. tbl2str(Dflttable, src); 8. n_tar = win_inflate(&tar, src, 6*256, data, n_data-2); 9. table->code = str2tbl(tar); 10. } COMMENTS. 4-6: These lines allocate memory for the new code table and initialize the sizes of the near and same caches using the given data. 7: This line constructs the string corresponding to the default instruction code table. 8-9: These lines construct the new table. 6. FURTHER ISSUES There are two issues not yet addressed: Secondary compressors: As discussed in Section 4.1.1, certain sections in the delta encoding of a window may be further compressed by a secondary compressor. In our experience, the basic Vcdiff format is adequate for most purposes. So ordinarily, secondary compressors are not needed. However, such compressors can be occasionally useful for further compressing the unmatched data in RUN and ADD instructions when Vcdiff is used only for compression. -16- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 We reserve the following values for the indices of a few common compressors: #define VC_HUFFMAN 1 /* Huffman encoding */ #define VC_ARITH 2 /* arithmetic encoding */ #define VC_SPLAY 3 /* splay tree encoding */ The formats of the compressed data via such compressors or any compressors that may be defined in the future are left open to their implementations. Large file system vs. small file system: As discussed in Section 4.1.2, a target window in a large file may be compared against some source window in another file or in the same file (from some earlier part). In that case, the file offset of the source window is specified as a variable-sized integer in the delta encoding. There is a possibility that the encoding was computed on a system supporting much larger files than that in a system where the data may be decoded (e.g., 64-bit file systems vs. 32-bit file systems). In that case, some target data may not be recoverable. This state is detectable when such a large integer is encountered during decoding. 7. SUMMARY We have described a general and portable encoding format for compression and differencing. The format is compact. For compression only, using the same LZ-77 string parsing strategy and without any secondary compressors, the typical compression rate is better than Unix compress and close to gzip. For differencing, the same data format is better than all known methods. Ignoring secondary compressor issues, the decoding algorithms run in linear time and require working space proportional to window sizes. 8. ACKNOWLEDGEMENTS Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. 9. SECURITY CONSIDERATIONS Vcdiff only provides a format to encode compressed and differenced data. It does not address any issues concerning how such data are, in fact, stored in a given file system or the run-time memory of a computer system. Therefore, we do not anticipate any security issues with respect to Vcdiff. -17- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 10. SOURCE CODE AVAILABILITY Vcdiff is implemented as a data transforming method in Phong Vo's Vcodex library. AT&T Corp. has made the source code for Vcodex available for anyone to use to transmit data via HTTP1.1 Delta Encoding. The source code and according license is accessible at the below URL: http://www.research.att.com/sw/tools 11. REFERENCES [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995. [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3):337-343, May 1977. [3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984. [4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976. [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996. [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Accepted for publication in Software Practice & Experience, 1999. 12. AUTHOR'S ADDRESS Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8630 Email: kpv@research.att.com David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8602 Email: dgk@research.att.com -18- RFC 4 VCDIFF Generic Differencing and Compression Data Format June 2001 13. APPENDIX: SECTION LAY-OUT IN A DELTA FILE For reference convenience, the lay-out of the various sections in a delta file are summarized below. An indented section means further refinement of the item above it. Header (0x56 | (1<<7)) - uchar_t (0x43 | (1<<7)) - uchar_t (0x44 | (1<<7)) - uchar_t (0) - uchar_t Indicator - uchar_t [Secondary compressor - uchar_t] [Length of instruction code table data - uint_t] [Instruction code table data] Size of near cache - uchar_t Size of same cache - uchar_t Compressed table data Window1 Indicator - uchar_t [Source segment size - uint_t] [Source segment position - uint_t] Length of the delta encoding - uint_t The delta encoding Length of target window - uint_t Indicator - uchar_t Length of ADD and RUN data section - uint_t Length of instructions section - uint_t Length of COPY addresses section - uint_t Data section for ADD and RUN instructions Instructions section Addresses section for COPY instructions Window2 ... 14. EXPIRATION DATE December 29, 2001 -19-