INTERNET-DRAFT R. deSilva Intended status: Standard File: draft-ray-radx-00 25 February 2018 Expires: 25 August 2018 Low Overhead Compression For Short String Data Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html ABSTRACT Compression methods primarily rely on detecting patterns of redundancy in text strings, which means that a minimum of around 800 bytes of data are required before compression can occur. Most data fields in databases are far smaller than that, and therefore dictionary-based compression is not efficient. The method proposed begins by making assumptions about the data and accepting some compromises; which allows compression to begin from the very first byte. The discussion proves that this can result in 1) an average 30% compression, 2) a maximum 50% compression, and 3) no expansion at all (0%) of the input data. As an added benefit, all processing can be performed using bit shifting and masking, which drastically lowers the cost of compression; no division or multiplication operations are necessary. UTF-8 text is handled gracefully. OVERVIEW AND BACKGROUND The origin of RadX is the attempt by Digital Equipment Corporation in the 1970's to pack more text into the limited memory space available to computers in that era, particularly as DEC was supporting different classes of machines with different word sizes - some, like the PDP/8 series, were byte machines, but some used 32 or even 36 bits. DEC engineers basically used radix40 as the basis for a compressed character set which included the alphabet, the numerals, the space, and three other special characters (which varied by machine). They called it "RAD50", because in that era DEC were into octal arithmetic and every- thing was in three-bit groups. It worked - they packed three letters into every 16 bits (40 x 40 x 40 = 64000), but the glaring problem was that it didn't work for anything but filenames - the character set was too small. There were also performance issues. Using non-binary bases for character sets is not kind to ALUs; lots of division, modulus and multiplication are involved. Those who designed UTF-8 to represent Unicode in a flexible and fool- proof way had a similar problem. They decided on an approach in which all transformations could use bit-shifting instead of divide and multi- ply. We have done the same, which implies, of course, sticking to binary bases and radices. DISCUSSION The best compression methods seek out redundancy in text - but in order to do that, the analyzer needs a sufficiently large sample, which some experts have determined needs to exceed 800 bytes for random text. The most elegant solutions generate dictionaries on-the-fly, requiring only one pass through the data. If this method does not serve, then the best alternative is to make some assumptions about the data before it is ever seen, perhaps even restrict the universe of permitted values, and then dive right into a compression procedure from the very first byte. UTF-8 solved many encoding problems - backward compatibility with ASCII being the most important, but no less important is the ability to recover from a bad byte or sequence error, to skip illegal segments which may have come from some other Unicode encoding scheme, and so on. But the recognition that an algorithm cannot be arrogantly wasteful of CPU cycles is what has made it possible to utilize it without having to worry about the cost. IMPLEMENTATION RadX is a useful combination of Radix-32/64 compression. The data is primarily stored in Radix-32, which allows bit-shifting instead of divi- ding/multiplying to greatly simplify computation. This is combined with a check for double characters, which can be encoded effectively in 4 bits. Other ASCII characters are stored in 6 bits; UTF-8 characters are reduced to bare Unicode and stored in 24 bits maximum. A series of zero bytes is used for padding. As the input string is scanned from left to right, all disallowed Uni- code ranges are ignored. Control characters are treated as spaces, and all spaces consolidated. If the character is UTF-8, the Unicode page number is simply substituted for pages 00 to 0F. For codes from U+0080 to U+0FFF, only the last 8 bits need to be added, so an extra byte is used. For U+1000 onwards, two extra bytes are encoded. For safety, certain bit patterns are detected and skipped. If the character is not UTF-8, then it is simple ASCII - and Radix-32 compression is attempted. A special subset of 32 characters is defined as: " .0123456789ABCDEFGHILMNOPRSTUWY" These are space, period, the numerals, and the alphabet with the excep- tion of J, K, Q, V, X, and Z. The table indexing automatically upper- cases alphas and substitutes some punctuation. If doubled codes are found, Radix-64 encoding is used with a "double" flag; if three Radix- 32 encoding are found, that set is used; else a single Radix-64 byte is created. The bit allocation table below shows the final encodings. Bit allocation ============== 0 n xxxxxx Radix-64 ASCII, n=0 if doubled, 1 if single 0 000 0000 +2 bytes Unicode U+1000 to U+D7FF (or padding) 0 011 1xxx +1 byte Unicode U+0080 to U+07FF 0 111 xxxx +1 byte Unicode U+0800 to U+0FFF 0 111 1110 +1 byte Unicode repeats, diff. in last digit 0 111 1111 Identical repeat of entire last sequence 1 xxxxx yyyyy zzzzz Radix-32 triplet The great advantages of this system over DEC RAD50 are that 1) the computational overhead is low, since there is no need to use division and modulus repeatedly for each character - simple bit-shift is suffi- cient, and 2) there are no unmappable characters. The Internet requires UTF-8 support, and with RadX we can have our cake and eat it too; we can compress ASCII efficiently and yet not get fazed when UTF-8 unexpectedly pops up (or is introduced on purpose). In fact, quite a lot of UTF-8 is also compressed - Indic languages, for instance, such as Hindi and Tamil are also compressed by 30%. The following ranges are mapped: U+0080 to U+024F, and U+0300 to U+D7FF. Codes outside those ranges are deleted. Codes from U+0080 to U+0FFF are mapped using one extra byte; from U+1000 onwards, two extra bytes are required. This still represents a savings, or at least no penalty: A. A UTF-8 character is found which is near or identical to the previous one. It is now stored in 8 bits. Compression = 3:1 to 2:1 (66% to 50%) B. Double ASCII character pair found, map to Radix-64. 2 bytes are stored in 8 bits. Compression = 2:1 (50%) C. 3 characters found which map into Radix-32. 3 bytes are stored in 16 bits. Compression = 3:2 (33%) D. A UTF-8 character in the range U+0800 to U+0FFF is found; it occupies 3 byte in UTF-8. It is now stored in 16 bits. Compression = 3:2 (33%) E. An ASCII character is found, and mapped to Radix-64. 1 byte is stored in 8 bits. Compression = 1:1 (0%) F. A UTF-8 character is found in the range U+0080 to U+07FF; it occupies 2 bytes and is stored in 16 bits. Compression = 2:2 (0%) G. A UTF-8 character in the range U+1000 to U+D7FF occupies 3 bytes and is encoded in 24 bits. Compression = 3:3 (0%) That takes care of all the possible characters in ASCII and UTF-8. Any dangerous Unicode sequences are deleted as shown below. All the encodable sequences are first checked to see if they match the previous Unicode character; if they do, the single-byte Unicode repeat code is used. Bit Transformation Sequence (in processing order) ================================================= 0 ??????? -> indexed substitution table zzzzzz -> 00 zzzzzz [Note A] xxxxx yyyyy zzzzz -> 1 xxxxx yy yyy zzzzz [Note B] zzzzzz -> 01 zzzzzz [Note C] 10 ?????? -> delete [orphan continuation byte] 1100 000 -> delete [Note D] 1100 0 x yy 10 yy zzzz -> 0011 100x yyyy zzzz [Note E] 1100 1000 10 yy zzzz -> 0011 1010 00 yy zzzz [Note F] 1100 1001 10 00 zzzz -> 0011 1010 0100 zzzz [Note G] 1100 10 ?? -> delete [Note H] 110 xxx yy 10 yy zzzz -> 0011 1xxx yyyy zzzz [Note J] 1110 0000 10 0????? -> delete [illegal padding] 1110 0000 10 xxxx yy 10 yy zzzz -> 0111 xxxx yyyy zzzz [Note K] 1110 111 ? -> delete [U+E000 - U+FFFF] 1110 aaaa 10 xxxx yy 10 yy zzzz -> 0000 0000 aaaa xxxx yyyy zzzz [No.L] ???? ???? -> delete [U+10000 and above] Notes: [A] ASCII doubled char in Radix-64 [B] 3 ASCII chars in Radix-32 [C] ASCII single char in Radix-64 [D] U+007F or less - hidden ASCII [E] U+0080 to U+01FF [F] U+0200 to U+023F [G] U+0240 to U+024F [H] U+0250 - U+02FF - custom chars [J] U+0300 - U+07FF [K] U+0800 - U+0FFF [L] U+1000 to U+DFFF The substitution table below allows the alphabetic characters in Radix- 64 to be visually recognized as such in data files, so that there is some ability to verify directly. The Radix-32 set is mostly aligned with it to simplify debugging. In Radix-64, only codes 1 to 55 are used for ASCII; 0 and 56 to 63 are reserved for Unicode encoding prefixes. Radix-64 and Radix-32 Character Encoding Table ============================================== Char Rad32 Rad64 Char Rad32 Rad64 Char Rad32 Rad64 ==== ===== ===== ==== ===== ===== ==== ===== ===== 0 32 @ 55 ` 39 ! 10 33 A 1 1 a 1 1 " 34 B 2 2 b 2 2 # 35 C 3 3 c 3 3 $ 36 D 4 4 d 4 4 % 37 E 5 5 e 5 5 & 38 F 6 6 f 6 6 ' 39 G 7 7 g 7 7 ( 27 H 8 8 h 8 8 ) 29 I 9 9 i 9 9 * 40 J 10 j 10 + 41 K 11 k 11 , 42 L 12 12 l 12 12 - 43 M 13 13 m 13 13 . 10 10 N 14 14 n 14 14 / 28 O 15 15 o 15 15 0 11 44 P 16 16 p 16 16 1 17 45 Q 17 q 17 2 22 46 R 18 18 r 18 18 3 24 47 S 19 19 s 19 19 4 26 48 T 20 20 t 20 20 5 27 49 U 21 21 u 21 21 6 28 50 V 22 v 22 7 29 51 W 23 23 w 23 23 8 30 52 X 24 x 24 9 31 53 Y 25 25 y 25 25 : 54 Z 26 z 26 ; 54 [ 27 { 27 < 27 \ 28 | 28 = 43 ] 29 } 29 > 29 ^ 30 ~ 30 ? 10 33 _ 31 The reverse process is even simpler. A series of zeroes flags the end of the text, so processing can stop once that is encountered. Every other possible combination represents a compression mode that can be decoded and expanded. RadX To UTF-8 Regeneration Sequence (in processing order) ========================================================= 0000 0000 0000 ???? ???? ???? -> delete [space padding] 0000 0000 aaaa xxxx yyyy zzzz -> 1110 aaaa 10 xxxx yy 10 yy zzzz 0011 0xxx yyyy zzzz -> 110 xxx yy 10 yy zzzz 0011 1110 zzzz ZZZZ -> rpt prev UTF-8, subst last digit 0011 1111 -> repeat previous sequence 0011 1xxx yyyy zzzz -> 1110 0000 10 1xxx yy 10 yy zzzz 00 zzzzzz -> rrrrrrrr rrrrrrrr [reverse Radix-64] 01 zzzzzz -> rrrrrrrr [reverse Radix-64] 1 xxxxx yy yyy zzzzz -> rrrrrrrr rrrrrrrr rrrrrrrr [reverse Radix-32] The fourth sequence above is used when the current UTF-8 and also the next one, are almost the same as the last but differ in the final hex digit. The fifth sequence above is used when the entire current encoding is identical to the last. Special processing is done for word beginnings which are lower case - the preceding Radix-64 space character's "double" bit is set to indicate that the next letter should NOT be uppercased (0x20 vs. 0x60). (Double spaces are impossible, since they are eliminated before anything else.) The sequence 0x800? cannot normally occur, since that would represent the Radix-32 sequence for two spaces in a row; so this allows for a set of 32 special codes which have not been allocated, any may be used for future extension. USAGE EXAMPLE And finally, here is a worst-case scenario with the famous full-alpha- bet sentence. T H E Q U I C K B R O W N ..20+8+5... 32 17 ..21+9+3... 11 ..0+2+18... ..15+23+14.. F O X J U M P S O V E R ..0+6+15... 24 32 10 ..21+13+16.. ..19+0+15.. 22 ..5+18+0 T H E L A Z Y D O G . ..20+8+5... ...0+12+1... 26 ..25+0+4... ..15+7+10.. The three-number groups represent a two-byte condensation in Rad-32, and the lone numbers represent a fallback to Rad-64 using one byte. The original 44-byte sentence is reduced to 32, or a 27% reduction even without any doubled characters. This letter frequency assumption works for all European languages, with a slight loss for German, Italian and Polish. RESULTING FEATURES The result of applying the above procedures to text strings gives the following advantages: 1. Achieves a maximum compression of 50%. 2. Achieves an average compression of 33%. 3. Achieves a minimum compression of 0% (i.e., no expansion at all). 4. Compresses strings as short as 15 bytes. 5. Compresses UTF-8 in European and Indic ranges. 6. Does not compress or expand other Unicode ranges. 7. Uses only comparison and bit-shifting for CPU efficiency. Security Considerations None. IANA Considerations None. Author's Address D. Renuk de Silva 72 First Lane Ratmalana WP 10370 Sri Lanka Phone: +9478 622-7093 EMail: raydesilva@hotmail.com