INTERNET-DRAFT Adam M. Costello draft-ietf-idn-altdude-00.txt 2001-Mar-19 Expires 2001-Sep-19 AltDUDE version 0.0.2 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this document is unlimited. Please send comments to the author at amc@cs.berkeley.edu, or to the idn working group at idn@ops.ietf.org. A non-paginated (and possibly newer) version of this specification may be available at http://www.cs.berkeley.edu/~amc/charset/altdude Abstract DUDE [DUDE01] by Mark Welter and Brian Spolarich is an ASCII-Compatible Encoding (ACE) of Unicode strings, and AltDUDE is a slight variation on it that is conceptually simpler. AltDUDE is a reversible map from a sequence of unsigned integers (intended to be Unicode code points) to a sequence of letters (A-Z, a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH characters. Such a map might be useful for internationalized domain names [IDN], because host name labels are currently restricted to LDH characters by [RFC952] and [RFC1123]. Besides domain names, there might also be other contexts where it is useful to transform Unicode [UNICODE] code points (or any unsigned integers that exhibit locality) into "safe" (delimiter-free) ASCII characters. (If other contexts consider hyphen-minus to be unsafe, it can trivially be eliminated, or replaced by a different character, like underscore.) Contents Differences from DUDE Features Name Overview Base-32 characters Encoding procedure Decoding procedure Signature Case sensitivity models Comparison with other ACEs Example strings Security considerations Credits References Author Example implementation Differences from DUDE AltDUDE differs from DUDE in four respects: 1) DUDE computes the XOR of each integer and the previous in order to decide how many bits of each integer to encode, whereas AltDUDE encodes the XOR itself, so there is no need for a mask. 2) DUDE makes the first quintet of each sequence different from the rest, while AltDUDE makes the last quintet different, so it's easier for the decoder to detect the end of the sequence. 3) AltDUDE uses a base-32 map that avoids 0, 1, o, and l, to help humans avoid transcription errors. 4) AltDUDE uses 96 rather than 0 as the initial value of the previous code point. For domain names, this makes a few encodings one character shorter and makes none longer. Features Uniqueness: Every sequence of integers maps to at most one LDH string. Completeness: Every sequence of integers maps to an LDH string. Restrictions on which integers are allowed, and on sequence length, may be imposed by higher layers. Efficient encoding: The ratio of encoded size to original size is small. This is important in the context of domain names because [RFC1034] restricts the length of a domain label to 63 characters. Simplicity: The encoding and decoding algorithms are reasonably simple to implement. The goals of efficiency and simplicity are at odds; AltDUDE places greater emphasis on simplicity. Case-preservation: If a Unicode string has been case-folded prior to encoding, it is possible to record the case information in the case of the letters in the encoding, allowing a mixed-case Unicode string to be recovered if desired, but a case-insensitive comparison of two encoded strings is equivalent to a case-insensitive comparison of the Unicode strings. This feature is optional; see section "Case sensitivity models". Name AltDUDE is a working name that should be changed if it is adopted. Rather than waste good names on experimental proposals, let's wait until one proposal is chosen, then assign it a good name. Suggestions (assuming the primary use is in domain names): DUDE (if the DUDE authors wish to adopt this algorithm) UniHost UTF-D ("D" for "domain names") UTF-33 (there are 33 characters in the output repertoire) Overview AltDUDE encodes unsigned integers as characters, although implementations will of course need to represent the output characters somehow, usually as bytes or other code units. When AltDUDE is used to encode Unicode characters, the integers are the corresponding Unicode code points, not UTF-16 surrogates. Each integer is represented by an integral number of characters in the encoded string. There is no intermediate bit string or octet string. Integers with value 45 are represented by hyphen-minus characters (45 is the Unicode code point for hyphen-minus). Each non-hyphen-minus character in the encoded string represents five bits (a "quintet"). A sequence of quintets represents the bitwise XOR between each non-45 integer and the previous one. The exception for 45 and hyphen-minus is useful for domain names, but could be dropped in other contexts, or replaced by a different exception. Base-32 characters "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 The digits "0" and "1" and the letters "o" and "l" are not used, to avoid transcription errors. All decoders must recognize both the uppercase and lowercase forms of the base-32 characters. The case may or may not convey information, as described in section "Case sensitivity models". Encoding procedure All ordering of nybbles and quintets is big-endian (most significant first). A nybble is 4 bits. XOR is bitwise exclusive or. let prev = 96 for each input integer n (in order) do begin if n == 45 then output hyphen minus else begin let diff = prev XOR n extract the least significant nybbles of diff, as few as are sufficient to hold all the nonzero bits (but at least one) prepend 0 to the last nybble and 1 to the rest output base-32 characters corresponding to the quintets let prev = n end end The encoder must either correctly handle all integer values that can be represented in the type of its input, or it must check whether the input contains values that it cannot handle and return an error if so. Under no circumstances may it produce incorrect output. Decoding procedure let prev = 96 while the input string is not exhausted do begin if the next character is hyphen-minus then output 45 else begin input characters and convert them to quintets until encountering a quintet beginning with 0 fail upon encountering a non-base-32 character or end-of-input strip the first bit of each quintet concatenate the resulting nybbles to form diff let prev = prev XOR diff output prev end end encode the output sequence and compare it to the input string fail if they are not equal The comparison at the end must be case-insensitive if ACEs are always compared case-insensitively (which is true of domain names), case-sensitive otherwise. See also section "Case sensitivity models". This check is necessary to guarantee the uniqueness property (there cannot be two distinct encoded strings representing the same sequence of integers). This check also frees the decoder from having to check for overflow while decoding the base-32 characters. Signature The issue of how to distinguish ACE strings from unencoded strings is largely orthogonal to the encoding scheme itself, and is therefore not specified here. In the context of domain name labels, a standard prefix and/or suffix (chosen to be unlikely to occur naturally) would presumably be attached to ACE labels. (In that case, it would probably be good to forbid the encoding of Unicode strings that appear to match the signature, to avoid confusing humans about whether they are looking at a Unicode string or an ACE string.) In order to use AltDUDE in domain names, the choice of signature must be mindful of the requirement in [RFC952] that labels never begin or end with hyphen-minus. The raw encoded string will begin or end with a hyphen-minus iff the Unicode string does. If the Unicode strings are forbidden from beginning or ending with hyphen-minus (which seems prudent anyway), then there is no problem. Otherwise, the signature must consist of both a prefix and a suffix. It appears that "---" is extremely rare in domain names; among the four-character prefixes of all the second-level domains under .com, .net, and .org, "---" never appears at all. Therefore, perhaps the signature should be of the form ?--- (prefix) or ---? (suffix), where ? could be "u" for Unicode, or "i" for internationalized, or "a" for ACE, or maybe "q" or "z" because they are rare. Case sensitivity models The higher layer must choose one of the following four models. Models suitable for domain names: * Case-insensitive: Before a string is encoded, all its non-LDH characters must be case-folded so that any strings differing only in case become the same string (for example, strings could be forced to lowercase). Folding LDH characters is optional. The case of base-32 characters and literal-mode characters is arbitrary and not significant. Comparisons between encoded strings must be case-insensitive. The original case of non-LDH characters cannot be recovered from the encoded string. * Case-preserving: The case of the Unicode characters is not considered significant, but it can be preserved and recovered, just like in non-internationalized host names. Before a string is encoded, all its non-LDH characters must be case-folded as in the previous model. LDH characters are naturally able to retain their case attributes because they are encoded literally. The case attribute of a non-LDH character is recorded in the last of the base-32 characters that represent it, which is guaranteed to be a letter rather than a digit. If the base-32 character is uppercase, it means the Unicode character is caseless or should be forced to uppercase after being decoded (which is a no-op if the case folding already forces to uppercase). If the base-32 character is lowercase, it means the Unicode character is caseless or should be forced to lowercase after being decoded (which is a no-op if the case folding already forces to lowercase). The case of the other base-32 characters in a multi-quintet encoding is arbitrary and not significant. Only uppercase and lowercase attributes can be recorded, not titlecase. Comparisons between encoded strings must be case-insensitive, and are equivalent to case-insensitive comparisons between the Unicode strings. The intended mixed-case Unicode string can be recovered as long as the encoded characters are unaltered, but altering the case of the encoded characters is not harmful--it merely alters the case of the Unicode characters, and such a change is not considered significant. In this model, the input to the encoder and the output of the decoder can be the unfolded Unicode string (in which case the encoder and decoder are responsible for performing the case folding and recovery), or can be the folded Unicode string accompanied by separate case information (in which case the higher layer is responsible for performing the case folding and recovery). Whichever layer performs the case recovery must first verify that the Unicode string is properly folded, to guarantee the uniqueness of the encoding. It is not very difficult to extend the nameprep algorithm [NAMEPREP03] to remember case information. The case-insensitive and case-preserving models are interoperable. If a domain name passes from a case-preserving entity to a case-insensitive entity, the case information will be lost, but the domain name will still be equivalent. This phenomenon already occurs with non-internationalized domain names. Models unsuitable for domain names, but possibly useful in other contexts: * Case-sensitive: Unicode strings may contain both uppercase and lowercase characters, which are not folded. Base-32 characters must be lowercase. Comparisons between encoded strings must be case-sensitive. * Case-flexible: Like case-preserving, except that the choice of whether the case of the Unicode characters is considered significant is deferred. Therefore, base-32 characters must be lowercase, except for those used to indicate uppercase Unicode characters. Comparisons between encoded strings may be case-sensitive or case-insensitive, and such comparisons are equivalent to the corresponding comparisons between the Unicode strings. Comparison with other ACEs The differences between AltDUDE and DUDE were given in section "Differences from DUDE". For a comparison between DUDE and other ACEs, please see the AMC-ACE-O specification [AMCACEO00]. Example strings The first several examples are all translations of the sentence "Why can't they just speak in ?" (courtesy of Michael Kaplan's "provincial" page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names. (A) Arabic (Egyptian): U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F AltDUDE: yueqpcycrcyjhbpznpitjycxf (B) Chinese (simplified): U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 AltDUDE: w85gvk7g9k2iwf6x9j6x7ju54k (C) Czech: Proprostnemluvesky = U+010D = U+011B = U+00ED AltDUDE: tActptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (D) Hebrew: U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 U+05D1 U+05E8 U+05D9 U+05EA AltDUDE: x5nckajvjpvnpenqpcvjvbevrvdvjvbvd (E) Hindi: U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 U+0939 U+0948 U+0902 (Devanagari) AltDUDE: 3wrtgmzjxnuqgthyfymygxfxiycyewjuktbzjwcuqyhzjkupvbydzq\ zbwk (F) Japanese: U+306A U+305C U+307F U+3093 U+306A U+65E5 U+672C U+8A9E U+3092 U+8A71 U+3057 U+3066 U+304F U+308C U+306A U+3044 U+306E U+304B (kanji and hiragana) AltDUDE: vsskvgud8n9jxx2ru6j875c54sn548d54ugvbuj6d8guqukuf (G) Korean: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) AltDUDE: 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8si\ tusauiyz5i23az96iz6ze3xaz2td96ry3si (H) Russian: U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A U+0438 (Cyrillic) AltDUDE: wxRbzjzcjzrzfdmdffigpnnzqrpzpbzqdcazmc (I) Spanish: PorqunopuedensimplementehablarenEspaol = U+00E9 = U+00F1 AltDUDE: tAtrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmMtgdtb\ 3a3qd (J) Taiwanese: U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 AltDUDE: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (K) Vietnamese: Taisaohokhngthchi\ noitingVit = U+0323 = U+00F4 = U+00EA = U+0309 = U+0301 AltDUDE: tEtfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\ yitptp2dv8mvyrjtBtr2dv6jvxh The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof). (L) 3B (Japanese TV program title) = U+5E74 (kanji) = U+7D44 (kanji) = U+91D1 U+516B U+5148 U+751F (kanji) AltDUDE: xdx8whx8tGz7ug863f6s5kuduwxh (M) -with-SUPER-MONKEYS (Japanese music group name) = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) AltDUDE: x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK (N) Hello-Another-Way- (Japanese song title) = U+305D U+308C U+305E U+308C U+306E (hiragana) = U+5834 U+6240 (kanji) AltDUDE: Ipjad-Qrbtmtnpth-Ftgti-vsue7b7c7c8cy2xkv4ze (O) 2 (Japanese TV program title) = U+3072 U+3068 U+3064 (hiragana) = U+5C4B U+6839 (kanji) = U+306E (hiragana) = U+4E0B (kanji) AltDUDE: vstctkny6urvwzcx2xhz8yfw8vj (P) MajiKoi5 (Japanese song title) = U+3067 (hiragana) = U+3059 U+308B (hiragana) = U+79D2 U+524D (kanji) AltDUDE: PnmdvssqvssNegvsva7cvs5qz38hu53r (Q) de (Japanese song title) = U+30D1 U+30D5 U+30A3 U+30FC (katakana) = U+30EB U+30F3 U+30D0 (katakana) AltDUDE: vs5bezgxrvs3ibvs2qtiud (R) (Japanese song title) = U+305D U+306E (hiragana) = U+30B9 U+30D4 U+30FC U+30C9 (katakana) = U+3067 (hiragana) AltDUDE: vsvpvd7hypuivf4q The last example is an ASCII string that breaks not only the existing rules for host name labels but also the rules proposed in [NAMEPREP03] for internationalized domain names. (S) -> $1.00 <- AltDUDE: -xqtqetftrtqatatn- Security considerations Users expect each domain name in DNS to be controlled by a single authority. If a Unicode string intended for use as a domain label could map to multiple ACE labels, then an internationalized domain name could map to multiple ACE domain names, each controlled by a different authority, some of which could be spoofs that hijack service requests intended for another. Therefore AltDUDE is designed so that each Unicode string has a unique encoding. However, there can still be multiple Unicode representations of the "same" text, for various definitions of "same". This problem is addressed to some extent by the Unicode standard under the topic of canonicalization, but some text strings may be misleading or ambiguous to humans when used as domain names, such as strings containing dots, slashes, at-signs, etc. These issues are being further studied under the topic of "nameprep" [NAMEPREP03]. Credits AltDUDE reuses a number of preexisting techniques. The basic encoding of integers to nybbles to quintets to base-32 comes from UTF-5 [UTF5], and the particular variant used here comes from AMC-ACE-M [AMCACEM00]. The idea of avoiding 0, 1, o, and l in base-32 strings was taken from SFS [SFS]. From DUDE (of which the latest version is [DUDE01]) comes the idea of encoding differences between successive integers. The idea of using the alphabetic case of base-32 characters to record the desired case of the Unicode characters was suggested by this author, but in DUDE it was first applied it to the UTF-5-style encoding. References [AMCACEM00] Adam Costello, "AMC-ACE-M version 0.1.0", 2001-Feb-12, draft-ietf-idn-amc-ace-m-00. [AMCACEO00] Adam Costello, "AMC-ACE-O version 0.0.3", 2001-Mar-19, draft-ietf-idn-amc-ace-o-00. [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. [IDN] Internationalized Domain Names (IETF working group), http://www.i-d-n.net/, idn@ops.ietf.org. [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation of Internationalized Host Names", 2001-Feb-24, draft-ietf-idn-nameprep-03. [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", http://www.trigeminal.com/samples/provincial.html. [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host Table Specification", 1985-Oct, RFC 952. [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 1987-Nov, RFC 1034. [RFC1123] Internet Engineering Task Force, R. Braden (editor), "Requirements for Internet Hosts -- Application and Support", 1989-Oct, RFC 1123. [SFS] David Mazieres et al, "Self-certifying File System", http://www.fs.net/. [UNICODE] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html. [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. Author Adam M. Costello http://www.cs.berkeley.edu/~amc/ See also the authors of DUDE [DUDE01]. Example implementation /******************************************/ /* altdude.c 0.0.2 (2001-Mar-19-Sun) */ /* Adam M. Costello */ /******************************************/ /* This is ANSI C code (C89) implementing AltDUDE */ /* (draft-ietf-idn-altdude-00), a simplified variant */ /* of DUDE (draft-ietf-idn-dude-01). */ /************************************************************/ /* Public interface (would normally go in its own .h file): */ #include enum altdude_status { altdude_success, altdude_invalid_input, altdude_output_too_big }; enum case_sensitivity { case_sensitive, case_insensitive }; #if UINT_MAX >= 0x1FFFFF typedef unsigned int u_code_point; #else typedef unsigned long u_code_point; #endif enum altdude_status altdude_encode( unsigned int input_length, const u_code_point *input, const unsigned char *uppercase_flags, unsigned int *output_size, char *output ); /* altdude_encode() converts Unicode to AltDUDE (without any */ /* signature). The input must be represented as an array */ /* of Unicode code points (not code units; surrogate pairs */ /* are not allowed), and the output will be represented as */ /* null-terminated ASCII. The input_length is the number of code */ /* points in the input. The output_size is an in/out argument: */ /* the caller must pass in the maximum number of characters */ /* that may be output (including the terminating null), and on */ /* successful return it will contain the number of characters */ /* actually output (including the terminating null, so it will be */ /* one more than strlen() would return, which is why it is called */ /* output_size rather than output_length). The uppercase_flags */ /* array must hold input_length boolean values, where nonzero */ /* means the corresponding Unicode character should be forced */ /* to uppercase after being decoded, and zero means it is */ /* caseless or should be forced to lowercase. Alternatively, */ /* uppercase_flags may be a null pointer, which is equivalent */ /* to all zeros. The encoder always outputs lower case base-32 */ /* characters except when nonzero values of uppercase_flags */ /* require otherwise. The return value may be any of the */ /* altdude_status values defined above; if not altdude_success, */ /* then output_size and output may contain garbage. On success, */ /* the encoder will never need to write an output_size greater */ /* than input_length*k+1 if all the input code points are less */ /* than 1 << (4*k), because of how the encoding is defined. */ enum altdude_status altdude_decode( enum case_sensitivity case_sensitivity, char *scratch_space, const char *input, unsigned int *output_length, u_code_point *output, unsigned char *uppercase_flags ); /* altdude_decode() converts AltDUDE (without any signature) to */ /* Unicode. The input must be represented as null-terminated */ /* ASCII, and the output will be represented as an array of */ /* Unicode code points. The case_sensitivity argument influences */ /* the check on the well-formedness of the input string; it */ /* must be case_sensitive if case-sensitive comparisons are */ /* allowed on encoded strings, case_insensitive otherwise. */ /* The scratch_space must point to space at least as large */ /* as the input, which will get overwritten (this allows the */ /* decoder to avoid calling malloc()). The output_length is */ /* an in/out argument: the caller must pass in the maximum */ /* number of code points that may be output, and on successful */ /* return it will contain the actual number of code points */ /* output. The uppercase_flags array must have room for at least */ /* output_length values, or it may be a null pointer if the case */ /* information is not needed. A nonzero flag indicates that the */ /* corresponding Unicode character should be forced to uppercase */ /* by the caller, while zero means it is caseless or should be */ /* forced to lowercase. The return value may be any of the */ /* altdude_status values defined above; if not altdude_success, */ /* then output_length, output, and uppercase_flags may contain */ /* garbage. On success, the decoder will never need to write */ /* an output_length greater than the length of the input (not */ /* counting the null terminator), because of how the encoding is */ /* defined. */ /**********************************************************/ /* Implementation (would normally go in its own .c file): */ #include /* Character utilities: */ /* is_AtoZ(c) returns 1 if c is an */ /* uppercase ASCII letter, zero otherwise. */ static unsigned char is_AtoZ(char c) { return c >= 65 && c <= 90; } /* base32[n] is the lowercase base-32 character representing */ /* the number n from the range 0 to 31. Note that we cannot */ /* use string literals for ASCII characters because an ANSI C */ /* compiler does not necessarily use ASCII. */ static const char base32[] = { 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 109, 110, /* m-n */ 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ }; /* base32_decode(c) returns the value of a base-32 character, in the */ /* range 0 to 31, or the constant base32_invalid if c is not a valid */ /* base-32 character. */ enum { base32_invalid = 32 }; static unsigned int base32_decode(char c) { if (c < 50) return base32_invalid; if (c <= 57) return c - 26; if (c < 97) c += 32; if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; return c - 97 - (c > 108) - (c > 111); } /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ /* then ASCII A-Z are considered equal to a-z respectively. */ static int unequal( enum case_sensitivity case_sensitivity, const char *s1, const char *s2 ) { char c1, c2; if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; for (;;) { c1 = *s1; c2 = *s2; if (c1 >= 65 && c1 <= 90) c1 += 32; if (c2 >= 65 && c2 <= 90) c2 += 32; if (c1 != c2) return 1; if (c1 == 0) return 0; ++s1, ++s2; } } /* altdude_initial_code_point is the initial value of the */ /* "previous" code point, before the first code point. */ static const u_code_point altdude_initial_code_point = 96; /* Encoder: */ enum altdude_status altdude_encode( unsigned int input_length, const u_code_point *input, const unsigned char *uppercase_flags, unsigned int *output_size, char *output ) { unsigned int next_in, next_out, max_out, n, out; u_code_point prev, codept, diff, tmp; char shift; prev = altdude_initial_code_point; max_out = *output_size; next_out = 0; for (next_in = 0; next_in < input_length; ++next_in) { codept = input[next_in]; if (codept == 45) { /* hyphen-minus stands for itself */ if (max_out - next_out < 1) return altdude_output_too_big; output[next_out++] = 45; continue; } shift = uppercase_flags && uppercase_flags[next_in] ? 32 : 0; /* shift will determine the case of the last base-32 digit */ diff = prev ^ codept; for (tmp = diff >> 4, n = 1; tmp != 0; ++n, tmp >>= 4); /* n is the number of base-32 digits */ if (max_out - next_out < n) return altdude_output_too_big; /* Computing the base-32 digits in reverse order is easiest. */ /* Only the last base-32 digit has the high bit clear. */ out = next_out + n - 1; output[out] = base32[diff & 0xF] - shift; while (out > next_out) { diff >>= 4; output[--out] = base32[0x10 | (diff & 0xF)]; } next_out += n; prev = codept; } /* null terminator: */ if (max_out - next_out < 1) return altdude_output_too_big; output[next_out++] = 0; *output_size = next_out; return altdude_success; } /* Decoder: */ enum altdude_status altdude_decode( enum case_sensitivity case_sensitivity, char *scratch_space, const char *input, unsigned int *output_length, u_code_point *output, unsigned char *uppercase_flags ) { u_code_point prev, q, diff; const char *in; char c; unsigned int next_out, max_out, input_size, scratch_size; enum altdude_status status; prev = altdude_initial_code_point; max_out = *output_length; next_out = 0; in = input; for (c = *in; c != 0; ) { if (max_out - next_out < 1) return altdude_output_too_big; if (c == 45) { /* hyphen-minus stands for itself */ output[next_out] = 45; if (uppercase_flags) uppercase_flags[next_out] = 0; ++next_out; c = *++in; continue; } /* Base-32 sequence: */ diff = 0; do { q = base32_decode(c); if (q == base32_invalid) return altdude_invalid_input; diff = (diff << 4) | (q & 0xF); c = *++in; } while (q >> 4 == 1); /* case of last digit determines uppercase flag: */ if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(in[-1]); prev = output[next_out++] = prev ^ diff; } /* Re-encode the output and compare to the input: */ input_size = in - input + 1; scratch_size = input_size; status = altdude_encode(next_out, output, uppercase_flags, &scratch_size, scratch_space); if (status != altdude_success || scratch_size != input_size || unequal(case_sensitivity, scratch_space, input) ) return altdude_invalid_input; *output_length = next_out; return altdude_success; } /******************************************************************/ /* Wrapper for testing (would normally go in a separate .c file): */ #include #include #include #include /* For testing, we'll just set some compile-time limits rather than */ /* use malloc(), and set a compile-time option rather than using a */ /* command-line option. */ enum { unicode_max_length = 256, ace_max_size = 256, test_case_sensitivity = case_insensitive /* suitable for host names */ }; static void usage(char **argv) { fprintf(stderr, "%s -e reads big-endian UTF-32 and writes AltDUDE ASCII.\n" "%s -d reads AltDUDE ASCII and writes big-endian UTF-32.\n" "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" , argv[0], argv[0]); exit(EXIT_FAILURE); } static void fail(const char *msg) { fputs(msg,stderr); exit(EXIT_FAILURE); } static const char too_big[] = "input or output is too large, recompile with larger limits\n"; static const char invalid_input[] = "invalid input\n"; static const char io_error[] = "I/O error\n"; int main(int argc, char **argv) { enum altdude_status status; int r; if (argc != 2) usage(argv); if (argv[1][0] != '-') usage(argv); if (argv[1][2] != '\0') usage(argv); if (argv[1][1] == 'e') { u_code_point input[unicode_max_length]; unsigned char uppercase_flags[unicode_max_length]; char output[ace_max_size]; unsigned int input_length, output_size; int c0, c1, c2, c3; /* Read the UTF-32 input string: */ input_length = 0; for (;;) { c0 = getchar(); c1 = getchar(); c2 = getchar(); c3 = getchar(); if (ferror(stdin)) fail(io_error); if (c1 == EOF || c2 == EOF || c3 == EOF) { if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); break; } if (input_length == unicode_max_length) fail(too_big); if ((c0 != 0 && c0 != 0x80) || c1 < 0 || c1 > 0x10 || c2 < 0 || c2 > 0xFF || c3 < 0 || c3 > 0xFF ) { fail(invalid_input); } input[input_length] = ((u_code_point) c1 << 16) | ((u_code_point) c2 << 8) | (u_code_point) c3; uppercase_flags[input_length] = (c0 >> 7); ++input_length; } /* Encode, and output the result: */ output_size = ace_max_size; status = altdude_encode(input_length, input, uppercase_flags, &output_size, output); if (status == altdude_invalid_input) fail(invalid_input); if (status == altdude_output_too_big) fail(too_big); assert(status == altdude_success); r = fputs(output,stdout); if (r == EOF) fail(io_error); return EXIT_SUCCESS; } if (argv[1][1] == 'd') { char input[ace_max_size], scratch[ace_max_size]; u_code_point output[unicode_max_length], codept; unsigned char uppercase_flags[unicode_max_length]; unsigned int output_length, i; /* Read the AltDUDE ASCII input string: */ fgets(input, ace_max_size, stdin); if (ferror(stdin)) fail(io_error); if (!feof(stdin)) fail(too_big); /* Decode, and output the result: */ output_length = unicode_max_length; status = altdude_decode(test_case_sensitivity, scratch, input, &output_length, output, uppercase_flags); if (status == altdude_invalid_input) fail(invalid_input); if (status == altdude_output_too_big) fail(too_big); assert(status == altdude_success); for (i = 0; i < output_length; ++i) { r = putchar(uppercase_flags[i] ? 0x80 : 0); if (r == EOF) fail(io_error); codept = output[i]; r = putchar(codept >> 16); if (r == EOF) fail(io_error); r = putchar((codept >> 8) & 0xFF); if (r == EOF) fail(io_error); r = putchar(codept & 0xFF); if (r == EOF) fail(io_error); } return EXIT_SUCCESS; } usage(argv); return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ } INTERNET-DRAFT expires 2001-Sep-19