Network Working Group Keld Simonsen INTERNET-DRAFT Danish Unix Users Group Philippe-Andre Prindeville Telecom Paris 4th August 1991 Mnemonic Text Format Status of the Memo This memo specifies an encoding format that permits the exchange of tex- tual messages consisting of characters from a wide range of interna- tional scripts, including Latin, Cyrillic, Greek, Arabic, Hebrew, Kata- kana, Hiragana and some special characters. The encoding is done so that originating and receiving end user equip- ment have a good chance of communicating understandingly although they have different capabilities, as a readable unambigous mnemonic fallback is defined. Although this memo specifies ways of doing character set conversions, it is not allowed to use other character sets than ASCII or its proper subset ICS for general Internet use with the specifications in this memo. The memo supplements and is conformant with the set defined in [1], and in [2], where the format specified in this memo is known as Text/Quoted-Readable. The memo uses the definitions about characters and coded character sets as specified in [3]. Distribution of this memo is unlimited. This draft document will be submitted to the RFC editor for adaption as a protocol specification. Please send any comments to Keld Simonsen . 1. Introduction As the Internet grows in size, the number and scope of users increases. TCP/IP has become a player in the pan-European networking arena [4], and the number of other networks that the Internet connects to is also mounting. In short, the Internet is becoming Internationalized. With this expanding circle of users, the breadth of their needs similarly increases. One of the most popular services of the Internet, electronic mail (email), is also perhaps one of the least adequate to meet this new demand. Issues of addressing and gatewaying have been conceived and implemented, but email still bears the constraint that messages be com- posed of the 7-bit ASCII graphical character set. For non-Anglophones, this is simply not adequate. This memo defines techniques to meet this international demand. For the remainder of this document, we shall take the subset of ASCII characters that have ISO, EBCDIC, and Teletext equivalents, and refer to Simonsen & Prindeville [Page 1] INTERNET-DRAFT Mnemonic Text Format August 1991 it as ICS (Invariant Character Set, equivalent to invariant ISO 646 [5]). We regard this as the minimal universal character set. Its con- tents are given in [3] as the ISO_646.basic:1983 character set. 2. Considerations When approaching the problem, we identified a few major considerations. The solution: - must render reasonable results on an ICS terminal; Not all users will have access to resources that can display the com- plete set (indeed, few are expected to have the full set available); still others will continue to use the ubiquitous ICS terminal. In such instances, this encoding must yield acceptable results. - must be extensible, to incorporate future insights; Work continues on the definition and cataloging of national character sets. One fairly extensive list, ISO DIS 10646 [6], is being compiled at the time of this writing. Symbols will probably be added in the future: at such time, they should by accomodated. Therefore, expanda- bility is needed. - must work with existing MTAs and UAs; System software is costly and difficult to install; further, current mail addressing techniques offer little or impractical control of the routing of messages. As a result, mail may be carried by obsolete Message Transfer Agents (MTAs). Further, message encoding is a presentation-level service, and is best dealt with by the User Agent (UA). User Agent software may also be difficult to change, so a solu- tion must be able to work with existing UAs. - must align with internet methodology; As mentioned in the first point above, the user, implementor, or sys- tem administrator may not have access to adequate encoding/decoding or rendering facilities. He may be obliged to view or enter/manipulate encoded text by hand. In order to support this, an encoding format should be simple and intuitive. - should interoperate with a broad range of systems; The current networking environment contains many diverse types of sys- tems with varied interchange formats (e.g. BITNET, X.400, UUCP). To interoperate with the greatest number of them, exchange must be based on the most common assumptions: a limited character set, limited line lengths, etc. - should be simple and unambiguous; Any solution that is to have widespread acceptance must be simple and unambiguous; indeed, the latter frequently precludes the former. 3. Message Format As in [2], the message exists as a series of parts, each part being a group of lines containing characters in the character set employed. Simonsen & Prindeville [Page 2] INTERNET-DRAFT Mnemonic Text Format August 1991 Each part may or may not be encoded using this format; we concern our- selves in this document solely with those that are. Within the relevant parts of the message, ordinary text may have occurrences of the follow- ing sequence: an intro character, followed by a string of characters that represent a character mnemonic, as described below. Similarly, the intro character may doubled, indicating a single occurence of the respective symbol in decoded format. This message format uses the headers "Content-Type:" and "Content- Charset:" from [2] and is signified with the new header "Mnemonic- Intro:" and the two optional headers "Orig-Content-Charset" and "Orig- Mnemonic-Intro:". Case is ignored in these headers. A description of these headers and their fields follows. 3.1. Content-Type: Text If the Content-Type header is present, it must have the first header field as "Text" according to this specification. The Content-Type header can also be omitted, as Text is the default content type. The mnemonic message format is intended to be read by the end user possibly without further intervention. 3.2. Content-Charset: charset The charset is given as one of the coded character set names in [3] and is the encoding used for the transport. For general use on the Inter- net, only "ASCII" and "ICS" are allowed. ASCII is the recommended char- acter set, while ICS will be very robust for traversing gateways, but it will cause trouble for (amongst other things) source code for several programming languages. The use of other character sets are delimited to agreement between the communicating parties. When such an agreement has been achieved, or when a User Agent is operating in another character set than this transport character set, conversion of the message body part is done according to the tables in [3], as characters occuring in both encodings are just transformed, and characters not existing in the receiving coded character set are represented by the intro character of the receiving coded character set plus the mnemonic from [3], as described under the intro character. The characters forming the mnemonic are translated into the receiving code, which must have these characters present. An undefined character in the originating coded character set is transformed into a question mark character. The Content-Type:-header is changed accordingly to reflect such conversion. Conversion is not allowed if the content-type is not "Text" (which is the default if the Content-Type header is missing). An example of changing headers is the following: The UA runs in an 8-bit character set: Content-Type: Text Content-Charset: ISO_8859-1 Mnemonic-Intro: 29 Orig-Content-Charset: ISO_8859-1 Orig-Mnemonic-Intro: 29 Simonsen & Prindeville [Page 3] INTERNET-DRAFT Mnemonic Text Format August 1991 The MTA converts it before sending it to the recepient: Content-Type: Text Content-Charset: ASCII Mnemonic-Intro: 38 Orig-Content-Charset: ISO_8859-1 Orig-Mnemonic-Intro: 29 3.3. Mnemonic-Intro: Intro The intro character is given as the decimal value of the intro character in the transport character set. The recommended value is 38 for the ampersand (&) character in ASCII. Another common value is 29 for the control character Group Seperator, which may be convenient when operat- ing in some environments, and ordinary text is not changed. The intro character is used for introducing character mnemonics from [3] when a character is not present in the mail transport character set (as defined by "charset"). Character mnemonics longer than two characters are sur- rounded by the underline character. The intro character is doubled to repesent one occurance of itself. Characters in the mail transport character set are normally just represented with their encoding, but may also be represented by the intro character and the mnemonic encoding. If the intro character is specified as 0 (zero), it is omitted in the transport, giving a better readably content, but eliminating the possi- bility of reversability and introducing an information loss. 3.4. Orig-Content-Charset: orig-charset The orig-charset is given as the original character set name. This may be set by the sending User Agent before converting the message into a character set suitable for transport. If no orig-charset is specified, the charset character set is used. 3.5. Orig-Mnemonic-Intro: orig-intro The orig-intro character is given as the original intro character as used by the originating User Agent. The orig-charset and orig-intro may be used to recreate the message in its original encoding. If no orig- intro character is specified, the intro character is used. If the orig-intro character is specified as 0 (zero), no intro character was used in the original content. On the other hand, having a non-zero orig-intro value allows the user to generate characters, that are not available in the orig-charset. 3.6. Compatibility If applications conforming to this specification interoperates with other versions of this specifications, and it encounters mnemonics that are undefined with this specification, it shall leave the mnemonic as it is coded. This provides for upward compatibility. Simonsen & Prindeville [Page 4] INTERNET-DRAFT Mnemonic Text Format August 1991 3.7. Examples of headers Content-Charset: ASCII Mnemonic-Intro: 38 4. Acknowledgements This memo was inspired by [1],[7], and [8], as well as by conversations with Justin Bur of l'E'cole Polytechnique Federale de Lausanne (EPFL) and people active within l'Association Franc,ais d'Utilisateurs Unix (AFUU), and les Re'seaux Associe's pour la Recherche Europeen (RARE). 5. REFERENCES [1] D. Robinson, R. Ullman, "Encoding Header Field for Internet Mes- sages," RFC 1154, April 1990. [2] Nathaniel Borenstein, Ned Freed, "Mechanism for Specifying and Describing the Format of Internet Message Bodies", Internet draft June 1991. [3] Keld Simonsen, "Character Mnemonics & Character Sets", Internet draft, August 1991. [4] R. Blokzijl, "RIPE: IP coordination in Europe" in Computer Networks and ISDN Systems, Nos. 3-5, November 1990. [5] ISO 646:1983 "Seven Bit Code for Information Interchange". [6] ISO DIS 10646 "Universal Character Set Code (UCS)", ISO/IEC JTC1/SC2/WG3 N666, November 1990. [7] M. Sirbu, "Content-Type Header Field for Internet Messages," RFC 1049, March 1988. [8] J.W. van Wingen, "Networks and Coded Character Sets" in Computer Net- works and ISDN Systems, Nos. 3-5, November 1990. Simonsen & Prindeville [Page 5]