Network Working Group                                      Keld Simonsen
INTERNET-DRAFT                                   Danish Unix Users Group
                                              Philippe-Andre Prindeville
                                                           Telecom Paris
                                                         4th August 1991


                          Mnemonic Text Format


Status of the Memo

This memo specifies an encoding format that permits the exchange of tex-
tual messages consisting of characters from a wide range of interna-
tional scripts, including Latin, Cyrillic, Greek, Arabic, Hebrew, Kata-
kana, Hiragana and some special characters.

The encoding is done so that originating and receiving end user equip-
ment have a good chance of communicating understandingly although they
have different capabilities, as a readable unambigous mnemonic fallback
is defined. Although this memo specifies ways of doing character set
conversions, it is not allowed to use other character sets than ASCII or
its proper subset ICS for general Internet use with the specifications
in this memo.

The memo supplements and is conformant with the set defined in [1], and
in [2], where the format specified in this memo is known as
Text/Quoted-Readable.  The memo uses the definitions about characters
and coded character sets as specified in [3].

Distribution of this memo is unlimited.  This draft document will be
submitted to the RFC editor for adaption as a protocol specification.
Please send any comments to Keld Simonsen <Keld.Simonsen@dkuug.dk>.

1.  Introduction

As the Internet grows in size, the number and scope of users increases.
TCP/IP has become a player in the pan-European networking arena [4], and
the number of other networks that the Internet connects to is also
mounting.  In short, the Internet is becoming Internationalized.  With
this expanding circle of users, the breadth of their needs similarly
increases.

One of the most popular services of the Internet, electronic mail
(email), is also perhaps one of the least adequate to meet this new
demand.  Issues of addressing and gatewaying have been conceived and
implemented, but email still bears the constraint that messages be com-
posed of the 7-bit ASCII graphical character set.  For non-Anglophones,
this is simply not adequate.  This memo defines techniques to meet this
international demand.

For the remainder of this document, we shall take the subset of ASCII
characters that have ISO, EBCDIC, and Teletext equivalents, and refer to


Simonsen & Prindeville                                          [Page 1]


INTERNET-DRAFT            Mnemonic Text Format               August 1991


it as ICS (Invariant Character Set, equivalent to invariant ISO 646
[5]).  We regard this as the minimal universal character set.  Its con-
tents are given in [3] as the ISO_646.basic:1983 character set.

2.  Considerations

When approaching the problem, we identified a few major considerations.
The solution:

- must render reasonable results on an ICS terminal;
  Not all users will have access to resources that can display the com-
  plete set (indeed, few are expected to have the full set available);
  still others will continue to use the ubiquitous ICS terminal.  In
  such instances, this encoding must yield acceptable results.

- must be extensible, to incorporate future insights;
  Work continues on the definition and cataloging of national character
  sets.  One fairly extensive list, ISO DIS 10646 [6], is being compiled
  at the time of this writing.  Symbols will probably be added in the
  future: at such time, they should by accomodated.  Therefore, expanda-
  bility is needed.

- must work with existing MTAs and UAs;
  System software is costly and difficult to install; further, current
  mail addressing techniques offer little or impractical control of the
  routing of messages.  As a result, mail may be carried by obsolete
  Message Transfer Agents (MTAs).  Further, message encoding is a
  presentation-level service, and is best dealt with by the User Agent
  (UA).  User Agent software may also be difficult to change, so a solu-
  tion must be able to work with existing UAs.

- must align with internet methodology;
  As mentioned in the first point above, the user, implementor, or sys-
  tem administrator may not have access to adequate encoding/decoding or
  rendering facilities.  He may be obliged to view or enter/manipulate
  encoded text by hand.  In order to support this, an encoding format
  should be simple and intuitive.

- should interoperate with a broad range of systems;
  The current networking environment contains many diverse types of sys-
  tems with varied interchange formats (e.g.  BITNET, X.400, UUCP). To
  interoperate with the greatest number of them, exchange must be based
  on the most common assumptions: a limited character set, limited line
  lengths, etc.

- should be simple and unambiguous;
  Any solution that is to have widespread acceptance must be simple and
  unambiguous; indeed, the latter frequently precludes the former.

3.  Message Format

As in [2], the message exists as a series of parts, each part being a
group of lines containing characters in the character set employed.


Simonsen & Prindeville                                          [Page 2]


INTERNET-DRAFT            Mnemonic Text Format               August 1991


Each part may or may not be encoded using this format; we concern our-
selves in this document solely with those that are.  Within the relevant
parts of the message, ordinary text may have occurrences of the follow-
ing sequence: an intro character, followed by a string of characters
that represent a character mnemonic, as described below.  Similarly, the
intro character may doubled, indicating a single occurence of the
respective symbol in decoded format.

This message format uses the headers "Content-Type:" and "Content-
Charset:" from [2] and is signified with the new header "Mnemonic-
Intro:" and the two optional headers "Orig-Content-Charset" and "Orig-
Mnemonic-Intro:". Case is ignored in these headers.  A description of
these headers and their fields follows.

3.1.  Content-Type: Text

If the Content-Type header is present, it must have the first header
field as "Text" according to this specification. The Content-Type header
can also be omitted, as Text is the default content type.  The mnemonic
message format is intended to be read by the end user possibly without
further intervention.

3.2.  Content-Charset: charset

The charset is given as one of the coded character set names in [3] and
is the encoding used for the transport.  For general use on the Inter-
net, only "ASCII" and "ICS" are allowed.  ASCII is the recommended char-
acter set, while ICS will be very robust for traversing gateways, but it
will cause trouble for (amongst other things) source code for several
programming languages.  The use of other character sets are delimited to
agreement between the communicating parties. When such an agreement has
been achieved, or when a User Agent is operating in another character
set than this transport character set, conversion of the message body
part is done according to the tables in [3], as characters occuring in
both encodings are just transformed, and characters not existing in the
receiving coded character set are represented by the intro character of
the receiving coded character set plus the mnemonic from [3], as
described under the intro character.  The characters forming the
mnemonic are translated into the receiving code, which must have these
characters present.  An undefined character in the originating coded
character set is transformed into a question mark character.  The
Content-Type:-header is changed accordingly to reflect such conversion.
Conversion is not allowed if the content-type is not "Text" (which is
the default if the Content-Type header is missing).

An example of changing headers is the following: The UA runs in an 8-bit
character set:

      Content-Type: Text
      Content-Charset: ISO_8859-1
      Mnemonic-Intro: 29
      Orig-Content-Charset: ISO_8859-1
      Orig-Mnemonic-Intro: 29


Simonsen & Prindeville                                          [Page 3]


INTERNET-DRAFT            Mnemonic Text Format               August 1991


The MTA converts it before sending it to the recepient:

      Content-Type: Text
      Content-Charset: ASCII
      Mnemonic-Intro: 38
      Orig-Content-Charset: ISO_8859-1
      Orig-Mnemonic-Intro: 29


3.3.  Mnemonic-Intro: Intro

The intro character is given as the decimal value of the intro character
in the transport character set. The recommended value is 38 for the
ampersand (&) character in ASCII. Another common value is 29 for the
control character Group Seperator, which may be convenient when operat-
ing in some environments, and ordinary text is not changed.  The intro
character is used for introducing character mnemonics from [3] when a
character is not present in the mail transport character set (as defined
by "charset").  Character mnemonics longer than two characters are sur-
rounded by the underline character. The intro character is doubled to
repesent one occurance of itself.  Characters in the mail transport
character set are normally just represented with their encoding, but may
also be represented by the intro character and the mnemonic encoding.
If the intro character is specified as 0 (zero), it is omitted in the
transport, giving a better readably content, but eliminating the possi-
bility of reversability and introducing an information loss.

3.4.  Orig-Content-Charset: orig-charset

The orig-charset is given as the original character set name.  This may
be set by the sending User Agent before converting the message into a
character set suitable for transport.  If no orig-charset is specified,
the charset character set is used.

3.5.  Orig-Mnemonic-Intro: orig-intro

The orig-intro character is given as the original intro character as
used by the originating User Agent. The orig-charset and orig-intro may
be used to recreate the message in its original encoding.  If no orig-
intro character is specified, the intro character is used.  If the
orig-intro character is specified as 0 (zero), no intro character was
used in the original content.  On the other hand, having  a non-zero
orig-intro value allows the user to generate characters, that are not
available in the orig-charset.

3.6.  Compatibility

If applications conforming to this specification interoperates with
other versions of this specifications, and it encounters mnemonics that
are undefined with this specification, it shall leave the mnemonic as it
is coded. This provides for upward compatibility.


Simonsen & Prindeville                                          [Page 4]


INTERNET-DRAFT            Mnemonic Text Format               August 1991


3.7.  Examples of headers

Content-Charset: ASCII
Mnemonic-Intro: 38

4.  Acknowledgements

This memo was inspired by [1],[7], and [8], as well as by conversations
with Justin Bur of l'E'cole Polytechnique Federale de Lausanne (EPFL)
and people active within l'Association Franc,ais d'Utilisateurs Unix
(AFUU), and les Re'seaux Associe's pour la Recherche Europeen (RARE).

5.  REFERENCES


[1]
   D. Robinson, R. Ullman, "Encoding Header Field for Internet Mes-
   sages," RFC 1154, April 1990.


[2]
   Nathaniel Borenstein, Ned Freed, "Mechanism for Specifying and
   Describing the Format of Internet Message Bodies", Internet draft
   June 1991.


[3]
   Keld Simonsen, "Character Mnemonics & Character Sets", Internet
   draft, August 1991.


[4]
   R. Blokzijl, "RIPE: IP coordination in Europe" in Computer Networks
   and ISDN Systems, Nos. 3-5, November 1990.


[5]
   ISO 646:1983 "Seven Bit Code for Information Interchange".


[6]
   ISO DIS 10646 "Universal Character Set Code (UCS)", ISO/IEC
   JTC1/SC2/WG3 N666, November 1990.


[7]
   M. Sirbu, "Content-Type Header Field for Internet Messages," RFC
   1049, March 1988.


[8]
   J.W. van Wingen, "Networks and Coded Character Sets" in Computer Net-
   works and ISDN Systems, Nos. 3-5, November 1990.


Simonsen & Prindeville                                          [Page 5]