draft            X.400 use of extended character sets           Apr 92


                   X.400 use of extended character sets

                       Fri Nov  6 15:13:56 MET 1992


                         Harald Tveit Alvestrand
                               SINTEF DELAB
                    Harald.Alvestrand@delab.sintef.no


    Status of this Memo

    This draft document is being circulated for comment.

    If consensus is reached it may be submitted to the RFC editor as a
    Proposed Standard protocol specificiation, for use in X.400 in the
    Internet.

    Please send comments to the author, or to the RARE WG-MSG list
    <wg-msg@rare.nl>.

    The following text is required by the Internet-draft rules:

    This document is an Internet Draft.  Internet Drafts are working
    documents of the Internet Engineering Task Force (IETF), its
    Areas, and its Working Groups. Note that other groups may also
    distribute working documents as Internet Drafts.

    Internet Drafts are draft documents valid for a maximum of six
    months. Internet Drafts may be updated, replaced, or obsoleted by
    other documents at any time.  It is not appropriate to use
    Internet Drafts as reference material or to cite them other than
    as a "working draft" or "work in progress."

    Please check the I-D abstract listing contained in each Internet
    Draft directory to learn the current status of this or any other
    Internet Draft.


Alvestrand                 Expires May 6 93                   [Page 1]

draft            X.400 use of extended character sets           Apr 92


    1.  Introduction

    Since 1988, X.400 has had the capacity for carrying a large number
    of different character sets in a message by using the body part
    "GeneralText" defined by ISO/IEC 10021-7.

    Since 1992, the Internet also has the means of passing around
    messages containing multiple character sets, by using the
    mechanism defined in RFC-MIME.

    This document defines a suggested method of using "GeneralText" in
    order to harmonize as much as possible the usage of this body
    part.


    2.  General principles

    2.1.  Goals

    The target of this memo is to define a way of using existing
    standards to achieve:


    (1)  in the short term, a standard for sending E-mail in the
         European languages (Latin letters with European accents,
         Greek and Cyrillic)

    (2)  in the medium term, extending this to cover the Hebrew and
         Arabic character sets

    (3)  in the long term, opening up true international E-mail by
         allowing the full character set specified in ISO-10646 to be
         used.


    The author believes that this document gives a specification that
    can easily accomodate the use of any character set in the ISO
    registry, and, by giving guidance rules for choosing character
    sets, will help interworking.


Alvestrand                 Expires May 6 93                   [Page 2]

draft            X.400 use of extended character sets           Apr 92


    2.2.  Families of character sets


    2.2.1.  ISO 6937/T.61

    ISO 6937 is a code technique used and recommended in T.51 and
    T.101 (Teletex and Videotex service) and in X.500, providing a
    repertoire of 333 characters from the Latin script by use of non-
    spacing diacritical marks. It corresponds closely to CCITT
    recommendation T.61.

    The problem with that technique is that the character stream comes
    in two modes, i.e some characters are coded with one byte and some
    with two (composite characters). This makes information processing
    systems such as an E-mail UA or GW more complex.

    It is also not extensible to other languages like Korean or
    Chinese, or even Greek, without invoking the character set
    switching techniques of ISO 2022.


    2.2.2.  ISO 8859

    ISO 8859 defines a set of character sets, each suitable for use in
    some group of languages. Each character in ISO 8859 is coded in a
    single byte.

    There are currently 9 parts of ISO 8859, plus a "supplementary"
    set, registered as ISO IR 154. All languages using single-byte
    characters can be written in one or another of the ISO 8859 sets.
    There are sets covering Greek, Hebrew and Arabic.

    All the ISO 8859 sets include US ASCII as a subset. All use 8
    bits.

    ISO 8859 is regarded by many as a solution; for instance, the X
    windows system now comes with ISO-8859-1 as the "standard"
    character set, with the possibility of specifying others. But
    since the same applications often do not support character set
    switching within text, it is problematic to use these in a truly
    multilingual environment.  (Also, most fonts claiming to be "ISO-
    8859-1" in X11R5 are actually 7-bit fonts. The implied lie is very
    unfortunate.)


Alvestrand                 Expires May 6 93                   [Page 3]

draft            X.400 use of extended character sets           Apr 92


    It turns out to work fine, however, if the second language is
    English, since this can be written in all ISO 8859 sets.

    The parts 3 and 4 have not seen wide acceptance, and it is
    expected that they will be discarded. They should therefore not be
    used.


    2.2.3.  ISO 10646

    At the moment of writing, ISO 10646 has just been accepted as an
    International Standard. It is basically a 32-bit character set,
    with all of the currently used characters being numbered by the
    first 16 bits, leaving some room for expansion.

    It is not possible to use ISO 10646 as a normal character set,
    because it does not conform to the rules for usage of byte values
    set down in ISO 2022 and other places; it uses the "control space"
    for (parts of) graphic character codes.

    There are a number of ways to encode ISO 10646 characters "on the
    wire". There are methods within the ISO 2022 standard to switch to
    these, either as "other coding system without return" or as "other
    coding system with return" (that is, you can go back from it to
    the one you came from using an ISO 2022 escape sequence).

    The following registrations have been made:


    ISO 10646 UCS-2 Level 1 has been registered with ESC 2/5 2/15 4/0,
    ISO 10646 UCS-4 Level 1 has been registered with ESC 2/5 2/15 4/1,

    The following are applied for:

    Reg# Escape sequence  Standard/Sponsor   Description
    174  ESC 2/5 2/15 F   ISO/IEC 10646      UCS-2, Level 2
    175  ESC 2/5 2/15 F   ISO/IEC 10646      UCS-4, Level 2
    176  ESC 2/5 2/15 F   ISO/IEC 10646      UCS-2, Level 3
    177  ESC 2/5 2/15 F   ISO/IEC 10646      UCS-4, Level 3
    178  ESC 2/5 F        ISO/IEC 10646      UTF-1

    << NOTE: The registration numbers for UCS-2 level 1 and UCS-4
    level 1 are not known. Neither are the assigned final characters
    for the other sets. Information requested!>>


Alvestrand                 Expires May 6 93                   [Page 4]

draft            X.400 use of extended character sets           Apr 92


    This character set will become very important in the future, but
    at the moment, few systems are able to support this directly.

    The GeneralText body part can be used for carrying any of these
    character sets.


    2.3.  Body parts that can be used in X.400

    At the moment, no established way of transferring a full set of
    characters in X.400-based E-mail exists.  In the future, it is
    likely that a new body part, based in ISO 10646, will be
    available; it is, however, dangerous to try to specify this body
    part before ISO 10646 is final.

    In the short term, the deployed and available body parts are:


    (1)  IA5Text

    (2)  For X.400/84: ISO6937Text and Teletex

    (3)  For X.400/88: GeneralText

    IA5Text is the method of choice for E-mail that contains only
    characters from IA5 (equivalent to ASCII).

    The ISO6937Text body part is defined in the ISO DIS documents
    corresponding to X.400(84) [MOTIS-86]; these never became a
    standard, so they are now quite difficult to find. It is in
    principle limited to using text that can be presented in ISO 6937,
    but since ISO 6937 refers to the ISO 2022 method of changing
    character sets, it is theoretically possible to use any ISO
    registered character set, but there is no facility for announcing
    the character sets used. This makes interworking with equipment
    that does not support the same character sets complex.

    It is still, however, the only body part suitable for carrying
    non-paginated text in non-basic character sets in X.400(84).

    Teletex, which is identical in all versions of the X.400 standard,
    has the same problem of implicit ISO6937, but has the added
    problem that it also specifies a page format, with, for instance,
    a left margin of 5 character positions. This is often not


Alvestrand                 Expires May 6 93                   [Page 5]

draft            X.400 use of extended character sets           Apr 92


    desirable.

    The details of Teletex are specified in recommendation T.51 and
    its relatives.

    GeneralText is defined in ISO 10021-8, the part of [MOTIS] that
    corresponds to CCITT recommendation [X.420]. It is an Extended
    body part, so no modification to CCITT implementations is needed
    to carry it.

    GeneralText is suitable for interchange, since it has got proper
    announcement facilities. It can use any number of character sets,
    and announces them both in the Encoded Information Types of the
    X.400 envelope and the parameters of the body part.

    We recommend this body part for carrying unformatted text in
    X.400/88.


    3.  GUIDELINES FOR THE GENERATION OF GENERALTEXT


    3.1.  Formal definition of GeneralText

    A GeneralText message is a byte stream that contains characters
    and character switching sequences according to [ISO 2022].

    The X.400 ASN.1 definition of the GeneralText body part is:


    general-text-body-part EXTENDED-BODY-PART-TYPE
        PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-text
        DATA       GeneralTextData
        ::= id-et-general-text

    GeneralTextParameters ::= SET OF CharacterSetRegistration

    CharacterSetRegistration ::= INTEGER (1..32767)

    GeneralTextData ::= GeneralString


    The definition is from ISO/IEC 10021-7 [MOTIS], Annex I, with
    modifications made in the MHS Implementor' Guide, version 8,


Alvestrand                 Expires May 6 93                   [Page 6]

draft            X.400 use of extended character sets           Apr 92


    chapter 3.6.3, bullet F130. It does not appear in the CCITT
    version of the standards.


    3.2.  Brief description of ISO 2022 character set switching

    There are 4 graphic character sets active at any time in a
    GeneralText message, called G0, G1, G2 and G3. In addition, there
    are 2 control character sets, called C0 and C1.

    At any moment, one of the sets G0-G3 is active in code positions
    2/1 to 7/14, and another is active in code positions 10/0 to
    15/15. The setting is achieved by so-called "locking shift"
    sequences.

    (Formally, code positions 2/0 and 7/15 are reserved for "space"
    and "DEL" respectively, and only 94-character character sets can
    be used in G0. In practice, this restriction is sometimes ignored)

    Single characters from the non-active sets may be invoked by the
    use of "single shift" sequences.

    The control character sets always occupy the code positions 0/0 to
    1/15 (C0) and 8/0 to 9/15 (C1).

    The character sets currently active as G0-G3 and C0-C1 may be
    changed using "character set designating sequences".

    At the beginning of a GeneralText message, one must always assume
    that set 2 (IA5) is active as G0, shifted into the lower half,
    that set 1 (standard) is active as C0, and that no G1-G3 or C1 set
    is invoked. This is specified in the definition of "GeneralString"
    in [X.209], the definition of ASN.1 encoding (section 23.5.2).

    If this is not a suitable initial state, a message must always
    start with the necessary announcers and escape sequences to
    designate and invoke the character sets that are actually used.
    The character sets in use may be changed later in the message by
    use of escape sequences.

    The parameters of a GeneralText message always list all the
    character sets used, by quoting their ISO reference numbers.

    It is impossible to use a character set not registered with ISO in


Alvestrand                 Expires May 6 93                   [Page 7]

draft            X.400 use of extended character sets           Apr 92


    a GeneralText message.

    It is also impossible to decide on the true meaning of a byte in a
    GeneralText message without scanning the whole message for shift
    and escape sequences.


    3.3.  How to use the character sets

    RECOMMENDATION:

    When the text to be rendered is representable in one of the
    character sets of ISO-8859, the G0 set should be set to ISO 646
    International Reference Version (1991), also called US-ASCII,
    ISO-IR-6.

    The older character set ISO-IR-2, ISO 646 IRV(1983), should NOT be
    used.  This means that the escape sequence ESC 2/8 4/2
    (designating ASCII as G0) should always occur at the beginning of
    the message.

    The G1 set should be set to the relevant ISO-8859 part. G2 and G3
    are not used.

    This corresponds to the first level of ISO 4873 usage.

    For the currently defined parts of ISO 8859, the character set
    designations are (relative to ISO 8859:1987):

    Part    ISO IR name             Escape sequence Remarks
                                    for G1 use

    1       ISO-IR-100              Esc 2D 41       West Europe
    2       ISO-IR-101              Esc 2D 42       East European
    3       ISO-IR-109              Esc 2D 43
    4       ISO-IR-110              Esc 2D 44
    5       ISO-IR-144              Esc 2D 4C       Cyrillic
    6       ISO-IR-127              Esc 2D 47       Arabic
    7       ISO-IR-126              Esc 2D 46       Greek
    8       ISO-IR-138              Esc 2D 48       Hebrew
    9       ISO-IR-148              Esc 2D 4D       Baltic, Turkish

    NOTE: The use of ISO 8859-3 and ISO 8859-4 is NOT recommended if
    other possibilities exist.


Alvestrand                 Expires May 6 93                   [Page 8]

draft            X.400 use of extended character sets           Apr 92


    The G1 set should be permanently shifted into the upper half of
    the code page.

    When the text is not representable in one of the ISO-8859
    character sets, the following rules may be applied:


    (1)  If any Latin characters are used, keep IA5 as the G0 set.

    (2)  If a mainstream character set is used (Greek, Cyrillic,
         Hebrew, Arabic), designate this as the G1 character set, and
         permanently shift this into the upper half of the code page
         (LS1R).
         EXCEPTION: The Japanese community has a long tradition of
         switching between the Japanese 16-bit character set ISO-IR-87
         and USASCII as the G0 set. See [RFC-2022-JP] for details. If
         ISO-IR-87 is used, that technique should be used instead of
         the one recommended here.

    (3)  If occasional extensions to a character set that is basically
         Latin occur (like accents, national variants and so on), and
         these are available in a single character set, designate the
         relevant set as G2 and use single shift (SS2) to invoke
         characters from this character set.

         The ISO 8859 supplementary set, ISO-IR-154, is recommended
         for this purpose.

         This corresponds to the ISO 4873 "second level" application.

    (4)  If two non-Latin character sets are used, the second should
         be designated as G3, and shifted into the upper half of the
         code page by the use of Locking Shift 3 Right (LS3R).

         This corresponds to the ISO 4873 "third level" application.


    (5)  If avoidable, use of character sets with floating accents,
         like ISO 6937, should be avoided.

    (6)  The shifts changing the lower half of the code table (SI/SO,
         LS2 and LS3) should NOT be used.


Alvestrand                 Expires May 6 93                   [Page 9]

draft            X.400 use of extended character sets           Apr 92


    RATIONALE: Keeping the G0 set reserved for ASCII will ensure that
    text in ASCII has the same bit representation always.

    The use of the upper code page for other scripts ensures that both
    text in these languages and text of this type mixed with English
    can be represented without the use of shift sequences.

    If the language and/or content of a text is completely unknown,
    chapter 5 gives an algorithm that may be used to decide upon the
    character sets. This might, for instance, be suitable for use at
    automatic mail gateways.

    NOTE: At the time of this writing, few applications that use ISO
    4873 level 2 and level 3 encoding exist. It has been estimated
    that implementing them in an application that already uses a rich
    repertoire of characters is a matter of programmer-days, not
    programmer-months, but this has not been proven.


    4.  GUIDELINES FOR THE RENDERING OF GENERALTEXT

    As a basic rule, one should NOT assume that any of the rules above
    are followed.

    An user agent capable of rendering GeneralText should:


    (1)  ALWAYS be able to identify and render characters in IA5, no
         matter how they are designated and invoked.

    (2)  ALWAYS be able to identify and render characters in the
         "native" character sets, no matter how they are designated
         and invoked.

    (3)  ALWAYS indicate the presence of characters that cannot be
         adequately represented on the current output device.

    (4)  NEVER render a character in an unknown or unrepresentable
         character set by displaying the character in the same bit
         position in the native character set.

    (5)  PREFERABLY be able to identify and render characters that are
         the same as characters in the "native" character sets, even
         though they are designated and invoked as part of other


Alvestrand                 Expires May 6 93                  [Page 10]

draft            X.400 use of extended character sets           Apr 92


         character sets.  This applies in particular to the
         "invariant" part of ISO 8859, parts 1 through 6.

    (6)  PREFERABLY be able to combine the floating accents of ISO
         6937 with their base characters for suitable rendering using
         the capabilities of the current output device.

    (7)  PREFERABLY be able to display text both in a mode using
         fallbacks for nonrenderable characters and in a mode
         designating nonrenderable characters as such.

    (8)  PREFERABLY be able to save the content of a GeneralText
         message to a file or other suitable media, saving all
         character set information, for later processing by other
         means.  It is not illegal to render the character set
         information into a different format; however, it should be
         noted that it is easy to lose vital information if the format
         chosen for representing character sets does not offer the
         possibility of referencing all character sets in the ISO
         registry of character sets.

    These requirements also apply to gateways that transform the
    message into some other format, for example a gateway that
    transforms a message into MIME using [RFC-2022-JP] for the
    purpose.


    5.  RECOMMENDATION FOR SELECTION OF CHARACTER SETS


    5.1.  Algorithm for selection of character sets

    When one has text in which characters from several character sets
    occurs, and wants to process this into a GeneralText document, it
    is often hard to guess right at the character sets to select.

    The following paragraphs give an algorithm that can be started at
    the beginning of a message, and at the end of it, return a set of
    character sets that can be used as G0..G3 character sets, OR an
    indication that the task is impossible.

    VARIABLES:


Alvestrand                 Expires May 6 93                  [Page 11]

draft            X.400 use of extended character sets           Apr 92


    UsedSets
         The set of character sets that MUST be used for this message

    UsableSets
         The set of character sets that MAY be used for this message.
         Each set also contains a counter for each character position.

    PossibleSets
         The set of all the character sets known to be usable in the
         destination format.

         ALGORITHM:

    1)   Add IA5 (ISO-IR-6) to the UsedSets (as G0).

    2)   Get the next character of the text.  If the text is
         completely analyzed, go to FINISHED

    3)   If it is in the UsedSets, go to 2).

    4)   Find the set of character sets from PossibleSets in which the
         character occurs. If it does not occur in any, report
         failure.

    5)   If it is in a single character set in PossibleSets only, add
         this set to UsedSets, and go to 2).

    6)   If it is in more than one character set, add these to
         PossibleSets (if not already present), and increment the
         counter for that character in all the sets. Go to 2).

    FINISHED)

    1)   (FINAL SELECTION) Remove any character set in UsedSets from
         PossibleSets.

         Zero the counters for any character in PossibleSets that also
         occurs in UsedSets.
         WHILE (more characters left)
           Select one character set and move it from PossibleSets to UsedSets.
           Zero the counters for all characters in this set in the other
           PossibleSets.
         END WHILE
         This step can be "tuned" any way you want, for instance by


Alvestrand                 Expires May 6 93                  [Page 12]

draft            X.400 use of extended character sets           Apr 92


         choosing the character sets most likely to be understood at
         the destination first, choosing the character sets covering
         the most characters first, avoiding multi-byte character sets
         as long as possible, or any other scheme suitable for the
         application.

    5.2.  WHAT TO DO ON FAILURE

    Failure will occur in this schema if a character is found that is
    not in the PossibleSets. It may then be handled in one of the
    following ways:

    (1)  Replace the character with the SUB control character

    (2)  Replace the character with Keld Simonsen Mnemonics. This is a
         reversible transformation as long as the recipient is aware
         that it has been used, but requires passing out-of-band
         information to indicate this.

    (3)  Replace the lost characters with any suitable fallback or
         mnemonic scheme intended for human understanding

    (4)  Bounce the message/refuse the conversion/give up.

    The action to be taken may be different based on the percentage of
    "lost" characters.

    If the message has "controls" like "conversion with loss
    prohibited", only the last possibility may be used.


    5.3.  RECOMMENDED CHARACTER SETS

    There are 2 steps in the algorithm above that are left for local
    judgement:

    (1)  Selection of the sets to appear in PossibleSets.

    (2)  The algorithm for deciding which character set to select in
         step 9.

    In the context of generating X.400 GeneralText messages, the
    following is recommended:


Alvestrand                 Expires May 6 93                  [Page 13]

draft            X.400 use of extended character sets           Apr 92


    Sets in PossibleSets:
    ISO-IR-6        Esc 28 42 (G0)  US-ASCII, IA5, ISO646
    ISO-IR-100      Esc 2D 41 (G1)  ISO-8859-1      West Europe
    ISO-IR-101      Esc 2D 42 (G1)  ISO-8859-2      Central/Eastern Europe
    ISO-IR-144      Esc 2D 4C (G1)  ISO-8859-5      Cyrillic
    ISO-IR-127      Esc 2D 47 (G1)  ISO-8859-6      Arabic
    ISO-IR-126      Esc 2D 46 (G1)  ISO-8859-7      Greek
    ISO-IR-138      Esc 2D 48 (G1)  ISO-8859-8      Hebrew
    ISO-IR-148      Esc 2D 4D (G1)  ISO-8859-9      Baltic/Nordic/Turkish

    The following multi-byte character sets are recommended:

    ISO-IR-87 (Japanese JIS C6226-1983)     Esc 24 29 42 (G1)
    ISO-IR-149 (Korean KS C 5601-1989)      Esc 24 29 43 (G1)
    ISO-IR-58 (Chinese GB 2312-80)          Esc 24 29 41 (G1)

    It is a STRONG recommendation that character sets not listed
    above, which do not add any new characters to the total set of
    characters given by the character sets above, should NOT be used
    in X.400 interchange.

    ISO-IR-87 is the Japanese character set that is allowed in a
    Teletex string, such as the subject field.

    NOTE: ISO-IR-87 has been "superseded" by ISO-IR-168, which allows
    two extra Kanji characters. Any application that handles ISO-IR-87
    should also be able to handle ISO-IR-168.

    Algorithm for selecting character sets:

    Start at the top of the list above, and add each set only if it is
    needed.


    5.4.  Selecting a character set based on language

    If the most common language of the environment in which it is used
    is known, the following character sets are recommended.

    The table of Latin-script languages is based on work by Johan van
    Wingen.  <BUTPAA@rulmvs.leidenuniv.nl>. The others are best
    guesses by the author.

    The tables of character sets prepared by Keld Jorn Simonsen


Alvestrand                 Expires May 6 93                  [Page 14]

draft            X.400 use of extended character sets           Apr 92


    <keld@dkuug.dk> (RFC-KELD) were invaluable in matching the data on
    languages to the data on character sets.

    Again, these are intended for guidance, not enforcement; there is
    considerable prestige atttached to such recommendations in other
    contexts, and it is therefore likely that each language group will
    make appropriate decisions on this subject. The table below is
    intended as a compilation of existing knowledge, again on the
    principle that it is better to say something than to say nothing.

    The language codes (for those languages that have codes) come from
    ISO 639.

    NOTE: ISO 639 is a very incomplete list of the world's languages
    (perhaps 10 or 20 % according to some experts), and is undergoing
    revision. The only reason for using it is that it is the only
    ISO-standardized shorthand notation for languages available at the
    moment.

    Language                  1   2   3   4   5
    ------------------------------------------------------------
    sq Albanian               X   X   X
    eu Basque                 X       X
    br Breton                 X
    hr Croatian                   X
    cs Czech                      X
    da Danish                 X
    eo Esperanto                          X
    fo Faeroese               X
    fi Finnish                X   X   X
    fy Frisian                X
    ?? Gaelic                 X
    gl Galician               X       X
    de German                 X       X
    hu Hungarian                  X
    is Icelandic              X
    ga Irish                  X   X   X
    it Italian                X
    no Norwegian              X       X
    pl Polish                     X
    pt Portuguese             X
    ?? Rhaetian               X
    ro Romanian                   X
    sk Slovak                     X


Alvestrand                 Expires May 6 93                  [Page 15]

draft            X.400 use of extended character sets           Apr 92


    sl Slovenian                  X
    ?? Sorbian                    X
    es Spanish                X       X
    sv Swedish                X       X
    tr Turkish                        X

    Explanation of character set codes
    ----------------------------------------
     1: ISO_8859-1:1987
     2: ISO_8859-2:1987
     3: ISO_8859-9:1989
     4: ISO_8859-supp
     5: ISO_8859-2:1987 and ISO_8859-supp


    Other languages for which appropriate character sets are known are
    listed in the table below.

    Language        Character set

    ar Arabic       ISO-8859-6
    be Byelorussian ISO-8859-5
    bg Bulgarian    ISO-8859-5
    el Greek        ISO-8859-7
    en English      USASCII
    fa Persian      ISO-8859-6
    iw Hebrew       ISO-8859-8
    ja Japanese     ISO-IR-87 (Japanese JIS C6226-1983)
    ko Korean       ISO-IR-149 (Korean KS C 5601-1989)
    la Latin        USASCII
    lo Laotian      ISO-IR-166
    ru Russian      ISO-8859-5
    sw Swahili      USASCII
    th Thai         ISO-IR-166
    uk Ukrainian    ISO-8859-5
    ur Urdu         ISO-8859-6
    vo Volapuk      ISO-8859-1
    zh Chinese      ISO-IR-58 (Chinese GB 2312-80)

    Additional entries in this table are welcome!

    Some languages have only one or a few characters missing. These
    are listed below.


Alvestrand                 Expires May 6 93                  [Page 16]

draft            X.400 use of extended character sets           Apr 92


    Language        Character set           Missing

       Sami         ISO-8859-9              C with caron
                                            D with stroke
                                            I with diaeresis
                                            N with acute
                                            Eng
                                            S with caron
                                            T with stroke
                                            Z with caron
    kl Greenlandic  ISO-8859-1              I with tilde
                                            K with cedilla
                                            U with tilde
    cy Welsh        ISO-8859-1              W with acute
                                            W with grave
                                            W with diaeresis
                                            Y with grave
                                            Y with circumflex
    nl Dutch        ISO-8859-1              Ligature IJ
    af Afrikaans    ISO-8859-1              N preceded by apostrophe
    fr French       ISO-8859-1              Ligature OE
    ca Catalan      ISO-8859-1              L with middle dot

    According to comments received, the "problem characters" for
    Dutch, Afrikaans, French, Greenlandic and Catalan are not in
    common use, or may be avoided by use of alternate spelling (like
    using "ij" instead of the "Ligature IJ").

    For French, Dutch, Catalan and Afrikaans, the character set ISO
    6937-2, which uses floating diacritical marks, contains all
    required characters.

    The following languages can (to the author's limited knowledge) be
    written with the current ISO 10646 standard, but with no other
    registered character sets:


    Language               Country(ies)             Script(s)

    aa Afar                 Somalia, Ethiopia, Djibouti     Latin
    ab Abkhazian            Georgia                         Cyrillic
    am Amharic              Ethiopia                        Ethiopic
    as Assamese             India, Nepal                    Bengali
    ay Aymara               Bolivia, Peru, Chile            Latin


Alvestrand                 Expires May 6 93                  [Page 17]

draft            X.400 use of extended character sets           Apr 92


    az Azerbaijani          SNC, Iran, Iraq, Turkey         Cyrillic, Arabic
    ba Bashkir              SNC                             Cyrillic
    bh Bihari               India                           Gujarati (or Kaithi)
    bi Bislama              Vanuatu, New Caledonia          Latin
    bn Bengali              India                           Bengali
    co Corsican             France                          Latin
    fj Fiji                 Fiji                            Latin
    gd Scots                UK                              Latin
    gn Guarani              Paraguay                        Latin
    gu Gujarati             India                           Gujarati
    ha Hausa                Nigeria, Niger, Chad, Sudan,... Latin
    hi Hindi                India                           Devanagari
    hy Armenian             Armenia                         Armenian
    ia Interlingua          None (Artificial Language)      Latin
    ie Interlingue          None (Artificial Language)      Latin
    ik Inupiak              USA, Cannada                    Latin, Cree
    in Indonesian           Indonesia                       Latin
    ji Yiddish              Germany, USA, SNC, Israel       Hebrew
    jw Javanese             Indonesia, Malaysia             Latin, Javanese
    ka Georgian             Georgia                         Georgian
    kk Kazakh               SNC, Afghanistan                Cyrillic, Arabic
    km Cambodian            Cambodia                        Khmer
    kn Kannada              India                           Kannada
    ks Kashmiri             India, Pakistan                 Arabic
    ku Kurdish              SNC, Turkey, Iraq, Iran         Cyrillic, Arabic
    ky Kirghiz              SNC, China, Afghanistan         Cyrillic, Arabic
    ln Lingala              CAR, Congo, Zaire               Latin
    mg Malagasy             Madagascar, Comoro Islands      Latin, Arabic
    mi Maori                New Zealand                     Latin
    mk Macedonian           Greece, Yugoslavia              Greek, Cyrillic
    ml Malayalam            India                           Malayalam
    mn Mongolian            Mongolia                        Cyrillic, Mongolian
    mo Moldavian            Romania                         Latin
    mr Marathi              India                           Devanagari
    ms Malay                Malaysia, Thailand              Latin
    my Burmese              Myanmar                         Burmese
    na Nauru                Nauru                           Latin
    ne Nepali               Nepal                           Devanagari
    oc Occitan              France                          Latin
    or Oriya                India                           Oriya
    pa Punjabi              India                           Gurmukhi
    ps Pashto (Western)     Afghanistan, Iran               Arabic
    qu Quechua              Peru                            Latin
    rm Rhaeto               Swizerland                      Latin


Alvestrand                 Expires May 6 93                  [Page 18]

draft            X.400 use of extended character sets           Apr 92


    rn Kirundi              Burundi, Uganda                 Latin
    rw Kinyarwanda          Rwanda, Uganda, Zaire           Latin
    sa Sanskrit             India                           Devanagari
    sd Sindhi               Pakistan, India, Afghanistan    Arabic, Gurmukhi
    sg Sangro               Central African Republic        Latin
    si Singhalese           Sri Lanka                       Sinhalese
    sm Samoan               Samoa, USA, New Zealand         Latin
    sn Shona                Zimbabwe, Zambia, Mozambique    Latin
    so Somali               Somalia, Ethiopia, Djibouti     Latin
    sr Serbian              former Yugoslavia               Cyrillic
    ss Siswati              S. Africa, Swaziland            Latin
    st Sesotho              S. Africa, Lesotho              Latin
    su Sudanese             Sudan                           Latin
    ta Tamil                India, Malaysia                 Tamil
    te Tegulu               India                           Telugu
    tg Tajik                Tajikistan                      Arabic
    ti Tigrinya             Ethiopia                        Latin, Ethiopic
    tk Turkmen              SNC, Iran, Afghanistan          Cyrillic, Arabic
    tl Tagalog              Phillipines                     Latin
    tn Setswana             S. Africa, Botswana, Namibia    Latin
    to Tonga (3)            Mozambique                      Latin
    ts Tsonga               Mozambique, Swaziland           Latin
    tt Tatar                SNC                             Cyrillic
    tw Twi (Ewe)            Ghana                           Latin
    uz Uzbek (Southern)     Afghanistan, Turkey             Arabic
    vi Vietnamese           Vietnam, Cambodia, China        Latin
    wo Wolof                Senegal, Mauritania             Latin
    xh Xhosa                S. Africa                       Latin
    yo Yoruba               Nigeria, Togo, Benin            Latin
    zu Zulu                 S. Africa, Lesotho, Malawi      Latin


    The information about languages in ISO 10646 was kindly supplied
    by Glenn Adams <glenn@metis.com>

    Languages for which the author does NOT know any proper character
    set include:


    bo Tibetan
    dz Bhutani
    et Estonian
    lt Lithuanian
    lv Latvian, Lettish


Alvestrand                 Expires May 6 93                  [Page 19]

draft            X.400 use of extended character sets           Apr 92


    mt Maltese
    sh Serbo-Croatian


    6.  REFERENCES


    [ISO 4873]
         <<title coming>> 1991 revision.  Replaces ISO 2022

    [ISO 8859]

    [ISO 6937]

    [ISO 639]

    [X.209]
         CCITT Recommendation X.209(1988): Specification of Basic
         Encoding Rules for Abstract Syntax Notation One (ASN.1).
         Technically aligned with ISO 8825 and ISO 8825/AD 1.

    [ISO 10646]

    [RFC-2022-JP]

    [RFC-KELD]


    7.  Missing items   This section is intended as a memory aid for
    the author, and should be empty by the time the RFC is published.

    (1)  Get exact escape sequence information for ISO 10646

    (2)  Full titles in the references section

    (3)  Consider number of lines when listing extra chars in
         languages in cleartext

    (4)  Check Sami character set with Sami school

    (5)  Locate (Norwegian) editor of revision for ISO 639 and get
         language codes for Sorbian and Sami, if possible


Alvestrand                 Expires May 6 93                  [Page 20]

draft            X.400 use of extended character sets           Apr 92


    (6)  Add MOTIS properly to reference list

    (7)  Add Johan van Wingen's E-mail address

    (8)  Number and reference entry for RFC-KELD

    (9)  Check for references to/copies of Johan van Wingen's work


Alvestrand                 Expires May 6 93                  [Page 21]


------------------------------ End of body part 2