Network Working Group                                          C. Newman
Internet Draft: Multi-Lingual String Format                     Innosoft
Document: draft-ietf-acap-mlsf-00.txt                           May 1997
                                                   Expires in six months


                   Multi-Lingual String Format (MLSF)


Status of this memo

     This document is an Internet Draft.  Internet Drafts are working
     documents of the Internet Engineering Task Force (IETF), its Areas,
     and its Working Groups.  Note that other groups may also distribute
     working documents as Internet Drafts.

     Internet Drafts are draft documents valid for a maximum of six
     months.  Internet Drafts may be updated, replaced, or obsoleted by
     other documents at any time.  It is not appropriate to use Internet
     Drafts as reference material or to cite them other than as a
     "working draft" or "work in progress".

     To learn the current status of any Internet-Draft, please check the
     1id-abstracts.txt listing contained in the Internet-Drafts Shadow
     Directories on ds.internic.net, nic.nordu.net, ftp.isi.edu, or
     munnari.oz.au.

     A revised version of this draft document will be submitted to the
     RFC editor as a Proposed Standard for the Internet Community.
     Discussion and suggestions for improvement are requested.  This
     document will expire six months after publication.  Distribution of
     this draft is unlimited.


Abstract

     While UTF-8 [UTF-8] solves most internationalization (I18N)
     problems, it fails to solve multilingualization problems (M17N)
     problems.  The two basic problems with UTF-8 are that CJK
     unification fails to recognize glyph style differences between
     Chinese, Japanese and Korean and that it is impossible to read
     UTF-8 text to a blind person without knowing the language.

     Encoding language tagging in the coded character set itself can
     unnecessarily complicate processing which doesn't need language
     tags.  Encoding the language tagging at the application protocol
     level will add unnecessary complexity to every application protocol
     which needs multi-lingual support.  In addition, such higher level


Newman                                                          [Page 1]

Internet Draft        Multi-Lingual String Format               May 1997


     language support may fail to deal with mixed language strings and
     strings which have alternate representations in different
     languages.

     This specification uses unused octet sequences in UTF-8 as a
     framework to build a new encoding called MLSF (Multi-Lingual String
     Format) which supports mixed language strings and alternative
     language strings.  The goal is to make language tags easy to strip
     when unnecessary, easy to support when necessary, and to preserve
     the good searching characteristics of UTF-8 as much as possible.


1. Conventions used in this document

     The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
     in this document are to be interpreted as defined in "Key words for
     use in RFCs to Indicate Requirement Levels" [KEYWORDS].


2. MLSF simple form

     MLSF uses "Tags for the Identification of Languages" [LANG-TAGS] as
     the basis for language identification.

     Language tags are encoded by mapping them to upper-case, then
     adding hexidecimal A0 to each octet.  The result is broken up into
     groups of five octets followed by a final group of five or fewer
     octets.  Each group is prefixed by a UTF-8-style length count with
     the low bits set to 0.  See Appendix D for sample source code to
     perform this conversion.

     MLSF simple form is UTF-8 with embedded MLSF language tags.  An
     important observation is that a UTF-8 interpreter which silently
     ignores illegal characters will successfully process MLSF simple
     form strings.  MLSF simple form is defined by the MLSF-SIMPLE rule
     in section 7.  A quoted version of MLSF simple form is defined by
     the MLSF-SIMPLE-QUOTED rule.


3. MLSF alternative form

     A MLSF alternative form string may contain alternative
     representations of the same text in different primary languages.
     The octet with hexidecimal representation of FE is used to
     introduce a new alternative.  This MUST be followed by a MLSF
     language tag for the primary language of the alternative.

     The component of the MLSF string prior to the first FE octet is


Newman                                                          [Page 2]

Internet Draft        Multi-Lingual String Format               May 1997


     considered the "preferred" representation for the string.  This is
     the version which will be displayed by MLSF clients which choose
     not to support alternative representations.  The preferred
     representation MAY be prefixed by a MLSF language tag.

     MLSF alternate form is defined by the MLSF-ALT rule in section 7.
     A quoted version of MLSF alternate form is defined by the
     MLSF-ALT-QUOTED rule.


4. Minimal Support: downconverting MLSF to UTF-8

     Minimal support for MLSF requires the ability to downconvert MLSF
     to UTF-8.  This is a simple procedure which selects the preferred
     alternative and strips all language tags.  Sample code is included
     in Appendix B.  All UTF-8 strings which do not contain a 0 octet
     are also MLSF strings.


5. MLSF MIME character sets

     The character set label "XXXX-simple" has been registered to
     indicate the use of MLSF simple form.  The character set label
     "XXXX-alt" has been registered to indicate the use of MLSF
     alternate form.

     MLSF may be used in conjunction with MIME header [MIME-HDR]
     encoding to permit language tagging and alternative representations
     in header fields.

     For single language MIME body parts, the UTF-8 character set with
     an appropriate Content-Language [LANG-TAG] header SHOULD be used
     instead of MLSF.


6. Security Considerations

     Multi-Lingual String Format is not believed to have any security
     considerations beyond those for simple US-ASCII strings.  In
     particular, unfiltered display of certain US-ASCII control
     characters by a terminal emulator may result in modifying the
     behavior of the terminal emulator (e.g. by redefining function
     keys) such that security can be breached.  Programs which display
     text to a potentially insecure terminal emulator channel are
     encouraged to remove control characters to avoid these problems.


Newman                                                          [Page 3]

Internet Draft        Multi-Lingual String Format               May 1997


7. Formal Grammar

     This section defines the formal grammar for MLSF using Augmented
     BNF [ABNF] notation.

     MLSF-ALT           = [[MLSF-LANG-TAG] MLSF-COMPONENT
                           *(MLSF-ALTERNATE MLSF-COMPONENT)]

     MLSF-ALT-QUOTED    = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q
                           *(MLSF-ALTERNATE MLSF-COMPONENT-Q)] <">

     MLSF-ALTERNATE     = %xFE MLSF-LANG-TAG

     MLSF-COMPONENT     = UTF8-NON-NUL *([MLSF-LANG-TAG] UTF8-NON-NUL)

     MLSF-COMPONENT-Q   = UTF8-QUOTED *([MLSF-LANG-TAG] UTF8-QUOTED)

     MLSF-LANG-TAG      = *MLSF-LANG-5 (MLSF-LANG-1 / MLSF-LANG-2 /
                          MLSF-LANG-3 / MLSF-LANG-4 / MLSF-LANG-5)
                          ;; Encoded version of Language-Tag from RFC 1766
                          ;; characters converted to uppercase, with
                          ;; A0 added and broken into MLSF-LANG components

     MLSF-LANG-CONT     = %xCD / %xE1..FA

     MLSF-LANG-1        = %xC0 MLSF-LANG-CONT

     MLSF-LANG-2        = %xE0 2MLSF-LANG-CONT

     MLSF-LANG-3        = %xF0 3MLSF-LANG-CONT

     MLSF-LANG-4        = %xF8 4MLSF-LANG-CONT

     MLSF-LANG-5        = %xFC 5MLSF-LANG-CONT

     MLSF-SIMPLE        = [[MLSF-LANG-TAG] MLSF-COMPONENT]

     MLSF-SIMPLE-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q] <">

     QUOTED             = "\" QUOTED-SPECIAL

     QUOTED-SPECIAL     = "\" / <">

     US-ASCII-SAFE      = %x01..09 / %x0B..0C / %x0E..21
                          / %x23..2E / %x30..7F
                         ;; US-ASCII except QUOTED-SPECIALs, CR, LF, NUL

     UTF8-NON-NUL       = UTF8-SAFE / CR / LF / QUOTED-SPECIAL


Newman                                                          [Page 4]

Internet Draft        Multi-Lingual String Format               May 1997


     UTF8-QUOTED        = UTF8-SAFE / QUOTED

     UTF8-SAFE          = US-ASCII-SAFE / UTF8-1 / UTF8-2 / UTF8-3
                          / UTF8-4 / UTF8-5

     UTF8-CONT          = %x80..BF

     UTF8-1             = %xC0..DF UTF8-CONT

     UTF8-2             = %xE0..EF 2UTF8-CONT

     UTF8-3             = %xF0..F7 3UTF8-CONT

     UTF8-4             = %xF8..FB 4UTF8-CONT

     UTF8-5             = %xFC..FD 5UTF8-CONT


8. References

     [ABNF] Crocker, D., "Augmented BNF for Syntax Specifications:
     ABNF", Work in progress: draft-ietf-drums-abnf-xx.txt

     [KEYWORDS] Bradner, "Key words for use in RFCs to Indicate
     Requirement Levels", RFC 2119, Harvard University, March 1997.

         <ftp://ds.internic.net/rfc/rfc2119.txt>

     [LANG-TAGS] Alvestrand, H., "Tags for the Identification of
     Languages", RFC 1766.

         <ftp://ds.internic.net/rfc/rfc1766.txt>

     [MIME-HDR] Moore, "MIME (Multipurpose Internet Mail Extensions)
     Part Three: Message Header Extensions for Non-ASCII Text", RFC
     2047, University of Tennessee, November 1996.

         <ftp://ds.internic.net/rfc/rfc2047.txt>

     [MIME-IMB] Freed, Borenstein, "Multipurpose Internet Mail
     Extensions (MIME) Part One: Format of Internet Message Bodies", RFC
     2045, Innosoft, First Virtual, November 1996.

         <ftp://ds.internic.net/rfc/rfc2045.txt>


Newman                                                          [Page 5]

Internet Draft        Multi-Lingual String Format               May 1997


     [UTF8] Yergeau, F. "UTF-8, a transformation format of Unicode and
     ISO 10646", RFC 2044, Alis Technologies, October 1996.

         <ftp://ds.internic.net/rfc/rfc2044.txt>


9. Acknowledgements

     Special thanks to Mark Crispin for the idea of using unused UTF-8
     codes for this purpose.   Thanks are also due to participants of
     the ACAP WG mailing list who helped review this proposal.


10. Author's Address

     Chris Newman
     Innosoft International, Inc.
     1050 East Garvey Ave. South
     West Covina, CA 91790 USA

     Email: chris.newman@innosoft.com


Appendix A.  Client advice

     A simple UTF-8 client is likely to find the source code in Appendix
     B useful.  A simple Latin-1 based client is likely to find the
     source code in Appendix C useful.

     A more sophisticated client will allow the user to select a
     preferred language and use something like the source code in
     Appendix E to find the best alternative in an MLSF string.  Such
     clients should also be aware that sometimes the client's preferred
     language is misconfigured, and the user may wish to have the last
     few messages repeated after they have changed languages.  For this
     reason, such a client may wish to cache the last few MLSF strings
     displayed to the user.


Newman                                                          [Page 6]

Internet Draft        Multi-Lingual String Format               May 1997


Appendix B.  Sample code to convert to UTF-8

Here is sample C source code to convert from MLSF to UTF-8.

#include <stdio.h>
#include <ctype.h>

/* a UTF8 lookup table */
#define BAD 0x80
#define SEP 0x40
#define EXT 0x20
static unsigned char utlen[256] = {
        /* 0x00 */ BAD,   1,   1,   1,   1,   1,   1,   1,
        /* 0x08 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x10 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x18 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x20 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x28 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x30 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x38 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x40 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x48 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x50 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x58 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x60 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x68 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x70 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x78 */   1,   1,   1,   1,   1,   1,   1,   1,
        /* 0x80 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0x88 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0x90 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0x98 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0xA0 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0xA8 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0xB0 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0xB8 */ EXT, EXT, EXT, EXT, EXT, EXT, EXT, EXT,
        /* 0xC0 */   2,   2,   2,   2,   2,   2,   2,   2,
        /* 0xC8 */   2,   2,   2,   2,   2,   2,   2,   2,
        /* 0xD0 */   2,   2,   2,   2,   2,   2,   2,   2,
        /* 0xD8 */   2,   2,   2,   2,   2,   2,   2,   2,
        /* 0xE0 */   3,   3,   3,   3,   3,   3,   3,   3,
        /* 0xE8 */   3,   3,   3,   3,   3,   3,   3,   3,
        /* 0xF0 */   4,   4,   4,   4,   4,   4,   4,   4,
        /* 0xF8 */   5,   5,   5,   5,   6,   6, SEP, BAD
};


Newman                                                          [Page 7]

Internet Draft        Multi-Lingual String Format               May 1997


/* Down conversion from NUL terminated MLSF string to UTF-8.
 *  this strips the language tags and only keeps the preferred
 *  representation.
 * It returns the length of the final string.
 * The destination string will not be longer than the source string.
 *  dst and src may be the same for in-place conversion.
 */
int MLSFtoUTF8(unsigned char *dst, unsigned char *src)
{
    unsigned char *start = dst;
    int len;

    for (;;) {
        len = utlen[*src];
        if (len > 6) break;
        /* skip language tags */
        if (len > 1 && src[1] > 0xC0U) {
            while (len && *src != '\0') {
                ++src;
                --len;
            }
            continue;
        }
        /* copy UTF8 character */
        while (len && *src != '\0') {
            *dst = *src;
            ++dst;
            ++src;
            --len;
        }
    }
    *dst = '\0';

    return (dst - start);
}


Newman                                                          [Page 8]

Internet Draft        Multi-Lingual String Format               May 1997


Appendix C. Sample code to convert to Latin-1

/* Down conversion from NUL terminated MLSF string to 8859-1
 * The destination string will not be longer than the source string.
 *  fillc is used to fill untranslatable characters,
 *  if fillc is NUL, untranslatable characters are ignored.
 * returns 0 if source only contained latin-1, returns -1 otherwise.
 */
int MLSFtoLatin1(unsigned char *dst, unsigned char *src, int fillc)
{
    int len, result = 0;

    for (;;) {
        len = utlen[*src];
        /* copy US-ASCII */
        if (len == 1) {
            *dst = *src;
            ++dst;
            ++src;
            continue;
        }
        /* stop at illegal character or end of string */
        if (len > 6) break;
        /* skip non-latin1 glyphs and language tags */
        if (*src > 0xC3U || src[1] > 0xC0U) {
            if (src[1] <= 0xC0U) {
                /* non-latin1 glyph found */
                result = -1;
                if (fillc) {
                    *dst = fillc;
                    ++dst;
                }
            }
            while (len && *src != '\0') {
                ++src;
                --len;
            }
            continue;
        }
        /* copy latin 1 character */
        *dst = ((src[0] & 0x03) << 6) | (src[1] & 0x3F);
        ++dst;
        src += 2;
    }
    *dst = '\0';

    return (result);
}


Newman                                                          [Page 9]

Internet Draft        Multi-Lingual String Format               May 1997


Appendix D. Sample code for encoding/decoding language tags

/* encode a language tag
 *  the destination must have a size of least (counting terminating NUL):
 *        (6 * strlen(src) + 9) / 5
 *  returns the length of the destination.
 */
int MLSFlangencode(unsigned char *dst, unsigned char *src)
{
    static unsigned char prefix[] = { 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
    unsigned char *start = dst;
    int len;                    /* source length */
    int complen;                /* component length */
    int i;

    for (len = strlen(src); len > 0; len -= complen) {
        /* find maximal component length */
        complen = len;
        if (len >= 5) {
            complen = 5;
        }
        /* look up component prefix */
        *dst = prefix[complen - 1];
        ++dst;
        /* copy and map characters in component */
        for (i = 0; i < complen; ++i) {
            *dst = (islower(*src) ? toupper(*src) : *src) + 0xA0U;
            ++dst;
            ++src;
        }
    }
    *dst = '\0';

    return (dst - start);
}


Newman                                                         [Page 10]

Internet Draft        Multi-Lingual String Format               May 1997


/* decode a language tag
 *  the destination will not be longer than the source
 *  dst and src may be the same for in-place conversion
 * returns the length of the destination
 */
int MLSFlangdecode(unsigned char *dst, unsigned char *src)
{
    unsigned char *start = dst;
    int complen;

    while (src[0] >= 0xC0U && src[1] > 0xC0U) {
        for (complen = utlen[*src++]; complen > 1; --complen) {
            *dst = *src - 0xA0U;
            ++dst;
            ++src;
        }
    }
    *dst = '\0';

    return (dst - start);
}


Appendix E. Sample code for selecting the "best" alternative

/* select the "best" language match from an MLSF string
 *  assume input language tag has been converted to upper case
 *  assume language tags in string won't exceed 256 characters
 *  "best" is calculated by matching RFC 1766 language tag components
 * returns a pointer to the start of best matching component
 */
unsigned char *MLSFselect(unsigned char *str, unsigned char *tag)
{
    unsigned char ltag[256];
    unsigned char *best, *match1, *match2;
    int bestlen, mlen;

    /* start with match on preferred alternative */
    best = str;
    bestlen = 0;

    /* skip test if no language tag */
    if (tag != NULL && *tag != '\0') {
        do {
            /* get language tag for this component */
            MLSFlangdecode(ltag, str);


Newman                                                         [Page 11]

Internet Draft        Multi-Lingual String Format               May 1997


            /* calculate match length of language tags */
            match1 = ltag;
            match2 = tag;
            mlen = 0;
            while (*match1 != '\0' && *match1 == *match2) {
                ++match1, ++match2;
                /* save length of partial match */
                if (*match2 == '-'
                    && (*match1 == '-' || *match1 == '\0')) {
                    mlen = match1 - ltag;
                }
            }

            /* finish on exact match */
            if (*match2 == '\0'
                && (*match1 == '-' || *match1 == '\0')) {
                best = str;
                break;
            }

            /* remember best match */
            if (mlen > bestlen) {
                best = str;
                bestlen = mlen;
            }

            /* skip to next MLSF component */
            while (*str != '\0' && *str++ != 0xFEU)
                ;
        } while (*str != '\0');
    }

    return (best);
}


Newman                                                         [Page 12]