TOC 
Network Working GroupM. Blanchet
Internet-DraftViagenie
Obsoletes: 3454 (if approved)July 5, 2010
Intended status: Standards Track 
Expires: January 6, 2011 


Precis Framework: Handling Internationalized Strings in Protocols
draft-blanchet-precis-framework-00.txt

Abstract

Using Unicode codepoints in protocol strings requires preparation of the string. This document describes the Precis Protocol Framework that prepares various classes of strings used in protocol elements. A protocol specification chooses a class of strings and then implements the corresponding preparation steps described in this document. This document is based on the IDNAbis approach. It obsoletes the Stringprep algorithm.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This Internet-Draft will expire on January 6, 2011.

Copyright Notice

Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English.



Table of Contents

1.  Introduction
2.  String Classes
3.  Domain U-Label, A-Label and Name
4.  Email Addresses
5.  Restricted Identifier
6.  Less-Restrictive Identifier
7.  Normalization Form and Case Folding
8.  Codepoint Properties
9.  Category definitions Used to Calculate Derived Property Value
    9.1.  LetterDigits (A)
    9.2.  Unstable (B)
    9.3.  IgnorableProperties (C)
    9.4.  IgnorableBlocks (D)
    9.5.  LDH (E)
    9.6.  Exceptions (F)
    9.7.  BackwardCompatible (G)
    9.8.  JoinControl (H)
    9.9.  OldHangulJamo (I)
    9.10.  Unassigned (J)
10.  Calculation of the Derived Property
11.  Codepoints
12.  IANA Considerations
    12.1.  IDNA derived property value registry
    12.2.  IDNA Context Registry
        12.2.1.  Template for context registry
13.  Security Considerations
14.  Discussion home for this draft
15.  Acknowledgements
Appendix A.  Contextual Rules Registry
Appendix A.1.  ZERO WIDTH NON-JOINER
Appendix A.2.  ZERO WIDTH JOINER
Appendix A.3.  MIDDLE DOT
Appendix A.4.  GREEK LOWER NUMERAL SIGN (KERAIA)
Appendix A.5.  HEBREW PUNCTUATION GERESH
Appendix A.6.  HEBREW PUNCTUATION GERSHAYIM
Appendix A.7.  KATAKANA MIDDLE DOT
Appendix A.8.  ARABIC-INDIC DIGITS
Appendix A.9.  EXTENDED ARABIC-INDIC DIGITS
Appendix B.  Codepoints 0x0000 - 0x10FFFF
Appendix B.1.  Codepoints in Unicode Character Database (UCD) format
16.  Informative References
§  Author's Address




 TOC 

1.  Introduction

[draft-ietf-blanchet-newprep-problem-statement] describes the rationale behind updating Stringprep[RFC3454] to a new framework.

Current Stringprep profiles and their corresponding protocol specifications share similar class of strings. This framework is based on the assumption that the use of internationalized strings in most protocols can be grouped into a few set of string classes. By defining a few string classes and their corresponding preparation algorithms instead of specific profiles for each protocol,

This framework takes heavily on the IDNAbis tables[IDNABISTABLES], therefore, could help implementors by sharing common code for all string classes, including domain labels and names.

EDITOR NOTE:This current version of the document copy a lot of normative text from draft-ietf-idnabis-tables. The editor would highly prefer reference instead of copy, but at least for the purpose of discussion, copied text. Moreover, the idnabis-table draft contains references to IDN labels in many places which may make problematic for normative reference. To be looked at as we go.



 TOC 

2.  String Classes

The following classes of strings are identified:



 TOC 

3.  Domain U-Label, A-Label and Name

TBD:define the class.

For these string classes, implement [IDNA2008].



 TOC 

4.  Email Addresses

TBD:define the class by instantiating and refering to the EAI, SMTP.

For this classes of strings, implement [EAI]?



 TOC 

5.  Restricted Identifier

This class of strings, named RI in this document, corresponds to an identifier which contains language-type characters, no spacing characters, no "@", no "punctuation", no display characters. The normative description of this class is in the corresponding mapping tables.

In section XX below, allowed Unicode codepoints for this string class are identified as PVALID or RI_PVALID. Disallowed codepoints are identified as DISALLOWED or RI_DISALLOWED.



 TOC 

6.  Less-Restrictive Identifier

This class of strings, named LRI in this document, corresponds to an identifier which contains language-type characters, no spacing characters, no "@", but contains various "punctuation" and display characters. The normative description of this class is in the corresponding mapping tables.

In section XX below, allowed Unicode codepoints for this string class are identified as PVALID or LRI_PVALID. Disallowed codepoints are identified as DISALLOWED or LRI_DISALLOWED.



 TOC 

7.  Normalization Form and Case Folding

TBD: discuss NFC vs NFKC, case folding",



 TOC 

8.  Codepoint Properties

This document reviews and classifies the collections of code points in the Unicode character set by examining various properties of the code points. It then defines an algorithm for determining a derived property value. It specifies a procedure, and not a table, of code points so that the algorithm can be used to determine code point sets independent of the version of Unicode that is in use.

This document is not intended to specify precisely how these property values are to be applied in protocol strings. That information should be defined in the protocol specification that instantiate a string class of this document.

The value of the property is to be interpreted as follows.

The mechanisms described here allow determination of the value of the property for future versions of Unicode (including characters added after Unicode 5.2). Changes in Unicode properties that do not affect the outcome of this process do not affect this framework. For example, a character can have its Unicode General_Category value (see [Unicode52]) change from So to Sm, or from Lo to Ll, without affecting the algorithm results. Moreover, even if such changes were to result, the BackwardCompatible list (BackwardCompatible (G)) can be adjusted to ensure the stability of the results.

Some code points need to be allowed in exceptional circumstances, but should be excluded in all other cases; these rules are also described in other documents. The most notable of these are the Join Control characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-JOINER. Both of them have the derived property value CONTEXTJ. A character with the derived property value CONTEXTJ or CONTEXTO (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate rule has been established and the context of the character is consistent with that rule. It is invalid to either register a string containing these characters or even to look one up unless such contextual rule is found and satisfied. Please see Appendix A (Contextual Rules Registry), The Contextual Rules Registry, for more information.



 TOC 

9.  Category definitions Used to Calculate Derived Property Value

The derived property obtains its value based on a two-step procedure. First, characters are placed in one or more character categories based on either core properties defined by the Unicode Standard or by treating the codepoint as an exception and addressing the codepoint by its codepoint value. These categories are not mutually exclusive.

In the second step, set operations are used with these categories to determine the values for an string class specific property. Those operations are specified in Section 10 (Calculation of the Derived Property).

Unicode property names and property value names may have short abbreviations, such as gc for the General_Category property, and Ll for the Lowercase_Letter property value of the gc property.

In the following specification of categories, the operation which returns the value of a particular Unicode character property for a code point is designated by using the formal name of that property (from PropertyAliases.txt) followed by '(cp)'. For example, the value of the General_Category property for a code point is indicated by General_Category(cp).



 TOC 

9.1.  LetterDigits (A)

A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

These rules identifies characters commonly used in mnemonics and often informally described as "language characters".

For more information, see section 4.5 of [Unicode5].

The categories used in this rule are:



 TOC 

9.2.  Unstable (B)

B: toNFKC(toCaseFold(toNFKC(cp))) != cp

This category is used to group the characters that are not stable under NFKC normalization and casefolding. In general, these code points are not suitable for use in any string class.

The toCaseFold() operation is defined in Section 3.13 of [Unicode5].

The toNFKC() operation returns the code point in normalization form KC. For more information, see Section 5 of [TR15].



 TOC 

9.3.  IgnorableProperties (C)

C: Default_Ignorable_Code_Point(cp) = True or
   White_Space(cp) = True or
   Noncharacter_Code_Point(cp) = True

This category is used to group code points that are not recommended for use in identifiers. In general, these code points are not suitable for identifiers.

The definition for Default_Ignorable_Code_Point can be found in DerivedCoreProperties.txt and is at the time of Unicode 5.2:

Other_Default_Ignorable_Code_Point + Cf (Format characters)
+ Variation_Selector - White_Space - FFF9..FFFB (Annotation
Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters
that should be visible)


 TOC 

9.4.  IgnorableBlocks (D)

D: Block(cp) is in {Combining Diacritical Marks for Symbols,
                    Musical Symbols, Ancient Greek Musical Notation}

This category is used to identifying code points that are not useful in mnemonics but may be useful for some string classes.

The definition of blocks can be found in Blocks.txt



 TOC 

9.5.  LDH (E)

E: cp is in {002D, 0030..0039, 0061..007A}

This category is used in the second step to preserve the traditional "hostname" (LDH) characters ('-', 0-9 and a-z). In general, these code points are suitable for use for identifiers.



 TOC 

9.6.  Exceptions (F)

F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
             0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
             0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
             06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007,
             302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B,
             30FB}

This category explicitly lists code points for which the category cannot be assigned using only the core property values that exist in the Unicode standard. The values are according to the table below:

PVALID -- Would otherwise have been DISALLOWED

00DF; PVALID     # LATIN SMALL LETTER SHARP S
03C2; PVALID     # GREEK SMALL LETTER FINAL SIGMA
06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN
0F0B; PVALID     # TIBETAN MARK INTERSYLLABIC TSHEG
3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

CONTEXTO -- Would otherwise have been DISALLOWED

00B7; CONTEXTO   # MIDDLE DOT
0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)
05F3; CONTEXTO   # HEBREW PUNCTUATION GERESH
05F4; CONTEXTO   # HEBREW PUNCTUATION GERSHAYIM
30FB; CONTEXTO   # KATAKANA MIDDLE DOT

CONTEXTO -- Would otherwise have been PVALID

0660; CONTEXTO   # ARABIC-INDIC DIGIT ZERO
0661; CONTEXTO   # ARABIC-INDIC DIGIT ONE
0662; CONTEXTO   # ARABIC-INDIC DIGIT TWO
0663; CONTEXTO   # ARABIC-INDIC DIGIT THREE
0664; CONTEXTO   # ARABIC-INDIC DIGIT FOUR
0665; CONTEXTO   # ARABIC-INDIC DIGIT FIVE
0666; CONTEXTO   # ARABIC-INDIC DIGIT SIX
0667; CONTEXTO   # ARABIC-INDIC DIGIT SEVEN
0668; CONTEXTO   # ARABIC-INDIC DIGIT EIGHT
0669; CONTEXTO   # ARABIC-INDIC DIGIT NINE
06F0; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ZERO
06F1; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ONE
06F2; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT TWO
06F3; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT THREE
06F4; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FOUR
06F5; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FIVE
06F6; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SIX
06F7; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT NINE

DISALLOWED -- Would otherwise have been PVALID

0640; DISALLOWED # ARABIC TATWEEL
07FA; DISALLOWED # NKO LAJANYALAN
302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
3031; DISALLOWED # VERTICAL KANA REPEAT MARK
3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF
3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF
303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK


 TOC 

9.7.  BackwardCompatible (G)

G: cp is in {}

This category includes the code points that property values in versions of Unicode after 5.2 have changed in such a way that the derived property value would no longer be PVALID or DISALLOWED. If changes are made to future versions of Unicode so that code points might change property value from PVALID or DISALLOWED, then this table can be updated and keep special exception values so that the property values for code points stay stable.



 TOC 

9.8.  JoinControl (H)

H: Join_Control(cp) = True

This category consists of Join Control characters (i.e., they are not in LetterDigits (LetterDigits (A))) but are still required in strings under some circumstances.



 TOC 

9.9.  OldHangulJamo (I)

I: Hangul_Syllable_Type(cp) is in {L, V, T}

This category consists of all conjoining Hangul Jamo (Leading Jamo, Vowel Jamo, and Trailing Jamo).

Elimination of conjoining Hangul Jamos from the set of PVALID characters results in restricting the set of Korean PVALID characters just to preformed, modern Hangul syllable characters. Old Hangul syllables, which must be spelled with sequences of conjoining Hangul Jamos, are not PVALID for string classes.



 TOC 

9.10.  Unassigned (J)

J: General_Category(cp) is in {Cn} and
   Noncharacter_Code_Point(cp) = False

This category consists of code points in the Unicode character set that are not (yet) assigned. It should be noted that Unicode distinguishes between 'unassigned code points' and 'unassigned characters'. The unassigned code points are all but (Cn - Noncharacters), while the unassigned *characters* are all but (Cn + Cs).



 TOC 

10.  Calculation of the Derived Property

Possible values of the property are:

The algorithm to calculate the value of the derived property is as follows. If the names of a rule (such as Exception) is used, that implies the set of codepoints that the rule define, while the same name as a function call (such as Exception(cp)) imply the value cp has in the Exceptions table.

If .cp. .in. Exceptions Then Exceptions(cp);
Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp);
Else If .cp. .in. Unassigned Then UNASSIGNED;
Else If .cp. .in. LDH Then PVALID;
Else If .cp. .in. JoinControl Then CONTEXTJ;
Else If .cp. .in. Unstable Then DISALLOWED;
Else If .cp. .in. IgnorableProperties Then DISALLOWED;
Else If .cp. .in. IgnorableBlocks Then LRI_PVALID;
Else If .cp. .in. OldHangulJamo Then DISALLOWED;
Else If .cp. .in. LetterDigits Then PVALID;
Else DISALLOWED;



 TOC 

11.  Codepoints

The Categories and Rules defined in Section 9 (Category definitions Used to Calculate Derived Property Value) and Section 10 (Calculation of the Derived Property) apply to all Unicode code points. The table in Appendix B (Codepoints 0x0000 - 0x10FFFF) shows, for illustrative purposes, the consequences of the categories and classification rules, and the resulting property values.

The list of code points that can be found in Appendix B (Codepoints 0x0000 - 0x10FFFF) is non-normative. Section 9 (Category definitions Used to Calculate Derived Property Value) and Section 10 (Calculation of the Derived Property) are normative.



 TOC 

12.  IANA Considerations



 TOC 

12.1.  IDNA derived property value registry

IANA is to create a registry with the derived properties for the versions of Unicode that is released after (and including) version 5.2. The derived property value is to be calculated in cooperation with a designated expert[RFC5226] according to the specifications in Section 9 (Category definitions Used to Calculate Derived Property Value) and Section 10 (Calculation of the Derived Property) and not by copying the non-normative table found in Appendix B (Codepoints 0x0000 - 0x10FFFF).

If during this process (creation of the table of derived property values) followed by a designated expert review, either non-backward compatible changes to the table of derived properties are discovered, or otherwise problems during the creation of the table arises, that is to be flagged to the IESG. Changes to the rules (as specified in Section 9 (Category definitions Used to Calculate Derived Property Value) and Section 10 (Calculation of the Derived Property)), including BackwardCompatible (BackwardCompatible (G)) (a set that is at release of this document is empty), require IETF Review, as described in [RFC 5226].



 TOC 

12.2.  IDNA Context Registry

For characters that are defined in IDNA derived property value registry (IDNA derived property value registry) as CONTEXTO or CONTEXTJ and therefore requiring a contextual rule IANA will create and maintain a list of approved contextual rules. Additions or changes to these rules require IETF Review, as described in [RFC5226].

A table from which that registry can be initialized, and some further discussion appears in Appendix A (Contextual Rules Registry).



 TOC 

12.2.1.  Template for context registry

The following information is to be given when a new rule is created.

Name: Unique name of the rule

Code point: Rule should be applied when this codepoint exist in label

Overview: Description in plain english on what the rule verifies

Lookup: Should rule be applied at time of lookup?

Rule Set: The set of rules, as described in



 TOC 

13.  Security Considerations

TBD



 TOC 

14.  Discussion home for this draft

This document is discussed in the precis@ietf.org mailing list (This section to be removed when published as RFC).



 TOC 

15.  Acknowledgements

The author of this document would like to acknowledge the comments and contributions of the following people: ...

Since this document copies a lot of text and the algorithms from IDNAbis tables, therefore all authors and contributors to the idnabis work are deeply acknowledged.



 TOC 

Appendix A.  Contextual Rules Registry

As discussed in Section 12.2 (IDNA Context Registry), a registry of rules that define the contexts in which particular PROTOCOL-VALID characters, characters associated with a requirement for Contextual Information, are permitted. These rules are expressed as tests on the label in which the characters appear (all, or any part of, the label may be tested).

The grammatical rules are expressed in pseudo code. The conventions used for that pseudo code are explained here.

Each rule is constructed as a Boolean expression that evaluates to either True or False. A simple "True;" or "False;" rule sets the default result value for the rule set. Subsequent conditional rules that evaluate to True or False may re-set the result value.

A special value "Undefined" is used to deal with any error conditions, such as an attempt to test a character before the start of a label or after the end of a label. If any term of a rule evaluates to Undefined, further evaluation of the rule immediately terminates, as the result value of the rule will itself be Undefined.

cp represents the codepoint to be tested.
FirstChar is a special term which denotes the first codepoint in a string.
LastChar is a special term which denotes the last codepoint in a string.
.eq. represents the equality relation.
A .eq. B evaluates to True if A equals B.
.is. represents checking position in a string.
A .is. B evaluates to True if A and B have same position in the same string.
.ne. represents the non-equality relation.
A .ne. B evaluates to True if A is not equal to B.
.in. represents the set inclusion relation.
A .in. B evaluates to True if A is a member of the set B.


A functional notation, Function_Name(cp), is used to express either string positions within a string, Boolean character property tests of a codepoint, or a regular expression match. When such function names refer to Boolean character property tests, the function names use the exact Unicode character property name for the property in question, and "cp" is evaluated as the Unicode value of the codepoint to be tested, rather than as its position in the string. When such function names refer to string positions within a string, "cp" is evaluated as its position in the string.

RegExpMatch(X) takes as its parameter X a schematic regular expression consisting of a mix of Unicode character property values and literal Unicode codepoints.

Script(cp) returns the value of the Unicode Script property, as defined in Scripts.txt in the Unicode Character Database.

Canonical_Combining_Class(cp) returns the value of the Unicode Canonical_Combining_Class property, as defined in UnicodeData.txt in the Unicode Character Database.

Before(cp) returns the codepoint of the character immediately preceding cp in logical order in the string representing the string. Before(FirstChar) evaluates to Undefined.

After(cp) returns the codepoint of the character immediately following cp in logical order in the string representing the string. After(LastChar) evaluates to Undefined.

Note that "Before" and "After" do not refer to the visual display order of the character in a string, which may be reversed or otherwise modified by the bidirectional algorithm for strings including characters from scripts written right-to-left. Instead, 'Before' and 'After' refer to the network order of the character in the string.

The clauses "Then True" and "Then False" imply exit from the pseudo-code routine with the corresponding result.

Repeated evaluation for all characters in a string makes use of the special construct:

For All Characters:
Expression;
End For;

This construct requires repeated evaluation of "Expression" for each codepoint in the string, starting from FirstChar and proceeding to LastChar.

The different fields in the rules are to be interpreted as follows:

Code point:
The codepoint, or codepoints, that this rule is to be applied to. Normally, this implies that if any of the codepoints in a string is as defined, then the rules should be applied. If evaluated to True, the codepoint is ok as used; if evaluated to False, it is not o.k.
Overview:
A description of the goal with the rule, in plain English.
Lookup:
True if application of this rule is recommended at lookup time; False otherwise.
Rule Set:
The rule set itself, as described above.



 TOC 

Appendix A.1.  ZERO WIDTH NON-JOINER

Code point:
U+200C
Overview:
This may occur in a formally cursive script (such as Arabic) in a context where it breaks a cursive connection as required for orthographic rules, as in the Persian language, for example. It also may occur in Indic scripts in a consonant conjunct context (immediately following a virama), to control required display of such conjuncts.
Lookup:
True
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
(Joining_Type:T)*(Joining_Type:{R,D})) Then True;


 TOC 

Appendix A.2.  ZERO WIDTH JOINER

Code point:
U+200D
Overview:
This may occur in Indic scripts in a consonant conjunct context (immediately following a virama), to control required display of such conjuncts.
Lookup:
True
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;


 TOC 

Appendix A.3.  MIDDLE DOT

Code point:
U+00B7
Overview:
Between 'l' (U+006C) characters only, used to permit the Catalan character ela geminada to be expressed
Lookup:
False
Rule Set:
False;
If Before(cp) .eq. U+006C And
After(cp) .eq. U+006C Then True;


 TOC 

Appendix A.4.  GREEK LOWER NUMERAL SIGN (KERAIA)

Code point:
U+0375
Overview:
The script of the following character MUST be Greek.
Lookup:
False
Rule Set:
False;
If Script(After(cp)) .eq. Greek Then True;


 TOC 

Appendix A.5.  HEBREW PUNCTUATION GERESH

Code point:
U+05F3
Overview:
The script of the preceding character MUST be Hebrew.
Lookup:
False
Rule Set:
False;
If Script(Before(cp)) .eq. Hebrew Then True;


 TOC 

Appendix A.6.  HEBREW PUNCTUATION GERSHAYIM

Code point:
U+05F4
Overview:
The script of the preceding character MUST be Hebrew.
Lookup:
False
Rule Set:
False;
If Script(Before(cp)) .eq. Hebrew Then True;


 TOC 

Appendix A.7.  KATAKANA MIDDLE DOT

Code point:
U+30FB
Overview:
Note that the Script of Katakana Middle Dot is not any of "Hiragana", "Katakana" or "Han". The effect of this rule is to require at least one character in the label to be in one of those scripts.
Lookup:
False
Rule Set:
False;
For All Characters:
If Script(cp) .in. {Hiragana, Katakana, Han} Then True;
End For;


 TOC 

Appendix A.8.  ARABIC-INDIC DIGITS

Code point:
0660..0669
Overview:
Can not be mixed with Extended Arabic-Indic Digits.
Lookup:
False
Rule Set:
True;
For All Characters:
If cp .in. 06F0..06F9 Then False;
End For;



 TOC 

Appendix A.9.  EXTENDED ARABIC-INDIC DIGITS

Code point:
06F0..06F9
Overview:
Can not be mixed with Arabic-Indic Digits.
Lookup:
False
Rule Set:
True;
For All Characters:
If cp .in. 0660..0669 Then False;
End For;


 TOC 

Appendix B.  Codepoints 0x0000 - 0x10FFFF

If one applies the rules (Calculation of the Derived Property) to the code points 0x0000 to 0x10FFFF to Unicode 5.2, the result is as follows.

This list is non-normative, and only included for illustrative purposes. Specifically, what is displayed in the third column is not the formal name of the codepoint (as defined in section 4.8 of [Unicode52]). The differences exists for example for the codepoints that have the codepoint value as part of the name (example: CJK UNIFIED IDEOGRAPH-4E00) and the naming of Hangul syllables. For many codepoints, what you see is the official name.



 TOC 

Appendix B.1.  Codepoints in Unicode Character Database (UCD) format

0000..10FFFF; TBD!


 TOC 

16. Informative References

[RFC3454] Hoffman, P. and M. Blanchet, “Preparation of Internationalized Strings ("stringprep"),” RFC 3454, December 2002 (TXT).
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, “Internationalizing Domain Names in Applications (IDNA),” RFC 3490, March 2003 (TXT).
[RFC3491] Hoffman, P. and M. Blanchet, “Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN),” RFC 3491, March 2003 (TXT).
[RFC3492] Costello, A., “Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA),” RFC 3492, March 2003 (TXT).
[RFC3722] Bakke, M., “String Profile for Internet Small Computer Systems Interface (iSCSI) Names,” RFC 3722, April 2004 (TXT).
[RFC3920] Saint-Andre, P., Ed., “Extensible Messaging and Presence Protocol (XMPP): Core,” RFC 3920, October 2004 (TXT, HTML, XML).
[RFC4011] Waldbusser, S., Saperia, J., and T. Hongal, “Policy Based Management MIB,” RFC 4011, March 2005 (TXT).
[RFC4013] Zeilenga, K., “SASLprep: Stringprep Profile for User Names and Passwords,” RFC 4013, February 2005 (TXT).
[RFC4505] Zeilenga, K., “Anonymous Simple Authentication and Security Layer (SASL) Mechanism,” RFC 4505, June 2006 (TXT).
[RFC4518] Zeilenga, K., “Lightweight Directory Access Protocol (LDAP): Internationalized String Preparation,” RFC 4518, June 2006 (TXT).
[RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, “Review and Recommendations for Internationalized Domain Names (IDNs),” RFC 4690, September 2006 (TXT).


 TOC 

Author's Address

  Marc Blanchet
  Viagenie
  2600 boul. Laurier, suite 625
  Quebec, QC G1V 4W1
  Canada
Email:  Marc.Blanchet@viagenie.ca
URI:  http://www.viagenie.ca