INTERNET-DRAFT Editor: Kurt D. Zeilenga Intended Category: Standard Track OpenLDAP Foundation Expires in six months 5 August 2002 Internationalized String Matching Rules for X.500 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. This document is intended to be published as a Standard Track RFC. Distribution of this memo is unlimited. Technical discussion of this document will take place on the IETF LDAP Revision Working Group mailing list . Please send editorial comments directly to the author . Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at . The list of Internet-Draft Shadow Directories can be accessed at . Copyright 2002, The Internet Society. All Rights Reserved. Please see the Copyright section near the end of this document for more information. Abstract The existing X.500 Directory Service technical specifications do not precisely define how string matching is to be preformed. This has lead to a number of interoperabilitly problems. This document provides string preparation profiles for string-based syntaxes and matching rules for standard string syntaxes and matching rules defined in X.520. Zeilenga X.500 Intl. String Matching [Page 1] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 This document is intended to be submitted to the ITU for publication as an amendment to X.520 and published as an Informational RFC. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119]. 1. Background and Intended Usage The "X.520: Selected attribute types" [X.520] provides (amongst other things) value syntaxes and matching rules for comparing values commonly used in the Directory [X.500]. These specifications are inadequate for strings composed of characters from the Universal Character Set (UCS) [ISO10646] (a superset of Unicode [UNICODE]). The Case Ignore Match, for example, is simply defined as being a case insensitive comparison where insignificant spaces are ignored. For printableString, there is only one space character and case mapping is bijective, hence this definition is sufficient. However, for UCS- based string types such as universalString, it's not sufficient. For example, a case insensitive matching implementation which folding lower case characters to upper case would yield different different results then an implementation which used upper to lower case folding. Or one implementation may view space as referring to SPACE (U+00020), a second implementation may view any character with the space separator (Zs) property as a space, and another implementation may view any character with the whitespace (WS) category as a space. The lack of precise specification for string matching has led to significant interoperability problems. To address these problems, this document updates X.520 [X.520] with a detailed specification of string syntax and matching rule requirements. The matching rule algorithms described in this document are based upon the "stringprep" approach [STRPREP]. In "stringprep", presented and stored values are first prepared for comparison and so that a character-by-character comparison yields the "correct" result. The algorithm used here is a refinement of the "stringprep" approach. Design considerations are discussed in Section 2. The algorithm involves an additional preparation step is added to transcode all input values to Unicode before proceeding with the preparation steps outlined in [STRPREP]. Additionally, the Zeilenga X.500 Intl. String Matching [Page 2] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 "stringprep" normalization step also includes removal of space insignificant to X.500 string matching. As described, preparation of strings involves the following steps: 1) Transcode 2) Map 3) Normalize 4) Prohibit 5) Check bidi These steps are described in Section 3. Section 4 replaces section 6.1 of X.520 [X.520]. It updates select string matching rules. Section 5 replaces portions of section 6.2 of X.520 [X.520]. It updates select syntax-based matching rules. 2. Design Considerations The X.500 string matching rule specification provided in Section 3 designed to leverage the "stringprep" framework [STRPREP] for comparing of strings. As noted above, a transcoding step was added as X.500 supports matching between strings of different character sets. 2.1. Transcode As mappings between character sets, such as T.61 and UCS, are not bijective, all strings are first transliterated to a common character encoding set. UCS was the logical choice as all other character sets (used in X.500) can be transcoded to it without information loss. None of the other character sets offer this property. 2.2. Map TBD 2.3. Normalize Normalization is performed to ensure comparison is always done between canonical-equivalent strings. As directory strings are often used as identifiers, we selected Form KC (compatibility composed) as it allows a greater number of strings to be treated as equivalent. Unfortunately, this choice is not best for all applications. Zeilenga X.500 Intl. String Matching [Page 3] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 Additional matching rules which use different string preparation algorithms may be introduced in the future to better support these applications. In particular, matching rules which use Form C (composed) normalization instead of Form KC would also be generally useful. 2.4. Prohibit TBD 2.5. Check bidi TBD 3. String Preparation The following 5-step process SHALL to be applied to each presented and attribute value in preparation for string match rule evaluation. 1) Transcode 2) Map 3) Normalize 4) Prohibit 5) Check bidi Failure in any step SHALL be cause the assertion to be Undefined. 3.1. Transcode Each non-Unicode string values are to transcoded to Unicode. TeletexString values SHALL transcoded to Unicode as described in [T61-UCS]. PrintableString value SHALL be transcoded directly to Unicode. UniversalString, UTF8String, and bmpString values need not be transcoded as they are Unicode-based strings (which, in the case of bmpString, restricted to a subset of Unicode). If the implementation is unable or unwilling to perform the transcoding as described above, or the transcoding fails, this step fails and the assertion SHALL be evaluated to Undefined. Implementations SHOULD support transcoding for all stored non-Unicode Zeilenga X.500 Intl. String Matching [Page 4] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 strings which may be matched against using the rules defined in Section 3. The transcoded string is the output string. 3.2. Map SOFT HYPEN (U+00AD) and MONGOLIAN TODO SOFT HYPHEN (U+1806) code points are to be mapped to nothing. COMBINING GRAPHEME JOINER (U+034F) and VARIATION SELECTORs (U+180B-180D,FF00-FE0F) are code points are also to be mapped to nothing. The OBJECT REPLACEMENT CHARACTER (U+FFFC) is mapped to nothing. For non-numeric matching rules, CHARACTER TABULATION (U+0009), LINE FEED (LF) (U+000A), LINE TABULATION (U+000B), FORM FEED (FF) (U+000C), CARRIAGE RETURN (CR) (U+000D), and NEXT LINE (NEL) (U+0085) are to be mapped to SPACE (U+0020). For numeric matching rules, these are mapped to nothing. All other code points of Control class are to be mapped to nothing. ZERO WIDTH SPACE (U+200B) is to be mapped to nothing. For non-numeric matching rules, all other code points with Separator (space, line, or page) property are to be mapped to SPACE (U+0020). For numeric matching rules, all Separators characters are mapped to nothing. For case ignore, numeric, and stored prefix string matching rules, characters are to be case folded per B.2 of [STRPREP]. 3.3. Normalize In this step, the input string SHALL be normalized by first canonicalization to Unicode Form KC (compatibility composed) as described in [UAX15] and, then, removal of insignificant SPACE (U+0020) code points (spaces). Note - numerous code points are mapped to SPACE (U+0020) by the preceding step. The following spaces regarded as not significant: - leading spaces (i.e. those preceding the first character that is not a space); - trailing spaces (i.e. those following the last character that is not a space); - multiple consecutive spaces (these are taken as equivalent to a single space character). Zeilenga X.500 Intl. String Matching [Page 5] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 (A string consisting entirely of spaces is equivalent to a string containing exactly one space.) For example, removal of spaces from the Form KC string "foobar" would result in the output string: "foobar". and the Form KC string: "" would result in the output string "". 3.4. Prohibit All Unassigned, Private Use, and non-character code points are prohibited. Surrogate codes (U+D800-DFFFF) are prohibited. The REPLACEMENT CHARACTER (U+FFFD) code is prohibited. The first code point of a string is probibited from being a combining character. The step fails if the input string contains any prohibited code point. The output string is the input string. 3.5. Check bidi There are no bidirectional restrictions. The output string is the input string. 4 String Matching Rules In the matching rules specified in this section, all presented and stored string values SHALL be prepared for matching as described in Section 3. String preparation produces strings suitable for character-by-character matching. 4.1. Case Exact / Ignore Match The Case Exact Match rule compares for equality a presented string with a attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String. caseExactMatch MATCHING-RULE ::= { Zeilenga X.500 Intl. String Matching [Page 6] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 SYNTAX DirectoryString {ub-match} ID id-mr-caseExactMatch } The Case Ignore Match rule compares for equality a presented string with an attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String, without regard to the case (upper or lower) of the strings (e.g. "Dundee" and "DUNDEE" match). The rule is identical to the caseExactMatch rule except upper case characters are folded to lower case during string preparation as discussed in 3.2. caseIgnoreMatch MATCHING-RULE ::= { SYNTAX DirectoryString {ub-match} ID id-mr-caseIgnoreMatch } Both rules returns TRUE if the prepared strings are the same length and corresponding characters are identical. 4.2. Case Exact / Ignore Ordering Match The Case Exact Ordering Match rule compares the collation order of a presented string with an attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String. caseExactOrderingMatch MATCHING-RULE ::= { SYNTAX DirectoryString {ub-match} ID id-mr-caseExactOrderingMatch } The Case Ignore Ordering Match rule compares the collation order of a presented string an attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String, without regard to the case (upper or lower) of the strings. The rule is identical to the caseExactOrderingMatch rule except upper case characters are folded to lower case during string preparation as discussed in 3.2. caseIgnoreOrderingMatch MATCHING-RULE ::= { SYNTAX DirectoryString {ub-match} ID id-mr-caseIgnoreOrderingMatch } Both rules returns TRUE if the attribute value is "less" or appears earlier than the presented value, when the prepared strings are compared using Unicode code point collation order. 4.3. Case Exact / Ignore Substrings Match Zeilenga X.500 Intl. String Matching [Page 7] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 The Case Exact Substrings Match rule determines whether a presented value is a substring of an attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String. caseExactSubstringsMatch MATCHING-RULE ::= { SYNTAX SubstringAssertion -- only the PrintableString choice ID id-mr-caseExactSubstringsMatch } SubstringAssertion ::= SEQUENCE OF CHOICE { initial [0] DirectoryString {ub-match}, any [1] DirectoryString {ub-match}, final [2] DirectoryString {ub-match}, control Attribute } -- Used to specify interpretation of the following items -- at most one initial and one final component The Case Ignore Substrings Match rule determines whether a presented value is a substring of an attribute value of type DirectoryString or one of the data types appearing in the choice type DirectoryString, e.g. UTF8String, without regard to the case (upper or lower) of the strings. The rule is identical to the caseExactSubstringsMatch rule except upper case characters are folded to lower case during string preparation as discussed in 3.2. caseIgnoreSubstringsMatch MATCHING-RULE ::= { SYNTAX SubstringAssertion ID id-mr-caseIgnoreSubstringsMatch } Both rules returns TRUE if there is a partitioning of the prepared attribute value (into portions) such that: - the specified substrings (initial, any, final) match different portions of the value in the order of the strings sequence. - initial, if present, matches the first portion of the value; - any, if present, matches some arbitrary portion of the value; - final, if present, matches the last portion of the value. - control is not used for the caseExactSubstringsMatch, caseIgnoreSubstringsMatch, telephoneNumberSubstringsMatch, or any other form of substring match for which only initial, any, or final elements are used in the matching algorithm; if a control element is encountered, it is ignored. The control element is only used for matching rules that explicitly specify its use in the matching algorithm. Such a matching rule may also redefine the semantics of the initial, any and final substrings. NOTE - The generalWordMatch matching rule is an example of such a matching rule. There shall be at most one initial, and at most one final in strings. If initial is present, it shall be the first element of strings. If Zeilenga X.500 Intl. String Matching [Page 8] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 final is present, it shall be the last element of strings. There shall be zero or more any in strings. For a component of substrings to match a portion of the attribute value, corresponding characters must be identical (including all combining characters in the combining character sequences). 4.4. Numeric String Match The Numeric String Match rule compares for equality a presented numeric string with an attribute value of type NumericString. numericStringMatch MATCHING-RULE ::= { SYNTAX NumericString ID id-mr-numericStringMatch } The rule is identical to the caseIgnoreMatch rule (case is irrelevant as characters are numeric) except that all space characters are removed during string preparation. 4.5. Numeric String Ordering Match The Numeric String Ordering Match rule compares the collation order of a presented string with an attribute value of type NumericString. numericStringOrderingMatch MATCHING-RULE ::= { SYNTAX NumericString ID id-mr-numericStringOrderingMatch } The rule is identical to the caseIgnoreOrderingMatch rule (case is irrelevant as characters are numeric) except that all space characters are removed during string preparation. 4.6. Numeric String Substrings Match The Numeric String Substrings Match rule determines whether a presented value is a substring of an attribute value of type NumericString. numericStringSubstringsMatch MATCHING-RULE ::= { SYNTAX SubstringAssertion ID id-mr-numericStringSubstringsMatch } The rule is identical to the caseIgnoreSubstringsMatch rule (case is irrelevant as characters are numeric) except that all space characters Zeilenga X.500 Intl. String Matching [Page 9] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 are removed during string preparation. 4.7. Case Ignore List Match The Case Ignore List Match rule compares for equality a presented sequence of strings with an attribute value which is a sequence of DirectoryStrings, without regard to the case (upper or lower) of the strings. caseIgnoreListMatch MATCHING-RULE ::= { SYNTAX CaseIgnoreList ID id-mr-caseIgnoreListMatch } CaseIgnoreList ::= SEQUENCE OF DirectoryString {ub-match} The rule returns TRUE if and only if the number of strings in each is the same, and corresponding strings match. The latter matching is as for the caseIgnoreMatch matching rule. 4.8. Case Ignore List Substrings Match The Case Ignore List Substring rule compares a presented substring with an attribute value which is a sequence of DirectoryStrings, but where the case (upper or lower) is not significant for comparison purposes. caseIgnoreListSubstringsMatch MATCHING-RULE ::= { SYNTAX SubstringAssertion ID id-mr-caseIgnoreListSubstringsMatch } A presented value matches a stored value if and only if the presented value matches the string formed by concatenating the strings of the stored value. This matching is done according to the caseIgnoreSubstringsMatch rule; however, none of the initial, any, or final values of the presented value are considered to match a substring of the concatenated string which spans more than one of the strings of the stored value. 4.9. Stored Prefix Match The Stored Prefix Match rule determines whether an attribute value, whose syntax is DirectoryString, is a prefix (i.e. initial substring) of the presented value, without regard to the case (upper or lower) of the strings. NOTE - It can be used, for example, to compare values in Zeilenga X.500 Intl. String Matching [Page 10] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 the Directory which are telephone area codes with a purported value which is a telephone number. storedPrefixMatch MATCHING-RULE ::= { SYNTAX DirectoryString {ub-match} ID id-mr-storedPrefixMatch } The rule returns TRUE if the attribute value is an initial substring of the presented value with corresponding characters identical except possibly with regard to case. 5. Other changes to X.520 This document makes the following changes to X.520: The section 6.2.8 sentence: The rules for matching are identical to those for caseIgnoreMatch, except that all space and "-" characters are skipped during the comparison. is replaced with: The rules for matching are identical to those for numericStringMatch, except that HYPHEN-MINUS (U+002D), ARMENIAN HYPHEN (U+058A), HYPHEN (U+2010), NON-BREAKING HYPHEN (U+2011), MINUS SIGN (U+2212), SMALL HYPHEN-MINUS (U+FE63), and FULLWIDTH HYPHEN-MINUS (U+FF0D) are to be mapped to nothing during the string preparation map step. The section 6.2.9 sentence: The rules for matching are identical to those for caseExactSubstringsMatch, except that all space and "-" characters are skipped during the comparison. is replaced with: The rules for matching are identical to those for caseExactSubstringsMatch, except that HYPHEN-MINUS (U+002D), ARMENIAN HYPHEN (U+058A), HYPHEN (U+2010), NON-BREAKING HYPHEN (U+2011), MINUS SIGN (U+2212), SMALL HYPHEN-MINUS (U+FE63), and FULLWIDTH HYPHEN-MINUS (U+FF0D) are to be mapped to nothing during the string preparation map step. 6. Security Considerations TBD 7. Acknowledgments Zeilenga X.500 Intl. String Matching [Page 11] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 The approach used in this document based upon design principals and algorithms described in "Preparation of Internationalized Strings ('stringprep')" [STRPREP] by Paul Hoffman and Marc Blanchet. Additional guidance was drawn from Unicode Technical Standards, Technical Reports, and Notes. Section 3.3 and 4 of this document is derived from Section 6.1 of [X.520]. 8. Author's Address Kurt Zeilenga E-mail: 9. References 9.1. Normative References [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14 (also RFC 2119), March 1997. [X.520] International Telephone Union, "The Directory: Selected Attribute Types", X.520, 2000. [STRPREP] P. Hoffman, M. Blanchet, "Preparation of Internationalized Strings ('stringprep')", draft-hoffman-stringprep-xx.txt (a work in progress). [ISO10646] Universal Multiple-Octet Coded Character Set (UCS) - Architecture and Basic Multilingual Plane, ISO/IEC 10646-1 : 1993. [UAX15] M. Davis, M. Duerst, "Unicode Standard Annex #15: Unicode Normalization Forms, Version 3.2.0". . [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 3.2.0" is defined by "The Unicode Standard, Version 3.0" (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the "Unicode Standard Annex #27: Unicode 3.1" (http://www.unicode.org/reports/tr27/) and by the "Unicode Standard Annex #28: Unicode 3.2" (http://www.unicode.org/reports/tr28/). 9.2. Informative References Zeilenga X.500 Intl. String Matching [Page 12] INTERNET-DRAFT draft-zeilenga-ldapbis-strmatch-00 5 August 2002 [X.500] International Telephone Union, "The Directory: Overview of Concepts, Models and Service", X.500, 2000. Copyright 2002, The Internet Society. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE AUTHORS, THE INTERNET SOCIETY, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Zeilenga X.500 Intl. String Matching [Page 13]