Sieve Working Group K. Murchison Internet-Draft Carnegie Mellon University Expires: September 24, 2010 N. Freed Oracle Corporation March 23, 2010 Sieve Email Filtering: Regular Expression Extension draft-ietf-sieve-regex-01.txt Abstract This document describes the "regex" extension to the Sieve email filtering language. In some cases, it is desirable to have a string matching mechanism which is more powerful than a simple exact match, a substring match or a glob-style wildcard match. The regular expression matching mechanism defined in this draft provides users with much more powerful string matching capabilities. Change History (to be removed prior to publication as an RFC) Changes from draft-murchison-sieve-regex-08: o Updated to XML source. o Documented interaction with variables. Changes from draft-ietf-sieve-regex-00: o Various cleanup and updates. o Added trial text specifying comparator interactions. Open Issues (to be removed prior to publication as an RFC) o The major open issue with this draft is what to do, if anything, about localization/internationalization. Are [IEEE.1003-2.1992] collating sequences and character equivalents sufficient? Should we reference the Unicode technical specification? Should we punt and publish the document as experimental? o Is the current approach to comparator integration the right one to use? o Should we allow shorthands such as \\b (word boundary) and \\w (word character)? Murchison & Freed Expires September 24, 2010 [Page 1] Internet-Draft Sieve Regex Extension March 2010 o Should we allow backreferences (useful for matching double words, etc.)? Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 24, 2010. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. 1. Introduction Sieve [RFC5228] is a language for filtering email messages at or around the time of final delivery. It is designed to be implementable on either a mail client or mail server. Murchison & Freed Expires September 24, 2010 [Page 2] Internet-Draft Sieve Regex Extension March 2010 The Sieve base specification defines so-called match types for tests: is, contains, and matches. An "is" test requires an exact match, a "contains" test provides a substring match, and "matches" provides glob-style wildcards. This document describes an extension to the Sieve language that provides a new match type for regular expression comparisons. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. The terms used to describe the various components of the Sieve language are taken from Section 1.1 of [RFC5228]. 3. Capability Identifier The capability string associated with the extension defined in this document is "regex". 4. Regex Match Type When the regex extension is available, commands that support matching may take the optional tagged argument ":regex" to specify that a regular expression match should be performed. The ":regex" match type is subject to the same rules and restrictions as the standard match types defined in [RFC5228]. The "MATCH-TYPE" syntax element defined in [RFC5228] is augmented here as follows: MATCH-TYPE =/ ":regex" 5. Interaction with Sieve comparators In order to provide for matches between character sets and case insensitivity, Sieve uses the comparators defined in the Internet Application Protocol Collation Registry [RFC5228]. The comparator used by a given test is specified by the :comparator argument. The interaction between collators and the match types defined in the Sieve base specification is straightforward. Howeer, the nature of regular expressions does not lend itself to this usage for the :regex Murchison & Freed Expires September 24, 2010 [Page 3] Internet-Draft Sieve Regex Extension March 2010 match type. A component of the definition of many collators is a normalization operation. For example, the "i;octet" comparator employs an identity normalization; whereas the "i;ascii-casema" normalizes all lower case ASCII characters to upper case. The :regex match type only uses the normalization component of the associated comparator. This normalization operation is applied to the key-list argument to the test; the result of that normalization becomes the target of the regular expression comparison. The comparator has no effect on the regular expression pattern or the underlying comparison operation. It is an error to specify a comparator that has no associated normalization operation in conjunction with a :regex match type. 6. Regular expression comparisions Implementations MUST support extended regular expressions (EREs) as defined by [IEEE.1003-2.1992]. Any regular expression not defined by [IEEE.1003-2.1992], as well as [IEEE.1003-2.1992] basic regular expressions, word boundaries and backreferences are not supported by this extension. Implementations SHOULD reject regular expressions that are unsupported by this specification as a syntax error. The following tables provide a brief summary of the regular expressions that MUST be supported. This table is presented here only as a guideline. [IEEE.1003-2.1992] should be used as the definitive reference. +------------+------------------------------------------------------+ | Expression | Pattern | +------------+------------------------------------------------------+ | . | Match any single character except newline. | | [ ] | Bracket expression. Match any one of the enclosed | | | characters. A hypen (-) indicates a range of | | | consecutive characters. | | [^ ] | Negated bracket expression. Match any one character | | | NOT in the enclosed list. A hypen (-) indicates a | | | range of consecutive characters. | | \\ | Escape the following special character (match the | | | literal character). Undefined for other characters. | | | NOTE: Unlike [IEEE.1003-2.1992], a double-backslash | | | is required as per section 2.4.2 of [RFC5228]. | +------------+------------------------------------------------------+ Murchison & Freed Expires September 24, 2010 [Page 4] Internet-Draft Sieve Regex Extension March 2010 Table 1: Items to match a single character +------------+------------------------------------------------------+ | Expression | Pattern | +------------+------------------------------------------------------+ | [: :] | Character class (alnum, alpha, blank, cntrl, digit, | | | graph, lower, print, punct, space, upper, xdigit). | | [= =] | Character equivalents. | | [. .] | Collating sequence. | +------------+------------------------------------------------------+ Table 2: Items to be used within a bracket expression (localization) +------------+------------------------------------------------------+ | Expression | Pattern | +------------+------------------------------------------------------+ | ? | Match zero or one instances. | | * | Match zero or more instances. | | + | Match one or more instances. | | {n,m} | Match any number of instances between n and m | | | (inclusive). {n} matches exactly n instances. {n,} | | | matches n or more instances. | +------------+------------------------------------------------------+ Table 3: Quantifiers - Items to count the preceding regular expression +------------+--------------------------------------------+ | Expression | Pattern | +------------+--------------------------------------------+ | ^ | Match the beginning of the line or string. | | $ | Match the end of the line or string. | +------------+--------------------------------------------+ Table 4: Anchoring - Items to match positions +------------+------------------------------------------------------+ | Expression | Pattern | +------------+------------------------------------------------------+ | | | Alternation. Match either of the separated regular | | | expressions. | | ( ) | Group the enclosed regular expression(s). | +------------+------------------------------------------------------+ Table 5: Other constructs Murchison & Freed Expires September 24, 2010 [Page 5] Internet-Draft Sieve Regex Extension March 2010 7. Interaction with Sieve Variables This extension is compatible with, and may be used in conjunction with the Sieve Variables extension [RFC5229]. 7.1. Match variables A sieve interpreter which supports both "regex" and "variables", MUST set "match variables" (as defined by [RFC5229] section 3.2) whenever the ":regex" match type is used. The list of match variables will contain the strings corresponding to the group operators in the regular expression. The groups are ordered by the position of the opening parenthesis, from left to right. Note that in regular expressions, expansions match as much as possible (greedy matching). Example: require ["fileinto", "regex", "variables"]; if header :regex "List-ID" "<(.*)@" { fileinto "lists.${1}"; stop; } # Imagine the header # Subject: [acme-users] [fwd] version 1.0 is out if header :regex "Subject" "^[(.*)] (.*)$" { # ${1} will hold "acme-users] [fwd" stop; } 7.2. Set modifier :quoteregex A sieve interpreter which supports both "regex" and "variables", MUST support the optional tagged argument ":quoteregex" for use with the "set" action. The ":quoteregex" modifier is subject to the same rules and restrictions as the standard modifiers defined in [RFC5229] section 4. For convenience, the "MODIFIER" syntax element defined in [RFC5229] is augmented here as follows: MODIFIER =/ ":quoteregex" This modifier adds the necessary quoting to ensure that the expanded text will only match a literal occurrence if used as a parameter to :regex. Every character with special meaning (".", "*", "?", etc.) is prefixed with "\" in the expansion. This modifier has a precedence value of 20 when used with other modifiers. Murchison & Freed Expires September 24, 2010 [Page 6] Internet-Draft Sieve Regex Extension March 2010 8. Examples Example: require "regex"; # Try to catch unsolicited email. if anyof ( # if a message is not to me (with optional +detail), not address :regex ["to", "cc", "bcc"] "me(\\\\+.*)?@company\\\\.com", # or the subject is all uppercase (no lowercase) header :regex :comparator "i;octet" "subject" "^[^[:lower:]]+$" ) { discard; # junk it } 9. IANA Considerations The following template specifies the IANA registration of the "regex" Sieve extension specified in this document: To: iana@iana.org Subject: Registration of new Sieve extension Capability name: regex Capability keyword: regex Capability arguments: N/A Standards Track/IESG-approved experimental RFC number: this RFC Person and email address to contact for further information: Kenneth Murchison E-Mail: murch@andrew.cmu.edu This information should be added to the list of Sieve extensions given on http://www.iana.org/assignments/sieve-extensions. 10. Security Considerations General Sieve security considerations are discussed in [RFC5228]. All of the issues described there also apply to regular expression matching. It is easy to construct problematic regular expressions that are Murchison & Freed Expires September 24, 2010 [Page 7] Internet-Draft Sieve Regex Extension March 2010 computationally infeasible to evaluate. Execution of a Sieve that employs a potentially problematic regular expression, such as "(.*)*", may cause problems ranging from degradation of performance to and outright denial of service. Moreover, determining the computationl complexity associated with evaluating a given regular expression is in general an intractable problem. For this reason, all implementations MUST take appropriate steps to limit the impact of runaway regular expression evaluation. Implementations MAY restrict the regular expressions users are allowed to specify. Implementations that do not impose such restrictions SHOULD provide a means to abort evaluation of tests using the :regex match type if the operation is taking too long. 11. Normative References [IEEE.1003-2.1992] Institute of Electrical and Electronics Engineers, "Information Technology - Portable Operating System Interface (POSIX) - Part 2: Shell and Utilities (Vol. 1)", IEEE Standard 1003.2, 1992. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC5228] Guenther, P. and T. Showalter, "Sieve: An Email Filtering Language", RFC 5228, January 2008. [RFC5229] Homme, K., "Sieve Email Filtering: Variables Extension", RFC 5229, January 2008. Appendix A. Acknowledgments Most of the text documenting the interaction with Sieve variables was taken from an early draft of Kjetil Homme's Sieve variables specification. Thanks to Tim Showalter, Alexey Melnikov, Tony Hansen, Phil Pennock, and Jutta Degener for their help with this document. Murchison & Freed Expires September 24, 2010 [Page 8] Internet-Draft Sieve Regex Extension March 2010 Authors' Addresses Kenneth Murchison Carnegie Mellon University 5000 Forbes Avenue Cyert Hall 285 Pittsburgh, PA 15213 US Phone: +1 412 268 2638 Email: murch@andrew.cmu.edu Ned Freed Oracle Corporation 800 Royal Oaks Monrovia, CA 91016-6347 USA Phone: +1 909 457 4293 Email: ned.freed@mrochek.com Murchison & Freed Expires September 24, 2010 [Page 9]