INTERNET-DRAFT Bill Yerazunis draft-yerazunis-spamfilt-inoculation-02.txt Jonathan Zdziarski spamfilt group February 2004 Expires August 2004 A MIME Encoding for Spam Inoculation Messages Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this memo is unlimited. Abstract This document describes in detail a method for encapsulating an email message or text sample for the purpose of training (or "inoculating") a mail filter. The sample messages or text (the "payload") provide the contextual information necessary for the filter to reject ("spam") or accept ("non-spam") the message being inoculated, or messages similar in design. RFC 2045 defines the MIME format. This document expands on this by adding an "inoculation" MIME subtype, and also adds additional header fields necessary to the functionality being provided. This message format is designed to enable different mail filters of different design to communicate inoculations with one another using the MIME subtype introduced. 1. Introduction Analytical anti-spam tools are all subject to the same inherent problem, which is that spam is dynamic; it evolves. This constant mutation guarantees a marginal error rate in all such anti-spam tools, making it difficult to achieve perfect accuracy. spamfilt group Inoculation Message Format [Page 1] ^L INTERNET-DRAFT Inoculation Message Format The premise behind inoculation is to distribute these new mutations to other users so that an entire group may benefit from one user's misfortune. In light of the fact that there are many different anti-spam tools available today, a standard for sharing an inoculation of either spam or nonspam must be created to both encourage and enable the widespread acceptance of this practice. This memo describes several components that combine to create the message format for sharing an inoculation payload. In particular, it describes: 1. The inoculation subtype, which identifies that the message being received is an inoculation and should be treated accordingly. 2. An Inoculation-Sender field, which identifies the sender of the inoculation, and provides an identity the recipient can query locally for authentication information (such as a shared secret, public key, etcetera). 3. An Inoculation-Type field, which specifies the type of inoculation payload being sent (spam or nonspam), to instruct the filter how to proceed with importing the inoculation payload. 4. An Inoculation-Authentication field which specifies the method of authentication provided (if any) to verify that the inoculation is from a trusted user. 5. Extended authentication message components, such as a public key signature, may be present depending on the authentication mechanism used. 6. The inoculation payload, which is the actual information provided to seed the filter tool. This memo expands on RFCs 2822 and 2045 which outline the relevant standards for the Internet Mail format (email), and depends upon the standards outlined in RFC 2015 for PGP signed data. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119]. 2. Notations, Conventions, and Generic Grammar Many of the mechanisms specified in this memo are described formally in RFC 2822 and RFC 2045. Implementors will need to be familiar with this notation in order to understand this specification, and are referred to RFC 2822 and RFC 2045 for a complete explanation. The term "message", when not further qualified, means either the (complete of "top-level") message being transferred on a network, spamfilt group Inoculation Message Format [Page 2] ^L INTERNET-DRAFT Inoculation Message Format or a message encapsulated in a body of type "message". 3. The Inoculation Subtype The inoculation subtype specifies the nature of the message body to be a complete message, spam or nonspam, presented for inoculation to the recipient's filter agent. The media type identifies the payload being sent. In the augmented BNF notation of RFC 2822, the message/inoculation MIME type is represented in the Content-Type header field defined as follows: content := "Content-Type" ":" type "/" subtype *(";" auth-parameter) ; case-insensitive matching of type and subtype type := "message" / "text" / "multipart" ; All values case-insensitive subtype := "inoculation" ; case-insensitive auth-parameter := auth-attribute "=" value auth-attribute := token ; case-insensitive value := token / quoted-string The three initial pre-defined media types are detailed in the bulk of this memo. They are: message -- complete message. defines the inoculation as a complete message (spam or nonspam) with its own message structure in compliance with RFC 2822. text -- miscellaneous text. defines the inoculation as a string of related text without any specific structure. multipart -- an inoculation consisting of multiple parts of independent data types. RATIONALE: A filter may process the analysis of the inoculation payload differently depending on the type of information being sent. In order to insure the most effective use of the inoculation payload, each inoculation must provide this basic information about itself to avoid ambiguity. It should be noted that the list of Content-Type values given here may be augmented in time, via the mechanisms described above, and that the set of types is expected to grow substantially. When a mail reader encounters mail with a subtype of 'inoculation' spamfilt group Inoculation Message Format [Page 3] ^L INTERNET-DRAFT Inoculation Message Format and an unknown type value, it should generally treat it as equivalent to "text/inoculation", as described in this memo. 4. The Inoculation-Sender Field The Inoculation-Sender field identifies the sender to the recipient using a common identity shared between the two (for example, an email address or user name). The sender identity is necessary to authentication of the inoculation by providing a reference to the correct secret, public key, or other authentication information. This field has not been defined by any previous standard. The field's value is a single token specifying the sender's identity, as shown below. Formally: sender := "Inoculation-Sender" ":" token token := 1* tspecials := "(" / ")" / "<" / ">" / "," / ";" / "\" / <"> / "/" / "[" / "]" ; Must be in quoted-string, ; to use within parameter values The values used are case insensitive. That is, BOB and bob are both the same sender. Identities should be specific enough to avoid any potential collisions with other users. A single user should have a single identity for among the other users they are sharing inoculations with. For this reason, the sender field can support an email address or fingerprint identity. The Inoculation-Sender field is a required field and must be present in all inoculation messages. If the message is a multipart/inoculation media type, the Inoculation-Sender field should follow the rules below: 1. If all parts of the message are being sent by the same sender, the Inoculation-Sender field may appear only once in the message's top-level headers, or individually in each part of the message. 2. If the message consists of parts being sent by different senders, the Inoculation-Sender field must not appear in the message's top-level headers, but must appear individually in each part of the message. RATIONALE: The sender's identity may not always match the "From" field of the message. It is necessary to use a field specific to the sender's identification to provide the flexibility to the sender to change their email address, name, or other such data they may use in the "From" field to identify themeselves casually. spamfilt group Inoculation Message Format [Page 4] ^L INTERNET-DRAFT Inoculation Message Format 5. The Inoculation-Type Field The Inoculation-Type field identifies the type of inoculation being sent. The two inoculation types presently supported are "spam" and "nonspam". It is necessary to specify the type of inoculation in order to direct the appropriate method of learning chosen by the filter. This field has not been defined by any previous standard. the field's value is a single token specifying the type of inoculation, as shown below. Formally: type := "Inoculation-Type" ":" attribute attribute := "spam" / "nonspam" / x-token ; all values case insensitive x-token := token := 1* tspecials := "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" / <"> / "/" / "[" / "]" / "?" / "=" ; Must be in quoted-string, ; to use within parameter values These values are not case sensitive. That is, SPAM and spam and SpAm are all equivalent. A "spam" inoculation type must be accompanied by a message that is deemed to be spam by the sender and a "nonspam" inoculation type must be accompanied by a message that is deemed to be innocent by the sender. The Inoculation-Type field is a required field and must be present in all inoculation messages. If the message is a multipart/inoculation media type, an Inoculation-Type field should be present in the headers of each part of the message. Implementors may, if necessary, define new Inoculation-Type values, but must use an X-token, which is a name prefixed by "X-" to indicate its non-standard status, e.g.: Inoculation-Type: x-my-new-type However the creation of new Inoculation-Type values is strongly discouraged, as it seems likely to hinder interoperability with little potential benefit. 6. The Inoculation-Authentication Field The Inoculation-Authentication field specifies the type of spamfilt group Inoculation Message Format [Page 5] ^L INTERNET-DRAFT Inoculation Message Format authentication being used to authenticate the sender's identity. Authentication is necessary to insure that the sender is not a malicious party attempting to reprogram the recipient's filter (something a spammer, for example, may attempt to do with mass inoculation mailings). The defined authentication methods provide a means of authenticating both the sender and the message, to insure that the message has not been modified in transit. This field has not been defined by any previous standard. The field's value is a single token specifying the type of authentication mechanism used, as shown below. Formally: type := "Inoculation-Authentication" ":" auth-type *(";" auth-parameter) auth-type := "none" / "md5" / "signed" / x-token ; all values case insensitive x-token := token := 1* tspecials := "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" / <"> / "/" / "[" / "]" / "?" / "=" ; Must be in quoted-string, ; to use within parameter values These values are not case sensitive. That is, md5, MD5, and Md5 are all equivalent. The Inoculation-Authentication field is a required field and must be present in all inoculation messages. If the message uses a multipart/inoculation media type, an Inoculation-Authentication field must be provided for each part of the message. Implementors may, if necessary, define new Inoculation-Authentication values, but must use an X-token, which is a name prefixed by "X-" to indicate its non-standard status, e.g., Inoculation-Authentication: x-my-new-mechanism" However the creation of new Inoculation-Authentication values is strongly discouraged, as it seems likely to hinder interoperability with little potential benefit. 6.1 The "none" Authentication Mechanism spamfilt group Inoculation Message Format [Page 6] ^L INTERNET-DRAFT Inoculation Message Format The "none" authentication mechanism identifies the message as having no means of authentication. The decision is left up to the recipient as to whether to accept or reject an inoculation with no means of authentication. 6.2 The "md5" Authentication Mechanism The "md5" authentication mechanism identifies the message as using an md5 checksum in conjunction with a shared secret to authentication the sender and the inoculation payload. The formal grammar for the Inoculation-Authentication field for md5 is as follows: auth-type := "md5" ";" "checksum" "=" checksum checksum := 1* The checksum provided should be both generated by the sender and authenticated by the recipient using the MD5 algorithm in the following manner: 1. The sender and recipient have agreed on a shared secret, or verification code, to authenticate using this mechanism. 2. The recipient will, based on the sender identified in the Inoculation-Sender field, look up the sender's shared secret. 3. An MD5 checksum is generated by combining the shared secret + a newline character + the inoculation payload. 4. If the checksum generated by the recipient matches the checksum provided by the sender, the inoculation is authenticated. If the inoculation message has a media type of multipart/inoculation, a separate authentication checksum must be provided for every part of the message using the inoculation media subtype. 6.3 The "signed" Authentication Mechanism The "signed" authentication mechanism identifies the message as using a public-key signature to authenticate the sender and the inoculation payload. In order to use the signed authentication mechanism, the media type must be set to multipart/inoculation, however signed authentication limits each inoculation message to only a single inoculation payload. This is necessary as the public-key signature itself will use a separate part of the message. A separate part of the message containg the public-key signature for the inoculation payload must be provided. Authentication of the inoculation payload should be performed using the standard outlined in RFC 2015. 7. The Inoculation Payload spamfilt group Inoculation Message Format [Page 7] ^L INTERNET-DRAFT Inoculation Message Format The inoculation payload is the only component provided in the body of an inoculation message or message component, and represents all of the data specific to the payload itself. Depending on the media type of the inoculation, the payload may contain different information covered below. When processing the inoculation payload, special care should be taken to compare the 'Content-Length' as specified in the message with the actual content's length to insure that the entire message has been received. 7.1. The 'message' Payload If the media type specified for the payload is message, the inoculation payload must consist of a complete message including message headers as outlined in RFC 2822. An RFC 2045-compliant message incorporating MIME may also be provided, granted that the boundaries specified in the message do not conflict with the boundaries used in the top-level message. 7.2. The 'text' Payload Inoculation payloads with a media type of 'text' should be treated as plain text. This media type should be used when the headers for the inoculation payload are not available or nonexistant, or if the payload does not conform to the Internet Message standard outlined in RFC 2822. 7.3. The 'multipart' Payload No payload is provided or assumed when the media type for the top-level message is multipart. Instead, the individual components of the message must be examined as to the standards set in RFC 2045. Each message component must provide its own specific media type, which must be either 'message' or 'text' when specifying a media subtype of inoculation. 8.0 Message Examples This section provides some examples of the message format. Depending on the format of this draft, the message's whitespace and structure may have been changed leaving the checksums in the example to fail. Assume the recipient (spamsucks@myhouse.com) is willing to accept inoculations of antispam from the sender (jonathan@nuclearelephant.com). They have previously agreed on the shared authentication secret of 'beware the jabberwock'. The steps involved in receiving and processing this inoculation are as follows. spamfilt group Inoculation Message Format [Page 8] ^L INTERNET-DRAFT Inoculation Message Format 0. The recipient's inoculation-aware spam tool notes that this is an inoculation-type message. 1. The recipient spam tool parses the headers to find the claimed sender is jonathan@nuclearelephant.com, and the claimed inoculation type is spam. 2. The recipient spam tool checks the local set of authorized inoculators, and finds that jonathan@nuclearelephant.com is permitted to inoculate spam. 3. The recipient spam tool looks up jonathan@nuclearelephant.com, and finds that the corresponding authentication shared secret is the string of 'beware the jabberwock'. 4. The recipient spam tool tests to confirm that this is not a multipart inoculation, and that the payload is the entire data text area. 5. The recipient spam tool forms the authentication text by concatenating the authentication shared secret, a newline, and the full data text area (omitting the obligatory newline-newline after the last header line) and continuing to end-of-file on the email text or the length of the content, specified in the 'Content-Length' field, if present. 6. The recipient spam tool calculates the md5 checksum of this authentication text. 7. The recipient spam tool compares the calculated checksum (from step 6) with the claimed checksum found in the message header. If the checksum does not match, no automatic inoculation is done and the MTA may either notify the user of an attempted inoculation failure, or may simply drop the message and exit with nonerror status. It is recommended that this behavior be user-configurable. 8. Having validated the authenticity of the sender / checksum / payload tuple, the payload (and only the payload) is forwarded to the proper user-configured spam filtering program's learning interface, including the information that the payload was "spam". Please note also should the message contain a 'From ' header, a space must precede the line in order to comply with RFC 821. This space should be part of the inoculation payload, and stripped out by the recipient's spam tool. No 'From ' lines are used in the examples below. 8.1 message/inoculation example To: Everyone on my list From: Jonathan A. Zdziarski Subject: This is a test inoculation spamfilt group Inoculation Message Format [Page 9] ^L INTERNET-DRAFT Inoculation Message Format Inoculation-Authentication: md5; checksum="dcdac94fab6ded79f33b0134d665d02f" Inoculation-Type: spam Inoculation-Sender: jonathan@nuclearelephant.com Content-Type: message/inoculation Content-Length: 169 From: Bob Denver Subject: This is a spam To: You This is a test innoculation. The checksum is correct, however. -Bill Yerazunis 8.2 text/inoculation example From: Jonathan A. Zdziarski To: Everyone on my list Subject: This is a test inoculation Inoculation-Authentication: md5; checksum="d5c883bce00de5391fbd8f7d17fb56a4" Inoculation-Type: spam Inoculation-Sender: jonathan@nuclearelephant.com Content-Type: text/inoculation Content-Length: 84 This is a test innoculation. The checksum is correct, however. -Bill Yerazunis 8.3 multipart/inoculation example To: Everyone on my list From: Jonathan A. Zdziarski Subject: This is a test inoculation Inoculation-Sender: jonathan@nuclearelephant.com Content-Type: multipart/inoculation; boundary="--NextPart-010203" ----NextPart-010203 Inoculation-Authentication: md5; checksum="c3a47b29744062288cbd5c305897eaa9" Inoculation-Type: spam Content-Type: message/inoculation Content-Length: 169 From: Bob Denver Subject: This is a spam To: You This is a test innoculation. The checksum is correct, however. -Bill Yerazunis spamfilt group Inoculation Message Format [Page 10] ^L INTERNET-DRAFT Inoculation Message Format ----NextPart-010203 Inoculation-Authentication: md5; checksum="d5c883bce00de5391fbd8f7d17fb56a4" Inoculation-Type: spam Content-Type: text/inoculation Content-Length: 84 This is a test innoculation. The checksum is correct, however. -Bill Yerazunis ----NextPart-010203-- Acknowledgements Many thanks to Brian Burton for his input and comments to this document. References [RFC2822] - Internet Message Format [RFC2045] - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies Author's Address Please send all coments to one of the authors listed below. Bill Yerazunis Mitsubishi Electric 201 Broadway Cambridge, MA 02139 USA Phone: +1 617 621 7530 Email: wsy@merl.com Jonathan A. Zdziarski 3069 Heritage Rd. Milledgeville, GA 31061 USA Phone: +1 478 452 8187 Email: jonathan@nuclearelephant.com Full Copyright Statement Copyright (C) The Regents of the Anti-Spam Community (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it spamfilt group Inoculation Message Format [Page 11] ^L INTERNET-DRAFT Inoculation Message Format or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the copyright holder or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE AUTHORS DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." February 2004 Expires August 2004 spamfilt group Inoculation Message Format [Page 12]