MADMAN Working Group Gordon B. Jones [gbjones@mitre.org] INTERNET-DRAFT MITRE draft-ietf-madman-alarmmib-01.txt Niraj Jain [njain@us.oracle.com] Oracle Corporation Glenn Mansfield [glenn@aic.co.jp] AIC Systems Laboratory August 1996 Mail and Directory Alarms Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net, nic.nordu.net, ftp.nisc.sri.com, or munnari.oz.au. Abstract This document defines alarms for Mail and Directory usage. It is to be used in conjunction with the Mail and Directory Management (MADMAN) RFCs. Expires: January 31, 1997 [Page 1] Internet Draft August 1996 1.The SNMPv2 Network Management Framework. 1. The SNMPv2 Network Management Framework. The major components of the SNMPv2 Network Management framework are described in the documents listed below. o RFC 1902 [1] defines the Structure of Management Information (SMI), the mechanisms used for describing and naming objects for the purpose of management. o STD 17, RFC 1213 [2] defines MIB-II, the core set of managed objects (MO) for the Internet suite of protocols. o RFC 1905 [3] defines the protocol used for network access to managed objects. The framework is adaptable/extensible by defining new MIBs to suit the requirements of specific applications/protocols/situations. Managed objects are accessed via a virtual information store, the MIB. Objects in the MIB are defined using the subset of Abstract Syntax Notation One (ASN.1) defined in the SMI. In particular, each object type is named by an OBJECT IDENTIFIER, which is an administratively assigned name. The object type together with an object instance serves to uniquely identify a specific instantiation of the object. For human convenience, often a textual string, termed the descriptor, is used to refer to the object type. 2. The Need for Alarms in Messaging Alarms are notifications of abnormalities associated with an MTA or a message processed by an MTA. Alarms are generated by a Management Console. Two facilities aid the Management Console in the generation of alarms. The first facility is the trap, which is an unsolicited event initiated by the Management Agent and directed to the Management Console. Traps generated by an agent may optionally convey the values of MIB variables inside them. The Management Console interprets the traps and generates alarms as it determines appropriate. The second facility consists of variables that can be polled by the Management Console. These variables include the existing MIB variables defined in the other MADMAN RFCs (Network Services Monitoring MIB, Directory Services Monitoring MIB, Mail Monitoring MIB), plus more Expires: January 31, 1997 [Page 2] Internet Draft August 1996 variables defined herein specifically to augment support for alarm generation. If the Management Console detects a variable value which indicates that a threshold has been reached, or some other worrisome trend or event has occurred, it generates an alarm as it determines appropriate. It is expected that when an abnormality occurs, a trap will be generated indicating the specific cause of the problem. If the trap is lost or discarded by the network, the console may still detect the abnormality on its next regular polling cycle through inspection of the MIB variables. This combination of mechanisms provides a flexible alarm functionality that is either event-driven, polling-driven, or both. It is understood that traps are an unreliable mechanism. However, traps may enhance the effects of polling-based alarms. This is because traps can provide a more immediate discovery of a problem than polling alone can, which may be important within some operational environments. For example, when component availability is required to exceed 99%, a polling cycle consisting of fifteen minute intervals to detect if a component is operational may fail this requirement. A polling cycle more frequent than fifteen minutes might saturate the network with SNMP traffic. When a fifteen minute polling cycle with 99% reliability is combined with an event-driven mechanism that is itself 99% reliable, the probability that a given component failure goes undetected, if both event-driven and polled, becomes less than one one-hundredth of one percent. This scenario is also applicable to the case of message throughput requirements, where the detection of queue saturation may be both event-driven and polling-driven. Alarms denote cases where outstanding intervention is required. Implementations that result in a bombardment of superfluous traps should be avoided (some fault conditions may lend themselves to this). Traps should not be issued repetitively to signify one basic fault condition. The setting of threshold conditions and the evaluation of other composite information is the responsibility of the console, or is a local implementation matter within the agent. The destinations of SNMP traps as selected by the SNMP agents or applications is also a local matter. 3. MIB Data to Support Alarms The following material is a definition of the traps and MIB variables defined specifically to support alarm functionality. The MADMAN variables used to support alarms are defined in RFCs 19??, 19??, and 19??. The usage of these traps and MIB variables to fulfill specific requirements is defined in a later section. 3.1 Traps to Support Alarms Expires: January 31, 1997 [Page 3] Internet Draft August 1996 Two forms of specific traps are defined to support alarms. The first, called mADAlarm, denotes an MTA- or DSA-related failure, and the second, messageAlarm, denotes a message-related failure in an MTA. mADAlarm This trap is generated by the agent in an unsolicited fashion to signify that a failure has occurred within the MTA or DSA. Examples of such failures may include one MTA's inability to contact another MTA, or the detection of message queue saturation. The mADAlarm trap may convey a number of values, including the name of the MTA or DSA reporting the problem, the name of the remote MTA or DSA purportedly causing the problem, and variables describing the problem itself. messageAlarm This trap is generated by the agent in an unsolicited fashion to signify that a non- recoverable failure has occurred in processing a message due to some sort of structural flaw in the message itself or in its addressing. Examples may include cases where a message can not be delivered, non- delivered, or redirected, or the case where a messaging loop was detected. The messageAlarm trap may convey a number of values, including the name of the MTA that processed the message, and variables describing the problem itself. 3.2 MIB Variables to Support Alarms A new table is defined in the MIB to supply supplementary fault-related information to support alarm generation. When a failure occurs, the identities of the applications responsible are retained in the MIB, along with the ID of the message most recently involved in a failure. Through polling, any changes in the values of these variables can signify a recent failure. The following sections describe each variable in the MIB. lastMessageIdFailure This is the identifier of the most recent message that was the cause of a message-related failure. A message-related failure is defined to be a non-recoverable error in the processing of a message. In the event of multiple message failures, it is a clue to the administrator or application to inspect the message queues to determine which messages are defective. numMessagesFailed This is the total number of messages that have failed processing since the messaging application was last initialized. This variable may be used in conjunction with lastMessageIdFailure to detect multiple message failures within a single unit of time. lastFailureMtaGroupName When an error involving a neighboring MTA occurs, this variable holds the mtaGroupName (from the MADMAN mtaGroupTable) of the MTA most recently involved in a failure. lastFailureApplName This variable holds the applName (from the MADMAN applTable) of the MTA that most recently reported a failure. 4. SNMP Format for Alarms Alarms are supported under SNMP using traps and additional MIB Expires: January 31, 1997 [Page 4] Internet Draft August 1996 variables. An additional table called mADAlarmTable is defined here. Elements of the existing MADMAN tables and proposed extensions are also utilized for alarm purposes. It is expected that traps will be implemented under SNMP v1, but that the grammatical constructs used to define them are taken from SNMP v2. Page 31 of RFC 1157 shows how trap Protocol Data Units (PDUs) are formed in SNMP v1. We would add two enterprise-specific traps (generic-trap type 6) whose specific-trap values are set to either mADAlarm (specific-trap 0) or messageAlarm (specific-trap 1). The enterprise field of the trap would contain the OID "experimental ??" designating the MADMAN alarm MIB (MADAlarmMIB). The values of variables and their corresponding OBJECT IDENTIFIERs are conveyed within the VarBindList. These variables are obtained from either the mADAlarmTable or tables found in the other MADMAN RFCs. Expires: January 31, 1997 [Page 5] Internet Draft August 1996 MADMAN-ALARM-MIB DEFINITIONS ::= BEGIN IMPORTS MODULE-IDENTITY, OBJECT-TYPE, NOTIFICATION-TYPE, experimental, Counter32, Gauge32 FROM SNMPv2-SMI DisplayString, TEXTUAL-CONVENTION FROM SNMPv2-TC applOperStatus, applName FROM APPLICATION-MIB mtaGroupName, mtaGroupInboundRejectionReason, mtaGroupStoredVolume, mtaLoopsDetected, mtaGroupLoopsDetected, mtaGroupOutboundConnectFailureReason FROM MTA-MIB; mADAlarmMIB MODULE-IDENTITY LAST-UPDATED "9608230000Z" ORGANIZATION "IETF Mail and Directory Management Working Group" CONTACT-INFO " Glenn Mansfield Postal: AIC Systems Laboratory 6-6-3, Minami Yoshinari Aoba-ku, Sendai, Japan 989-32. Tel: +81-22-279-3310 Fax: +81-22-279-3640 E-mail: glenn@aic.co.jp" DESCRIPTION "The MIB module describing alarms for MADMAN" ::= { experimental 73 } mADAlarmTable OBJECT-TYPE SYNTAX SEQUENCE OF mADAlarmEntry ACCESS not-accessible STATUS mandatory DESCRIPTION "The table holding alarm information for an individual MTA or DSA." ::= { mADAlarmMIB 1 } mADAlarmEntry OBJECT-TYPE SYNTAX mADAlarmEntry ACCESS not-accessible STATUS mandatory DESCRIPTION "The alarm entry associated with each MTA or DSA." ::= { mADAlarmTable 1 } Expires: January 31, 1997 [Page 6] Internet Draft August 1996 mADAlarmEntry ::= SEQUENCE { lastMessageIdFailure DisplayString, numMessagesFailed Counter32, lastFailureMtaGroupName DisplayString, lastFailureMtaApplName DisplayString } lastMessageIdFailure OBJECT-TYPE SYNTAX DisplayString ACCESS read-only STATUS mandatory DESCRIPTION "This is the message ID of the last message to either loop or have an unrecoverable error while proccessing" ::= {mADAlarmEntry 1} numMessagesFailed OBJECT-TYPE SYNTAX Counter32 ACCESS read-only STATUS mandatory DESCRIPTION "This is the number of messages that have had an unrecoverable error while proccessing since MTA initialization" ::= {mADAlarmEntry 2} lastFailureMtaGroupName OBJECT-TYPE SYNTAX DisplayString ACCESS read-only STATUS mandatory DESCRIPTION "This is the group name of the last MTA group to have a connectivity failure" ::= {mADAlarmEntry 3} lastFailureMtaApplName OBJECT-TYPE SYNTAX DisplayString ACCESS read-only STATUS mandatory DESCRIPTION "This is the application name of the last MTA to have a connectivity failure" ::= {mADAlarmEntry 4} mADAlarmNotifications OBJECT IDENTIFIER ::= { mADAlarmMIB 2 } mADAlarm NOTIFICATION-TYPE OBJECTS {applOperStatus, applName, mtaGroupName, Expires: January 31, 1997 [Page 7] Internet Draft August 1996 mtaGroupConnectFailureReason, mtaGroupStoredVolume} -- these OBJECTS are the things that an mADAlarm may convey ::= {mADAlarmNotifications 1} messageAlarm NOTIFICATION-TYPE OBJECTS {lastMessageIdFailure, numMessagesFailed } ::= {mADAlarmNotifications 2} mADAlarmConformance OBJECT IDENTIFIER ::= {mADAlarmMIB 3} mADAlarmGroup OBJECT IDENTIFIER ::= {mADAlarmConformance 1} mADAlarmCompliances OBJECT IDENTIFIER ::= {mADAlarmConformance 2} mADAlarmTrapCompliance MODULE-COMPLIANCE STATUS current DESCRIPTION "The most basic level of compliance for MAD SNMPv2 entities that implement MAD alarms." MODULE MANDATORY-GROUPS {mADAlarmTrapGroup} ::= {mADAlarmCompliances 1} mADAlarmVariableCompliance MODULE-COMPLIANCE STATUS current DESCRIPTION "The compliance statement for MAD SNMPv2 entities that implement MIB variables to support alarms for MTAs." MODULE MANDATORY-GROUPS {mADAlarmVariableGroup} ::= {mADAlarmCompliances 2} mADAlarmTrapGroup OBJECT-GROUP OBJECTS {mADAlarm, messageAlarm} STATUS current DESCRIPTION "Two Traps providing the basic level of support for alarms for MTAs." ::= {mADAlarmGroup 1} mADAlarmVariableGroup OBJECT-GROUP OBJECTS {lastMessageIdFailure, numMessagesFailed, lastFailureMtaGroupName, lastFailureMtaApplName} STATUS current DESCRIPTION "A collection of objects providing support for alarms for MTAs that includes some other alarm-specific MIB variables" ::= {mADAlarmGroup 2} Expires: January 31, 1997 [Page 8] Internet Draft August 1996 END 5. Scenarios The following scenarios provide examples of how the mADAlarm and messageAlarm are used in various fault conditions. 5.1 Connectivity Failure When an MTA or DSA detects that another MTA or DSA cannot be contacted, a mADAlarm is sent. The mADAlarm contains the applName of the MTA reporting the problem, the mtaGroupName for the MTA that cannot be contacted, and the mtaGroupOutboundConnectFailureReason. In the case of a more general connectivity failure, such as the general unavailability of the network element, the MTA-trap contains only the variable mtaGroupConnectFailureReason. Care should be taken to report these conditions only in the case of permanent failure, since intermittent failures are more frequent and might result in too many traps being generated. For example, when an MTA cannot connect to another MTA in order to deliver a message, the MTA delivering the message usually retries the delivery attempt for a specified duration or for a specified number of tries. If the retry limit is exceeded, a case that should not occur, the message is returned. In this case, a trap would be sent when the retry limit is exceeded, but would not be sent for each individual retry. 5.2 MTA or DSA Down This condition signifies that the MTA or DSA is not operational (but should be) or has not recently registered with the management system. This condition is reported with an mADAlarm containing the values of applOperStatus and applName from the MADMAN Application Monitoring MIB. Support for this feature is optional, since an MTA or DSA that has crashed cannot report that fact to an agent, and since off-the-shelf agents cannot be expected to monitor the aliveness of applications by themselves. 5.3 Messaging Loop Detection This condition may signify that a particular message has been detected, received, and sent multiple times, perhaps exceeding a locally established threshold value. The condition is reported with a messageAlarm trap, where the trap contains the applName of the MTA reporting the problem, and optionally the values of lastMessageIdFailure, mtaLoopsDetected, mtaGroupLoopsDetected. Expires: January 31, 1997 [Page 9] Internet Draft August 1996 5.4 Message Processing Failure When an MTA encounters certain non-recoverable errors processing a message, (e.g., a "dead" message that cannot be delivered, nondelivered, or redirected), a messageAlarm is generated. The messageAlarm contains the applName of the MTA reporting the failure, and optionally the lastMessageIdFailure, which identifies the most recent message that failed, and numMessagesFailed, which aids in detecting multiple message failures. If other messages had failed processing prior to the immediate condition being reported and after the most recent polling cycle, the identities of these messages may be detected manually. 5.5 Queue Error When an MTA or agent detects that a queue is full or is approaching saturation, a mADAlarm is sent. The applName of the MTA reporting the problem is conveyed within the variable bindings list of the mADAlarm. The mADAlarm also contains the values of the MIB variables mtaGroupName and mtaGroupStoredVolume (both from the mtaGroupTable). 5.6 Security Error When an MTA or agent detects a security error such as an authentication failure (e.g. when an MTA or DSA fails to authenticate itself to another), a mADAlarm is sent. The applName of the MTA reporting the problem is conveyed within the variable bindings list of the mADAlarm. The mADAlarm also contains the values of the MIB variables mtaGroupInboundRejectionReason (stating an authentication failure) and the mtaGroupName. When an MTA or agent detects a security error such as a data integrity violation (e.g. while processing a message), a messageAlarm is sent. The applName of the MTA reporting the problem is conveyed within the variable bindings list of the messageAlarm. The messageAlarm also contains the values of the MIB variables mtaGroupInboundRejectionReason (stating an integrity violation) and the mtaGroupName. Expires: January 31, 1997 [Page 10] Internet Draft August 1996 6. Acknowledgements This draft is the product of discussions and deliberations carried out in the following groups: ietf-madman-wg ietf-madman@innosoft.com This draft also incorporates the intellectual contributions of Bruce Greenblatt Sue Lebeck Roger Mizumori Edward Owens 7. References [1] Case, J., McCloghrie, K., Rose, M., and S. Waldbusser, "Structure of Management Information for version 2 of the Simple Network Management Protocol (SNMPv2)", RFC 1902, SNMP Research,Inc., Hughes LAN Systems, Dover Beach Consulting, Inc., Carnegie Mellon University, February 1996. [2] McCloghrie, K., and M. Rose, Editors, "Management Information Base for Network Management of TCP/IP-based internets: MIB-II", Expires: January 31, 1997 [Page 11] Internet Draft August 1996 STD 17, RFC 1213, Hughes LAN Systems, Performance Systems International, March 1991. [3] Case, J., McCloghrie, K., Rose, M., and S, Waldbusser, "Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2)", RFC 1905, SNMP Research,Inc., Hughes LAN Systems, Dover Beach Consulting, Inc., Carnegie Mellon University, February 1996. [4] Freed, N., Kille, S., "Network Services Monitoring MIB" Monitoring MIB", RFC 1565, Innosoft, ISODE Consortium, January 1994. [5] Freed, N., Kille, S., "Mail Monitoring MIB", RFC 1566, Innosoft, ISODE Consortium, January 1994. [6] Mansfield, G., Kille, S, "X.500 Directory Monitoring MIB", Monitoring MIB", RFC 1567, AIC Systems Lab, ISODE Consortium, November 1994 Security Considerations Security issues are not discussed in this memo. Authors' Addresses Glenn Mansfield AIC Systems Laboratories 6-6-3 Minami Yoshinari Aoba-ku, Sendai 989-32 Japan Phone: +81-22-279-3310 E-Mail: glenn@aic.co.jp Gordon B. Jones MITRE Corporation 1820 Dolley Madison Blvd. McLean, VA 22102-3481 Phone: (703) 883-76701 E-Mail: gbjones@mitre.org Expires: January 31, 1997 [Page 12] Internet Draft August 1996 Niraj Jain Oracle Corporation 500 Oracle Parkway Redwood Shores California 940065 Phone: (415) 506-2581 E-Mail: njain@us.oracle.com Expires: January 31, 1997 [Page 13]