INTERNET-DRAFT Yixian Yang Expires: April 2006 Ming Cao Jian Li Beijing University of Posts and Telecom. October 2005 An Algorithm for Intrusion Prevention System based on Fuzzy Connectedness Clustering draft-yang-ipsfcc-algorithm-00.txt Intellectual Property Rights (IPR) statement: By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2005). All Rights Reserved. Abstract In this document, an algorithm for intrusion prevention systems based on fuzzy connectedness clustering was presented in order to measure the similarity. Fuzzy connectedness that was first applied in image segmentation was extended to clustering algorithm, and used as the similarity metric between objects to be clustered. By a little expert knowledge, starting with at least one seed data in each cluster, the similarity between each data in dataset and each seed data in each Yixian Yang, et al. Expires April, 2006 [Page 1] INTERNET-DRAFT algorithm for IPSFCC October, 2005 cluster was measured. According to the highest fuzzy connectedness of each data, clustering was finished. This algorithm was applied in intrusion prevention systems as a tool of detecting intrusion, overcoming some shortcomings of previous clustering algorithm. Table of Contents Status of This Memo ......................................... 1 Abstract .................................................... 1 1. Introduction ............................................. 3 2. Glossary ................................................. 4 3. IPS based on Data Mining ................................. 5 3.1 Related Work .......................................... 5 3.2 Clustering Algorithms Applied in IPS .................. 6 4. FCC Algorithm Proposed ................................... 11 4.1 Background ............................................. 11 4.2 Fuzzy Connectedness ................................... 11 4.3 Methodology ........................................... 12 4.3.1 Definition ......................................... 12 4.3.2 Overview of FCC Algorithm .......................... 13 4.3.3 Description of FCC Algorithm ....................... 13 4.3.4 Parameters ......................................... 14 4.4 Application in IPS .................................... 15 4.4.1 Application Flow ................................... 15 4.4.2 Explanation about Detail ........................... 16 4.4.3 System Evaluation and Results....................... 17 5 Conclusions ........................................... 18 6. Acknowledgements.......................................... 19 7. Informative References.................................... 19 8. Authors' Addresses........................................ 19 Full Copyright Statement .................................... 20 Yixian Yang, et al. Expires April, 2006 [Page 2] INTERNET-DRAFT algorithm for IPSFCC October, 2005 1. Introduction Human attention has become a precious resource. So, we must find ways to automatically analyze the data, to automatically classify it, to automatically summarize it, to automatically discover and characterize trends in it, and to automatically flag anomalies. In this document, we introduce a new clustering algorithm, FCC, for intrusion detection based on the concept of fuzzy connectedness. This concept was introduced by Rosenfeld in 1979 and used with success in image segmentation; here we extend this approach to clustering and demonstrate its effectiveness in intrusion prevention. Starting with a single or a few seed points in each cluster, all the data points are dynamically assigned to the cluster that has the highest fuzzy connectedness value (strongest connection). With an efficient heuristic algorithm, the time complexity of the clustering process is O( NlogN), where N is the number of data points. The value of fuzzy connectedness is calculated using both the Euclidean distance and the statistical properties of clusters. This unsupervised learning method allows the discovery of clusters of any shape. Application of the method in intrusion detection demonstrates that it can detect not only known intrusion types, but also their variants. Experimental results on the KDD-99 intrusion detection data set show the efficiency and accuracy of this method. Yixian Yang, et al. Expires April, 2006 [Page 3] INTERNET-DRAFT algorithm for IPSFCC October, 2005 2. Glossary This document uses terminology that is defined in [DSARCH]. There is also current work-in-progress on this terminology in the IETF and some of the definitions provided here are taken from that work. Some of the terms from these other references are defined again here in order to provide additional detail, along with some new terms specific to this document. IPS devices deployed in carrier, corporate, and small-to-medium business environments stop upper and lower layer network-based attacks that other network security devices are not designed to handle. Network IPS devices are placed inline normally either in front of or behind the network firewall. There may be multiple related deployments depending on your network's size, complexity, and number of entry points. Clustering builds models from data without predefined classes. Data instances are grouped together based on a similarity scheme defined by the clustering system. With the help of one or several evaluation techniques, it is up to us to decide the meaning of the formed clusters. Yixian Yang, et al. Expires April, 2006 [Page 4] INTERNET-DRAFT algorithm for IPSFCC October, 2005 3. IDS based on Data Mining 3.1 Related Work Data mining based intrusion prevention systems can be roughly categorized into two major groups: misuse detection and anomaly detection. In misuse detection, a model is trained with labeled data to recognize the patterns of ''normal'' visits and ''intrusion'' attempts. Unlike the traditional knowledge base method, signatures of different types of intrusions are learnt automatically, and they are much more powerful than manually defined signatures in recording the subtle characteristics. Misuse detection has been shown to be very successful in detecting previously known attacks. However, since the misuse model is highly dependent on the labeled data used in the training stage, its capabilities of detecting new intrusion types is limited. Different from misuse detection, anomaly detection first establishes a model of normal system behaviors, and anomaly events are then distinguished based on this model. The implicit assumption is that any intrusive activity will be anomalous. Anomaly detection is able to detect newly emerging attacks (if only the assumption still holds), but it also has some drawbacks. It may fail to detect some known attacks if the behaviors of them are not significantly different from what is considered to be normal. Moreover, the false alarm rate tends to be high when the data of some normal system behaviors are not involved in the training phase. With the help of data mining approaches, these difficulties can be easily overcome. Briefly, the data mining techniques have provided the following benefits: + Improved variants detection. This is especially true for anomaly detection. Not limited to pre-defined signatures, the concern with variants is not as much as before, since any deviation from a unique (normal) signature will be treated as intrusion, including those previously unknown variants of intrusions. + Controlled false alarms. Even though these are false positives, with a learning process to identify recurring sequences of false alarms, it is possible for us to filter those normal system activities and keep the rate of false alarms at an acceptable level. + Reduced false dismissals. With data mining techniques, patterns (or signatures) of normal activities and abnormal events (intrusions) can be created automatically. It is also possible to introduce new types of attacks through an incremental learning process. As a result, more and more attacks can be detected correctly. This leads to a reduced number of false dismissals. Yixian Yang, et al. Expires April, 2006 [Page 5] INTERNET-DRAFT algorithm for IPSFCC October, 2005 + Improved efficiency. One very attractive feature of data mining techniques is the ability to extract most meaningful information out of large amounts of data. After a step of feature extraction and/or feature selection, the learning process can be done much more efficiently. A wide variety of data mining techniques have been applied to intrusion prevention, which include decision trees and neural networks. With pre-labeled (normal or intrusion) network traffic data, classification-based intrusion detection models can be trained and proved to be very useful. Actually, however, it is not always possible to have enough training data to guarantee the effectiveness of the classifiers. At the same time, temporal changes in network intrusion patterns and characteristics also tend to invalidate the usability of trained models. These challenges lead to the emerging interests of researchers in applying unsupervised or clustering techniques to intrusion detection. 3.2 Clustering Algorithms Applied in IPS Clustering is a well known and studied problem. It has been studied in many fields including statistics, machine learning, databases, and visualization. Basic methods for clustering include the Linkage bases and K-means techniques. K-means makes several passes through the training data and on each pass shifts cluster centers to the mean of the data points assigned to that cluster. It then re-assigned data point to the nearest prototype, and continues interacting in this manner until no significant changes in cluster center positions occur. The K-means method generally produces a more accurate clustering than linkage based methods, but it has a greater time complexity and this becomes an extremely important factor in network intrusion detection due to very large dataset sizes. Although some optimizations of K-means for very large datasets exist, they still do not perform sufficiently fast for datasets with high dimensionality. Image that you are given a set of data objects for analysis where, unlike in classification, the class label of each object is not known. Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its rots in many areas, including data mining, statistics, biology and machine learning. Yixian Yang, et al. Expires April, 2006 [Page 6] INTERNET-DRAFT algorithm for IPSFCC October, 2005 The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group in many applications. Cluster analysis is an important human activity. Early in childhood, one learns how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. Cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research. By clustering, one can identify dense and sparse regions and, therefore, discover overall distribution patterns and interesting correlations among data attributes. "What are some typical applications of clustering?" In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database, and in the identification of groups of automobile insurance policy holders with a high average claim cost, as well as the identification of groups of houses in a city according to house type, value, and geographical location. It can also be used to help classify documents on the Web for information discovery. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization and classification, which would then operate on the detected clusters. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. Yixian Yang, et al. Expires April, 2006 [Page 7] INTERNET-DRAFT algorithm for IPSFCC October, 2005 In machine learning, clustering is an example of unsupervised learning. Unlike classicality, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In conceptual clustering, a group of objects forms a class only if it is describable by a concept. This differs from conventional clustering, which measures similarity based on geometric distance. Conceptual clustering consists of two components: (1) it discovers the appropriate classes, and (2) it forms descriptions for each class, as in classification. The guideline of striving for high interclass similarity and low interclass similarity still applies. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Clustering is a challenging field of research where its potential applications pose their own special requirements. The following are typical requirements of clustering in data mining: + Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. + Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based(numerical) data. However, applications may require clustering other types of data, such as binary, categorical(nominal), and ordinal data, or mixtures of these data types. + Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. + Mining requirements for domain knowledge input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis(such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often hard to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but also makes the quality of clustering difficult to control. Yixian Yang, et al. Expires April, 2006 [Page 8] INTERNET-DRAFT algorithm for IPSFCC October, 2005 + Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. + Insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data; for example, the same set of data, when presented with different orderings to such an algorithm, may generate dramatically different clusters. It is important to develop algorithms that are insensitive to the order of input. + High dimensionality: A database or a data warehouse can contain several dimension or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. It is challenging to cluster data objects in high-dimensional space, especially considering that such data can be very sparse and highly skewed. + Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic cash-dispensing machines(i.e., ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city's rivers and highway networks, and customer requirements per region. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. + Interpretability and usability: Users expect clustering results to to be interpretable, comprehensible, and usable. That is, clustering may need to be tied up with specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering methods. Recently, several unsupervised methods have been proposed. Portnoy et al. introduced an algorithm to detect both known and new intrusion types without the need to label the training data. This method, however, has its limitations. It is based on the assumption that ''the normal instances constitute an overwhelmingly large portion (>98%)''. Also, it requires a predefined parameter of clustering width which is not always easy to find. Y-means, introduced by Guan et al., overcomes the shortcomings of the traditional K-means algorithm and can automatically decide the number of clusters through splitting and merging clusters. However, like other centroid-based clustering algorithms, Y-means can only deal with clusters with a spherical shape and one still needs to define a threshold for the "confident area". Yixian Yang, et al. Expires April, 2006 [Page 9] INTERNET-DRAFT algorithm for IPSFCC October, 2005 To deal with these issues, we introduce a new clustering algorithm, the Fuzzy-Connectedness Clustering (FCC) for intrusion detection based on the concept of fuzzy connectedness. With little prior knowledge, the proposed method allows the discovery of clusters of any shape and can detect not only the known intrusion types, but also their variants. Yixian Yang, et al. Expires April, 2006 [Page 10] INTERNET-DRAFT algorithm for IPSFCC October, 2005 4. FCC Algorithm 4.1 Background In data mining, clustering is the most important unsupervised learning process used to find the structures or patterns in a collection of unlabeled data. During the learning process, the similarities between pairs of data instances are considered and an optimized output is found which maximizes the similarities within the same group and the dissimilarities between different groups. While there is no absolute criterion to judge the optimality of the clustering result, the choice of similarity measures can surely lead to totally different groupings. In many clustering algorithms, the spatial distance (usually Euclidean distance) between different instances is used to measure the similarity. This has proved to be successful in many applications. It is intuitive that similar objects are closer together while objects from different groups are far from each other. This intuition, however, is not always true in more complicated applications, especially when the shape of data clusters is not limited to be spherical. Figure 1 shows an example in which the popular k-means clustering algorithm can not achieve satisfactory result. To deal with more complex situations, it should be more suitable to take both local neighborhood and global information into consideration when deciding the membership of each data instance. In this document, we propose a new clustering algorithm which uses fuzzy connectedness as the similarity metric. 4.2 Fuzzy Connectedness The concept of fuzzy connectedness was first introduced by Rosenfeld in 1979. It can be utilized to deal with the spatial vagueness when the spatial entities of data do not have homogeneous interiors and sharply boundaries. Fuzzy connectedness has been successfully applied to image processing. In an image segmentation framework introduced by Udupa et al., fuzzy affinities are assigned to the target data object during the classification. The affinity between a pair of pixels is defined as a weighted function of their adjacencies in the coordinate space, the intensity space, and the intensity gradient space. Yixian Yang, et al. Expires April, 2006 [Page 11] INTERNET-DRAFT algorithm for IPSFCC October, 2005 In a recent work, Herman and Carvalho generalized the concept of fuzzy connectedness. For a finite set of data points, the strength of the link from one point to another is defined as the reciprocal of the distance between them and it can be automatically defined based on statistical properties of the links within regions of points identified as belonging to the same groups. A chain is a sequence of ordered points and the strength of a chain is its weakest link. Taking fuzzy connectedness as the similarity metric, the task of a clustering process is to find the strongest path of every data point and assign the point to the group that path leads to. 4.3 Methodology 4.3.1 Definition Basic definitions that we use in this document are presented below. A fuzzy set A in a space of points X={x} is a class of events with a continuum of grades of membership and is characterized by a membership function 米_A(x) with associates with each point in X a real number in the interval [0, 1] with the value of 米_A(x) at x representing the grade of membership of x in A. Formally, a fuzzy set A with its finite number of supports x1, x2, ..., xn is defined as a collection of ordered pairs: A={(米_A(xi),xi), xi, i=1, 2, ..., n} where the supports of A is the subset of X defined as S(A)={x, x﹋X and 米A(x)>0} 米i, the grade of membership of xi in A, denotes the degree to which an event xi may be a membership of A or belong to A. This characteristic function in fact, can be viewed as a weighting coefficient which reflects the ambiguity in a set, and as it approaches unity, the grade of membership of an event in A becomes higher. Definition 1: For a positive integer K, a K-FCC of a set D (of points) is a function which maps each point c﹋D into a (K+1)-dimensional vector 考_c = {考_c0, 考_c1, 考_c2, ... , 考_cK}, such that 考_ci﹋[0, 1], and for at least one k, 1 ≒ k ≒ K, 考_c0 = 考_ck; for 1 ≒ m ≒ K and m ≧ k, 考_cm = 0 or 考_cm = 考_ck. Definition 2: A fuzzy affinity on D is a function:耳: D2 ↙ [0, 1]. 耳(c, d) is the strength of the link (c,d). Definition 3: A chain in U﹋D from c_0 to c_N is a sequence C_(0,N ) = (c_0, c_1, c_2, ... , c_N) in U and the strength of the chain 耳_U(C_(0, N)) = min(耳_U(c_n-1, c_n)), 1 ≒ n ≒ N. Yixian Yang, et al. Expires April, 2006 [Page 12] INTERNET-DRAFT algorithm for IPSFCC October, 2005 4.3.2 Overview of FCC Algorithm Given the above definitions, a clustering process is to find the mapping of K每FCC which is deduced from the calculation of fuzzy affinities between pairs of points. In this process, the definition of the fuzzy affinity function 耳 is very essential; one can include the statistical information of each cluster to make the results more reliable. When only the Euclidean distance between pairs of points is considered, it is not possible to separate the two clusters correctly. In the proposed method, in order to initialize the clustering process, we need a little prior knowledge. This is in the form of a few labeled data points in each possible cluster. This requirement, however, is usually easy to satisfy in intrusion prevention applications. 4.3.3 Description of FCC Algorithm Every data collected is regarded as a vector of characteristic, because FCC algorithm will be applied in IPS. So, we call an element in the data as a vector here. Input: Data D, number of clusters K, seed points S(U_k), 1 ≒ k ≒ K; Output: K-FCC matrix 考={考_ck | c﹋D, 0 ≒ k ≒ K} and label matrix label(c) for c﹋D. Step1: for every c﹋D Step2: for k = 1:K Step3: 考_ck = 0 Step4: Heap = 耳 Step5: for k = 1:K Step6: U_k = S(U_k) ﹍ {nearest neighbors of S(U_k)} Step7: Heap = Heap ﹍ U_k Step8: for every c ﹋ U_k 考_ck = 1 Step9: Max_link = 1 Step10: while Max_link > 0 Step11: for k = 1:K Step12: while |U_k | > 0 Step13: d = U_k(1); U_k = U_k - d Yixian Yang, et al. Expires April, 2006 [Page 13] INTERNET-DRAFT algorithm for IPSFCC October, 2005 Step14: NN = {nearest neighbors of d} Step15: C = { c﹋NN | 考_ck < min(Max_link, 考_k(d, c) & 考_c0 ≒ Max_link} Step16: while |C| > 0 Step17: c = C(1); C = C - c Step18: l = min(Max_link, 耳_k(d, c)) Step19: if Max_link = l and 考_ck < Max_link then U_k = U_k ﹍{c} Step20: if 考_c0 < l then Step21: if 考_c0 = 0 then Heap = Heap ﹍{c} Step22: for n = 1:K 考_cn = 0 Step23: if 考_c0 ≒ l then 考_c0 = 考_ck = l Step24: Heap = Heap - {c|c ﹋ Heap and 考_c0 = Max_link} Step25: Max_link = Max(考_c0), c ﹋ Heap Step26: for every c﹋D Step27: label(c) = k with 考_ck = 考_c0 4.3.4 Parameters In the FCC algorithm, two parameters are involved: (a) the number of seed points and (b) the number of neighbors in the neighborhood. The first parameter is decided using prior knowledge about the data. The minimum value is 1, which means that each cluster has at least one instance at the very beginning. Usually, the more seed points we have, the more information about the cluster will be given, and this will possibly provide more accurate results. However, depending on the distributions of different clusters, a small number of seed points may also give satisfactory results. Yixian Yang, et al. Expires April, 2006 [Page 14] INTERNET-DRAFT algorithm for IPSFCC October, 2005 The number of neighbors used to define the neighborhood is also important. As discussed earlier, when calculating the fuzzy connectedness of an instance to other instances in a certain cluster, both the local information about the distance between two instances and the statistical information about that cluster are involved. An instance x is fuzzy connected to another instance y only if y falls in the neighborhood of x. If the number of neighbors is too small, only a few instances tightly close to x are considered reachable from x, and this may separate x from the cluster it really belongs to; when this number is too large, the importance of local connectivity will be reduced and it is possible for two instances far from each other to be considered as fuzzy connected. This may bring two or more clusters close to each other forming a single cluster, which is also not desirable. So, there should be a trade off among the effects of varying the two parameters. With different configurations of the parameters, while Detection Rate stays high and False Alarm Rate stays low, there are some changes in the numbers. When the number of nearest neighbors increases from 2 to 4, both Detection Rate and False Alarm Rate are slightly improved. The reason is that there are more choices for an instance to get connected and this helps it to find the correct cluster. Consequently, fewer attacks will be mislabeled as normal and fewer normal instances will be treated as intrusion. However, there is not always improvement with the increase of the number of nearest neighbors considered. 4.4 Application in IPS 4.4.1 Application Flow +--------+ +-----------+ +------------------+ |Database|-------->|Abstraction|-------->|Vector Description| +--------+ +-----------+ +------------------+ | | V +-----------+ +------------------+ | FCC |<--------|Measure Similarity| +-----------+ +------------------+ Figure 1: Application Flow in IPS Yixian Yang, et al. Expires April, 2006 [Page 15] INTERNET-DRAFT algorithm for IPSFCC October, 2005 4.4.2 Explanation about Detail The algorithm starts with an initialization process (steps 1-9). For every data point c﹋D , its fuzzy mapping vector is initialized with zeros. As shown later in the algorithm, the values in the mappings vectors are used to decide the memberships of the data points. All the seed points are put into corresponding clusters and certain items in their mapping vectors are set to 1. A heap data structure is used to temporarily contain the unlabelled data points. It is set to empty at the beginning. Before the main loop (10-27) of the algorithm, only the seed points S(Uk) and their nearest neighbors are labeled. Their linkage to their cluster is considered the strongest and the value is set to one. Starting with these points, all the clusters try to expand. The strength of the linkage of each unlabeled point to a cluster is the strength of a chain (path) that connects this point to one seed point in that cluster. After all the fuzzy affinities are calculated, we get the K-FCC for the dataset D. The label of each data point c﹋D is finally decided by its mapping vector 考 c . It is assigned to cluster k if c k 考 is the largest element of the vector. In the FCC algorithm, a priority heap is used for the operations of insertion into (step 21) and removal (step 24) from the priority queue. With the definition of nearest neighborhood of an instance, there are no more than a fixed number of nearest neighbors that need to be processed in step 15. If there are N instances in D and n is used to denote the number of nearest neighbors, the computational complexity of FCC is O(N(logN + nK)). There are two issues that are very important to this algorithm: one is related to the definition of neighborhood (NN) of a data instance, the other is how to calculate the fuzzy affinity between each pair of instances. For the first issue, the neighborhood is defined in Euclidean space. A small number of neighbors around the target point with the shortest Euclidean distances are defined to be its nearest neighbors. With an indexing scheme such as an R- tree, it is very easy to find the neighborhood in a short time. In order to calculate the fuzzy affinity of an instance c and another instance k d ﹋U (target cluster), besides their spatial adjacency, we also involve the statistical information about the cluster Uk. The distribution information of a certain variable (e.g., Euclidean distance between nearest neighbors) of Uk is stored and kept up to date. When a new instance comes, its probability of falling into Uk is calculated using the distribution information and the Euclidean distance between the new instance and its nearest neighbor d in Uk. Let E(c, d) be the Euclidean distance between c and d, and Mk and Sk be the mean and standard deviation of the Euclidean distance between every pair of instances already in Uk. Yixian Yang, et al. Expires April, 2006 [Page 16] INTERNET-DRAFT algorithm for IPSFCC October, 2005 4.4.3 System Evaluation and Results To evaluate our system we are interested in two major indicators of performance: the detection rate and the false positive rate. The detection rate is defined as the number of intrusion instances detected by the system divided by the total number of intrusion instances present in the test set. The false positive rate is defined as the total number of normal instances that were (incorrectly) classified as intrusions divided by the total number of normal instances. These are good indicators of performance, since they measure what percentage of intrusions the system is able to detect and how many incorrect classifications it makes in the process. We calculate these values over the labeled data to measure performance. Yixian Yang, et al. Expires April, 2006 [Page 17] INTERNET-DRAFT algorithm for IPSFCC October, 2005 5 Conclusions In this document, we introduce a new clustering algorithm, FCC, for intrusion detection. This algorithm uses the concept of fuzzy connectedness to calculate the similarities among different data instances. Starting with a single or a few seed points in each cluster, all the data points are dynamically assigned to the cluster that has the highest fuzzy connectedness value (strongest connection). In calculating the fuzzy connectedness, both local distance information and statistical properties of clusters are considered. Comparing with the frequently utilized Euclidean distance, the new similarity metric is more robust, and there is no restriction on the shape of clusters that can be discovered. FCC has also some other advantages over the other previously proposed clustering algorithms: no predefined ※cluster width§, no threshold for ※confidence area§, etc. Although the FCC algorithm requires a parameter K which denotes the number of clusters, most of the time it is not necessary to know an exact value for it. In many cases where we just care about whether an instance is an intrusion and do not need to know about the exact intrusion type, the parameter K can be fixed as a constant, i.e., K=2. With an efficient heuristic algorithm, the time complexity of the clustering process is O(NlogN+nK), where N is the number of data points and n, K are constants. Moreover, this unsupervised learning method can detect not only the known intrusion types, but also their variants. The two other parameters in FCC, the number of seed points and the number of neighbors, do not seem to affect greatly the performance of the algorithm. Experimental results on a subset of KDD-99 dataset showed the stability of efficiency and accuracy of the FCC algorithm. With different settings, the Detection Rate stayed always above 94% while the False Alarm Rate was below 4%. For future work, we need to find out if there is a general accepted definition of fuzzy affinity. With different distribution variable involved in the calculation, the values of fuzzy affinity may be different and it may lead to different clustering results. Even though the one we employ in equation (1) gives good results on the KDD-99 data set, it is not necessary that it will be appropriate for other datasets. Yixian Yang, et al. Expires April, 2006 [Page 18] INTERNET-DRAFT algorithm for IPSFCC October, 2005 6. Acknowledgement The authors wish to thank Jian Li. 7. Informative References [1] Leonid Portnoy, Eleazar Eskin and Sal Stolfo, ''Intrusion Detection with Unlabeled Data Using Clustering'', Proceedings of ACM CSS Workshop on Data Mining Applied to Security(DMSA-2001),Philadelphia, PA: November 5-8, 2001. [2] Wenke Lee, ''A Data Mining Framework for Constructing Features and Models for Intrusion Detection Systems'', PhD Thesis, Columbia University, 2000. [3] Pierre Hansen and Nenad Mladenoviu, ''J-Means: A New Local Search Heuristic for Minimum Sum of Squares Clustering'', Pattern Recognition , 2001, 34(2): 405-413 [4] Yu Guan, Ail A, Ghorbani and Nabil Belacel, ''Y-Means: A Clustering Method for Intrusion Detection'', Proceedings of Canadian Conference on Electrical and Computer Engineering, Montreal, Quebec, Canada. May 4-7, 2003. [5] Azriel Rosenfeld, ''Fuzzy Digital Topology'', Information and Control, 1979, Vol.40, 76-87. [6] Rebecca Gurley Bace, ''Intrusion Detection'', Macmillan Technical Publishing, 2000. [7] Pan JJ, Yang XN, Wang GZ. ''An Image Segmentation and its Algorithm based on Fuzzy Connetedness'', Journal of Software, 2005, 16(1): 67-76. 8. Authors' Addresses Yixian Yang Information Security Center, Beijing University of posts and telecom.(BUPT), Beijing, China,100876 Phone:8610-62283366 Email:yxyang@bupt.edu.cn Yixian Yang, et al. Expires April, 2006 [Page 19] INTERNET-DRAFT algorithm for IPSFCC October, 2005 Full Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. Yixian Yang, et al. Expires April, 2006 [Page 20]