Internet DRAFT - draft-yang-ipsfcc-algorithm

draft-yang-ipsfcc-algorithm






INTERNET-DRAFT                                   Yixian Yang
Expires: April 2006                              Ming Cao
                                                 Jian Li
                        	                 Beijing University of
                                                 Posts and Telecom.
                                                 October 2005


                  An Algorithm for Intrusion Prevention System
                    based on Fuzzy Connectedness Clustering
                      draft-yang-ipsfcc-algorithm-00.txt



Intellectual Property Rights (IPR) statement:
By submitting this Internet-Draft, each author represents that any 
applicable patent or other IPR claims of which he or she is aware 
have been or will be disclosed, and any of which he or she becomes 
aware will be disclosed, in accordance with Section 6 of BCP 79.

Status of this Memo
   By submitting this Internet-Draft, I certify that any applicable
   patent or other IPR claims of which I am aware have been disclosed,
   and any of which I become aware will be disclosed, in accordance with
   RFC 3668.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.
   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.
   
Copyright Notice

   Copyright (C) The Internet Society (2005).  All Rights Reserved.

Abstract

   In this document, an algorithm for intrusion prevention systems based
   on fuzzy connectedness clustering was presented in order to measure
   the similarity. Fuzzy connectedness that was first applied in image
   segmentation was extended to clustering algorithm, and used as the
   similarity metric between objects to be clustered. By a little expert
   knowledge, starting with at least one seed data in each cluster, the
   similarity between each data in dataset and each seed data in each
Yixian Yang, et al.         Expires April, 2006               [Page 1]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005

   cluster was measured. According to the highest fuzzy connectedness
   of each data, clustering was finished. This algorithm was applied in



   intrusion prevention systems as a tool of detecting intrusion,
   overcoming some shortcomings of previous clustering algorithm.





Table of Contents

  Status of This Memo .........................................   1

  Abstract ....................................................   1

  1. Introduction .............................................   3

  2. Glossary .................................................   4

  3. IPS based on Data Mining .................................   5
    3.1 Related Work ..........................................   5
    3.2 Clustering Algorithms Applied in IPS ..................   6
       
  4. FCC Algorithm Proposed ...................................   11
    4.1 Background .............................................  11
    4.2 Fuzzy Connectedness ...................................   11
    4.3 Methodology ...........................................   12
     4.3.1 Definition .........................................   12
     4.3.2 Overview of FCC Algorithm ..........................   13
     4.3.3 Description of FCC Algorithm .......................   13
     4.3.4 Parameters .........................................   14
    4.4 Application in IPS ....................................   15
     4.4.1 Application Flow ...................................   15
     4.4.2 Explanation about Detail ...........................   16
     4.4.3 System Evaluation and Results.......................   17
    
  5 Conclusions     ...........................................   18
   
  6. Acknowledgements..........................................   19

  7. Informative References....................................   19

  8. Authors' Addresses........................................   19
 
  Full Copyright Statement ....................................   20







Yixian Yang, et al.         Expires April, 2006               [Page 2]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


1. Introduction

   Human attention has become a precious resource.  So, we must find
   ways to automatically analyze the data, to automatically classify
   it, to automatically summarize it, to automatically discover and
   characterize trends in it, and to automatically flag anomalies.
 
   In this document, we introduce a new clustering algorithm, FCC, for
   intrusion detection based on the concept of fuzzy connectedness.
   This concept was introduced by Rosenfeld in 1979 and used with
   success in image segmentation; here we extend this approach to
   clustering and demonstrate its effectiveness in intrusion prevention.
 
   Starting with a single or a few seed points in each cluster, all the
   data points are dynamically assigned to the cluster that has the
   highest fuzzy connectedness value (strongest connection). With an
   efficient heuristic algorithm, the time complexity of the clustering
   process is O( NlogN), where N is the number of data points. The value
   of fuzzy connectedness is calculated using both the Euclidean
   distance and the statistical properties of clusters.
 
   This unsupervised learning method allows the discovery of clusters
   of any shape. Application of the method in intrusion detection
   demonstrates that it can detect not only known intrusion types, but
   also their variants. Experimental results on the KDD-99 intrusion
   detection data set show the efficiency and accuracy of this method.




























Yixian Yang, et al.         Expires April, 2006               [Page 3]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


2. Glossary

   This document uses terminology that is defined in [DSARCH].  There
   is also current work-in-progress on this terminology in the IETF
   and some of the definitions provided here are taken from that work.
   Some of the terms from these other references are defined again
   here in order to provide additional detail, along with some new
   terms specific to this document.

   IPS          devices deployed in carrier, corporate, and
                small-to-medium business environments stop upper and
                lower layer network-based attacks that other network
                security devices are not designed to handle. Network
                IPS devices are placed inline normally either in front
                of or behind the network firewall. There may be
                multiple related deployments depending on your
                network's size, complexity, and number of entry points.

   Clustering   builds models from data without predefined classes.
                Data instances are grouped together based on a
                similarity scheme defined by the clustering system.
                With the help of one or several evaluation techniques,
                it is up to us to decide the meaning of the formed
                clusters.
  
 
              
              


























Yixian Yang, et al.         Expires April, 2006               [Page 4]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


3. IDS based on Data Mining
   
3.1 Related Work

   Data mining based intrusion prevention systems can be roughly
   categorized into two major groups: misuse detection and anomaly
   detection.  In misuse detection, a model is trained with labeled
   data to recognize the patterns of ''normal'' visits and
   ''intrusion'' attempts.  Unlike the traditional knowledge base
   method, signatures of different types of intrusions are learnt
   automatically, and they are much more powerful than manually
   defined signatures in recording the subtle characteristics.
 
   Misuse detection has been shown to be very successful in detecting
   previously known attacks.  However, since the misuse model is highly
   dependent on the labeled data used in the training stage, its
   capabilities of detecting new intrusion types is limited.
 
   Different from misuse detection, anomaly detection first
   establishes a model of normal system behaviors, and anomaly events
   are then distinguished based on this model.  The implicit assumption
   is that any intrusive activity will be anomalous.  Anomaly detection
   is able to detect newly emerging attacks (if only the assumption
   still holds), but it also has some drawbacks.  It may fail to detect
   some known attacks if the behaviors of them are not significantly
   different from what is considered to be normal.  Moreover, the false
   alarm rate tends to be high when the data of some normal system
   behaviors are not involved in the training phase.

   With the help of data mining approaches, these difficulties can be
   easily overcome.  Briefly, the data mining techniques have provided
   the following benefits:

   +  Improved variants detection. This is especially true for anomaly
      detection. Not limited to pre-defined signatures, the concern
      with variants is not as much as before, since any deviation from
      a unique (normal) signature will be treated as intrusion,
      including those previously unknown variants of intrusions.
 
   +  Controlled false alarms. Even though these are false positives,
      with a learning process to identify recurring sequences of false
      alarms, it is possible for us to filter those normal system
      activities and keep the rate of false alarms at an acceptable
      level.
 
   +  Reduced false dismissals. With data mining techniques, patterns
      (or signatures) of normal activities and abnormal events
      (intrusions) can be created automatically. It is also possible
      to introduce new types of attacks through an incremental learning
      process. As a result, more and more attacks can be detected
      correctly. This leads to a reduced number of false dismissals.
 
 

Yixian Yang, et al.         Expires April, 2006               [Page 5]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   +  Improved efficiency. One very attractive feature of data mining
      techniques is the ability to extract most meaningful information
      out of large amounts of data. After a step of feature extraction
      and/or feature selection, the learning process can be done much
      more efficiently.

   A wide variety of data mining techniques have been applied to
   intrusion prevention, which include decision trees and neural
   networks.  With pre-labeled (normal or intrusion) network traffic
   data, classification-based intrusion detection models can be trained
   and proved to be very useful. Actually, however, it is not always
   possible to have enough training data to guarantee the effectiveness
   of the classifiers.  At the same time, temporal changes in network
   intrusion patterns and characteristics also tend to invalidate the
   usability of trained models. These challenges lead to the emerging
   interests of researchers in applying unsupervised or clustering
   techniques to intrusion detection.
 
3.2 Clustering Algorithms Applied in IPS
 
   Clustering is a well known and studied problem. It has been studied
   in many fields including statistics, machine learning, databases, and
   visualization.  Basic methods for clustering include the Linkage
   bases and K-means techniques. K-means makes several passes through
   the training data and on each pass shifts cluster centers to the mean
   of the data points assigned to that cluster. It then re-assigned
   data point to the nearest prototype, and continues interacting in
   this manner until no significant changes in cluster center positions
   occur.  The K-means method generally produces a more accurate
   clustering than linkage based methods, but it has a greater time
   complexity and this becomes an extremely important factor in network
   intrusion detection due to very large dataset sizes.  Although some
   optimizations of K-means for very large datasets exist, they still
   do not perform sufficiently fast for datasets with high
   dimensionality.
 
   Image that you are given a set of data objects for analysis
   where, unlike in classification, the class label of each object
   is not known.  Clustering is the process of grouping the data
   into classes or clusters so that objects within a cluster have
   high similarity in comparison to one another, but are very
   dissimilar to objects in other clusters. Dissimilarities are
   assessed based on the attribute values describing the objects.
   Often, distance measures are used. Clustering has its rots in
   many areas, including data mining, statistics, biology and
   machine learning.

 
 
 
 
 
 

Yixian Yang, et al.         Expires April, 2006               [Page 6]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   The process of grouping a set of physical or abstract objects into
   classes of similar objects is called clustering.  A cluster is
   collection of data objects that are similar to one another within
   the same cluster and are dissimilar to the objects in other
   clusters.  A cluster of data objects can be treated collectively
   as one group in many applications.
 
   Cluster analysis is an important human activity.  Early in childhood,
   one learns how to distinguish between cats and dogs, or between
   animals and plants, by continuously improving subconscious
   clustering schemes.  Cluster analysis has been widely used in
   numerous applications, including pattern recognition, data
   analysis, image processing, and market research.  By clustering,
   one can identify dense and sparse regions and, therefore, discover
   overall distribution patterns and interesting correlations among
   data attributes.
 
   "What are some typical applications of clustering?"  In business,
   clustering can help marketers discover distinct groups in their
   customer bases and characterize customer groups based on
   purchasing patterns.  In biology, it can be used to derive plant
   and animal taxonomies, categorize genes with similar functionality,
   and gain insight into structures inherent in populations.
 
   Clustering may also help in the identification of areas of similar
   land use in an earth observation database, and in the identification
   of groups of automobile insurance policy holders with a high average
   claim cost, as well as the identification of groups of houses in a
   city according to house type, value, and geographical location.  It
   can also be used to help classify documents on the Web for
   information discovery.  As a data mining function, cluster analysis
   can be used as a stand-alone tool to gain insight into the
   distribution of data, to observe the characteristics of each
   cluster, and to focus on a particular set of clusters for further
   analysis.  Alternatively, it may serve as a preprocessing step for
   other algorithms, such as characterization and classification,
   which would then operate on the detected clusters.
 
   Owing to the huge amounts of data collected in databases, cluster
   analysis has recently become a highly active topic in data mining
   research.
 
 
 
 
 
 
 
 
 
 
 
 

Yixian Yang, et al.         Expires April, 2006               [Page 7]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005
 

   In machine learning, clustering is an example of unsupervised
   learning. Unlike classicality, clustering and unsupervised learning
   do not rely on predefined classes and class-labeled training
   examples.  For this reason, clustering is a form of learning by
   observation, rather than learning by examples.  In conceptual
   clustering, a group of objects forms a class only if it is
   describable by a concept.  This differs from conventional clustering,
   which measures similarity based on geometric distance. Conceptual
   clustering consists of two components: (1) it discovers the
   appropriate classes, and (2) it forms descriptions for each class,
   as in classification.  The guideline of striving for high interclass
   similarity and low interclass similarity still applies.
 
   In data mining, efforts have focused on finding methods for efficient
   and effective cluster analysis in large databases.  Active themes of
   research focus on the scalability of clustering methods, the
   effectiveness of methods for clustering complex shapes and types of
   data, high-dimensional clustering techniques, and methods for
   clustering mixed numerical and categorical data in large databases.
 
   Clustering is a challenging field of research where its potential
   applications pose their own special requirements.  The following
   are typical requirements of clustering in data mining:

   +  Scalability: Many clustering algorithms work well on small data
      sets containing fewer than 200 data objects; however, a large
      database may contain millions of objects.  Clustering on a sample
      of a given large data set may lead to biased results.  Highly
      scalable clustering algorithms are needed.

   +  Ability to deal with different types of attributes: Many
      algorithms are designed to cluster interval-based(numerical)
      data.  However, applications may require clustering other types
      of data, such as binary, categorical(nominal), and ordinal data,
      or mixtures of these data types.
 
   +  Discovery of clusters with arbitrary shape: Many clustering
      algorithms determine clusters based on Euclidean or Manhattan
      distance measures.  Algorithms based on such distance measures
      tend to find spherical clusters with similar size and density.
      However, a cluster could be of any shape. It is important to
      develop algorithms that can detect clusters of arbitrary shape.
 
   +  Mining requirements for domain knowledge input parameters: Many
      clustering algorithms require users to input certain parameters
      in cluster analysis(such as the number of desired clusters).
      The clustering results can be quite sensitive to input parameters.
      Parameters are often hard to determine, especially for data sets
      containing high-dimensional objects.  This not only burdens users,
      but also makes the quality of clustering difficult to control.
 
 
 

Yixian Yang, et al.         Expires April, 2006               [Page 8]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005
  

   +  Ability to deal with noisy data: Most real-world databases contain
      outliers or missing, unknown, or erroneous data.  Some clustering
      algorithms are sensitive to such data and may lead to clusters of
      poor quality.
 
   +  Insensitivity to the order of input records: Some clustering
      algorithms are sensitive to the order of input data; for example,
      the same set of data, when presented with different orderings to
      such an algorithm, may generate dramatically different clusters.
      It is important to develop algorithms that are insensitive to the
      order of input.
 
   +  High dimensionality: A database or a data warehouse can contain
      several dimension or attributes.  Many clustering algorithms are
      good at handling low-dimensional data, involving only two to three
      dimensions.  Human eyes are good at judging the quality of
      clustering for up to three dimensions.  It is challenging to
      cluster data objects in high-dimensional space, especially
      considering that such data can be very sparse and highly skewed.
 
   +  Constraint-based clustering: Real-world applications may need to
      perform clustering under various kinds of constraints.  Suppose
      that your job is to choose the locations for a given number of new
      automatic cash-dispensing machines(i.e., ATMs) in a city.  To
      decide upon this, you may cluster households while considering
      constraints such as the city's rivers and highway networks, and
      customer requirements per region.  A challenging task is to find
      groups of data with good clustering behavior that satisfy
      specified constraints.
 
   +  Interpretability and usability: Users expect clustering results to
      to be interpretable, comprehensible, and usable.  That is,
      clustering  may need to be tied up with specific semantic
      interpretations and applications.  It is important to study how
      an application goal may influence the selection of clustering
      methods.
 
   Recently, several unsupervised methods have been proposed. Portnoy
   et al. introduced an algorithm to detect both known and new
   intrusion types without the need to label the training data. This
   method, however, has its limitations. It is based on the assumption
   that ''the normal instances constitute an overwhelmingly large
   portion (>98%)''.  Also, it requires a predefined parameter of
   clustering width which is not always easy to find. Y-means,
   introduced by Guan et al., overcomes the shortcomings of the
   traditional K-means algorithm and can automatically decide the
   number of clusters through splitting and merging clusters. However,
   like other centroid-based clustering algorithms, Y-means can only
   deal with clusters with a spherical shape and one still needs to
   define a threshold for the "confident area".

 


Yixian Yang, et al.         Expires April, 2006               [Page 9]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   To deal with these issues, we introduce a new clustering algorithm,
   the Fuzzy-Connectedness Clustering (FCC) for intrusion detection
   based on the concept of fuzzy connectedness.  With little prior
   knowledge, the proposed method allows the discovery of clusters
   of any shape and can detect not only the known intrusion types,
   but also their variants.
 















































Yixian Yang, et al.         Expires April, 2006              [Page 10]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005

4. FCC Algorithm

4.1 Background

   In data mining, clustering is the most important unsupervised
   learning process used to find the structures or patterns in a
   collection of unlabeled data.  During the learning process, the
   similarities between pairs of data instances are considered and an
   optimized output is found which maximizes the similarities within
   the same group and the dissimilarities between different groups.
 
   While there is no absolute criterion to judge the optimality of the
   clustering result, the choice of similarity measures can surely lead
   to totally different groupings.

   In many clustering algorithms, the spatial distance (usually
   Euclidean distance) between different instances is used to measure
   the similarity.  This has proved to be successful in many
   applications.  It is intuitive that similar objects are closer
   together while objects from different groups are far from each other.
 
   This intuition, however, is not always true in more complicated
   applications, especially when the shape of data clusters is not
   limited to be spherical. Figure 1 shows an example in which the
   popular k-means clustering algorithm can not achieve satisfactory
   result.

   To deal with more complex situations, it should be more suitable to
   take both local neighborhood and global information into
   consideration when deciding the membership of each data instance.
   In this document, we propose a new clustering algorithm which uses
   fuzzy connectedness as the similarity metric.

4.2 Fuzzy Connectedness
       
   The concept of fuzzy connectedness was first introduced by
   Rosenfeld in 1979.  It can be utilized to deal with the spatial
   vagueness when the spatial entities of data do not have homogeneous
   interiors and sharply boundaries.  Fuzzy connectedness has been
   successfully applied to image processing.  In an image segmentation
   framework introduced by Udupa et al., fuzzy affinities are assigned
   to the target data object during the classification.  The affinity
   between a pair of pixels is defined as a weighted function of their
   adjacencies in the coordinate space, the intensity space, and the
   intensity gradient space.










Yixian Yang, et al.         Expires April, 2006              [Page 11]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   In a recent work, Herman and Carvalho generalized the concept of
   fuzzy connectedness.  For a finite set of data points, the strength
   of the link from one point to another is defined as the reciprocal
   of the distance between them and it can be automatically defined
   based on statistical properties of the links within regions of
   points identified as belonging to the same groups.  A chain is a
   sequence of ordered points and the strength of a chain is its
   weakest link.  Taking fuzzy connectedness as the similarity metric,
   the task of a clustering process is to find the strongest path of
   every data point and assign the point to the group that path leads
   to.
 
4.3 Methodology

4.3.1 Definition

   Basic definitions that we use in this document are presented below.

   A fuzzy set A in a space of points X={x} is a class of events with
   a continuum of grades of membership and is characterized by a
   membership function 米_A(x) with associates with each point in X a
   real number in the interval [0, 1] with the value of 米_A(x) at x
   representing the grade of membership of x in A.  Formally, a fuzzy
   set A with its finite number of supports x1, x2, ..., xn is defined
   as a collection of ordered pairs:
  
                      A={(米_A(xi),xi), xi, i=1, 2, ..., n}
  
   where the supports of A is the subset of X defined as
  
                      S(A)={x, x﹋X and 米A(x)>0}
  
   米i, the grade of membership of xi in A, denotes the degree to which
   an event xi may be a membership of A or belong to A.  This
   characteristic function in fact, can be viewed as a weighting
   coefficient which reflects the ambiguity in a set, and as it
   approaches unity, the grade of membership of an event in A becomes
   higher.

   Definition 1: For a positive integer K, a K-FCC of a set D (of
   points) is a function which maps each point c﹋D into a
   (K+1)-dimensional vector 考_c = {考_c0, 考_c1, 考_c2, ... , 考_cK},
   such that 考_ci﹋[0, 1], and for at least one k, 1 ≒ k ≒ K,
   考_c0 = 考_ck; for 1 ≒ m ≒ K and m ≧ k, 考_cm = 0 or 考_cm =
   考_ck.

   Definition 2: A fuzzy affinity on D is a function:耳: D2 ↙ [0, 1].
   耳(c, d) is the strength of the link (c,d).
 
   Definition 3: A chain in U﹋D from c_0 to c_N is a sequence
   C_(0,N ) = (c_0, c_1, c_2, ... , c_N) in U and the strength of the
   chain 耳_U(C_(0, N)) = min(耳_U(c_n-1, c_n)), 1 ≒ n ≒ N.


Yixian Yang, et al.         Expires April, 2006              [Page 12]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


4.3.2 Overview of FCC Algorithm

   Given the above definitions, a clustering process is to find the
   mapping of K每FCC which is deduced from the calculation of fuzzy
   affinities between pairs of points.  In this process, the definition
   of the fuzzy affinity function 耳 is very essential; one can include
   the statistical information of each cluster to make the results more
   reliable.  When only the Euclidean distance between pairs of points
   is considered, it is not possible to separate the two clusters
   correctly.
 
   In the proposed method, in order to initialize the clustering
   process, we need a little prior knowledge.  This is in the form of a
   few labeled data points in each possible cluster.  This requirement,
   however, is usually easy to satisfy in intrusion prevention
   applications.
 
4.3.3 Description of FCC Algorithm

   Every data collected is regarded as a vector of characteristic,
   because FCC algorithm will be applied in IPS.  So, we call an
   element in the data as a vector here.

   Input: Data D, number of clusters K, seed points S(U_k), 1 ≒ k ≒ K;
 
   Output: K-FCC matrix 考={考_ck | c﹋D, 0 ≒ k ≒ K} and label matrix
           label(c) for c﹋D.

   Step1: for every c﹋D

   Step2: for k = 1:K

   Step3: 考_ck = 0

   Step4: Heap = 耳

   Step5: for k = 1:K

   Step6: U_k = S(U_k) ﹍ {nearest neighbors of S(U_k)}

   Step7: Heap = Heap ﹍ U_k

   Step8: for every c ﹋ U_k 考_ck = 1

   Step9: Max_link = 1

   Step10: while Max_link > 0

   Step11: for k = 1:K

   Step12: while |U_k | > 0

   Step13: d = U_k(1); U_k = U_k - d
 
Yixian Yang, et al.         Expires April, 2006              [Page 13]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   Step14: NN = {nearest neighbors of d}

   Step15: C = { c﹋NN | 考_ck < min(Max_link, 考_k(d, c)
           &  考_c0 ≒ Max_link}
 
   Step16: while |C| > 0
 
   Step17: c = C(1); C = C - c
 
   Step18: l = min(Max_link, 耳_k(d, c))
 
   Step19: if Max_link = l and 考_ck < Max_link then U_k = U_k ﹍{c}
 
   Step20: if  考_c0 < l then
 
   Step21: if 考_c0 = 0 then Heap = Heap ﹍{c}
 
   Step22: for n = 1:K   考_cn = 0
 
   Step23: if 考_c0 ≒ l then 考_c0 = 考_ck = l
 
   Step24: Heap = Heap - {c|c ﹋ Heap and 考_c0 = Max_link}
 
   Step25: Max_link = Max(考_c0), c ﹋ Heap
 
   Step26: for every c﹋D
 
   Step27: label(c) = k with 考_ck = 考_c0

     
4.3.4 Parameters

   In the FCC algorithm, two parameters are involved: (a) the number
   of seed points and (b) the number of neighbors in the neighborhood.
   The first parameter is decided using prior knowledge about the
   data.  The minimum value is 1, which means that each cluster has
   at least one instance at the very beginning. Usually, the more
   seed points we have, the more information about the cluster will
   be given, and this will possibly provide more accurate results.
   However, depending on the distributions of different clusters, a
   small number of seed points may also give satisfactory results.
 












Yixian Yang, et al.         Expires April, 2006              [Page 14]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


   The number of neighbors used to define the neighborhood is also
   important.  As discussed earlier, when calculating the fuzzy
   connectedness of an instance to other instances in a certain
   cluster, both the local information about the distance between two
   instances and the statistical information about that cluster are
   involved.  An instance x is fuzzy connected to another instance y
   only if y falls in the neighborhood of x. If the number of
   neighbors is too small, only a few instances tightly close to x are
   considered reachable from x, and this may separate x from the
   cluster it really belongs to; when this number is too large, the
   importance of local connectivity will be reduced and it is possible
   for two instances far from each other to be considered as fuzzy
   connected. This may bring two or more clusters close to each other
   forming a single cluster, which is also not desirable. So, there
   should be a trade off among the effects of varying the two
   parameters.
 
   With different configurations of the parameters, while Detection
   Rate stays high and False Alarm Rate stays low, there are some
   changes in the numbers. When the number of nearest neighbors
   increases from 2 to 4, both Detection Rate and False Alarm Rate are
   slightly improved. The reason is that there are more choices for an
   instance to get connected and this helps it to find the correct
   cluster. Consequently, fewer attacks will be mislabeled as normal
   and fewer normal instances will be treated as intrusion.  However,
   there is not always improvement with the increase of the
   number of nearest neighbors considered.  

4.4 Application in IPS

4.4.1 Application Flow

          +--------+         +-----------+         +------------------+
          |Database|-------->|Abstraction|-------->|Vector Description|
          +--------+         +-----------+         +------------------+
                                                              |
                                                              |
                                                              V
                             +-----------+         +------------------+
                             |    FCC    |<--------|Measure Similarity|
                             +-----------+         +------------------+  
                                                                                                        
                   
                          Figure 1:  Application Flow in IPS
                        
                        








Yixian Yang, et al.         Expires April, 2006              [Page 15]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


 4.4.2 Explanation about Detail   

   The algorithm starts with an initialization process (steps 1-9).
   For every data point c﹋D , its fuzzy mapping vector is initialized
   with zeros. As shown later in the algorithm, the values in the
   mappings vectors are used to decide the memberships of the data
   points. All the seed points are put into corresponding clusters
   and certain items in their mapping vectors are set to 1.  A heap
   data structure is used to temporarily contain the unlabelled data
   points. It is set to empty at the beginning. Before the main loop
   (10-27) of the algorithm, only the seed points S(Uk) and their
   nearest neighbors are labeled. Their linkage to their cluster is
   considered the strongest and the value is set to one. Starting
   with these points, all the clusters try to expand. The strength of
   the linkage of each unlabeled point to a cluster is the strength
   of a chain (path) that connects this point to one seed point in
   that cluster. After all the fuzzy affinities are calculated, we
   get the K-FCC for the dataset D. The label of each data point c﹋D
   is finally decided by its mapping vector 考 c . It is assigned to
   cluster k if c k 考 is the largest element of the vector.

   In the FCC algorithm, a priority heap is used for the operations of
   insertion into (step 21) and removal (step 24) from the priority
   queue. With the definition of nearest neighborhood of an instance,
   there are no more than a fixed number of nearest neighbors that
   need to be processed in step 15. If there are N instances in D
   and n is used to denote the number of nearest neighbors, the
   computational complexity of FCC is O(N(logN + nK)).

   There are two issues that are very important to this algorithm: one
   is related to the definition of neighborhood (NN) of a data
   instance, the other is how to calculate the fuzzy affinity between
   each pair of instances. For the first issue, the neighborhood is
   defined in Euclidean space. A small number of neighbors around the
   target point with the shortest Euclidean distances are defined to
   be its nearest neighbors. With an indexing scheme such as an R-
   tree, it is very easy to find the neighborhood in a short time.

   In order to calculate the fuzzy affinity of an instance c and
   another instance k d ﹋U (target cluster), besides their spatial
   adjacency, we also involve the statistical information about the
   cluster Uk.  The distribution information of a certain variable
   (e.g., Euclidean distance between nearest neighbors) of Uk is
   stored and kept up to date.  When a new instance comes, its
   probability of falling into Uk is calculated using the
   distribution information and the Euclidean distance between the
   new instance and its nearest neighbor d in Uk.  Let E(c, d) be the
   Euclidean distance between c and d, and Mk and Sk be the mean and
   standard deviation of the Euclidean distance between every pair of
   instances already in Uk.                     




Yixian Yang, et al.         Expires April, 2006              [Page 16]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


4.4.3 System Evaluation and Results

   To evaluate our system we are interested in two major indicators of
   performance: the detection rate and the false positive rate.  The
   detection rate is defined as the number of intrusion instances
   detected by the system divided by the total number of intrusion
   instances present in the test set.  The false positive rate is
   defined as the total number of normal instances that were
   (incorrectly) classified as intrusions divided by the total number
   of normal instances.  These are good indicators of performance, since
   they measure what percentage of intrusions the system is able to
   detect and how many incorrect classifications it makes in the
   process.  We calculate these values over the labeled data to measure
   performance.








































Yixian Yang, et al.         Expires April, 2006              [Page 17]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


5 Conclusions

   In this document, we introduce a new clustering algorithm, FCC, for
   intrusion detection.  This algorithm uses the concept of fuzzy
   connectedness to calculate the similarities among different data
   instances.  Starting with a single or a few seed points in each
   cluster, all the data points are dynamically assigned to the
   cluster that has the highest fuzzy connectedness value (strongest
   connection).  In calculating the fuzzy connectedness, both local
   distance information and statistical properties of clusters are
   considered.  Comparing with the frequently utilized Euclidean
   distance, the new similarity metric is more robust, and there is
   no restriction on the shape of clusters that can be discovered.
  
   FCC has also some other advantages over the other previously
   proposed clustering algorithms: no predefined ※cluster width§, no
   threshold for ※confidence area§, etc.  Although the FCC algorithm
   requires a parameter K which denotes the number of clusters, most
   of the time it is not necessary to know an exact value for it.
 
   In many cases where we just care about whether an instance is an
   intrusion and do not need to know about the exact intrusion type,
   the parameter K can be fixed as a constant, i.e., K=2. With an
   efficient heuristic algorithm, the time complexity of the
   clustering process is O(NlogN+nK), where N is the number of data
   points and n, K are constants. Moreover, this unsupervised learning
   method can detect not only the known intrusion types, but also
   their variants. The two other parameters in FCC, the number of seed
   points and the number of neighbors, do not seem to affect greatly
   the performance of the algorithm. Experimental results on a subset
   of KDD-99 dataset showed the stability of efficiency and accuracy
   of the FCC algorithm. With different settings, the Detection Rate
   stayed always above 94% while the False Alarm Rate was below 4%.

   For future work, we need to find out if there is a general accepted
   definition of fuzzy affinity. With different distribution variable
   involved in the calculation, the values of fuzzy affinity may be
   different and it may lead to different clustering results. Even
   though the one we employ in equation (1) gives good results on
   the KDD-99 data set, it is not necessary that it will be appropriate
   for other datasets.
          












Yixian Yang, et al.         Expires April, 2006              [Page 18]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005
      

6. Acknowledgement

   The authors wish to thank Jian Li.

7. Informative References

   [1]  Leonid Portnoy, Eleazar Eskin and Sal Stolfo, ''Intrusion
        Detection with Unlabeled Data Using Clustering'',
        Proceedings of ACM CSS Workshop on Data Mining Applied
        to Security(DMSA-2001),Philadelphia, PA: November 5-8, 2001.
    
   [2]  Wenke Lee, ''A Data Mining Framework for Constructing
        Features and Models for Intrusion Detection Systems'',
        PhD Thesis, Columbia University, 2000.
    
   [3]	Pierre Hansen and Nenad Mladenoviu, ''J-Means: A New
        Local Search Heuristic for Minimum Sum of Squares
        Clustering'', Pattern Recognition , 2001, 34(2): 405-413
    
   [4]	Yu Guan, Ail A, Ghorbani and Nabil Belacel, ''Y-Means: A
        Clustering Method for Intrusion Detection'', Proceedings
        of Canadian Conference on Electrical and Computer
        Engineering, Montreal, Quebec, Canada. May 4-7, 2003.
    
   [5]	Azriel Rosenfeld, ''Fuzzy Digital Topology'', Information
        and Control, 1979, Vol.40, 76-87.
             
   [6]  Rebecca Gurley Bace, ''Intrusion Detection'', Macmillan
        Technical Publishing, 2000.
 
   [7]  Pan JJ, Yang XN, Wang GZ. ''An Image Segmentation and
        its Algorithm based on Fuzzy Connetedness'', Journal of
        Software, 2005, 16(1): 67-76.
      
8. Authors' Addresses

   Yixian Yang
   Information Security Center,
   Beijing University of posts and telecom.(BUPT),
   Beijing, China,100876
   Phone:8610-62283366
   Email:yxyang@bupt.edu.cn
 











Yixian Yang, et al.         Expires April, 2006              [Page 19]
INTERNET-DRAFT               algorithm for IPSFCC        October, 2005


Full Copyright Statement

   Copyright (C) The Internet Society (2005).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@ietf.org.
 
 
 
 
 
 



    
 
 



Yixian Yang, et al.         Expires April, 2006              [Page 20]