With the advancement in computer technology, reliance on internet has drastically increased. This exposes the systems to intruders, gaining access remotely through network vulnerabilities. More than one significant cyber incidents are reported every month by CSIS between 2006 and 2018 20. The statistics in 23 showed constant increase in cyber security incidents every year over the world between 2009 and 2015. Bhutan is an underdeveloped country and still picking up the trend in Information Technology. However, it is not spared from the challenges in providing secure services online. With lots of official and personal task/trade online through internet like banking application, online security clearance system and lots other, this makes it even harder for the cyber security personnel to manage and monitor security breaches. This ultimately increases challenges in securing information and maintaining confidentiality, Integrity and Availability of information.

In order to maintain these triads of security and to prevent the systems from being exposed to attackers, we need tools/system that would be able to monitor and detect any suspicious activities over the network and generate alert to the network administrator. Network administrators generally use firewall to block suspicious unauthorised access through the network. However, firewall cannot filter out all intruders. Therefore, IDS is used in detecting unauthorised network intruders. The IDS monitors and identifies attempted unauthorised access and then alert network administrator of the suspicious behaviour on the network.

One way of the IDS analyses the historical network data and uses machine learning techniques to create the detection model. There are two key tasks of machine learning techniques used in IDS; Supervised and Unsupervised 2, 7, 19. For this project, we studied the machine learning techniques for intrusion detection in the supervised learning setting.

The methods we used for intrusion detection showed high accuracy rate. In this paper, Random Forest, J48 and Naive Bayes classifiers are implemented using Random Projection as a feature selection and reduction method. These models performed binary as well as multiclass classification. Among these models, Naive Bayes’ performance was poor. So, we applied PCA as a dimensionality and feature reduction method, used Naive Bayes classifier and then compared it with the experimental result obtained from using Random Projection with Naive Bayes classifier. While evaluating and experimenting the models, we learned that class imbalance problem was not taken into consideration by these classification models. To resolve the problem, we modified J48 algorithm by adding weights in the calculation of entropy. The modified algorithm tends to consider the imbalanced distribution of data and produced unbiased output. However, it does not show significant improvement in accuracy, FP and FN rates. In our experiment, we first evaluated the result as normal and anomaly class. Then we evaluated it in terms of different classes of attack, namely DoS, Probe, U2R and R2L.

The next chapter provides description of the background and related work, followed by Chapter 3 presenting the proposed ideas and methodology. Chapter 4 explains the experimental results. Chapter 5 comprises of concluding remarks and future work, references, appendices and biography.

CHAPTER IIBACKGROUND AND RELATED WORK

As mentioned in the introduction, there are two key tasks in the machine learning: the supervised learning and the unsupervised learning tasks.

Supervised learning is a technique in which the model is trained using data sample with classification correctly assigned. For building the model, the algorithm uses pre-labelled instances as training examples. Some of the well-known supervised learning techniques are the tree-based algorithms like J48 and Random Forest, and statistical techniques like Naive Bayes 2, 19. These models perform both binary and multiclass classification. Binary classification involves classifying the data into two classes; whether it is intrusion or not (Positive or Negative). So the models predict the probability of a target variable to be Positive/Negative. In classifying different types of attacks in intrusion detection, the classes include DoS, Probe, U2R and R2L. This is referred as multiclass classification as it consists of more than two classes. One can solve the multiclass classification problem as the multiple problems of two class classification. Specifically, one can create one classifier to classify each type, such as DoS model for classifying if the access DoS or non-DoS type, and normal model for classifying if the access is normal or not. Then, the most likely answer to label the access from all classifiers is used as the predicted label. One can also solve the problem using the classifier that can model multiclass data, such as Naive Bayes and Neural Network.

The unsupervised learning task does not require input instances to be pre-labelled 2. The common analysis which is applied on the unsupervised learning task is to find the correlation between the data points. Then, one can use clustering technique, such as K-Mean clustering and DBSCAN, to group together similar data points into clusters. One can also find the strong correlation between features and use them to construct the association rules using Apriori Algorithm. For the rest of this chapter, we discuss the machine learning techniques for supervised learning task.

Anna L. Buczak and Erhan Guven 7 presents a literature review of data mining and machine learning models for intrusion detection. According to their survey, “the methods that are the most effective for cyber applications have not been established. Due to the richness and the complexity of the methods, it is impossible to make one recommendation for each method based on the type of attack the system is supposed to detect. There are several other criteria that need to be considered for determining the effectiveness of the methods”. The common quantitative measurements are accuracy, complexity, time taken for classifying a sample, and the final classification solution of each Machine Learning or Data Mining method. Another aspect of Machine Learning for intrusion detection that is highlighted in their literature is the importance of the datasets used in training the models. From their survey, they also observed the problem of the availability of labelled data which should be invested more on. Another cyber problem that makes the models more difficult to use is related to how often the model needs to be retrained. The researchers recommend a further research on methods that could be used for both misuse and anomaly detection.

2.1 Machine Learning Techniques and DatasetsFor emphasizing intrusion detection challenges over the network, supervised learning techniques such as Random Forest, SVM, J48 and Naive Bayes are used with various feature selection and reduction methods. In 1, the authors used Random Forest and SVM along with Random Projection. The result from these methods showed the accuracy rate from Random Forest with Random Projection to be 100%, thus proving it to be a better predictive model than SVM with Random Projection. The J48 Decision tree and the Bayesian algorithm were applied in the intrusion detection task in 2. The experimental result showed that J48 algorithm performed well for intrusion detection with almost 100% accuracy rate. The authors of 3 proposed Naive Bayes as a classifier and PCA as a dimensionality and feature reduction method. Their experimental results are based on accuracy rate, detection rate and False Positive Rate. The accuracy rate of the model trained with feature selection was 98.53 which are higher than the accuracy of the model trained with a maximum of 41 features selected, which was 80.14.

Feature selection and reduction methods are generally used for dimensionality reduction. Dimensionality reduction methods identify and remove irrelevant attributes that do not contribute to the model’s prediction. For this task, Random Projection, GainRatio and PCA were also used with various classifiers by S.R. Johnson & A. Jain 1, U. Bashir & M. Chachoo 2 and S.M. Almansob & S. Lomte 3.

From these recent researches, we selected the models that performed best. That is, we used Random Forest, J48 and Naive Bayes classifiers to evaluate the performance of the models in detecting intrusion.

“Since 1999, KDD’99 dataset 5 was most widely used for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. 31 and is built based on the data captured in DARPA’98 IDS evaluation program. DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic. It contains 5 million connection records, each with about 100 bytes. However, the drawback of these datasets is that it consists of redundant records both in training and test data which makes the output biased and inaccurate.” In order to get unbiased result, NSL-KDD dataset is used for executing classification models. 4 gives an analysis of NSL-KDD dataset. The NSL-KDD dataset is the improved version of KDD’99 dataset. There are no redundant records in the train and test set. Missing values are also removed. The training dataset consists of different types of attack. The attack types present in the training data are the known attacks whereas those additional attacks in the test data are the new/ unknown attacks. The NSL-KDD dataset is publicly available for use for the researchers to check the efficiency and analysing the performance of various algorithms used for detecting the attacks over network. This dataset is freely downloadable from 6 and 7.

There are also other types of dataset used for experimental purpose in machine learning. The number of published studies from 21 shows that KDD99 is the de facto dataset for intrusion detection and machine learning research areas. This dataset was created in 1999, hence the name KDD99. Between 2010 and 2015, the research studies has shown that KDD99 dataset was used in 133 articles, indicating as the highest number of citations, followed by NSL-KDD with 22 articles. We have however used NSL-KDD dataset because these data are filtered, i.e. redundant records and missing values are removed. The researchers 8 have used other datasets, such DNS data, NetFlow, attack signatures and database logs. However, these datasets are not available for download.

2.2 Imbalanced ClassWhen the number of instance in the classes is not equally distributed, it is called imbalanced class. With such distribution of data, the classifiers will output biased and incorrect result. In order to tackle the problem, the solution is acquired at input and classifier levels. The sampling approaches are used to balance the numbers of samplings of each class at the input level. The under sampling approach randomly eliminates some of the instances from majority class. As a result, the overall number of records in the training set is significantly reduced. Since we are dealing with high dimensional data, there will be a significant savings in memory as well 32. However, much valuable information could be lost when we remove instances that could be useful to our classifier in building an accurate model. The oversampling approach randomly replicates samples of the minority class. In oversampling, no information from the original training set is lost since we preserve all instances from both minority and majority classes. However, the downside is that we greatly increase the size of the training set. Thus, increasing training time and the amount of memory required to hold the training set.

At classifier level, the solution is usually obtained by modifying the classifier algorithm. The advantage of acquiring the latter solution is to avoid pre-processing steps on the dataset. 30 introduced a technique of including weights in the calculation of entropy to handle imbalance class problem by modifying J48 algorithm. Their experiment showed increase in accuracy rate from 70.2 to 75.2%. The Sensitivity also increased from 35.7 to 71.4% while Specificity decreased from 82.4 to 76.5%. In our paper also, we opted the solution at classifier level for handling imbalanced data. We compared all these models and evaluated its performance using an opensource tool Weka 3.8. 1 and 2 also used Weka while 3 used MATLAB for pre-processing, testing and evaluating the result.

CHAPTER IIIPROPOSED IDEA AND METHODOLOGYThe main idea of our proposed method is to evaluate the performance of machine learning models for intrusion detection. The overview of our proposed method is shown in Figure 3.1.The first step is to input the dataset (NSL-KDD) in Weka, followed by pre-processing of data. It is important to reduce the dimension of the data and select the appropriate attributes from the dataset. Hence, we used proficient methods for dimensionality and feature reduction, i.e. Random Projection and PCA techniques. After that, machine learning models Random Forest, J48 decision tree and Naive Bayes were applied for the classification of data. Finally, we evaluated the output in terms of normal and types of attack, namely DoS, Probe, U2R and R2L. We used open source tool Weka 3.8 and R Studio 1.1.414 to evaluate the classifiers.

62484079706Dimensionality & Feature Reduction

NSL-KDD Dataset

R2L

Normal

Evaluation of the result

Pre-Processing

Machine Learning Models

DoS

Probe

U2R

Dimensionality & Feature Reduction

NSL-KDD Dataset

R2L

Normal

Evaluation of the result

Pre-Processing

Machine Learning Models

DoS

Probe

U2R

Figure 3.1: Overview of the proposed method

3.1 DatasetAs mentioned in the previous chapter, for our studies, we used NSL-KDD dataset. This dataset consists of 42 features among which the normal activity is denoted by the last feature (label). The total number of instances is 148517. The instances are grouped into normal and attack class. The four categories of attacks present in the dataset are DoS, Probe, U2R and R2L. For the number of instances in each category is listed in Table 3.1 and 3.2. The full list of features of NSL-KDD dataset and its description is presented in Table 3.3. There are a total of 41 features in the dataset. Only the Protocol_type, Service, and Flag features are considered symbolic. While other features are of continuous type, many of which can be considered categorical. For example, Is_guest_login feature has only two possible values 0 and 1, where 0 indicates that the current login is not a guest, and 1 indicated otherwise.

Table 3.1: Number of instances in Normal and Attack class

Dataset Normal Attack Total Training 67343 58630 125973 148517

Testing 9710 12833 22544 Table 3.2: Number of instances in Normal and Different Attack Categories

Dataset Normal DoS Probe U2R R2L Total Training 67343 45927 11656 52 995 125973 148517

Testing 9711 7456 2421 200 2756 22544

Table 3.3: List of Features of NSL-KDD dataset and its Description

Table 3.3: List of Features of NSL-KDD dataset and its Description (cont.)

Table 3.3: List of Features of NSL-KDD dataset and its Description (cont.)

Table 3.3: List of Features of NSL-KDD dataset and its Description (cont.)

3.2 Data Pre-processingThe first step is to pre-process the data for binary and multiclass classification. Pre-processing method involves cleaning the data; removing redundant and missing values, detecting outliers, converting data to numeric or normalising the data. We used RStudio 1.1.4 to check for outliers, redundant and missing values. The dataset is split into training and test set which is freely downloadable from 6 and 7.

3.3 Dimensionality and Feature Reduction MethodThe dimensionality and feature reduction methods are used for choosing a subset of appropriate features for building the model. The advantage of using this method is to be able to reduce the dimension of the data and project it into low-dimensional space as well as remove the uninformative features. Each of the methods used are demonstrated in the following sections.

3.3.1 Random Projection

Random Projection is a feature selection / reduction method used to reduce the dimensionality of the data by reducing the amount of random variables. As per 1,25, “In random projection, the original d-dimensional data is projected to a k-dimensional (k ;; d) subspace through the origin, using a random k × d matrix R.” Random projection is commonly used because it produces fewer inaccurate results compared to other methods. The dimensions of random projection matrices are managed so as to approximately retain the distance between any two samples of the actual dataset.

XkxNRP = RkxdXdxNRandom matrix R is generated based on the following Sparse distribution:

3+1 with probability 160 with probability 23-1 with probability 16XkxNRP = 1kRkxdXdxNBy using this probability distribution, only 1/3rd of the data are processed. This makes it more memory efficient and allows faster computation of the projected data. Therefore, we selected the distribution as Sparse1 as shown in Figure 3.2.

In Weka, applying Random Projection reduced the dimension to 11 and labelled it as K1, K2…..K11 1 where K11 is the class label. We used default parameters for Random Projection as shown in Figure 3.2. The parameter “percent” gives us an option to set the percentage of dimensions of the data we want to be reduced to. Since we set the parameter “numberofAttributes” to 10, “percent” was ignored and set to 0. We set the parameter “replaceMissingVlaues” to False as we don’t have any missing values in our dataset. The last parameter is “seed” which was set to 42 for generating the random matrix. These are the main parameters used in Random Projection.

Figure 3.2: Configuration of Random Projection

3.3.2 Principal Component Analysis (PCA)3 defined Principal Component Analysis as a “dimensionality and feature reduction method that generates features which are linear combination of the initial features”. PCA maps each data point in a d-dimensional space to a much lower k-dimensional subspace. The set of k new dimensions generated are called the Principal Components (PC). Each principal component is directed towards maximum variance excluding the variance already accounted for in all its preceding components. Subsequently, the first component covers the maximum variance and each component that follows it covers lower variance”. PCs are calculated using the Eigenvalues and Eigenvectors of the data covariance matrix. Eigenvalues are coefficients attached to Eigenvectors. Eigenvalues and Eigenvectors are computed by solving the following equation.

| A-?IV | = 0

where, A is the data matrix, and ? and V are the eigenvalue and eigenvector that is computed from these formulae. Eigenvector V tells us in which direction the data varies the most and eigenvalue ? is a number, telling us how much variance there is in the data in that direction. The eigenvalue ? basically reveals if the vector V is reversed or left unchanged when multiplied by A. Because eigenvector V is non-zero, identity matrix I is used as it retains this fact. From this computation, eigenvector with the highest eigenvalue becomes the principal component of the data.

The Principal Components are represented as follows.PCi = a1X1+ a2X2+…+ adXdwhere,

PCi = Principal Component ‘i’,

Xj = original feature ‘j’, and

aj = numerical coefficient for Xj .

In Weka, PCA compute principal components of the attributes. Based on its covariance, it ranked the principal components in descending order. Variance higher than 1.1 are selected for further processing as it covered maximum variance. Weka also ranked the attributes that contributed the most in computing principal components. The default parameters used for PCA and the attributes selected for further processing are shown in Figure 3.3 and Table 3.4, respectively.

Figure 3.3: Configuration of PCA

Following are the description of the main parameters used 35:

centerData: Setting this parameter to True, PCA was computed from the covariance matrix.

maximumAttributeNames: The maximum number of attributes to include in transformed attribute names which was set to 10 by default.

transformBackToOriginal: Transform through the PC (principal component) space and back to the original space. This parameter was set to False as we do not require using the original space.

varianceCovered: Retain enough PC attributes to account for this proportion of variance. We set it to .95 so as to retain 95% of the variance.

Table 3.4: List of selected features from PCA and its Description

Table 3.4: List of selected features from PCA and its Description (cont.)

3.4 Applying classification modelsAfter the dimensionality and feature reduction methods are applied, different machine learning techniques are used for the classification of the data using Weka tool.3.4.1 J48 Decision TreeJ48 is a machine learning model that selects the value of a target variable of a new sample based on the values of various attributes of the available data. As 2 explained, new data is labelled according to the existing observation in the training dataset. One of the models of decision tree is C4.5 algorithm which is implemented as J48 in Weka tool using Java. The main idea of decision tree is to build a predictive model which is mapped into a tree structure. 2 listed the steps involved in growing a tree:

Place the best attribute of the dataset at the root of the tree.

Split the training set into subsets.

Repeat step i and ii on each subset until we find leaf nodes in all the branches of the tree.

To get a clear understanding of how decision tree is grown, we take an example as shown in Figure 3.4. Our objective here is to find out if it is intrusion or not. We used three attributes; Protocol, Service and Flag. Protocol is set as the root node of the tree. When the protocol is UDP, it indicated to be an intrusion as all the instances (4) belonged to Yes. The node is considered pure when all the instances belong to either of the class which is true in our example. Because the node is now pure, the splitting ends at this point. For the next protocol ICMP, 3 are intrusion and 2 are not, i.e. impure. Hence, we select next attribute for splitting which is chosen as Flag. In this case also, the node is pure as all three instances belong to flag SF. Therefore, we stop at this point. So, it is shown that when the protocol is ICMP and the flag is SF, it is an intrusion. Similar procedure follows for the protocol TCP. This example is a simple illustration of growing a decision tree.

25971552070Flag

Protocol

TCP

UDP

ICMP

2 Yes/ 3 No

4 Yes/ 0 No

3 Yes/ 2 No

Service

ssh

SF- 3 Yes

SF- 0 No

telnet

Intrusion or Not?

Flag

Protocol

TCP

UDP

ICMP

2 Yes/ 3 No

4 Yes/ 0 No

3 Yes/ 2 No

Service

ssh

SF- 3 Yes

SF- 0 No

telnet

Intrusion or Not?

Figure 3.4: Decision Tree

However, if the dataset consists of multiple attributes/features, it is difficult to decide the best split. For this purpose, Gain Ratio is used for splitting in decision tree. J48 decision tree use the concept of Gain Ratio for selecting attributes that best partitions a dataset 34. “It will consist of potential information generated by splitting the training data set into v partitions, corresponding to v outcomes on the attribute. The attribute with the maximum gain ratio is selected as the splitting attribute. Let D be a set consisting of d data samples with distinct classes”. In order to find the information gain ratio, first we compute Entropy:

E D = -i=1?pi log2(pi)where pi is the probability that an arbitrary sample belongs to class Ci, i.e. the probability of getting the ith value when randomly selecting one from the set.

Information gain is then computed as:

Gain D,S = E D -i=1nspi(Di) E (Di) The GainRatio is defined as:

GainRatio D,S = Gain (D,S)- i=1nsP (Di) log2 P(Di) GainRatio = Information GainSplit InfoThe advantages of using J48 classifier are:

i. Insensitive to outliers.

ii. Overcome scale difference between parameters.

iii. Handles missing values.

Disadvantages of using J48 classifier are:

Tend to overfit the training data.

ii. Prone to sampling errors.

The configuration set for applying J48 classifier is shown in Figure 3.5. Following are the description of the parameters from 35;

batchSize: The preferred number of instances to process if batch prediction is being performed. We set it to 100 so all the instances are to be processed.

binarySplits: Whether to use binary splits on nominal attributes when building the trees. This was set to False.

collapseTree: Set to True as we wanted parts to be removed that do not reduce training error.

confidenceFactor: The confidence factor used for pruning. We used confidenceFactor of 0.25 as lower value incurs more pruning.

Debug: If set to true, classifier outputs additional information to the console which we don’t need it for our experiment.

doNotCheckCapabilities: If set, classifier capabilities are not checked before classifier is built. We set it to False and this also reduces runtime.

doNotMakeSplitPointActualValue: If true, the split point is not relocated to an actual data value. We set it to False for our experiment.

minNumObj: The minimum number of instances per leaf, i.e. 2.

numDecimalPlaces: The number of decimal places to be used for the output of numbers in the model. The default value used was 2.

numFolds: Determines the amount of data used for reduced-error pruning. The default used was 3; one fold was used for pruning, the rest for growing the tree.

reducedErrorPruning: Whether reduced-error pruning was used. The default was set to False.

saveInstanceData — Whether to save the training data for visualization. The default was set to False.

seed: Used seed 1 for randomizing the data.

subtreeRaising: Set to True so as to consider the subtree raising operation when pruning.

Unpruned: Set to False because pruning was performed.

Figure 3.5: Configuration of J48 classifier

3.4.2 Modified J48 Algorithm for Imbalance Class ProblemThe J48 algorithm explained in 3.4.1 is modified to handle the imbalance class problem. As described in Chapter 2, the solution for handling imbalance class problem is acquired at input and classifier level. However, for our experiment, the latter solution is opted. The solution at the classifier level is acquired by modifying J48 algorithm as described in 30. J48 decision tree use Gain Ratio which is based on the entropy and entropy is used for computing the next split. The problem with the entropy is that if the data is imbalanced, it will result in low entropy and hence produce high false negative rate. So in order to tackle this problem, weights are used to find the entropy.

The weight is first computed using (1):

wc= 1?.pc (1)

Initially, entropy is computed using (2):

E D = -i=1?pi log2(pi)(2)

With weights, the weighted entropy (3) becomes:

Ew D = -i=1?wipi log2(wipi) (3)

where D is the dataset and wi is the weight of each class taken from W D. 30 used gastric cancer dataset and artificial datasets generated from this dataset. In our experiment, we used NSL-KDD dataset to verify if the modified algorithm is applicable and efficient for intrusion detection as well.

3.4.3 Random ForestRandom forest is an ensemble classification method used for improving the accuracy of the model 1. It is basically a collection of many decision trees which benefits in yielding low classification error as compared to other methods. It produces N number of trees and each tree in the forest represents either normal or malicious class. This basically says that it produces random bootstrap samples, i.e. subsets, which represent many decision trees. In bootstrap sample, the samples are randomly selected but with replacement. So each subset may contain duplicate samples and some samples from the actual data may not appear in the subset. This is the reason it is called Random Forest. The model searches for the best feature among a random subset of features. Therefore, only a random subset of the features is taken into consideration by the algorithm for splitting a node. Following are the steps involved in creating random forest as stated in 1.

Create n bootstrap samples from the original data.

Grow an unpruned tree

Use random set of predictors (m) for splitting: m= k , where k is the total number of predictors (attributes). Using m, we split the node with the best variable among the k. This eventually speeds up tree growing process.

Predict new data by aggregating the predictions of the n trees (maximum votes).

The advantages of using Random Forest are:

Handles missing values.

Overcomes the problem of overfitting.

Handles large datasets without variable removal.

Does not require scaling the data.

Disadvantages of using Random Forest are:

Time consuming.

If the data is too large, difficult to interpret the relationship existing in the data.

The parameter setting used for Random Forest classifier is shown in Figure 3.6. Following are the description of the parameters used 35:

bagSizePercent: Size of each bag, as a percentage of the training set size. i.e. 100.

batchSize: The preferred number of instances to process, i.e. 100.

breakTiesRandomly: Break ties randomly when several attributes look equally good. The default was set to False.

calcOutOfBag: Whether the out-of-bag (OOB) error is calculated. This was set to False as we don’t require computing OOB.

computeAttributeImportance: Set to False as we do not compute attribute importance via mean impurity decrease.

Debug: If set to true, classifier outputs additional information to the console which we don’t need it.

doNotMakeSplitPointActualValue: If true, the split point is not relocated to an actual data value. We set it to False for our experiment.

maxDepth: The maximum depth of the tree, 0 for unlimited.

numDecimalPlaces: The number of decimal places to be used for the output of numbers in the model. The default value used was 2.

numExecutionSlots: The number of execution slots to use for constructing the ensemble which was set to 1 for our experiment.

numIterations: The number of iterations to be performed, i.e. 100.

outputOutOfBagComplexityStatistics: Whether to output complexity-based statistics when out-of-bag evaluation is performed. Since we don’t use OOB evaluation, this was set to False.

printClassifiers: Print the individual classifiers in the output. Set to False.

seed: The random number seed to be used for randomizing the data which we used as 1.

Figure 3.6: Configuration of Random Forest

3.4.4 Naive BayesNaive Bayes is a classification algorithm based on Bayes Theorem. Bayes Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The naive Bayes classifier operates on a strong independent assumption that the probability of one attribute does not affect the probability of the other. Naive Bayes deals with continuous and discrete features as well as deals with missing features. As shown in 3, given a set of features A= (A1,A2,A3…..An) and set of classes C=(C1,C2…..Cn), we can apply the Naive Bayes model as follows. Specifically, the probability of the class C given the features {Ai}, is computed as:

P (C|A1,A2,…..An) = i=1nP(Ai|C) P(C)P(A1,A2….An)i.e. Posterior Probability = Likelihood x PriorEvidenceTo estimate P (Ai|Ck) for discrete attributes, we compute the likelihood of each attributes when the class is Ck. For example, the number of instances in attribute A1 is 2, A2 is 2 and A3 is 1 out of 5, given a class Yes. The P (Ai|Ck) is computed as (2/5) * (2/5) * (1/5). However, if the instance in an attribute does not occur with any class Ck, one can avoid zero probability using the add-one-count to all possible values of the attribute. Specifically, we add 1 to A1 ,A2,and A3 such that they are now 3, 3, 2 out of 8 respectively. Similarly, we compute the likelihood of each attributes, given a class No. Finally, after computing the posterior probability, the class with the highest probability is considered as the most likely class.

For estimating the probability of data containing continuous attribute, firstly we need to segment the data and then find the mean and variance of Ai for each class. We assume ?k as the mean of the values in Ai and ?2k as the variance of the values in Ai associated with class Ck. With sample value v, the probability distribution of v when a class Ck is given, P A=v Ck) is computed as:

12??k2 e- (v-?k)22?k2The advantages of using Naive Bayes are:

Handles missing values.

Does not overfit data.

Disadvantages of using Naive Bayes are:

Cannot represent complex behaviour.

Need smaller dataset.

In Weka, default configuration is chosen for Naive Bayes as shown in Figure 3.7. Following are the description of the parameters used 35;

batchSize: The preferred number of instances to process, i.e. 100.

Debug: If set to true, classifier outputs additional information to the console which we don’t need it.

displayModelInOldFormat: Whether to use old format for model output. Set to False.

doNotCheckCapabilities: If set, classifier capabilities are not checked before classifier is built. We set it to False and this also reduces runtime.

numDecimalPlaces: The number of decimal places to be used for the output of numbers in the model. The default value used was 2.

useKernelEstimator – Whether to use a kernel estimator for numeric attributes rather than a normal distribution. Set it to False as this was not required.

useSupervisedDiscretization — To convert numeric attributes to nominal ones. Set it to False.