Naive Bayes is one of many machine-learning algorithms. Despite its name, it is an efficient technique used for classification problems, and if used correctly, it can be very powerful. This article will first describe the concepts of machine learning, its working principles, taxonomy, and types of problems in order to then fully introduce the Naive Bayes method, examine its pros and cons and extensions, and make conclusions.
- Machine Learning
- Working Principles and Concepts
- Feature Space
- Hypothesis Function
- Classification Problems
- Naive Bayes
- Inductive Bias
- Mathematical Model and Training
- Preprocessing Data
- Missing Features
- Zero Probability Problem
- Underflow Issues
- Advantages and Disadvantages
- Handling Continuous Values
- Bayesian Networks
Machine learning is a fancy term used for the concept of software that learns automatically how to solve a problem or execute a task, and which becomes more and more accurate over time. Why use machine learning instead of explicit (classic) programming? Because some problems may be very difficult to implement manually, and because we need the system to be flexible enough to adapt itself to new cases without human intervention. Examples of problems that are commonly tackled by machine learning are: detection of fraudulent credit card operations, face recognition, voice recognition, self-driving cars, robotic planning, diagnosis of diseases, relevance of search engines results, etc.
Working Principles and Concepts
All machine-learning algorithms work on the same principle: they receive some input data, known as training data, and then they build a mathematical model from the data and use the mathematical model to execute their tasks. The kind of input data, stages of preprocessing needed, flexibility of the mathematical model, and type of output determine which algorithm better fits the problem at hand.
The idea of training data lies in the concept of feature space. It is an n-dimensional vector with all the properties that the input data can have, such as height, width, weight, color, etc. and the values that are allowed for these properties (e.g. discrete? continuous? enumerated?). For example height: 100 cm, width: 30 cm, weight: 40 kg, color: blue.
For a certain piece of data to be fed to a machine-learning algorithm, it must first fit in the feature space supported by the algorithm. Otherwise, the raw data cannot be used. Instead, there is a need to transform the data by a preprocessing step.
The hypothesis function is what the algorithm is trying to “learn.” In terms of feature space, let v be the feature vector, o the output and f an unknown relationship such that f(v) = o holds for all feature space. Then, a machine-learning algorithm will try to approximate the unknown f with a function h (hypothesis) such that h(v) = 0.
A hypothesis function can be simple, such as a linear function, or complex, such as a multimodal function. Some algorithms can only handle simple hypothesis functions while others can handle more complex ones. If a problem has a complex relationship with its features and it’s tackled with an algorithm that only supports a simple hypothesis, then the error margin will be large regardless of the training set.
The hypothesis set is the set of all possible hypotheses for a particular algorithm. The process of learning can be viewed as picking the best hypothesis from the hypothesis set that best fits the training data.
There is a wide range of different machine-learning algorithms, each one with its strengths and weaknesses. For this reason, they are usually classified in groups based on the type of task that they are trying to solve.
We can consider three big groups: supervised, unsupervised and reinforcement.
- Supervised algorithms are those that require that all the training data be labeled with the classes that need to be predicted (i.e. preclassified).
- Unsupervised algorithms are those that do not require that the training data be labeled, so the algorithm itself will classify the training set.
- Finally, the last group, reinforcement learning is used in environments where there is a reward for the actions performed by an agent and those rewards are not immediate, but delayed until the completion of the task.
Classification problems are a type of supervised learning. The task at hand is as follows: Given a feature vector, find the class to which it belongs (or at least, the most probable class). The number of classes determines whether it is either a binary classifier (only two classes) or a multilabel classifier (more than two classes).
The description of the task may sound simple, but there can be a lot of issues that need to be addressed before using one algorithm or another.
Below is a nonexhaustive list of the possible issues:
- Obtaining labeled data: Since it is necessary for all the input data to be labeled, this can be a problem in some cases.
- Missing data: In both the training stage and the classification stage there could be data missing.
- Noise: In both the training stage and the classification stage, there could be data that do not have the correct values.
- Mislabeled data: In the training set, a feature vector can be erroneously mislabeled.
- Dimensionality: Each feature vector can contain a lot of data —more than is necessary— blurring the frontiers between classes and making predictions harder.
- Training time: For some algorithms constructing the mathematical model can be time-consuming if their training set is too big.
- Over-fitting: A lot of training with the same dataset can be counterproductive for the algorithm to make generalizations in unseen cases.
- Accuracy: training with a poor or small dataset can lead to inaccuracies in the classification of unseen cases.
- Online vs. offline:Online is when training can be done incrementally. Offline means that the training can be done only when the entire dataset is available at once. Some datasets can be very large to fit in the memory, so in these cases algorithms with online training are preferred.
- Patterns changing over time: The pattern that is being learnt may vary over time. In this case, the accuracy will decrease and it is necessary to update the model with a fresh data set.
Most of these issues are critical to evaluate which algorithm is better for the task that is going to be resolved. Some algorithms may have issues with missing or noisy data and cannot be used, while others may not have these problems. So it’s important to understand the type of data and characteristics of the problem in order to then choose an appropriate algorithm.
Naive Bayes is a classification algorithm that uses Bayes’ theorem as its working principle. Bayes states that:
Where P(A|B) is called posterior, P(B|A) is called likelihood, P(A) is called prior and P(B) is called evidence.
When faced with the problem of classifying, we are interested in finding class Ci that maximizes:
P(Ci | feature1=value1, feature2=value2, feature3=value3…)
That is, finding the class Ci that is more probable among the others given the feature vector to classify. This is a decision rule called maximum a posteriori, so we need to calculate the posterior probability.
Since we don’t have the posterior probability, we need to use Bayes theorem to find it. We use the likelihood P(feature1=value1, feature2=value2, feature3=value3… | Ci), and the prior P(Ci). Note that we don’t need the evidence P(feature1=value1, feature2=value2, feature3=value3…) because it is a constant when we only change classes. We are only interested in the class that maximizes the probability, not the probability itself.
Putting it all together, we need to calculate:
Since machine-learning algorithms are required to generalize unseen cases, they all have some type of inductive bias associated with their learning. Naive Bayes assumes that all features are conditionally independent from each other (this is the reason why it’s called naive). It is important that this assumption holds true because if there are some correlations between any features, then when we are calculating the likelihood we are going to overweigh the correlated features and the results will be biased.
Using the assumption of conditionally independent features, we can calculate the joint probability required for the likelihood as follow:
P(feature1=value1, feature2=value2, feature3=value3, … | Ci) = P(feature1=value1| Ci).P(feature2=value2| Ci).P(feature3=value3| Ci) …
This will lead to the following formula:
Putting it together:
Mathematical Model and Training
The mathematical model used in Naive Bayes is based on data aggregation. In the training stage, the algorithm is fed with feature vectors and their corresponding classes. The value for each feature is aggregated along with its class, and the class value is aggregated too. Thus, we can compute the probability P(feature=value | Ci) as the frequency of the value appearing in the dataset relative to the other values for the same feature. Likewise, P(Ci) can be easily computed by aggregating all classes and taking the relative frequency.
This leads to a very compact model when data is kept only once and all the queries that are needed to be satisfied in the classification step can be done quickly. Moreover, training can be done online when each feature vector arrives, and instantly be ready for a new classification with the updated model. This is an advantage over other algorithms where the training stage must be done all at once when all the training data is available. In some cases, this can be a time-consuming operation.
In some cases, it may be necessary to preprocess data in order to improve the accuracy of the classification. Another reason to preprocess data is to enforce the algorithm’s precondition that states that all features must be independent with no correlations between them.
In its simplest form, Naive Bayes only works with discrete values (and this can be a disadvantage), so in order to work with continuous data, you must perform a discretization of the possible values. This discretization must be in accordance to the range of values allowed by the feature. It is not worth dividing with too fine granularity, nor doing the opposite, as this can lead to bad performance.
Another preprocessing that can be done is to project the feature space to a lower dimension. The features that are going to be projected must be chosen carefully to avoid deleting those that are relevant for the classification and leaving the ones that only add noise or make prediction harder.
When dealing with a feature vector, we can potentially have missing features both in the training set and later in the classification stage. Some machine-learning algorithms may have trouble dealing with such cases, as is the case of decision trees when the missing feature is a splitting node in the tree.
There are several strategies that can be followed in order to cope with this scenario. One strategy can be to ignore the whole feature vector completely when some of its features are missing. However, this can be done only in the training stage and not in the classification stage. Another possibility is to use a default value for each missing feature. Choosing a good default value is crucial to maximize accuracy. It is possible to use the training set to obtain the most common value for each feature.
A third strategy we can use, but only with Naive Bayes, is to ignore the missing features, but still use the remaining values for the training and classification. This is possible because of the independence assumption implicit in Naive Bayes when calculating the joint probability of the likelihood. Ignoring the missing feature is equivalent to making the same weighted contribution for all classes, so it is a good strategy to follow.
A Zero Probability Problem
A zero probability problem arises when we have a feature with a value that was never seen in the training set. This is a very special case we need to take care of because of the method we use to calculate the likelihood. Since we use the product of all features’ relative frequency, when we have a feature that we haven’t seen before, the associated frequency will be zero. Thus, the final product will be zero as well and the comparison will make no sense.
To cope with this problem it’s better to ignore the feature, but care must be taken to do this for all classes that are going to be tested and not only those for which the value is zero. Otherwise, the results will be biased.
Another option is to simply add one to the count of each value when calculating the probabilities. By doing so, we guarantee that all values will be always greater than zero.
When dealing with small probabilities and a lot of features that are multiplied together, we can easily get underflow issues as a result of limited floating point precision. This problem must be kept in mind because it can raise classification errors if it is not considered.
To overcome this, a common technique that is frequently used is to map the posterior probability to log space. Since the logarithm function is monotonically increasing, it keeps the order of two ordered values, and at the same time, it helps to reduce the range of values that are going to be compared. We can use the logarithm property ln(a*b) = ln(a) + ln(b) to make the computations to avoid underflow issues:
Since we do not need the posterior probability at all (only the class that maximizes it), we do not need to undo the mapping.
Advantages and Disadvantages
Naive Bayes has many advantages and disadvantages. Its advantages are its good resistance to missing and noisy data in both the training set and classification. It has a relatively easy implementation, and its speed is very fast. Training time has order O(N) with the dataset and the classification of one instance has order O(1) when the model has been constructed. As mentioned earlier, training can be done online and interleaved with classification. The mathematical model is relatively compact and only grows when the training set includes unseen values. As its hypothesis function is linear, it is nearly immune to overfitting because it cannot get stuck in local minimum. Convergence speed is another advantage since, in general, it took less training data to converge and start running than other methods.
On the other hand, there are some disadvantages that are worth mentioning. As mentioned before, its hypothesis function is linear, thus it’s very simplistic in most scenarios. This means that there are situations that even with a lot of training data, the classification will have a great margin of error due to the algorithm not being capable of representing a more complex hypothesis. The conditional independence assumption is a disadvantage only if it does not hold true because it can give biased results. This can be mitigated if we project the feature space to a lower dimension where the correlated features are removed, but this is not always an easy task because there can be hidden relationships between variables.
Another disadvantage is when we are trying to justify, or understand why a feature vector was classified with a certain class. In other algorithms, it can be clear why an instance was classified with a certain label (i.e. height > 180 and weight > 90), but with Naive Bayes we don’t get that information. Instead we only get the class with the higher posterior probability which says nothing in terms of its features. That is, we don’t get any information about the structure of the data being analyzed.
Regarding the feature’s domain, the biggest limitation is that the algorithm can only handle discrete and finite values in both the input values and the output class. However, it is possible to handle continuous values in the input using several strategies.
Handling Continuous Values
To overcome the limitation of discrete values, one possible strategy that we mentioned early was to divide the feature’s range into discrete values. We must be careful not to divide it too much because a lot of different values for the same feature will only confuse the algorithm. On the other hand, a few values will make the feature useless as it will not provide differences for each class.
There is a better strategy to handle continuous values, and it’s called Gaussian Naive Bayes. In Gaussian Naive Bayes, the feature’s values are assumed to follow the Gaussian distribution. The mathematical model is changed according, and instead of having the relative frequencies for the different values, we now only need the mean and the variance for each one. This way we can compute P(feature = x) using Gauss’s density function as follow:
Where σ is the standard deviation, and μ is the mean.
Having a method to compute the mean and standard deviation online is crucial if we want to maintain that desired property of the Naive Bayes algorithm. Instead of having at once all values to computing the mean, and later the variance (or standard deviation) we can proceed as follow:
Where N is the sample size and v is the new value to be added. And for the standard deviation:
This means that for each feature in each class, we need to store: the number of samples N, the current mean μ, and the variance σ². Then when a new value v arrives, update the mean and variance according to the formulas above.
It is worth mentioning that not all continuous data follow the normal distribution, it can follow for example the exponential distribution or any other distribution. In the special case that the data follows a uniform distribution, then it’s safe to remove that feature because it will not contribute to any class.
Thanks to the independence assumption, it’s possible to use together classical Naive Bayes with Gaussian Naive Bayes or other form of distribution for continuous data, each one targeting its corresponding features and use all the gathered probabilities for the final result. This is indeed the ideal case for achieving optimum accuracy, but it requires that all features be carefully inspected in order to determine their statistical properties.
Bayesian Networks are improvements over Naive Bayes because they can handle conditional dependence between features. They model dependencies using graphs, and thus they can use all the available information and they aren’t constrained to a conditional independent projection of the data, making them more accurate.
Bayesian Network example. http://www.intechopen.com/books/brain-computer-interface-systems-recent-progress-and-future-prospects/equivalent-current-dipole-source-localization-based-bcis-with-motor-imagery
Since all dependencies must be explicitly declared, and the corresponding probabilities for each case must be obtained, Bayesian Networks are harder to implement, and need more data to learn those dependencies and probabilities. If the hypothesis function is too complex, it will be necessary a lot of training data, otherwise it will be incomplete and the classification accuracy will be bad.
One of the greatest advantages over Naive Bayes is that with Bayesian Network it is possible to extract knowledge from the model. Unlike Naive Bayes, in Bayesian Networks, we can get the conditional dependence graph as part of the data mining process, and use it to understand the structure of the data itself. Moreover, this can be more important than the classification task in some problems.
Naive Bayes is a powerful method for supervised learning. Its simplicity and speed makes it a good candidate despite its conditional independence assumption and other limitations. Because it does not require too much data to start running, this makes a good starting point for a proof of concept, at least for a quick initial validation, then add posterior refinements if required.
“Naive Bayes Text Classification”. 2016.Nlp.Stanford.Edu. Accessed August 28 2016. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html.
” The Inductive Biases of Various Machine Learning Algorithms – Laura Diane Hamilton “. 2014.Lauradhamilton.Com. Accessed August 28 2016. http://www.lauradhamilton.com/inductive-biases-various-machine-learning-algorithms.
“Notes on Maximum Likelihood, Maximum A Posteriori and Naive Bayes – Zigang Xiao”. 2014.Blog.Ivansiu.Com. Accessed August 28 2016. http://blog.ivansiu.com/blog/2014/09/24/notes-on-maximum-likelihood/.
“Yudkowsky – Bayes’ Theorem”. 2016.Yudkowsky.Net. Accessed August 28 2016. http://www.yudkowsky.net/rational/bayes.
Ben-Gal I., Bayesian Networks, in Ruggeri F., Faltin F. & Kenett R., Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons (2007).