# Demystifying Bayes’ Decision Rule

Bayesian decision theory is a fundamental statistical approach to solve classification problems. It relies on prior probabilities and class conditional densities to reach a decision. Often it is stated that Bayes classifier is the most *optimum* among all classifiers and it has the least *error*. In simple terms, it aims to minimize the probability of error in classification problems.

One should also think that if in fact Bayes’ classifier is the best classifier for any classification problem, why do we need other classification methods in the first place! Also, how do we define “best”? To answer all these questions we need to get a little rigorous and see how Bayes’ theorem is built up from the first principles.

Since I don’t want the generalizations to obscure the central points, I will consider a case of only two classes and one feature. Let Ω ∈ ℜᵈ be our sample space which is the set of all possible features in a d dimensional Euclidean space. d is the number of features and in our case *d = 1*. And let that feature be denoted by x ∈ ℜ. We are only concerned about two classes as of now, so we partition our space into two sets Ω₁ and Ω₂. Since these two sets are partitions of the sample space they have the following properties:

It is important to note that the decision *D* is not symmetric according to our definition. Before things get out of hand lets give some context to our abstractions.

Suppose in a room there are 30% males and 70% females. We have no other information at this point. If we pick a random person from the room there is a 70% chance that it is a female rather than a male. If we want to design a classifier based on this information alone, the most obvious choice would be to always assign the class with higher probability. So far so good, but this classifier will only give us one output, i.e. always classify as a female. So it will be wrong 30% of the times, since that is the probability of picking a male. To formalize our discussion, let's call the two classes as ω₁ and ω₂. We have information about *P*(ω₁) and *P*(ω₂). In our case *P*(ω₁) = 0.3 and *P*(ω₂) = 0.7. These probabilities of individual classes are also known as prior probabilities. Also, *P*(ω₁) + *P*(ω₂) = 1. A classifier based on this information will always classify a sample in class ω₂. So the probability of error in this case is equal to *P*(ω₁), since error occurs when our sample is in ω₁, but is classified as ω₂.

If in addition to the class priors we also have the knowledge of class conditional distributions over some variable, we can *probably* improve our decision. In our case, let's say that we also know the probability density function of height for males and for females (*P*(height|male) and *P*(height|female)). Here height becomes our only feature *x*. Class conditional densities are thus represented by *p*(x|ω₁) and *p*(x|ω₂), i.e, given a class what is the probability of observing the feature. (We generally use an upper case *P*(⋅) to denote a probability mass function and a lower case *p*(⋅) to denote a probability density function).

Let us return to our problem again and we randomly pick a person from the room. But this time we also give the height of the person. How should we classify the person now? The question is how do we use this additional information to design a *better* classifier. It all boils down to reducing the probability of error. Intuitively, one can think that what we need is P (ωᵢ|x) to make a decision, i.e., the probability of observing male or female given a height. It turns out, in fact that this is the best way to deal with the problem. Let’s see *how*!

We construct what I call an error table

If *x* is actually in class 1 and is also classified as class 1 according to our decision rule, then there is no error incurred. Similarly for class 2. But if *x* is in class 1 but classified as class 2 or vice versa, there is some error.(How many error terms will be there if we have *n* classes?) We already introduced the partition of our sample space and the decision D(Ω₁, Ω₂).

There is a more intuitive way to think about the probability of error. In the previous case when we had no information about the class conditionals, the probability of error was P(ω₁) (if P(ω₂)>P(ω₁)). Now we have more information about our classes and we need to update the probability of error. In the image below two class conditional distributions are shown. Lets say that we have partitioned our domain into Ω₁ and Ω₂ for our decision, i.e. $x ∈ Ω₁ => x ∈ ω₁ and vice versa for class 2. The integration of p(x|ω₁) over all x which lies in partition Ω₂ gives the probability of error when the sample is classified as class 2 but is actually in 1.

Fig: Class conditional densities for two classes

Similarly,

If the definitions of the error terms are clear, then the rest is just a trivial calculation. It should also be noted that we are just concerned about the probability of misclassification and not about the *cost* of misclassification at this point.

A factor of half has no effect on the minimization problem. If we want to minimize this error we need to minimize both the integrals on the right hand side of the equation. If both integrals are integrating over negative quantities, the error achieved will be the minimum.

Partitioning our sample space again in sets c₁, c₂ and c₃ helps us in making the choice to find the optimum partitions Ω₁ᵒ and Ω₂ᵒ. To minimize the first integral over the domain Ω₁ we should integrate where x ∈ c₁ and to minimize the integral over the domain Ω₂ we should integrate where x ∈ c₃. This gives us -

$Ω₁ᵒ = c₁ ∪ c₂ and,

Ω₂ᵒ = c₃

c₂ can be taken in union with any of the sets since its addition does not affect the minimization problem. According to our decision D(Ω₁ᵒ, Ω₂ᵒ) if x ∈ c₁ ∪ c₂ => x∈ ω₁. Using some basic rules of conditional probability -

where P(ω₁|x) and P(ω₂|x) are known as posterior probabilities. p(x) acts as the normalization factor in both the fractions. Which brings us to the usual text book version of Bayes’ rule that -

To reach to this decision rule we didn’t impose any conditions on the sample space or take any assumptions. Thus, no other partitioning can yield a smaller probability of error.

If Bayes’ rule is as good as we claim, why don’t we solve every classification problem by this rule? The answer lies in the details. Usually a classification problem consists of a training set and its associated labels. If we have samples then {Xᵢ, yᵢ} constitutes the training set where X ᵢ∈ ℜᵈ and yᵢ ∈ ℜ. We can easily calculate the prior probabilities of each class, but usually the conditional densities p(X|ωᵢ) are not known. There is a huge amount of literature which deals with density estimation and making assumptions about class conditionals but that is beyond our scope. It can also become very difficult to calculate errors if the number of features increase since we need to calculate error integrals over complicated d-dimensional domains to estimate the efficiency of our estimator.

If someone wants to dwell into more details, similar arguments are presented in an excellent book on pattern recognition by Fukunaga. There’s also another blog post by Brandon Rohrer, which deals with more practical aspects of Bayesian inference.