Gaussian Discriminant Analysis
Generative Learning Algorithm vs Discriminative Learning Algorithm
Algorithms that try to learn mappings from input space 𝔁 to the output labels 𝔂 ( such as logistic regression, linear regression, etc.) are called the Discriminative Learning Algorithm(DLA). In other words, the discriminative learning algorithm tries to learn 𝔂 given 𝔁. Mathematically P(y | x). On the other hand, algorithms that try to learn input(𝔁) given output(𝔂) (i. e. p(x | y)) is called Generative Learning Algorithm (GLA). Naive Bayes, Gaussian discriminant analysis are the example of GLA. While DLA tries to find a decision boundary based on the input data, GLA tries to fit a gaussian in each output label.
Multivariate Gaussian Distribution
Gaussian Discriminant Analysis model assumes that p(x | y) is distributed according to a multivariate normal distribution, which is parameterized by a mean vector 𝜇 ∈ ℝⁿ and a covariance matrix Σ ∈ ℝⁿ ˣ ⁿ. Here, n is the number of input features. The density function for multivariate gaussian is:
The mean vector and covariance matrix will determine the shape of the density function. The density function’s shape for 𝜇 = [0, 0] and Σ = I is shown below. You can play with the parameters in this notebook.
Gaussian Discriminant Analysis(GDA) model
GDA is perfect for the case where the problem is a classification problem and the input variable is continuous and falls into a gaussian distribution. Now let's make a flower classifier model using the iris dataset. We will apply the GDA model which will model p(x|y) using a multivariate normal distribution. Iris dataset has 3 labels/class. Those are Setosa, Versicolor, and Virginica. For mathematical modeling, we will denote Setosa as class 0, Versicolor as class 1, Virginica as class 2. Iris dataset have 4 input feature(n=4).
During the training process at first, we will calculate the class probability for each class. Class probability indicates how often that individual class is present in the training set. For example, if we have 100 training example and class ‘Versicolor’ appears 13 times then the class probability for class ‘Versicolor’ will be
Generally,
Here m is the number of training examples, k is a specific class and 𝜙_𝑘 represents the class probability of class k.
Now we will fit a gaussian in each of the classes. For example, if we fit a gaussian for all the training data for class 0 we get its density function as:
Here 𝜇_₀ is the mean and Σ_0 is the covariance matrix for all the training set of class 0. More generally for class k, we get the density function:
This density function is parameterized by mean(𝜇) and covariance matrix(Σ). So, we need to find the mean and covariance matrix for all classes. We can find 𝜇 and Σ for class k using the following equation.
Before going into the prediction let’s implement the theory that we have discussed so far.
Making prediction
Now let’s assume we have some data x_test that we want to predict. First, we will match our data with each of the classes. We will use the Gaussian distribution(𝜇, Σ) of the classes that we calculated during the training process. This will give us the probability of x_test being in a specific class. For example, the following equation will give us the probability of x_text being class 0.
We will calculate p(y=k|x_test) for each class k, where k ∈ {0, 1, 2}. During the training, we also calculated the class probability(𝜙) of each of the classes. We will multiply class probability with the gaussian probability to get the overall probability of a class.
For whichever classes the probability value is highest we will consider that as the x_test’s class. So we will maximize p(y=k). Probability can be a small number. So it's better to maximize the log-likelihood of the data.
Let's say, for different classes we got the following probability:
Here, class Versicolor(1) got the highest value. So the prediction of x_test will be Versicolor. Now let's implement our prediction function.
Test Model
f1 score of our model: 0.9473684210526315
f1 score of scikit-learn model is: 0.9473684210526315
Our model works the same as the scikit-learn model. That’s pretty good!!
Conclusion
One of the biggest advantages of the GDA model is it doesn’t have any hyperparameter. This model works really well if the input dataset follows the gaussian distribution. If all the class share the same covariance matrix then the model is called Linear Discriminant Analysis(LDA) and if each class has a different covariance matrix then the model called Quadratic Discriminant Analysis(QDA).
Both the Logistic regression and GDA are classification algorithms and they share an interesting relationship. If we view the quantity of p(y=1 |x; 𝜙_k, \𝜇_k, Σ_k) as a function of x we will get the logistic/sigmoid function. So, when would we prefer one model over another?
To learn more about the Generative Learning algorithm you can go over this note by professor Andrew Ng.
Here is the notebook version of this blog post: https://github.com/gmortuza/machine-learning-scratch/blob/master/machine_learning/bayesian/gaussian_discriminative_analysis/Gaussian%20Discriminative%20analysis.ipynb
You can play with this notebook in Google colab
Thanks for reading…