Logistic regression
Unveiling the Mechanics of Logistic Regression
Building upon the foundations of linear regression, we delve into the mechanics and mathematical underpinnings of Logistic regression.
The underlying concept of logistic regression revolves around capturing the probabilistic relationship between input features and the binary output.
This relationship is elegantly established through the logistic or sigmoid function, which smoothly maps the input space to a range between 0 and 1. This function allows us to interpret the output as the probability of belonging to a particular class.
Imagine, a scenario where we are presented with a set of input features, such as the age and income of individuals, and our goal is to predict whether they are likely to purchase a particular product. Logistic regression steps up to the challenge by estimating the parameters—weights and biases—that maximize the likelihood of the observed data given the model's predictions.
To get the optimal parameters, we shall employ the method of gradient descent once more, leveraging the gradients of the loss function with respect to the model's parameters in order to iteratively update and refine them.
Classification problems
In regression, the objective is to estimate a continuous numerical value or a set of values within a specific range. The focus lies on modeling the relationship between input features and the corresponding output, leveraging algorithms capable of approximating this mapping with suitable accuracy. On the other hand, in classification, the goal is to assign input instances to predefined classes or categories. This entails designing algorithms that can learn decision boundaries or decision rules to correctly classify new, unseen instances based on their feature representation.
Logistic regression
I will not go deeply into the math of logistic regression and not even for the probabilistic interpretation of it because it is not the main part of my work.
Logistic regression is a popular algorithm for classification tasks. The goal is to model the probability of an instance belonging to a certain class, given its input features. The logistic regression model employs the sigmoid function, denoted as \(\sigma(z)\), to map the linear combination of the input features and corresponding weights, represented by \(z\), to a probability value between 0 and 1.
The logistic function is defined as:
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]
where \(e\) is the base of the natural logarithm. In logistic regression, the linear combination \(z\) is calculated as the dot product of the feature vector \(\mathbf{x}\) and weight vector \(\mathbf{w}\), along with an additional bias term \(b\):
\[ z = \mathbf{w}^T \mathbf{x} + b \]
The probability of the instance belonging to the positive class is then given by:
\[ P(y = 1 | \mathbf{x}) = \sigma(z) \]
where \(y\) denotes the binary class label (0 or 1). By optimizing the parameters \(\mathbf{w}\) and \(b\) using techniques like maximum likelihood estimation or gradient descent, logistic regression finds the optimal decision boundary that separates the two classes based on the given feature space.
One-dimensional case
To establish a connection with our previous work on linear regression, let us delve into a simple example involving one-dimensional data. Here, we shall consider a scenario where we once again encounter two parameters, denoted as \(w\) and \(b\). This elementary
In this interactive visualization, we have two groups, represented by red and blue. By utilizing the provided sliders, users can manipulate the underlying distributions of these groups. It is important to note that in real-world scenarios, we don't have direct access to these distributions, and therefore, we rely on samples from these distributions. This functionality is facilitated through the corresponding button, allowing users to generate samples from the manipulated distributions.
The decision boundary is typically set at 0.5 because it represents an equal chance of belonging to either class.
When the predicted probability from the sigmoid function is above 0.5, it implies that the model believes the instance is more likely to belong to class 1. Therefore, any instance with a predicted probability above 0.5 is classified as class 1.
Conversely, when the predicted probability is below 0.5, it suggests that the model believes the instance is more likely to belong to class 0. Hence, any instance with a predicted probability below 0.5 is classified as class 0.
In the next visualization, we can observe the sigmoid function alongside the decision boundary. Similar to what we did in the regression problem, let's start by interacting with the sliders and finding the optimal values that will result in a good discrimination between the groups.
Namely, the discrimination should ensure that all the red dots, and only the red dots, are classified as zero, specifically on the left side of the boundary (the purple line), while the blue dots should have predicted probabilities above 0.5, placing them on the right side of the purple line.
Now let's introduce the parameter space as we did in linear regression. This is exactly as we did in the regression post last article so you should look there for more information
Try play wth the red point in the parameters space to find good parameters.
Training
The loss function for logistic regression with 1-dimensional data is defined as: \[ L(w, b) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\sigma(wx_i + b)) + (1-y_i) \log(1 - \sigma(wx_i + b)) \right] \] where \(N\) is the total number of data points, \(x_i\) is the input feature for the \(i\)-th data point, and \(y_i\) is the corresponding binary label (0 or 1).
Again i dont get inside the math behind it cause it is not relevant to here, instead i will provide links in the end for those who want to dig deeper.
During the training of logistic regression, the model iteratively adjusts its parameters to minimize the difference between predicted probabilities and the actual binary labels in the training data. This is achieved by using gradient descent as we saw last article, where the gradients of the loss function with respect to the parameters are used to update the parameter values, gradually improving the model's ability to classify new instances.Go to the previous article for more information.
Let's see how it is work in the visualization.
In the following visualization, the loss function is plotted in the parameter space. Additionally, there is a button labeled "Take Training Step." By clicking on it, a small step will be taken in the opposite direction of the gradient, resulting in a decrease in the loss. To further train the model, click the button multiple times.
You can also experiment by adjusting the parameters of the distribution to make the groups less distinguishable. Observe how this affects the shape of the loss function and its minimum point.
2D case
So now let's see move to 2D case where we have data points in 2d and we want to classify them .
It looks like this:
Now in 2d case, where the input features are represented as \(x_1\) and \(x_2\), and we have three parameters \(w_1\), \(w_2\), and \(b\), the logistic regression formula can be expressed as follows: \[P(y = 1 | \mathbf{x}) = \frac{1}{1 + e^{-(w_1x_1 + w_2x_2 + b)}}\] This expression represents the probability of the instance belonging to the positive class given the input features \(x_1\) and \(x_2\), with the corresponding weights \(w_1\), \(w_2\), and bias term \(b\).
Let's see it visually. In the next visualization, the logistic regression is visualized above the input space.
Note before we used the 3D space for parameter space, but here it is not the parameter space; it is the input space and the logistic is above it.
You can play with sliders that control the parameters for the logistic regression.
Training
Now, since we have three parameters (w1, w2, b), visualizing the loss function above the parameter space as we did before is not feasible.
However, this is not a problem because the mathematical calculations still work effectively. In the next visualization of training logistic regression, you will see a button labeled "Step" in the top left window. Pressing it will initiate one iteration of gradient descent, which will occur behind the scenes.
you can press the calculation button and then slide the corresponding slider to see the calculations and the underlying mathematical operations in action.
The most important thing to observe is that logistic regression assumes a linear decision boundary. It assumes that there is a line that can separate the points, and its objective is to find that line. However, if the data is not linearly separable, the algorithm may fail to find an accurate decision boundary. Let's explore an example to illustrate this.
In the next visualization, we have an example of data that contains two groups which are not linearly separable. Try training the logistic regression model and observe that it does not make a good distinction between the two groups.
Conclusion
Building upon the foundations of linear regression, we explore the mechanics and mathematical underpinnings of logistic regression. Logistic regression captures the probabilistic relationship between input features and binary output using the sigmoid function. While logistic regression assumes a linear decision boundary, it may fail to accurately classify data that is not linearly separable. In the visualizations, we witness the training process and observe the limitations of logistic regression when applied to non-linearly separable data. However, in the next article, we will explore how neural networkk provide us with good solution for this problem. Enable us to make non-linear descisions by creating representations of the data.