Simple and Complete Tutorial on Logistic Regression

Here are the steps we are going to take to understand logistic regression

  1. Basic understanding of logistic regression
  2. Some concepts that you need to understand that will aid in understanding logistic regression
  3. Basic explanation of logistic regression
  4. Understanding the maximum likelihood
  5. Bias variance tradeoff in logistic regression
  6. Regularization in logistic regression
  7. How to evaluate a logistic regression function


The first thing we need to understand why we need logistic regression or generalized linear models.

1) The output of a linear regression model can be any real number ranging from negative to positive infinity, however, a categorical variable can only take on a limited number of discrete values within a specified range.

2) The error terms are not normally distributed for discrete output variables i.e 0, 1. So, the conditions for a linear regression are not met

Generalized Linear Models

In linear regression, we use a linear combination of dependent variables to make a prediction.

In a generalized linear model, we equate the linear combination of dependent variables to the probability distribution of the predicted variable using a link function.

The dependent variable Ydoes not need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)

For Logistic Regression we use a logit as the link function:-

\frac{P(Y)}{1 - P(Y)} = b_0 +b_1 x_1 .....+b_x x_n

Moving the equation around, we get the probability Y given x is given by the sigmoid function

P(Y)= \frac{1}{{1+e^{-(b_0 +b_1 x_1 ..+b_n x_n)}})}}

Here is the sigmoid curve with the threshold at 0.5

When To Use Logistic Regression

If you have a dataset where the number of features, m >10,000 in the dataset are greater than the number of training examples, n =1000, it is preferred to use logistic regression over other classification algorithms such as SVM.

On the other hand, if the number of training examples is very large in comparison to the number of features, you should do some feature engineering to create additional features before using logistic regression.

Step By Step Explanation of how Logistic Regression Work

  1. CORE PRINCIPLE- The core principle behind solving logistic regression is Maximum Likelihood estimation.

The parameters of the sigmoid function are calculated such that they maximize the likelihood that the model produced the observed data.

In Simpler words, find the parameters that maximize the likelihood that this model produced the Observed data.

2) Sigmoid Function gives the p(y given x) – In Logistic Regression, the probability of the output variable dependent on input variables x1,x2,x3 ..xn, is given by the sigmoid function :-

P(Y|X)= \frac{1}{{1+e^{-(b_0 +b_1 x_1 ..+b_n x_n)}})}}

To simplify, let’s donate the sigmoid function by f(x)

Let’s take Y to be binary, so it can either have an output of 0 or 1, hence we get for a given x.

P(Y|X)= f(x) \ where \ y=1

Because the sum total of a probability at a given point has to be equal to 1, we can calculate the probability of Y being equal to 0.

P(Y|X)= 1-f(x) \ where \ y=0

The reason why we choose a sigmoid function for logistic regression is that it’s an S-shaped curve that can take any real-valued number and map it between 0 and 1.

3) Maximum Likelihood Estimation – Sigmoid function gives the probability of a single point given x but to develop the model we need to find the probability of observing all the data together in the equation.

The data example of input variable X, are independent of each other, we obtain the joint probability by multiplying the probability of all individual data points.

We multiply the probabilities, based on the general multiplication rule of probability.


At a given input point xi the probability is given by

P(Y=y_i|X=x_i)= f(b_0 +b_1 x_i)^{yi}(1-f(b_0 +b_1 x_i)^{1-yi} \

if y=0, then the first part of the right-hand side become equal 1 and if y=1 then the second part becomes equal to 1

The final likelihood function is the combined probability of the output from all input points:-

\prod_{i=1}^{n}P(Y=y_i|X=x_i)= \prod_{i=1}^{n}f(b_0 +b_1 x_i)^{yi}(1-f(b_0 +b_1 x_i)^{1-yi} \

If we take the log of the probabilities, we can sum the n probabilities rather than multiplication.

Likelihood= \sum_{i=1}^{n}y_iln(f(b_0 +b_1 x_i))+(1-y_i)ln(1-f(b_0 +b_1 x_i)

Lower the sum of your log probability, the more likely it is that your hypothesis is wrong

4) Find the Value of the parameters:- Now that we can have likelihood estimation we can try to maximize it

To make it mathematically simpler, we can multiply it with – sign and minimize the equation. This is easier to do mathematically.

Loss= \sum_{i=1}^{n}-y_iln(f(b_0 +b_1 x_i))-(1-y_i)ln(1-f(b_0 +b_1 x_i)

This gets us to the cross-entropy function which is the cost function for logistic regression

Linear regression uses mean squared error as its cost function. If this is used for logistic regression, then it will be a non-convex function of parameters (theta). Gradient descent will converge into a global minimum only if the function is convex.

It is a convex learning rate equation.

5) Calculate the values of the parameters using Gradient Descent:- There is no closed form solution for the cross-entropy error in logistic regression, so you take an iterative solution to get the values of the parameters.

In gradient descent, we use the optimum gradient using an iterative approach.

\begin{equation} \begin{gather} b_0=b_0 -\alpha \frac{dL}{db_0}\\ b_1=b_1 -\alpha \frac{dL}{db_1} \end{gather} \end{equation}

Here L is the loss function and alpha is learning rate. Alpha is the learning rate, which controls how fast we converge on to the solution.

Now let’s try to calculate the gradient of the loss function with respect to the parameters of logistic regression.

  1. Using the Chain rule to solve the differential

The solve the gradient of the loss function, we have to use the chain rule, as the loss function is composed of the loss function, sigmoid function and linear combination of dependent variables.

We get:-

\begin{equation} \begin{gather} \sum_{i=1}^{n}y_iln(f(b_0 +b_1 x_i))+(1-y_i)ln(1-f(b_0 +b_1 x_i) \end{gather} \end{equation}

\begin{equation} \begin{gather} L(a,y_i)=\sum_{i=1}^{n}-y_iln(a)-(1-y_i)ln(1-a)\! Where\, a\, is\, the \, sigmoid\, function \\ a= \frac{1}{{1+e^{-z}}} \\ z=b_0+ b_1x_i \end{gather} \end{equation}

Hence we can calculate the gradient as:=

\frac{dL}{{db_0}} = \frac{dL}{{da}} . \frac{da}{{dz}} . \frac{dz}{{db_0}}

Here are the  individual differentials

 \frac{dL}{{da}}= -\frac{y}{{a}} + \frac{1-y}{{1-a}}

 \frac{da}{{dz}}= a(1-a)



If you have any doubts on how I ended up with these differentials you can check out the differentials for the log function and the sigmoid function.

Multiplying the three together we get the gradient of the coefficients.

\frac{dL}{{db_0}} = (-\frac{y}{{a}}+ \frac{1-y}{{1-a}}). (a(1-a)) . 1 =(a-y)

\frac{dL}{{db_1}} = (-\frac{y}{{a}}+ \frac{1-y}{{1-a}}). (a(1-a)) . x_i =(a-y)x_i

This is the gradient at a single point we can find the total gradient by summing over all the data examples.

Similarly, we can calculate the gradient for all coefficients of the linear component of logistic regression.

Assumptions in Logistic Regression

However, some other assumptions still apply.

  1. First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.
  2. Logistic regression requires observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data.
  3. Third, logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other.
  4. Fourth, logistic regression assumes linearity of independent variables and log odds.  although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.
  5. Finally, logistic regression typically requires a large sample size.  A general guideline is that you need a minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).
  6. You can get the theoretical guarantees on model performance for Logistic Regression without any assumptions on the distribution of the variables. However, LR is typically trained using Stochastic Gradient Descent, and that does benefit from normalizing the variables to have the same standard deviation, otherwise, SGD can take much longer to train.

There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized, or


Like linear regression, we also have the psuedo R2 variable in logistic regression.

Pseudo R2

McFadden’s pseudo-R squared

Logistic regression models are fitted using the method of maximum likelihood – i.e. the parameter estimates are those values which maximize the likelihood of the data which have been observed. McFadden’s R squared measure is defined as



where Lc denotes the (maximized) likelihood value from the current fitted model, and Lnull denotes the corresponding value but for the null model – the model with only an intercept and no covariates.

Psuedo rho-squared can be interpreted like R2, but don’t expect it to be as big. And values from 0.2-0.4 indicate (in McFadden’s words) excellent model fit.

Psuedo r-squared tries to show how much of the variation in the data is explained by the model, if the model predicts all the data points correctly the likelihood will be equal to 1, hence pseudo R squared will be 1 as log(1) is equal to 0. Similarly, if all the data points are classified incorrectly the value will be equal to 0.

You don’t have support for calculating Psuedo R squared in python however you can calculate it in R.

This is a great blog to learn about these tests:-

Likelihood Ratio Test (

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors.

The formula for the lr test statistic is:

lr = -2 ln(L(m1)/L(m2)) = 2(ln(m2)-ln(m1))

Where L(m*) denotes the likelihood of the respective model (either model 1 or model 2), and ll(m*) the natural log of the model’s final likelihood (i.e., the log likelihood). Where m1 is the more restrictive model(With less number of variables), and m2 is the less restrictive model (With more variables).

The resulting test statistic is distributed chi-squared, with degrees of freedom equal to the number of parameters that are constrained (in the current example, the number of variables removed from the model, i.e., 2).

If the test statistic is not significant than we conclude that the less restrictive model is better than the more restrictive model.

The likelihood ratio test can be performed in R using the lrtest() function from the lmtest package or using the anova() function in base.

anova(mod_fit_one, mod_fit_two, test =”Chisq”)

lrtest(mod_fit_one, mod_fit_two)

This is also an interesting way to readers understand the utility of an actual part of the theory,

it tests for improvement of model fit if variables that are currently omitted are added to the model

his to develop an intuition about how this test works.

In this case, we can see that for a curve where there is significant different between Theta and theta null, we can reject the null hypothesis, that the parameter constrained can be removed from the maximum likelihood estimation.

The value of the wald statistic is very high in this case and not close to 0. And we know for the test to be significant for chi squared test of 1 degree of freedom. This value needed to be close to 0.

In case two we have a case where the variance of the likelihood is very high. So, we get the value of the wald statistic close to 0 and we may not be able to reject the null hypothesis.

ROC Curve

The receiving operating characteristic is a measure of classifier performance.

Using the proportion of positive data points that are correctly considered as positive and the proportion of negative data points that are mistakenly considered as positive, we generate a graphic that shows the trade-off between the rate at which you can correctly predict something with the rate of incorrectly predicting something.

Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a good job in discriminating between the two categories which comprise our target variable.


K-Fold Cross Validation

When evaluating models, we often want to assess how well it performs in predicting the target variable on different subsets of the data. One such technique for doing this is k-fold cross-validation, which partitions the data into k equally sized segments (called ‘folds’).

One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy. The most common variation of cross validation is 10-fold cross-validation.

Here are some important traits of Logistic Regression

  1. The output of the logistic regression is a probability
  2. Logistic regression has a soft threshold. The sigmoid function gives the output between -1 to 1 . So, rather than using 0.5 as the threshold that divides the classes, we can also have a different threshold.
  3. In some cases you just need a probability as the end result, for example, predicting heart attacks
  4. The probability that we get a certain data output is the cross entropy with the log function. 

Logistic regression is a probabilistic model, once trained you can interpret predictions from a logistic regression as conditional probabilities


Wald test

Wald test is also like the likelihood test but the thing with it is that you do not need  to calculate the likelihood of other models

The following equation gives the value of the Wald test.


Here \hat{\theta}}= Maximum\,Likelihood\,Estimator

The resulting test statistic is distributed chi-squared, with degrees of freedom equal to the number of parameters that are constrained (in the current example, the number of variables removed from the model, i.e., 2).

If the value of P is greater than 0.05 we reject the null hypothesis.

This equation boils down, to the following equation for a simple case of working with just one parameter or input variable is constrained.

Regularization in Logistic Regression

Similar regularization techniques such as linear regression, we also have to apply regularization to logistic regression techniques.

For a logistic model with a lot of different coefficients and higher order variables.

We can easily overfit the data, and end up with a very complicated solution.

Therefore, in the same way as linear regression, we can have L2 regularization for logistic regression.

We multiply the square of the weight of the model along with loss function.

This stop the model to overfit the model.

Here is the equation of the cost function with L2 Regularization.

Loss=\frac{1}{{n}} \sum_{i=1}^{n}-y_iln(f(b_0 +b_1 x_i))-(1-y_i)ln(1-f(b_0 +b_1 x_i) +\frac{\lambda}{{2n}} \sum_{i=1}^{n}b_n^2

This is the loss function we will use to calculate gradients in the new equation.

If you want to know more about regularization, read this blog, my own blog:-

Feature Selection In Logistic Regression


Recursive Feature Elimination

Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.


Practical Example

Let’s first start with visualizing the dataset

import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib as plt

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

%matplotlib inline



The data frame has 3 main features:

  • band_1: flatten 75*75 horizontal radar frequency information. Please refer to Problem Background.
  • band_1: flatten 75*75 Vertical radar frequency information. Please refer to Problem Background.
  • inc_angle: incidence angle. and there is a respond binary vector called ‘is_iceberg’.

First, we need to drop the data rows where the value of inc_angle is na.

 data.inc_angle.replace({'na':np.nan}, inplace=True)

Logistic regression only gets the numerical value as a NumPy matrix, so we create a NumPy matrix that includes all features. However, as shown in the code, we created different matrices for each feature which enables us to build different combinations of input variables for the model.

X_HH_train = np.array([np.array(band).astype(np.float32) for band in data.band_1])
X_HH_train1 = np.array([np.array(band).astype(np.float32) for band in data.band_2])
X_HH_train2 = np.array([[np.array(angle).astype(np.float32) for angle in data.inc_angle]]).T
train_1 = np.concatenate((X_HH_train,X_HH_train1,X_HH_train2),axis=1)


For this problem, we will face some difficulties to scale the variance of each feature as the variance of the features are very small and close to zero. So we use MaxAbsScaler to scale each variable in the range of [-1,+1] which is centered by 0. This scaler is also able to keep the scaling scheme on the train set which later can be applied on the test set.

scaler = MaxAbsScaler()
X_train_maxabs = scaler.fit_transform(train_1)


After this step, we can fit the model on the data. First, we divide the dataset into a test and training data set


model=logreg=LogisticRegression(), y_train)
predicted_classes = model.predict(X_test)
accuracy = accuracy_score(y_test,predicted_classes)

We get an accuracy of around 73 % of this model. Now, we can see if with regularization we can improve this performance. By default, regularization is applied in logistic regression with coefficient C=1. C is the inverse of regularization parameter values used for cross-validation. But with cross-validation we can find the most optimum value of C

clf = LogisticRegressionCV(cv=5, random_state=0).fit(X_train, y_train)
predicted_classes = clf.predict(X_test)
accuracy = accuracy_score(y_test,predicted_classes)

By using regularization we find the optimum value of C as 0.04641589. And with this, we get a model of higher accuracy of 81.5%

We can check the performance of this Logistic regression using the ROC curve.


from sklearn import metrics


We get a value of 0.81 which means that our model is doing a good job.

Other Articles to Study

This is done from the collection of,

About the author


Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2020. Created by Meks. Powered by WordPress.