A logistic regression is a model that is appropriate to use when the dependent variable is binary, i.e. 0s and 1s, True or False, Yes or No. The logistic regression is part of the regression analysis library and could therefor be interpreted as a predictive analytics model.

In this walkthrough, we will cover the following:

Types of Questions

The questions you can answer with a logistic regression are very different from that of other regression models given that the dependent variable is binary. As such, most of the questions are Yes/No and probability based questions.

Here are a few examples:

• Do body weight, calorie intake, and age have an influence on heart attacks? Yes/No
• How does the probability of an individual getting lung cancer change for every pound of weight and cigarettes they smoke? Probability
• Do customer satisfaction, brand perception and price perception influence purchase decision? Yes/No
• What’s the probability of passing an exam given how much a student studied and how many hours of sleep they got? Probability

Model Building in Scikit-Learn

In this example, we will be building a prediction model for diabetes using the Pima Indian Diabetes dataset (the dataset can be downloaded here).

When following along, you should be using a notebook like Jupyter.

import pandas as pd

column_names = ['pregnant', 'glucose', 'bp',
'skin', 'insulin', 'bmi', 'pedigree',
'age', 'label']
names=column_names)

# Print out first 5 rows:
df.head()

Feature Engineering and Selection

This dataset is already pretty clean, and features have already been engineered. All that is left for us to do is split the dataset into two variables; one for the dependent variables and another for the independent variables.

feature_cols = ['pregnant', 'insulin', 'bmi',
'age','glucose','bp','pedigree']

X = df[feature_cols] # Independent Variables / Features
y = df.label  # Dependent Variable

Split Dataset into Training and Testing Sets

In order for us to properly test the performance of the model later, we need to remove a subset of our data. This subset will later be used to test the model.

We are going to split the dataset using train_test_split() from the sklearn.model_selection module.

# Import Needed Module
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=0)

We have now broken our dataset into two sets; one containing 75% of the data, and another containing the other 25%.

Building the Model

Before we get started we need to import the LogisticRegression module and create an object.

from sklearn.linear_model import LogisticRegression

# create and instantiate the object
lr = LogisticRegression(solver='lbfgs'). # 'lbfgs' is the default solver

Now, we are going to fit our model on the training set. After we have a model fitted, we are going to apply that model to our test dataset so we can see how our model did using real world data.

# Fit the Model with Data
lr.fit(X_train, y_train)

# Predict values using model on test data
y_pred = lr.predict(X_test)

Interpreting Coefficients

Logistic regression is an ordinary linear model and is one of the simpler classifiers out there. Because logistic regressions are inherently generalized linear models (GLM), we can interpret/summarize the coefficients with statements such as “a man is X% more likely to have a heart attack for every 100 calories he consumes”.

Despite being a generalized linear model, it is surprisingly difficult to interpret the coefficients of a logistic regression.

The most intuitive way to interpret the models coefficients is by looking at them as changes in probabilities.

Interpreting Coefficients From Our Model

To get the coefficients from our model, we need to get the output from the following:

odds = [math.exp(x) for x in lr.coef_[0]]
coef = dict(zip(feature_cols, zip(lr.coef_[0], odds)))

print(coef)

Here we are creating a key-value pair so it is easier to see what coefficient belongs to which feature, additionally, we are getting the odds-ratio for the coefficient. The odds-ratio is easier to understand in most circumstances and is acquired by through the following mathematical function:

$$odds = e^{coefficient}$$

From the above result, the first numerical value is the coefficient, the second is the odds-ratio for the given feature. Keep in mind that our dependent variable, what we are trying to classify, is whether or not an individual will have diabetes. Here are a couple of examples of how you can interpret the above odds-ratios:

• Age: For each additional year of age, the odds of having diabetes is 1.007 times as big.
• BMI: For every 1 increase in BMI, the odds of having diabetes is 1.11 times as big.

Model Performance

The performance of the model can be checked by using a confusion matrix and an AUC-ROC Curve. Let’s start with the confusion matrix.

Confusion Matrix and Classification Report

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

This yields the matrix on the left. The result is telling us that 118+35 of our predictions are correct, while 13+26 were incorrect., or in other words, 79% of our predictions were correct.

In addition to the confusion matrix, we can calculate ratios (just like we did with the 79%) to gain more insight into the performance of the model. These ratios compute precision, recall, F-Measure, and support. Here are the definitions from Scikit Learn:

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

ROC Curve

The ROC curve, or Receiver Operating Characteristics curve, is another common tool that is used when dealing with binary classifiers.

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

# Calculate ROC Values
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])

# Plot ROC Curve
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()