Weight of Evidence and Information Value in Python from scratch

7 min readNov 30, 2020

Summary of learnings from this blog post:

Definition and intuition behind IV and WoE
Method and mathematical explanation (with code from scratch in Python)
Benefits
Drawbacks
FAQs
GitHub repository
References

Definition and intuition behind IV and WoE:

Weight of Evidence and Information value can help us understand the relation between dependent and independent variables. It is widely used to evaluate the predictive power of an independent variable. Information Value gives us the strength of a predictor, i.e., how strongly or weakly will it be able predict a class. It is the measure of the separation of binary outcomes such as good and bad customers. A bad customer is one who has defaulted on a loan and good customer is one who has paid back on time.

The concept of IV and WoE was first introduced in 1950s to screen variables for binary classification problems in Credit Scoring problems and probability of default but over the years it has become an acceptable technique in other domains like response to marketing campaigns etc
It’s one of the good practices for variable selection. WoE is sometimes used in variable transformation too.

Why do we need feature selection in a model?

Reduced features means reduced computational cost
We loose explain-ability with high dimensional data because of a lot of noise
Curse of dimensionality - Data sparsity and Distance Concentration
Garbage in garbage out - Whatever we put in the model, the outcome is a result of the same. Basically, we don’t want to include unwanted features in our data

Method and Mathematical Explanation:

We recode the variables into discrete categories (bins) and calculate a unique WoE and IV for each category with respect to the distribution of dependent variable i.e., the percentage of events and non-events. Monotonicity in the dependent variable implies a linear relationship between the independent and dependent variable.

Formula:

In our case,

please refer the data below to get a better understanding of goods and bads

We read the data using pandas and this here’s how it looks like:

There are 1000 observations and 8 features. All of our columns are float except “state” column which is categorical.
“bad customer” is our target variable and the rest are predictor variables. We have selected a dataset with a few columns with missing values, so that its easy to interpret how WoE can be handy for missing values too.

The next step is to bin the variables.

Binning method:

Equi-spaced bins with at least 5% of total observations in each bin.
To ensure 5% sample in each class a maximum of 20 bins can be set.
Event rate for each bin will be monotonically increasing or monotonically decreasing. If a monotonous trend is not observed, a few of the bins can be combined accordingly to achieve monotonicity.
Separate bins will be created for missing values

Here’s a snippet of bin creation:

We need monotonic bins.
Calculated the mean of the target variable to check the monotonicity of the bins. The mean should either be increasing or decreasing with the bins. If the series is not monotonic, we drop max bin by 1 and check for monotonicity again and continue till the series becomes monotonic. If at a point, max bin goes to 2 then we forcefully create bins using the min, mean and max values.

function to check monotonicity

The next step is to calculate the WoE and IV

A sample output.

Lets try to interpret this.
“sample class” is our bins and we can see that the event rate is monotonically increasing till the last bin. And there is a separate bin for the missing values. And we have calculated the WoE and IV using the formula mentioned for each of the bins and finally added all the IVs

Final IV output

We have sorted the dataframe by descending order of IV values.
“number of missed payments” has the highest IV value of 0.22 and others follow

How to interpret IV values ?

Table below explains the range of IV values and their meaning i.e., the predictive strength.

We can see that out of all 7 variables, “number of missed payments” (IV=0.22) and “number of bank visits” (IV=0.16) are good predictors of a “bad customer” for this dataset.

Benefits:

Treat Outliers by WoE encoding of the feature.
Missing value treatment by WoE encoding of the feature.
WoE transformation scaling.
WoE encoding for Categorical and Binary variable.
WoE encoding for establishing a linear relationship with target variable.
Transformation is based on logarithmic scale which keeps it in good alignment with log odds in logistic regression.
Column reduction, e.g., for a Categorical variable if there are 20 categories, and we One-Hot Encode it, we get 20 columns with mostly 0’s as values. We can use WoE encoding instead where we will have just one column.
For variables with large number of discrete values with respect to the total population, they can be combined into fewer categories based on the type of information, dependent variable, count etc and, the WoE values can be used to express information for the variable.
The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables because WoE is standardized value (for example you can compare WoE of married people to WoE of manual workers).

Drawbacks:

In WoE encoding for variables, some loss of information due to binning.
It doesn’t talk about the inter-relation of the independent variables.
It’s easy to change the effect of the variables according to how bins are created. e.g., consider a variable “job tenure”, it has 30% of its values missing and the missing bin has an event rate of 65%, the missing bin will create a large impact on the IV of this variable.

FAQs:

Is it a feature selection or a feature reduction technique?
Ans: Its feature selection technique and not a feature reduction technique. PCA, LDA etc are feature reduction techniques.
Is it recommended to use WoE and IV selected variable in Machine Learning models such as Random Forrest, XGBoost etc ?
Ans: It is widely used but sometimes it is not an optimal feature selection method for ML models as these models work fine with non-linear data relationships also. So depending on the business requirement, one might miss out some information.
Is it recommended to use WoE and IV for Non-Finance problems ?
Ans: It was introduced for Credit Scoring problems, probability of default, but over the years it has become an acceptable technique in other domains like response to marketing campaigns etc
Does it tell us about the inter-relationship between the independent variables ?
Ans: This analysis only gives as a relationship between one independent variable at a time with the target. It does not explain the relation between the independent variables
Does WoE standardize all variables ?
Ans: It rescales the variables but doesn’t standardize them as standardization means mean = 0 with Standard Deviation = 1
Why do the bins have to be monotonous?
Ans: The idea here is to find a linear trend between the dependent and the independent variables
Is there a need to check variable trends separately for features which has monotonic WoE values?
Ans: If WoE is already monotonous then it will have either a linearly increasing or a linearly decreasing trend
Can we do segmentation as well using WoE and IV or is it limited to just feature selection?
Ans: Yes it is used to identify segments from a dataset too.
Can we prepare IV and WoE for continuous dependent features?
Ans: Yes. the applications of WoE has evolved over the years and with modification in formula we can do it for continuous dependent (Source: click here)
Why are features with IV > 0.5 suspicious?
Ans: We call it suspicious as this might be related to the target variable to a great extent, so we should properly investigate these variables and use them with discretion.

GitHub repository

View the entire code in Pankaj Kalania’s repository below:
WOE and IV from scratch in Python