Direct Marketing Optimization using Mobile Data

A classification approach to strategize projected growths in user subscription

Authors: Moorissa Tjokro, Jager Hartman

Motivation

A banking institution ran a direct marketing campaign based on phone calls. Oftentimes, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed or not. To solve this, we will predict whether someone will subscribe to the term deposit or not based on the given information.

Dataset

We use data.csv which contains the following fields:

  1. age (numeric)
  2. job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
  3. marital_status : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
  4. education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
  5. credit_default: has credit in default? (categorical: "no","yes","unknown")
  6. housing: has housing loan? (categorical: "no","yes","unknown")
  7. loan: has personal loan? (categorical: "no","yes","unknown")
  8. contact: contact communication type (categorical: "cellular","telephone")
  9. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  10. day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
  11. duration: last contact duration, in seconds (numeric).
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. prev_days: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. prev_contacts: number of contacts performed before this campaign and for this client (numeric)
  15. prev_outcomes: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
  16. emp_var_rate: employment variation rate - quarterly indicator (numeric)
  17. cons_price_idx: consumer price index - monthly indicator (numeric)
  18. cons_conf_idx: consumer confidence index - monthly indicator (numeric)
  19. euribor3m: euribor 3 month rate - daily indicator (numeric)
  20. nr_employed: number of employees - quarterly indicator (numeric)
  21. subscribed (target variable): has the client subscribed a term deposit? (binary: "yes","no")

Technique Overview

Our original approach was to create a simple poor man's stacking ensemble with models that were somewhat simplistic. We first tried stacking NaiveBayes, RandomForest, ExtraTreeClassifier, SVM and logistic regression together. These models performed well with regards to cross validation and on the testing set with roc-auc scores around 0.78-0.8. However, when submitting the Kaggle the score dropped to 0.76. Gaussian Naive Bayes was one of the strongest performers though did not add anything to the ensemble methods. As an aside, we tried NearestCentroid and KNN though had issues with consistent predict_proba calls so dropped these models as well.

*Note that all model hyperparameters were tuned using GridSearch with cv=5. Not all grid searches are included due to time in execution of the notebook.

Moving forward we decided to drop the SVM all together due to time constraints and drop NaiveBayes since it was performing strangely. We were then left with gradient boosting, ada boosting, easy ensembles, random forests and logistic regression with feature selection. The ExtraTree classifier was left out since the random forests would be more powerful and pick up on similar trends. AdaBoosting overfit too much and dominated the stacking ensemble. The gradient boost also overfit quite a lot but there seemed to be an improvement in the Kaggle scores when using this model. Cross-validation and the test set showed roc-auc's around 0.8 - 0.82 with the ensemble methods implementing the gradient boosted random forest. This left us with Gradient Boosting random forests and different implementations of logistic regression such as AdaBoost, easy ensemble and other sampling techniques.

The feature selection showed little imrpovements applied to all of the data prior to the voting classifier for poor-man's stacking. This was due to the tree models we were using. Instead, applying an RFE(RandomForest()) selector prior to logistic regression alone seemed to perform the best.

Standard scaling and min-max scaling seemed to perform similarly with standard scaling having a slight edge. This is contrary to our belief since the binary data from the dummy variables is between 0 and 1 where the standard scaler would shift the 0's to negative values.

Again, contrary to our belief, SMOTE and omitting a technique to deal with imbalance performed much better than using RandomUnderSampler. Using SMOTE(ratio = 0.5) followed by RandomUnderSampler seemed to give a compromise between the performance of SMOTE alone and RandomUnderSampler alone though showed no improvement over SMOTE alone. This was gauged with regard to logistic regression and the poor-man's stacking classifier.

Another approach we tried was to create an easy ensemble out of the voting classifiers which overfitted, though not as bad as adaboosting, and the results seemed to stay consistent regardless of the number of classifiers used. A further analysis of the affect of number of classifiers on easy ensemble is included at the very end in the analysis of resampling techniques.

Technique Summary

0. Import Libraries and Load Data

0.1. Load data and convert unknowns to nulls
0.2. Categorize features based on types
0.3. Define response variable
0.4. Split data into training and test set

1. Exploration and Preparation

1.1. Identify possible associations between dependent and independent variables
        * Includes a scatterplot matrix, density plots, and histograms.
1.2. Convert yes/no values to 1/0
1.3. Create dummies for categorical variables
1.4. Create binary prev_days indicator
1.5. Impute missing values
1.6. Select significant variables
1.7. Deal with imbalanced data
        * Includes oversampling and undersampling techniques.

2. Classification Models

2.1. Logistic Regression
2.2. Linear SVM
2.3. Kernelized SVM (RBF)
2.4. Naive Bayes
2.5. Stochastic Gradient Descent Classifier
2.6. Nearest Centroid Classifier
2.7. Logistic Regression with Resampled Ensemble
2.8. Logistic Regression with RFE
2.9. Logistic Regression with RFE Lasso
2.10. Model Evaluation (ROC-AUC & F-1 Scores)

3. Tree-based Models

3.1. Decision Tree
3.2. Random Forest
3.3. Bagging
3.4. Gradient Boosting
3.5. Adaboost
3.6. Extra Tree Classifier
3.7. Model Evaluation (ROC-AUC & F-1 Scores)

4. Ensemble Models

4.1. Poor Man's Stacking using Gradient Boosted Classifier and an Easy Ensemble of Logistic Regressions
4.2. Poor Man's Stacking using Random Forest Classifier and Easy Ensemble of Logistic Regressions
4.3. Model Evaluation

5. Resampling Techniques

5.1. Sampling Transformation and Analysis
5.2. Sampling Techniques Evaluation on Poor-man's Stacking Algorithm
5.2. Easy Ensembles
5.4. AdaBoost Resampled Ensembles

Step 0 - Import Libraries and Load Data

This is the basic step where we load the data and create train and test sets for internal validation.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
plt.rcParams["figure.dpi"] = 100
np.set_printoptions(precision=3, suppress=True)

0.1. Load data and convert unknowns to nulls

Unknown values in the dataset seem to be clean and consistent, encoded as unknown. In this case, we can convert them to null values while importing the data.

In [31]:
data = pd.read_csv('data/data.csv', delimiter = ',', na_values='unknown')
data.head()
Out[31]:
age job marital_status education credit_default housing loan contact month day_of_week ... campaign prev_days prev_contacts prev_outcomes emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed subscribed
0 41.0 blue-collar married basic.9y no yes no cellular apr mon ... 2.0 999 0 nonexistent -1.695118 92.698705 -46.727552 1.345160 5097.0 no
1 46.0 entrepreneur married NaN no no no cellular may wed ... 2.0 999 0 nonexistent -1.767159 92.914878 -46.313088 1.314499 5100.0 no
2 56.0 unemployed married basic.9y no yes yes cellular nov fri ... 1.0 999 0 nonexistent -0.100365 93.423076 -41.904559 4.003471 5193.0 no
3 89.0 retired divorced basic.4y no yes no cellular may wed ... 4.0 999 0 nonexistent -1.771314 93.672814 -46.045500 1.261668 5100.0 no
4 34.0 entrepreneur married university.degree NaN yes no cellular jul thu ... 8.0 999 0 nonexistent 1.458103 94.296285 -42.455877 5.152077 5233.0 no

5 rows × 21 columns

0.2. Categorize features based on types

Based on observations, we categorize independent variables based on their types below. Note that subsribed is not part of them because it is the response variable.

In [3]:
data.dtypes
Out[3]:
age               float64
job                object
marital_status     object
education          object
credit_default     object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration          float64
campaign          float64
prev_days           int64
prev_contacts       int64
prev_outcomes      object
emp_var_rate      float64
cons_price_idx    float64
cons_conf_idx     float64
euribor3m         float64
nr_employed       float64
subscribed         object
dtype: object
  • Variable types: Categorical and Continuous
In [32]:
categorical = ['job', 'marital_status', 'education',
                    'credit_default', 'housing', 'loan',
                    'contact', 'month', 'day_of_week',
                    'prev_outcomes']

#Removed Duration
continuous  = ['age', 'campaign', 'prev_days',
                    'prev_contacts', 'nr_employed',
                    'emp_var_rate', 'cons_price_idx', 
                    'cons_conf_idx', 'euribor3m']

print("Total number of categorical predictors:", str(len(data[categorical].columns.values)))
print("All categorical data as object:", str(data[categorical].dtypes.all() == 'object'), '\n')
print("Total number of continuous predictors:", str(len(data[continuous].columns.values)))
print("All continues data as float64 or int64:", str(data[continuous].dtypes.all() in ['float64','int64']))
Total number of categorical predictors: 10
All categorical data as object: True 

Total number of continuous predictors: 9
All continues data as float64 or int64: True

0.3. Define response variable

Since our goal is to predict whether someone will subscribe to the term deposit or not based on the given information, we define subscribed variable to be our response variable.

In [5]:
data.subscribed.value_counts()
Out[5]:
no     29238
yes     3712
Name: subscribed, dtype: int64

Note that we see imbalanced data here between no and yes. We would like to change no to 0 and yes to 1 as classification values so it would be easier to deal with as we model the data, but lets explore this more on the next step.

Also, it's good to see that there are no unknown values, so we don't need to drop any datapoints or rows.

0.4. Split data into training and test set

Note below that we are also dropping duration variable since it's prohibited in the assignment.

In [33]:
from sklearn.model_selection import train_test_split

subscribed = data.subscribed
data_ = data.drop(["duration", "subscribed"], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(data_, subscribed == "yes", random_state=0, stratify=subscribed)

print("Size for X_train:", X_train.shape)
print("Size for X_test:", X_test.shape)
print("Size for y_train:", y_train.shape)
print("Size for y_test:", y_test.shape)
Size for X_train: (24712, 19)
Size for X_test: (8238, 19)
Size for y_train: (24712,)
Size for y_test: (8238,)

Step 1 - Exploration and Preparation

In this step, we expect you to look into the data and try to understand it before modeling. This understanding may lead to some basic data preparation steps which are common across the two model sets required.

1.1. Identify possible associations between dependent and independent variables

Scatterplot matrix

We use scatterplot matrix to explore relationships between the independent variables and the dependent subscribed variable.

In [7]:
pd.tools.plotting.scatter_matrix(X_train[continuous], c=y_train, alpha=.2, figsize=(10, 10));

A few observations we see from the scatter matrix above:

  • majority of clients are in their 30's (age).
  • negative correlation between nr_employed and prev_contacts: clients with less number of employees tend to receive higher number of contacts before a specific campaign
  • those who are younger (age) tends to have higher number of employees (nr_employed)

Density Plots

We also used density plots below to visualize the distribution of those who subscribed and those who did not subscribed (y-axis) for each continuous variable (x-axis). The kernel gaussian density is used to draw inferences about the population of those who subscribed vs. those who didn't.

In [8]:
from scipy.stats import gaussian_kde

def density_calc(array, N = 500, bw = 0.2):
    """
    Parameters
    ----------
    array   :       array-like, data to be plotted as density
    N       :       int, number of points to use to generate density curve
    bw      :       float, corresponds to bandwidth, smaller results in skinnier bands.  
                    larger value results in wider bands
    Outputs
    -------
    x       :      array created from np.linspace
    density :      points of the density curve created with scipy.stats.gaussian_kde
    """
    density = gaussian_kde(array)
    x = np.linspace(np.min(array), np.max(array), N)
    density.covariance_factor = lambda: bw
    density._compute_covariance()
    return x, density(x)

def plot_density(values, bw = 0.2, N = 500):
    """
    Parameters
    ----------
    values       :       list, corresponds to columns in data table to plot
    bw           :       float, bandwidth parameter to be given to density_calc
    N            :       int, number of points used to generate density in density_calc    
    categorical  :       if data is categorical, ie non-integer/non-float then
                         create integer dummy variables
    Output
    ------
    Array of plots for density functions
    """
    df = data.copy() 
    fig = plt.figure(figsize=(2*len(values),10))
    for i,value in enumerate(values):
        axes = fig.add_subplot(2, len(values)/2+1, i+1)
        
        x, y = density_calc(df[str(value)][df["subscribed"] == "yes"], bw = bw)
        axes.plot(x, y, 'r', label='subscribed')

        x, y = density_calc(df[str(value)][df["subscribed"] == "no"], bw = bw)
        axes.plot(x, y, 'b', label='did not subscribe')
        axes.legend()
        axes.set_title(value)
        axes.title.set_fontsize(20)
    plt.show()
    
plot_density(continuous, bw=1)