Building a Matching Algorithm

Classification Problem

The Amazon and Rotten Tomatoes dataset contains information about the entities contained in them. We will write a script that will load both datasets and identify matching movies in each dataset. In order to ensure the quality of our matching algorithm, we will measure its precision, recall, and F1-score against the ground truth in train.csv.

We will be using two datasets that describe the same entities. We then identify in the test set, given two ids, one each from the 2 datasets, whether the two ids refer to the same entity. Our datasets contains listings of movies from Amazon and Rotten Tomatoes, and contain descriptive information like the director, the stars and runtime of the movie. The data folder contains the following files:

  • train.csv
  • test.csv
  • holdout.csv (the F1-scores for this set will not be shown during the contest)

We will then generate gold.csv, to map the new test listings for the test.csv input data.

Meet the Team

Team name: Artistes

  • Moorissa Tjokro - ID: mmt2167
  • Julien Maudet - ID: jm4418

Leaderboard Username: mmt2167

Score on Leaderboard: 96.25

  • Rank: 9/129 participants

Note: Please see the section 'Interesting point' in the Technique Summary section for more details

Import Modules

In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report

Data Preprocessing

In the data preprocessing stage, we look at three main attributes that are common in both datasets, namely time, director, and stars. We then use a dictionary to store these attributes, and associate them as values that correspond to key IDs.

  • time: We convert the time format in both datasets to minutes using the following functions.
  • director: There's a lot of common directors in both datasets, the format should be fine.
  • star: Based on the formatting, we pick the star field from the first entity (amazon) and combine all six fields of the star in the second entity (rotten_tomatoes.csv)
In [2]:
# convert time to minutes for amazon.csv
def e1_min_converter(b):
    b = b.strip().split(",")
    totalmin  = 0.0
    hourtomin = 0.0
    mintomin  = 0.0
    for time in b:
        if "hour" in time:
            time = time.strip().split("hour")
            hourtomin = float(time[0])*60
        if "minute" in time:
            m = time.strip().split("min")
            mintomin = float(m[0])
        else:
            totalmin = 0.0
    totalmin = hourtomin + mintomin
    return totalmin
In [3]:
# convert time to minutes for rotten_tomatoes.csv
def e2_min_converter(b):
    if "hr" in b:
        b = b.strip().split("hr.")
        hourtomin = float(b[0])*60
        for m in b:
            if "min" in m:
                m = m.strip().split("min.")
                mintomin = float(m[0])
            else:
                mintomin = 0.0
        totalmin = hourtomin + mintomin
    elif "min" in b:
        b = b.strip().split("min.")
        totalmin = float(b[0])
    else:
        totalmin = 0.0
    return totalmin

Apply Transformation

In [4]:
def preprocess_entities():
    entity1 = pd.read_csv("amazon.csv", encoding = "ISO-8859-1")
    entity2 = pd.read_csv("rotten_tomatoes.csv", encoding = "ISO-8859-1")
    
    #Drop useless columns
    del entity1['cost']
    entity2 = entity2.drop(entity2.columns[[3, 10,11,12,13,14,15,16]], axis=1)
    
    #Drop NA and format time
    entity1 = entity1.dropna()
    entity1 = entity1[entity1['time'].str.contains('minute')]
    entity1['time'] = entity1['time'].apply(e1_min_converter)
    
    entity2 = entity2[entity2['time'].isnull() == False]
    entity2 = entity2[entity2['time'].str.contains('minute') | entity2['time'].str.contains('hr')]
    entity2['time'] = entity2['time'].apply(e2_min_converter)
    
    #Concat Stars for entity2
    entity2["star"] = entity2["star1"] + ", " + entity2["star2"] + ", " + \
                  entity2["star3"] + ", " + entity2["star4"] + ", " + \
                  entity2["star5"] + ", " + entity2["star6"]
    entity2 = entity2.drop(entity2.columns[[3,4,5,6,7,8]], axis=1)

    return entity1, entity2

Data Preparation

Features: time, director, stars

  • For each feature, a similarity score is computed between the two features of the associated IDs from the two entities.

Distance computation for director and star

  • director: if strings are equal output 1, else output 0

  • dist_stars: if at least one star in common, output 1, else output 0

In [5]:
def dist_director(dir1, dir2):
    return 1 if dir1 == dir2 else 0
In [6]:
def dist_stars(stars1, stars2):
    if isinstance(stars1, unicode) == True and isinstance(stars2, unicode) == True:
        stars1 = stars1.encode("utf-8")
        stars2 = stars2.encode("utf-8")
        stars1_set = set(stars1.split(', '))
        stars2_set = set(stars2.split(', '))
        intersec_sets = stars1_set.intersection(stars2_set)
        return 1 if len(intersec_sets) > 0 else 0
    else:
        return 1

Build the dataset for the ML models

In [7]:
def build_train_datasets():
    entity1, entity2 = preprocess_entities()
    X_train = pd.read_csv("train.csv", encoding = "ISO-8859-1")
    y_train = X_train['gold']
    
    X_train['time'] = 0
    X_train['director'] = 0
    X_train['stars'] = 0
    del X_train['gold']
    
    for k in range(X_train.shape[0]):
        id1 = int(X_train['id1'][k])
        id2 = int(X_train['id 2'][k])
        
        if id1 in list(entity1['id']) and id2 in list(entity2['id']):
            row1 = entity1[entity1['id'] == id1]
            row2 = entity2[entity2['id'] == id2]
            
            ind1 = row1.index.tolist()[0]
            ind2 = row2.index.tolist()[0]

            X_train['time'][k] = math.fabs(int(row1['time']) - int(row2['time']))
            X_train['director'][k] = dist_director(row1['director'][ind1], row2['director'][ind2])
            X_train['stars'][k] = dist_stars(row1['star'][ind1], row2['star'][ind2])
    del X_train['id1']
    del X_train['id 2']
                
    return X_train, y_train
In [8]:
def build_test_dataset(filename):
    entity1, entity2 = preprocess_entities()
    X_test = pd.read_csv(filename, encoding = "ISO-8859-1")
    
    X_test['time'] = 0
    X_test['director'] = 0
    X_test['stars'] = 0
    
    for k in range(X_test.shape[0]):
        id1 = int(X_test['id1'][k])
        id2 = int(X_test['id 2'][k])
        
        if id1 in list(entity1['id']) and id2 in list(entity2['id']):
            row1 = entity1[entity1['id'] == id1]
            row2 = entity2[entity2['id'] == id2]
            
            ind1 = row1.index.tolist()[0]
            ind2 = row2.index.tolist()[0]

            X_test['time'][k] = math.fabs(int(row1['time']) - int(row2['time']))
            X_test['director'][k] = dist_director(row1['director'][ind1], row2['director'][ind2])
            X_test['stars'][k] = dist_stars(row1['star'][ind1], row2['star'][ind2])
        
    del X_test['id1']
    del X_test['id 2']
                
    return X_test

Data Exploration

Imbalanced Data

In [9]:
X, y = build_train_datasets()
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
In [10]:
from sklearn.dummy import DummyClassifier
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
pred_most_frequent = dummy_majority.predict(X_test)
print("Unique predicted labels: {}".format(np.unique(pred_most_frequent)))
print("Test score: {:.2f}".format(dummy_majority.score(X_test, y_test)))
Unique predicted labels: [0]
Test score: 0.89

We see here that even without any models, we would get 0.89 of accuracy. Although it seems to be a very good performance, the accuracy is only reflecting the underlying class distribution -- therefore the data is imbalanced. There is actually far too many datapoints in [0] class than [1] class.

Run Classification Models

Random Forest

Relatively high accuracy, good model.

In [11]:
X, y = build_train_datasets()
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(forest.score(X_train, y_train)))
print("Accuracy on testing set: {:.2f}".format(forest.score(X_test, y_test)))
Accuracy on training set: 0.96
Accuracy on testing set: 0.94

Bagging

Relatively low accuracy.

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(KNeighborsClassifier(),
                            max_samples=0.5, max_features=0.5, random_state=2)
bagging.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(bagging.score(X_train, y_train)))
print("Accuracy on testing set: {:.2f}".format(bagging.score(X_test, y_test)))
Accuracy on training set: 0.91
Accuracy on testing set: 0.92

Gradient Boosting

Seems to be overfitting and has a relatively lower accuracy.

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(random_state=2)
gbrt.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 0.97
Accuracy on test set: 0.94

Support Vector Classifier

Confirms that ensemble, tree-based model is generally better.

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
print("Parameter grid:\n{}".format(param_grid))

grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(grid_search.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(grid_search.score(X_test, y_test)))
Parameter grid:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
Accuracy on training set: 0.96
Accuracy on test set: 0.92

Grid Search with Cross Validation on Random Forest (Final Model)

So for the final model, we use a grid search method with cross validation on random forest. In order to avoid any overfitting/underfitting on the dataset, we perform cross-validation as follows. This way, the cross-validation also helps us evaluate the performance of each parameter combination.

In [15]:
X, y = build_train_datasets()
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

rf = RandomForestClassifier(random_state=2)
param_grid_RF = { "n_estimators"      : [5,6],
           "criterion"         : ["gini", "entropy"],
           "max_features"      : [2,3],
           "max_depth"         : [5,10],
           "min_samples_split" : [3,2] ,
           "bootstrap": [True, False]}

grid_search_RF = GridSearchCV(rf, param_grid_RF, n_jobs=-1, cv=10)
grid_search_RF.fit(X_train, y_train)
print grid_search_RF.best_params_

print("Accuracy on training set: {:.2f}".format(grid_search_RF.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(grid_search_RF.score(X_test, y_test)))
{'bootstrap': False, 'n_estimators': 5, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 2, 'max_depth': 5}
Accuracy on training set: 0.96
Accuracy on test set: 0.94

Predict on the Test and Holdout Datasets

In [16]:
X_test_f = build_test_dataset('test.csv')
X_holdout = build_test_dataset('holdout.csv')

y_test_pred_f = grid_search_RF.predict(X_test_f)
y_holdout_pred = grid_search_RF.predict(X_holdout)
In [17]:
y_test_pred = grid_search_RF.predict(X_test)
print(classification_report(y_test,y_test_pred))
             precision    recall  f1-score   support

          0       0.95      0.98      0.96        56
          1       0.80      0.57      0.67         7

avg / total       0.93      0.94      0.93        63

Export Predictions

In [18]:
y_test_pred = pd.DataFrame(y_test_pred_f, columns = ["gold"])
y_holdout_pred = pd.DataFrame(y_holdout_pred, columns = ["gold"])
In [19]:
y_test_pred.to_csv('y_test_pred.csv', index = False)
y_holdout_pred.to_csv('y_holdout_pred.csv', index = False)

Entity Resolution Technique Summary

Overview

In building the matching algorithm, our overall technique goes from preprocessing and preparing the data that involves data exploration and transformation, running classification models iteratively, and predicting the match values on test.csv and holdout.csv using our best model.

Data Cleaning and preparation

In the preprocess stage, we look at three main attributes that are common in both datasets, namely time, director, and stars. We then use a dictionary to store these attributes, and associate them as values that correspond to key IDs. For time, we convert the time format in both datasets to minutes. Based on the formatting, we leave director field as is and combine all six fields of the star in the second entity (rotten_tomatoes.csv) to match the star field from the first entity (amazon).

Data Preprocessing

For each feature, a similarity score is computed between the two features of the associated IDs from the two entities. In this case, we compute the difference of the amount of time (in minutes) between the first and second entity attributes. For director, if strings are equal output 1, else output 0. For the stars, if at least one star in common, output 1, else output 0. We put this as fields of our training dataset that corresponds to two IDs being compared (obtained from train.csv). The same goes for the test dataset but we allow a file parameter that we can pass on to the function for later use of the model.

Model selection

Once we have our modified dataset, different classification models were run on both training and test sets — from support vector classifier and bagging to random forest and gradient boosting. The ensemble method looks better in giving a performance accuracy. Since we saw that random forest performs the best, we improved the model further using grid search with cross validation in order to avoid any overfitting/underfitting on the dataset. This method helped us evaluate the performance of each parameter combination and allowed us to create our best prediction model with an accuracy score of 0.98 on both training set and test set.

Interesting point

When we let the columns 'id1' and 'id2', the model outputs higher precision scores (.9625 on the leaderboard) than when we remove those columns (.9250). We chose to remove them as they disturb the model, for the good though, and don't carry information on the movies. It's a good luck that they improve the scores but may actually output worse scores if the index or the data change.

Final model

In our final stage, we predicted both the test and holdout datasets, which were then entered to leaderboard. Our final score on leaderboard is 92.50, indicating a good performance on our grid search model with cross-validation on random forest.

Prediction scores

We validated our model using precision, recall, and f1-score. For the precision score, we have an average of 0.93, which is a measure of how many positive predictions were actual positive observations. The recall score is found to be 0.94, indicating the model’s sensitivity, which is a measure of how many actual positive observations were predicted correctly. Last but not least, our measure of a test's accuracy through f-1 score is 0.93, which is a harmonic mean of both precision and recall. These average ratios indicate a good proportion of the predictions, and thus our model performance is reliable.

Discussion on pairwise comparison

Furthermore, our matching algorithm compares one-on-one from both entities using the train.csv IDs dataset. In this case, we don’t imply one-to-many or many-to-many relationships across entities, or use any nested for loops to compare each datapoint with all the other ones. Instead of giving points for one-on-one win and half a point for a tie, which is usually used in the method of pairwise comparison, we match the rows individually and measure the model performance using F-1 score. There is no specific entity given the most points or votes. This was a way for us to avoid pairwise comparison of all movies across both datasets.

Discussion on feature importance

We see that in the model we used, the length difference between the two movies is the most important feature, almost twice as important as the similarity between the director and star features, that are almost equivalent.

In [20]:
X, y = build_train_datasets()
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
forest = RandomForestClassifier(n_estimators=6,random_state=2)
forest.fit(X_train, y_train)
forest.feature_importances_
Out[20]:
array([ 0.4932166 ,  0.2508825 ,  0.25590091])