Mining Traffic Complaints for Safer Streets in Boston

A classification technique to predict text data and model semantic categories

Authors: Moorissa Tjokro, Jager Hartman

Objective

The objective of this project is to do text classification on a dataset of complaints about traffic conditions to the city of Boston. The dataset we use can be found here: https://data.boston.gov/dataset/vision-zero-entry

To reach this objective, there are two main steps we will perform:

  1. Predict the type of complaint (“REQUESTTYPE”) from the complaint text.
  2. Strategize a better categorization of the data into semantic categories.

Technique Summary

Step 1: Data Preparation & Preprocessing

Basic data import and preprocessing techniques, with the following additional steps:

  • Clean up the target label REQUESTTYPE
  • Remove duplicates
  • Visualize class distributions
  • Split to training and test dataset

Step 2: Bag-of-Word Models

Run a baseline multi-class classification model using a bag-of-word approach, followed by :

  • Several BOW models attempts with tuned parameters
  • Macro f1-score report
  • Confusion matrix
  • Interpretation on modeling mistakes.

Step 3: N-Gram-based Models

Improve the model using more complex text features, including n-grams, character n-grams and possibly domain-specific features.

  • N-gram count vectorizer
  • Tfidf transformation
  • Normalization with lemmatization
  • Normalization with stemming

Step 4: Result Visualizations

Visualize results of the tuned model:

  • Classification results
  • Tuned model visualization
  • Confusion matrix
  • Important feature plots
  • Example mistakes

Step 5: Clustering Techniques

Apply the following techniques to the dataset:

  • LDA
  • NMF
  • K-Means Then find clusters or topics that match well with some of the ground truth labels. Use ARI to compare the methods and visualize topics and clusters.

Step 6: Improved Classification Techniques

Here are some steps to improve the text model:

  • Improve the class definition for REQUESTTYPE by using the results of the clustering and results of the previous classification model.
  • Re-assign labels using either the results of clustering or using keywords that you found during data exploration.
  • Deal with large “other” category by applying topic modeling and clustering techniques.
  • Report accuracy using macro average f1 score.

Step 7: Doc2Vec Models

Use a word embedding representation like word2vec for step 3 and or step 6:

  • Read and preprocess text
  • Train model
  • Assess model
  • Test model

Step 8: Visualizing Complaints in Boston Map

Visualize the geographic distribution of complaints using:

  • Scatterplot
  • Heatmap

Main Analysis

Step 1: Data Preparation & Preprocessing

Steps: Load the data, visualize the class distribution. Clean up the target labels. Some categories have been arbitrarily split and need to be consolidated.

Load data

In [1]:
# Add your code for task 1 here. You may use multiple cells. 

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import time
#import shapely
#import geopandas as gpd
from scipy import ndimage
from pandas.io.gbq import read_gbq
plt.rcParams["figure.dpi"] = 100
pylab.rcParams['figure.figsize'] = 8, 6
np.set_printoptions(precision=3, suppress=True)
In [2]:
df = pd.read_csv("Vision_Zero_Entry.csv")
print(df.shape)
df.head()
(8528, 11)
Out[2]:
X Y OBJECTID GLOBALID REQUESTID REQUESTTYPE REQUESTDATE STATUS STREETSEGID COMMENTS USERTYPE
0 -71.058698 42.343489 14807 NaN 14807.0 bike facilities don't exist or need improvement 2016-01-19T14:43:50.000Z Unassigned 0 Broadway Bridge is wide & off highway ramps. V... bikes
1 -71.054144 42.354168 14808 NaN 14808.0 of something that is not listed here 2016-01-19T14:48:45.000Z Unassigned 0 This intersection is dangerous. Cars don't fol... bikes
2 -71.099480 42.339384 14809 NaN 14809.0 people don't yield while going straight 2016-01-19T14:57:03.000Z Unassigned 0 It's terrifying to walk over here. It seems li... walks
3 -71.066565 42.349365 14810 NaN 14810.0 it’s hard to see / low visibility 2016-01-19T15:36:25.000Z Unassigned 0 cars coming around the corner of this wide one... walks
4 -71.114414 42.301993 14811 NaN 14811.0 people don't yield while turning 2016-01-19T21:26:54.000Z Unassigned 0 as you come off the bike path, it's unclear ho... bikes
In [3]:
# check data types
df.dtypes
Out[3]:
X              float64
Y              float64
OBJECTID         int64
GLOBALID       float64
REQUESTID      float64
REQUESTTYPE     object
REQUESTDATE     object
STATUS          object
STREETSEGID      int64
COMMENTS        object
USERTYPE        object
dtype: object

Clean up the target label REQUESTTYPE

In [4]:
# original version:
np.unique(df['REQUESTTYPE'])
Out[4]:
array([ '" src="images/01 - Not enough time to cross.png"></span>&nbsp;there\'s not enough time to cross the street',
       '" src="images/02 - Wait is too long.png"></span>&nbsp;the wait for the "Walk" signal is too long',
       '" src="images/06 - Speeding.png"></span>&nbsp;people speed',
       '" src="images/10 - Hard to see.png"></span>&nbsp;it’s hard to see / low visibility',
       '" src="images/11 - Sidewalk issue.png"></span>&nbsp;sidewalks/ramps don\'t exist or need improvement',
       '" src="images/12 - Bike facility issue.png"></span>&nbsp;the roadway surface needs improvement',
       '" src="images/14 - Other issue.png"></span>&nbsp;of something that is not listed here',
       "bike facilities don't exist or need improvement",
       "it's too far / too many lanes to cross",
       'it’s hard for people to see each other',
       'it’s hard to see / low visibility',
       'of something that is not listed here',
       'people are not given enough time to cross the street',
       'people cross away from the crosswalks',
       "people don't yield while going straight",
       "people don't yield while turning",
       'people double park their vehicles',
       'people have to cross too many lanes / too far',
       'people have to wait too long for the "Walk" signal',
       'people run red lights / stop signs', 'people speed',
       "sidewalks/ramps don't exist or need improvement",
       'the roadway surface needs improvement',
       'the roadway surface needs maintenance',
       'the wait for the "Walk" signal is too long',
       'there are no bike facilities or they need maintenance',
       'there are no sidewalks or they need maintenance',
       "there's not enough time to cross the street"], dtype=object)
In [5]:
# clean version:
for ind, word in enumerate(df['REQUESTTYPE']):
    if 'src="images' in word:
        df['REQUESTTYPE'][ind] = word[word.index('&nbsp;')+6:]

for word in np.unique(df['REQUESTTYPE']):
    print(word)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
bike facilities don't exist or need improvement
it's too far / too many lanes to cross
it’s hard for people to see each other
it’s hard to see / low visibility
of something that is not listed here
people are not given enough time to cross the street
people cross away from the crosswalks
people don't yield while going straight
people don't yield while turning
people double park their vehicles
people have to cross too many lanes / too far
people have to wait too long for the "Walk" signal
people run red lights / stop signs
people speed
sidewalks/ramps don't exist or need improvement
the roadway surface needs improvement
the roadway surface needs maintenance
the wait for the "Walk" signal is too long
there are no bike facilities or they need maintenance
there are no sidewalks or they need maintenance
there's not enough time to cross the street

Remove duplicates

In removing the duplicates, we'll look at all data and pay attention carefully at REQUESTTYPE, COMMENTS, and USERTYPE columns.

In [6]:
df['is_duplicated'] = df.duplicated(['REQUESTTYPE','COMMENTS','USERTYPE'])
df = df.loc[df['is_duplicated'] == False]
df = df.drop(["is_duplicated"], axis = 1)
df.shape
Out[6]:
(6519, 11)

Visualize class distributions

In [7]:
from collections import Counter

cnt = Counter()
for word in df['REQUESTTYPE'].as_matrix():
    cnt[word] += 1
cnt
Out[7]:
Counter({"bike facilities don't exist or need improvement": 699,
         "it's too far / too many lanes to cross": 86,
         'it’s hard for people to see each other': 27,
         'it’s hard to see / low visibility': 387,
         'of something that is not listed here': 1404,
         'people are not given enough time to cross the street': 10,
         'people cross away from the crosswalks': 259,
         "people don't yield while going straight": 261,
         "people don't yield while turning": 455,
         'people double park their vehicles': 427,
         'people have to cross too many lanes / too far': 27,
         'people have to wait too long for the "Walk" signal': 31,
         'people run red lights / stop signs': 653,
         'people speed': 744,
         "sidewalks/ramps don't exist or need improvement": 306,
         'the roadway surface needs improvement': 218,
         'the roadway surface needs maintenance': 35,
         'the wait for the "Walk" signal is too long': 202,
         'there are no bike facilities or they need maintenance': 124,
         'there are no sidewalks or they need maintenance': 39,
         "there's not enough time to cross the street": 125})
In [8]:
labels, values = zip(*Counter(df['REQUESTTYPE']).items())

inds = np.argsort(values)
sorted_values = []
sorted_labels = []
for i in inds:
    sorted_values.append(values[i])
    sorted_labels.append(labels[i])

labels = tuple(sorted_labels)
values = tuple(sorted_values)
indexes = np.arange(len(labels))
width = 1

plt.barh(indexes, values, width, color = "orange", edgecolor='white')
plt.yticks(indexes + width -1, labels)
plt.show()

Split to training and test dataset

In [20]:
df1 = df.copy()
df1 = df1[~df1['COMMENTS'].isnull()]
y1 = df1['REQUESTTYPE']
X1 = df1['COMMENTS']
In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, y1)

Step 2: Bag-of-Word Models

Steps: Run a baseline multi-class classification model using a bag-of-word approach, report macro f1-score (should be above .5), visualize the confusion matrix, and interpret the mistakes made by the model.

In [22]:
# Scikit import
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import scale, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
import re
In [23]:
vect = CountVectorizer(token_pattern = r"\b\w\w+\b")
vect.fit(df['REQUESTTYPE'])
blacklist = vect.get_feature_names()
print(blacklist)
['are', 'away', 'bike', 'cross', 'crosswalks', 'don', 'double', 'each', 'enough', 'exist', 'facilities', 'far', 'for', 'from', 'given', 'going', 'hard', 'have', 'here', 'improvement', 'is', 'it', 'lanes', 'lights', 'listed', 'long', 'low', 'maintenance', 'many', 'need', 'needs', 'no', 'not', 'of', 'or', 'other', 'park', 'people', 'ramps', 'red', 'roadway', 'run', 'see', 'sidewalks', 'signal', 'signs', 'something', 'speed', 'stop', 'straight', 'street', 'surface', 'that', 'the', 'their', 'there', 'they', 'time', 'to', 'too', 'turning', 'vehicles', 'visibility', 'wait', 'walk', 'while', 'yield']
In [24]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))
Vocabulary size: 67
Vocabulary content:
 {'bike': 2, 'facilities': 10, 'don': 5, 'exist': 9, 'or': 34, 'need': 29, 'improvement': 19, 'of': 33, 'something': 46, 'that': 52, 'is': 20, 'not': 32, 'listed': 24, 'here': 18, 'people': 37, 'yield': 66, 'while': 65, 'going': 15, 'straight': 49, 'it': 21, 'hard': 16, 'to': 58, 'see': 42, 'low': 26, 'visibility': 62, 'turning': 60, 'double': 6, 'park': 36, 'their': 54, 'vehicles': 61, 'the': 53, 'wait': 63, 'for': 12, 'walk': 64, 'signal': 44, 'too': 59, 'long': 25, 'sidewalks': 43, 'ramps': 38, 'speed': 47, 'cross': 3, 'away': 1, 'from': 13, 'crosswalks': 4, 'there': 55, 'enough': 8, 'time': 57, 'street': 50, 'far': 11, 'many': 28, 'lanes': 22, 'run': 41, 'red': 39, 'lights': 23, 'stop': 48, 'signs': 45, 'roadway': 40, 'surface': 51, 'needs': 30, 'are': 0, 'no': 31, 'they': 56, 'maintenance': 27, 'have': 17, 'each': 7, 'other': 35, 'given': 14}
In [25]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("All features:\n{}".format(feature_names[:]))
Number of features: 67
All features:
['are', 'away', 'bike', 'cross', 'crosswalks', 'don', 'double', 'each', 'enough', 'exist', 'facilities', 'far', 'for', 'from', 'given', 'going', 'hard', 'have', 'here', 'improvement', 'is', 'it', 'lanes', 'lights', 'listed', 'long', 'low', 'maintenance', 'many', 'need', 'needs', 'no', 'not', 'of', 'or', 'other', 'park', 'people', 'ramps', 'red', 'roadway', 'run', 'see', 'sidewalks', 'signal', 'signs', 'something', 'speed', 'stop', 'straight', 'street', 'surface', 'that', 'the', 'their', 'there', 'they', 'time', 'to', 'too', 'turning', 'vehicles', 'visibility', 'wait', 'walk', 'while', 'yield']

Vocabulary size and the number of feature names are consistent.

In [26]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

"""
Best classifiers as per the internet for text classification

BernoulliNB
GaussianNB
MultinomialNB

LinearSVC
PolynomialSVC
RbfSVC
NuSVC 

TfidfTransformer used in pipeline
"""
def standard_approach(X, y, 
                      vectorizer = CountVectorizer(token_pattern = r"\b\w\w+\b"),
                      classifier = LogisticRegression(),
                      scaling = None):
    if scaling is not None:
        pipe = make_pipeline(vectorizer, scaling, classifier)
    else:
        pipe = make_pipeline(vectorizer, classifier)
        
    scores = cross_val_score(pipe, X, y, cv=5, scoring = 'f1_macro')
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    test = np.mean(y_pred == y_test)
    
    return scores, np.mean(scores), test
In [27]:
print('First 5 lines of X_train:\n\n',X_train[:5],'\n')
print('List of unique words on y_train:\n\n',np.unique(y_train))
First 5 lines of X_train:

 3631    Drivers run the lights and give us even less t...
4008    There is a stretch of road about 75 feet with ...
448     You think there would be a stop sign here but ...
2517    The transition from the bike lane on the far r...
6119    When coming inbound on Boylston and turning ri...
Name: COMMENTS, dtype: object 

List of unique words on y_train:

 ["bike facilities don't exist or need improvement"
 "it's too far / too many lanes to cross"
 'it’s hard for people to see each other'
 'it’s hard to see / low visibility' 'of something that is not listed here'
 'people are not given enough time to cross the street'
 'people cross away from the crosswalks'
 "people don't yield while going straight"
 "people don't yield while turning" 'people double park their vehicles'
 'people have to cross too many lanes / too far'
 'people have to wait too long for the "Walk" signal'
 'people run red lights / stop signs' 'people speed'
 "sidewalks/ramps don't exist or need improvement"
 'the roadway surface needs improvement'
 'the roadway surface needs maintenance'
 'the wait for the "Walk" signal is too long'
 'there are no bike facilities or they need maintenance'
 'there are no sidewalks or they need maintenance'
 "there's not enough time to cross the street"]
In [41]:
names = ['BernouliNB', 'MultinomialNB', 'LogisticRegression', 'LinearSVC']
classifiers = [BernoulliNB(), MultinomialNB(), LogisticRegression(), LinearSVC()]

print('Using f-1 macro score:\n')
# without scaling for now
for i in range(len(names)):
    scores, m, t = standard_approach(X_train, y_train, classifier = classifiers[i])
#     print("Mean cross val score of %s with Standard Approach:" %names[i], m)
    print("Mean score of %s with Standard Approach:" %names[i], t)
Using f-1 macro score:

/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Mean score of BernouliNB with Standard Approach: 0.404334365325
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Mean score of MultinomialNB with Standard Approach: 0.483591331269
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Mean score of LogisticRegression with Standard Approach: 0.525696594427
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Mean score of LinearSVC with Standard Approach: 0.480495356037

Based on our baseline models above, our highest performing model is Logistic Regression and we will take a look more on our scores below.

F-1, Precision, and Recall Scores

In [34]:
from sklearn.metrics import classification_report

pipe = make_pipeline(CountVectorizer(token_pattern = r"\b\w\w+\b"),
                     LogisticRegression())

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set predictions:\n {}\n".format(y_pred))
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
print(classification_report(y_test, y_pred))

complaints_map_reindexed = {
                0 : 'J-walking',
                1 : 'bike facilities dont exist or need improvement',
                2 : 'double parking',
                3 : 'failure to yield',
                4 : 'low visibility',
                5 : 'other',
                6 : 'ramps or sidewalks',
                7 : 'roadway improvement',
                8 : 'speeding',
                9 : 'too far or too many lanes to cross or not enough time',
                10 : 'wait for walk signal'
                }
Test set predictions:
 ['people speed' "bike facilities don't exist or need improvement"
 'of something that is not listed here' ..., 'people speed'
 'of something that is not listed here' 'people speed']

Test set score: 0.53
                                                       precision    recall  f1-score   support

      bike facilities don't exist or need improvement       0.51      0.67      0.58       167
               it's too far / too many lanes to cross       0.33      0.05      0.08        21
               it’s hard for people to see each other       0.00      0.00      0.00         6
                    it’s hard to see / low visibility       0.66      0.62      0.64       101
                 of something that is not listed here       0.47      0.53      0.50       356
 people are not given enough time to cross the street       0.00      0.00      0.00         1
                people cross away from the crosswalks       0.40      0.38      0.39        65
              people don't yield while going straight       0.33      0.27      0.30        62
                     people don't yield while turning       0.38      0.39      0.39        97
                    people double park their vehicles       0.73      0.72      0.73       100
        people have to cross too many lanes / too far       0.00      0.00      0.00         6
   people have to wait too long for the "Walk" signal       0.00      0.00      0.00         3
                   people run red lights / stop signs       0.60      0.74      0.66       152
                                         people speed       0.62      0.65      0.64       198
      sidewalks/ramps don't exist or need improvement       0.45      0.47      0.46        73
                the roadway surface needs improvement       0.61      0.44      0.51        61
                the roadway surface needs maintenance       0.00      0.00      0.00        11
           the wait for the "Walk" signal is too long       0.44      0.42      0.43        45
there are no bike facilities or they need maintenance       0.50      0.05      0.09        40
      there are no sidewalks or they need maintenance       0.00      0.00      0.00         9
          there's not enough time to cross the street       0.53      0.20      0.29        41

                                          avg / total       0.51      0.53      0.51      1615

/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Confusion Matrix

In [37]:
from sklearn.metrics import confusion_matrix

conf_arr = confusion_matrix(y_test, y_pred)
In [43]:
plt.rcParams["figure.dpi"] = 100
pylab.rcParams['figure.figsize'] = 12, 12

norm_conf = []

def xrange(x):
    return iter(range(x))

for i in conf_arr:
    a = 0
    tmp_arr = []
    a = sum(i, 0)
    for j in i:
        tmp_arr.append(float(j)/float(a))
    norm_conf.append(tmp_arr)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.set_aspect(1)
res = ax.imshow(np.array(norm_conf), cmap="Blues", interpolation='nearest')

width, height = conf_arr.shape
for x in xrange(width):
    for y in xrange(height):
        ax.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')

cb = fig.colorbar(res)
plt.xticks(range(width),)
plt.yticks(range(height),)
plt.show()

complaints_map_origin = {
        0 : 'bike facilities dont exist or need improvement',
        1 : 'its too far too many lanes to cross',
        2 : 'its hard for people to see each other',
        3 : 'its hard to see or low visibility',
        4 : 'of something that is not listed here',
        5 : 'people are not given enough time to cross the street',
        6 : 'people cross away from the crosswalks',
        7 : 'people dont yield while going straight',
        8 : 'people dont yield while turning',
        9 : 'people double park their vehicles',
        10 : 'people have to cross too many lanes or too far',
        11 : 'people have to wait too long for the Walk signal',
        12 : 'people run red lights or stop signs',
        13 : 'people speed',
        14 : 'sidewalks or ramps dont exist or need improvement',
        15 : 'the roadway surface needs improvement',
        16 : 'the roadway surface needs maintenance',
        17 : 'the wait for the "Walk" signal is too long',
        18 : 'no bike facilities or they need maintenance',
        19 : 'no sidewalks or they need maintenance',
        20 : 'not enough time to cross the street'
        }
        
print('Complaints Index Map:')
for i in complaints_map_origin:
    print(i,":",complaints_map_origin[i])
Complaints Index Map:
0 : bike facilities dont exist or need improvement
1 : its too far too many lanes to cross
2 : its hard for people to see each other
3 : its hard to see or low visibility
4 : of something that is not listed here
5 : people are not given enough time to cross the street
6 : people cross away from the crosswalks
7 : people dont yield while going straight
8 : people dont yield while turning
9 : people double park their vehicles
10 : people have to cross too many lanes or too far
11 : people have to wait too long for the Walk signal
12 : people run red lights or stop signs
13 : people speed
14 : sidewalks or ramps dont exist or need improvement
15 : the roadway surface needs improvement
16 : the roadway surface needs maintenance
17 : the wait for the "Walk" signal is too long
18 : no bike facilities or they need maintenance
19 : no sidewalks or they need maintenance
20 : not enough time to cross the street

With our final score is 0.52, there are some classification mistakes from the model that can be found on the confusion matrix we plotted above. We saw that there are many digits that are misclassified, which causes the score to be very low. For instance, there are 26 digit 0's (bike facilities dont exist or need improvement) that are misclassified as 4 (of something that is not listed here). There are 22 digit 18's (failure to yield) also misclassified as digit 0 (bike facilities dont exist or need improvement). There are also many zeros in the diagonal lines, meaning we have some false segatives

There are also misclassifications of 'other' digits, such as 13 of them misclassified as digit 6 (ramps or sidewalks), and 28 of them misclassified as 'bike facilities dont exist or need improvement' (class 1). Some other tendency for mistakes can also be found in digit 9 (too far or too many lanes to cross or not enough time), where 12 of them are misclassified as class 3 (failure to yield) and 5 of them as class 10 (wait for walk signal).

Step 3: N-Gram-based Models

Steps: Improve the model using more complex text features, including n-grams, character n-grams and possibly domain-specific features.

In [47]:
# clean version:

plt.rcParams["figure.dpi"] = 100
pylab.rcParams['figure.figsize'] = 8, 3

type_to_key = [0, 1, 2, 2, 3, 1, 4, 5, 5, 6, 1, 7, 5, 8, 9, 10, 10, 7, 0, 9, 1]
complaints_map = {
                0 : 'bike facilities dont exist or need improvement',
                1 : 'too far or too many lanes to cross or not enough time',
                2 : 'low visibility',
                3 : 'other',
                4 : 'J-walking',
                5 : 'failure to yield',
                6 : 'double parking',
                7 : 'wait for walk signal',
                8 : 'speeding',
                9 : 'ramps or sidewalks',
                10 : 'roadway improvement'
                }

df['code'] = df['REQUESTTYPE']
df['y_true'] = df['REQUESTTYPE']
df = df[~df['COMMENTS'].isnull()]


for i, line in enumerate(np.unique(df['REQUESTTYPE'])):
    df['code'][df['REQUESTTYPE'] == line] = complaints_map[type_to_key[i]]
    df['y_true'][df['REQUESTTYPE'] == line] = type_to_key[i]


y = df['code']
X = df['COMMENTS']
y_true = list(df['y_true'])


labels, values = zip(*Counter(y).items())
inds = np.argsort(values)
sorted_values = []
sorted_labels = []

for i in inds:
    sorted_values.append(values[i])
    sorted_labels.append(labels[i])

labels = tuple(sorted_labels)
values = tuple(sorted_values)
indexes = np.arange(len(labels))
width = 1

plt.barh(indexes, values, width, color = "orange", edgecolor='white')
plt.yticks(indexes + width -1, labels)
plt.show()
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:27: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Note: We just grouped by the same meaning from 21 categories down to 11.

N-Grams Count Vectorizer

In [48]:
# 1-gram

vectorizer = CountVectorizer(token_pattern = r"\b\w\w+\b", ngram_range=(1, 1)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression())
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

# print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary size: 5759
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Training set CV scores: [ 0.363  0.338  0.366  0.353  0.368] with mean of 0.357593470486
Test set score: 0.53

As we can see, using a unigram does not improve our model because the default in vectorizer is using n=1.

In [46]:
# 2-gram

vectorizer = CountVectorizer(token_pattern = r"\b\w\w+\b", ngram_range=(2, 2)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression())
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

# print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary size: 43874
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Training set CV scores: [ 0.32   0.31   0.347  0.306  0.35 ] with mean of 0.326527079487
Test set score: 0.51
In [218]:
# 3-gram

vectorizer = CountVectorizer(token_pattern = r"\b\w\w+\b", stop_words="english", ngram_range=(3, 3)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression())
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

# print("Vocabulary:\n{}".format(cv.get_feature_names()))
Vocabulary size: 48935
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Training set CV scores: [ 0.207  0.222  0.223  0.212  0.21 ] with mean of 0.214877233219
Test set score: 0.39

The higher n value for n-grams we use the worse our scores would be. This makes sense because single words are usually significant enough and easier to match the count than two or three words in a document. We'll now use a pipeline to run a combination of unigrams, bigrams, and trigrams:

In [231]:
# using unigrams, bigrams, and trigrams

vectorizer = CountVectorizer(token_pattern = r"\b\w\w+\b", stop_words="english", ngram_range=(1, 3)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression())
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
Vocabulary size: 90007
Training set CV scores: [ 0.587  0.587  0.588  0.563  0.589] with mean of 0.582732602352
Test set score: 0.62

Using combination of n=1,2,&3 gives us the highest f-1 score. However, we realized that running the grid search took a longer time because of the relatively large grid and the inclusion of trigrams.

To improve our model further, we will now use grid search as follows to find best parameters, and also using LogReg for dealing with sparsity.

Tfidf Transformation

We apply the frequency–inverse document frequency (tf–idf) transformation to rescale features by how informative we expect them to be. This method will allow us to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document.

  • tf := Term Frequency
  • df(d, t) := Number of documents containing term t
  • $n_d$ := total number of documents
$$tfidf(t, d) = tf(t, d) * idf(t)$$$$idf(t) = log \frac{1 + n_d}{1 + df(d, t)} + 1$$

By using TfidfTransformer, we take in the sparse matrix output produced by CountVectorizer and transforms it. We will use TfidfVectorizer, which takes in the text data and does both the bag-of-words feature extraction and the tf–idf transformation.

In [232]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
import mglearn

standard_approach(X_train, y_train, scaling=TfidfTransformer())
Out[232]:
(array([ 0.5  ,  0.535,  0.554,  0.532,  0.549]),
 0.53402657111420437,
 0.61981424148606812)

We will tune the regularizatoin parameter C of LogisticRegression via cross-validation to improve our score.

In [233]:
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(TfidfVectorizer(min_df=5, token_pattern = r"\b\w\w+\b", stop_words="english"),
                     LogisticRegression())
param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
"tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring = 'f1_macro')
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

y_pred = grid.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Best cross-validation score: 0.58
Best parameters:
{'logisticregression__C': 10, 'tfidfvectorizer__ngram_range': (1, 2)}
Test set score: 0.60

It did not improve our score. Let's try just using the default token_pattern and without stop_words, see if it's improving our score. We will still tune the regularizatoin parameter C of LogisticRegression via cross-validation to improve our score.

In [234]:
pipe = make_pipeline(TfidfVectorizer(min_df=5, token_pattern = r"\b\w\w+\b"), LogisticRegression())
param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
"tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}

grid_m2 = GridSearchCV(pipe, param_grid, cv=5, scoring = 'f1_macro')
grid_m2.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid_m2.best_score_))
print("Best parameters:\n{}".format(grid_m2.best_params_))

y_pred_m2 = grid_m2.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred_m2 == y_test)))
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Best cross-validation score: 0.59
Best parameters:
{'logisticregression__C': 10, 'tfidfvectorizer__ngram_range': (1, 3)}
Test set score: 0.62

The tuning using grid search results in different parameter, but this time we improve our score by 2% using simple Tfidf without our previous token_pattern or stop words.

Normalization

We now want to apply a more advanced way to extract some normal form of a word, and see if this would improve our model. Therefore, we will first load spacy's English-language models and instantiate nltk's Porter stemmer, and also define function to compare lemmatization in spacy with stemming in nltk as follows.

Lemmatization

In [227]:
# using Lemmatization

import re
import spacy
import nltk

regexp = re.compile('(?u)\\b\\w\\w+\\b')
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))

def lemmatization(document):
    doc_spacy = en_nlp(document, entity=False, parse=False)
    return [token.lemma_ for token in doc_spacy]

vectorizer = TfidfVectorizer(min_df=5, tokenizer=lemmatization,
                             ngram_range=(1, 2)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression(C=10))
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
Vocabulary size: 2
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/moorissatjokro/anaconda/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Training set CV scores: [ 0.059  0.061  0.052  0.064  0.05 ] with mean of 0.057174219462
Test set score: 0.24

Number of vocabulary used is too low. We will try if it improves using stemming.

Stemming

In [228]:
# using Stemming
stemmer = nltk.stem.PorterStemmer()

def stemming(document):
    doc_spacy = en_nlp(document)
    return [stemmer.stem(token.norm_.lower()) for token in doc_spacy]

vectorizer = TfidfVectorizer(min_df=5, tokenizer=stemming,
                             ngram_range=(1, 2)).fit(X_train)
print("Vocabulary size: {}".format(len(vectorizer.vocabulary_)))

pipe = make_pipeline(vectorizer, LogisticRegression(C=10))
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring = 'f1_macro')
print('Training set CV scores:',str(scores),'with mean of', str(np.mean(scores)))

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
Vocabulary size: 5045
Training set CV scores: [ 0.598  0.584  0.59   0.575  0.629] with mean of 0.59495713456
Test set score: 0.63

We just saw that stemming words is better than performing lemmatization. We see above the model doesn't improve from just using TfidfVectorizer without normalization. So in this case, we will proceed with our highest model for next step: TfidfVectorizer with min_df of 5 and ngram_range of (1, 3) and Logistic Regression with C parameter of 10, with best cross validation score of 0.58 and a test score of 0.63.

Step 4: Result Visualizations

Tasks: Visualize results of the tuned model (classification results, confusion matrix, important features, example mistakes).

Classification Results

In [133]:
print("Test set predictions:\n {}\n".format(y_pred_m2))
print("Test set score: {:.2f}".format(np.mean(y_pred_m2 == y_test)))
print(classification_report(y_test, y_pred_m2))
Test set predictions:
 ['speeding' 'roadway improvement' 'low visibility' ..., 'double parking'
 'failure to yield' 'speeding']

Test set score: 0.63
                                                       precision    recall  f1-score   support

                                            J-walking       0.61      0.36      0.45        61
       bike facilities dont exist or need improvement       0.71      0.75      0.73       216
                                       double parking       0.84      0.70      0.76       108
                                     failure to yield       0.66      0.71      0.69       343
                                       low visibility       0.62      0.57      0.59       106
                                                other       0.46      0.55      0.50       324
                                   ramps or sidewalks       0.66      0.64      0.65        83
                                  roadway improvement       0.79      0.49      0.60        76
                                             speeding       0.67      0.70      0.68       182
too far or too many lanes to cross or not enough time       0.52      0.31      0.39        51
                                 wait for walk signal       0.73      0.62      0.67        65

                                          avg / total       0.64      0.63      0.63      1615

We see here that our precision 0.64, represents a performance metric when the goal is to limit the number of false positives, that is, TP/(TP+FP). The precision is relatively good in our model, which allows us to see that the modeldoes not produce as many false positives—in other words. Our recall score of 0.63 represents a performance metric to identify all positive samples and avoid false negatives, that is, TP/(TP+FN). Finally, our f1 score of 0.63 represents a harmonic mean of precision and recall, which in this case assures accuracy of our model.

Tuned Parameters

In [141]:
scores = grid_m2.cv_results_['mean_test_score'].reshape(-1, 3).T

heatmap = mglearn.tools.heatmap(
    scores, xlabel="C", ylabel="ngram_range", cmap="viridis", fmt="%.3f",
    xticklabels=param_grid['logisticregression__C'],
    yticklabels=param_grid['tfidfvectorizer__ngram_range'])
plt.colorbar(heatmap)
plt.show()

The heatmap above is constant with our result of 0.63, where we use (1,3) as our ngram range and 10 as our c value for the logistic regression. The 0.63 is highlighted with the brightest color (highest value) in the heatmap.

Confusion matrix

In [168]:
plt.rcParams["figure.dpi"] = 100
pylab.rcParams['figure.figsize'] = 8, 8

norm_conf = []

def xrange(x):
    return iter(range(x))

for i in conf_arr:
    a = 0
    tmp_arr = []
    a = sum(i, 0)
    for j in i:
        tmp_arr.append(float(j)/float(a))
    norm_conf.append(tmp_arr)

fig = plt.figure()
plt.clf()
ax = fig.add_subplot(111)
ax.set_aspect(1)
res = ax.imshow(np.array(norm_conf), cmap="Blues", 
                interpolation='nearest')

width, height = conf_arr.shape
for x in xrange(width):
    for y in xrange(height):
        ax.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')

cb = fig.colorbar(res)
plt.xticks(range(width),)
plt.yticks(range(height),)
plt.show()

print('Complaints Index Map:')
for i in complaints_map_reindexed:
    print(i,":",complaints_map_reindexed[i])
Complaints Index Map:
0 : J-walking
1 : bike facilities dont exist or need improvement
2 : double parking
3 : failure to yield
4 : low visibility
5 : other
6 : ramps or sidewalks
7 : roadway improvement
8 : speeding
9 : too far or too many lanes to cross or not enough time
10 : wait for walk signal

The model has an f1-score of 63%, which already tells us that we are doing pretty well. The confusion matrix above provides us with some more detail. As for the binary case, each row corresponds to a true label, and each column corresponds to a predicted label.

Important features

Based on parameters above, we can visualize our important features below.

In [144]:
# pipe_m2 = make_pipeline(TfidfVectorizer(min_df=5, ngram_range=(1, 3)), LogisticRegression(C=10))

vect_model2 = TfidfVectorizer(min_df=5, ngram_range=(1, 3))

X_train_model2 = vect_model2.fit_transform(X_train)
lr_model2 = LogisticRegression(C=10).fit(X_train_model2, y_train)
X_test_model2 = vect_model2.transform(X_test)

print("Model 2 score: {:.2f}".format(lr_model2.score(X_test_model2, y_test)))
Model 2 score: 0.63
In [149]:
coef_model2 = lr_model2.coef_.T # length of 1642
feature_names_model2 = np.array(vect_model2.get_feature_names())
print(coef_model2.shape, feature_names_model2.shape)# length of 1642
(6809, 11) (6809,)
In [150]:
# sample of the first 100 feature names:
feature_names_model2[:100]
Out[150]:
array(['10', '10 foot', '10 seconds', '100', '15', '16', '18', '1st', '20',
       '24', '25', '30', '35', '40', '40 mph', '40mph', '50', '50 mph',
       '500', '60', '90', '93', 'able', 'able to', 'about', 'about it',
       'about the', 'above', 'abruptly', 'absolutely', 'accelerate',
       'access', 'access the', 'access to', 'accessible', 'accident',
       'accident waiting', 'accident waiting to', 'accidents',
       'accidents at', 'accidents at this', 'accidents here',
       'accidents with', 'accommodate', 'across', 'across from',
       'across lanes', 'across the', 'across the bike',
       'across the street', 'across this', 'across washington',
       'activated', 'active', 'actually', 'ada', 'adams', 'adams park',
       'adams st', 'add', 'added', 'adding', 'addition', 'additional',
       'addressed', 'adhere', 'adjacent', 'adults', 'afraid', 'after',
       'after it', 'after the', 'after the light', 'afternoon',
       'afternoons', 'again', 'against', 'against the',
       'against the light', 'against traffic', 'aggressive',
       'aggressively', 'ago', 'ahead', 'ahead of', 'albany', 'albany st',
       'albany street', 'alh', 'aligned', 'alignment', 'all', 'all along',
       'all day', 'all directions', 'all four', 'all hours', 'all lanes',
       'all of', 'all of the'], 
      dtype='<U28')
In [151]:
print(np.min(coef_model2))
-7.04982052164

Across all 10 topics using 30 most and 30 least favorable features:

In [148]:
def plot_important_features(coef, feature_names, top_n=20, ax=None):
    if ax is None:
        ax = plt.gca()

    inds_max = np.argsort(np.max(coef, axis=1)).astype(int) # increasing order
    inds_min = np.argsort(np.min(coef, axis=1)).astype(int)
    
    low = inds_min[:top_n]
    high = inds_max[-top_n:]
    
    important_coef = np.hstack([np.min(coef, axis=1)[low],
              np.max(coef, axis=1)[high]])
    
    important = np.hstack([low, high])
    
    myrange = range(len(important))
    
    ax.bar(myrange, important_coef)
    ax.set_xticks(myrange)
    ax.set_xticklabels(feature_names[important], rotation=70, ha="right")
    
plt.figure(figsize=(15, 5))
plot_important_features(coef_model2, feature_names_model2, top_n=30)

The plot above allows us to see all the important features across all 10 topics. However the words bikes are in both ends, meaning it was very high for a topic and also was very low for another topic. Let's look at the features using each topic to fix this issue.

For each of the 10 topics using 30 most and 30 least favorable features:

In [173]:
def plot_features(coef, feature_names, top_n=20, ax=None):
    if ax is None:
        ax = plt.gca()

    inds = np.argsort(coef)
    low = inds[:top_n]
    high = inds[-top_n:]
    
    important_coef = np.hstack([low, high])
    myrange = range(len(important_coef))
    
    ax.bar(myrange, coef[important_coef])
    ax.set_xticks(myrange)
    ax.set_xticklabels(feature_names[important_coef], rotation=70, ha="right")


fig = plt.figure()
fig, axes = plt.subplots(11 // 3+1, 3, figsize=(18, 20))
print(axes[1,1])
for i, ax in enumerate(axes.flatten()):
    if i > 10:
        continue
    ax.set_title("Topic %s" %i)
    plot_features(coef_model2[:,i], feature_names_model2, top_n=10, ax=ax)
plt.tight_layout()  
plt.show()

print('Complaints Index Map:')
for i in complaints_map_reindexed:
    print(i,":",complaints_map_reindexed[i])
Axes(0.398529,0.518913;0.227941x0.16413)
<matplotlib.figure.Figure at 0x134753c88>