This is the project from my class to predict whether a review from IMDb is positive or negative using ML logistic regression, and select the best algorithm based on AUC among Ridge, Lasso and Elastic-net.
Dataset Summary
The dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 10 most common words". In this project we set number of words to 2500, which means the model has 2500 features.
For example, below is a positive review from dataset (the top 10 most common words are replaced by question mark):
? portrays ? day ? day ? reality ? ? on ? ? ? ? old west outstanding acting by both ? actors ? doesn't even feel like ? movie you feel like you're there animal ? should ? many scenes are obviously not just realistic they are real.
How it works
Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability. We can call a Logistic Regression a Linear Regression model, but the Logistic Regression uses a more complex cost function.
Introduction to Logistic Regression
Three different algorithms in this project have different cost functions. I will use histogram plot that created by Ridge algorithm to illustrate how logistic regression works to do classification job.
The output of the Logistic Regression is between 0 and 1. But as the plot shown above, the values in X axis are not in that range, because Logistic Regression will do one more step which is exponential transformation to transform the value into range 0,1. (Detail formula). The histogram plot divides review into three groups in different colors and is divided by 0. Say the threshold of Logistic Regression is 0.5, but it’s not always is 0.5, it depends on if the dataset is balance or not. If the value of a review is larger than 0, after exponential transformation, the Logistic Regression output will larger than 0.5, we classify this review as positive, on the contrary, if the value of a review is less than 0, the Logistic Regression output will less than 0.5, we classify this review as negative. The histogram around value 0, which also means Logistic Regression output is near 0.5, indicates those review is hard for computer to classify.
Model Selection
We use 10-fold cross-validation to tune the best parameter of each model and select one model with the best performance based on AUC value.
From the ROC curve plots, those three models’ performances are pretty good on predicting test dataset and it’s hard to identify which one is the best model only according to AUC values. We will discuss more criterions when selecting a suitable model for a project in Taiwan Bankruptcy, for example, time consuming, random sampling error.
Top 5 words of reviews
Even thought performances are similar, the importance of words for each model to classify sentiments are slightly different.
Ridge
From positive reviews:
gem noir captures wonderfully refreshing
From negative reviews:
worst unfunny disappointment lousy waste
Lasso
From positive reviews:
7 refreshing wonderfully captures noir
From negative reviews:
worst waste poorly badly lousy
Elastic-net
From positive reviews:
prince refreshing captures wonderfully noir
From negative reviews:
poorly lousy worst disappointment badly
Summary
As for now, we can not decide which algorithm performs better than the other two only considering ROC curve, their AUC values all look very good in training and testing datasets. However, when you consider model complexicy and time consuming of running code, Ridge regression usually takes less time but more complexicy, Lasso regression takes longer and less model complexicy, Elastic-net regression are between these two. In the Taiwan Bankruptcy project, we will go deep into complexicy trade-off and discuss in what kind of situation, time matters more than model complexity or the opposite. But we can know that Logistic Regression model is very suitable here to predict sentiment of movie reviews of IMDb The top 5 words show the importances of features, when model predicts a review, those words will have more contribution to classify positive or negative.