Research : Movie Success Prediction Using ML

Narayana Darapaneni Director - AIML Great Learning/Northwestern University Illinois, USA
Sujana Entoori Mentor - AIML Great Learning Bangalore, India
S V Vybhav Student - AIML Great Learning Bangalore, India
Christopher Bellarmine Student - AIML Great Learning Bangalore, India
Abir Kumar Student - AIML Great Learning Bangalore, India
Koushik Mondal Student - AIML Great Learning Bangalore, India
Anwesh Reddy Paduri Research Assistant - AIML Great Learning Mumbai, India
Abstract

Movies continue to be a major source of entertainment in any country. However, this industry also incurs a lot of losses when the movie does not perform at the Box Office. Our solution will try to predict the success rate of a movie by doing predictive analysis on the various features of the movie. Our model will predict the Success, based on different attributes / features of the movie. i.e. Movie crew (including director producer, music director), Movie plot (Storyline), Box-Office revenue, Audience and Critics reviews / ratings. In this paper a detailed study of machine learning algorithms such as Random Forest, DecisionTree, K-NearestNeighbours (KNN), NLP, XGBoost Classifier and Deep Neural Network were done and were implemented on IMDB dataset for predicting Success of movies. Based on the results, XGBoost Classifier gave best accuracy.

I. INTRODUCTION

A movie making a billion-dollar industry and movies are a big source of entertainment in any country [1]. Filmmaking involves a good story, screenwriting, casting, direction, sound recording and many other activities. Movie Industry produces hundreds of movies every year of different genres such as animation, war, comedy, thriller horror etc. Cinema has a distinct way to inspire each one of us. We get to learn a lot via this medium. A comedy helps us forget our sorrows, a sci-fi helps us to think big, a biopic helps us to achieve our dreams and so on. Every family member can watch a movie together since there are so many elements to cinema that is enjoyable. A great movie is a crowd-puller and the entire family enjoy together that translates into revenue generation (from purchase of tickets, snacks, and further shopping once movie is done).

Hence, the movie industry contributes significantly to any country’s economy. When a movie is a Blockbuster Hit, profits are huge; but when a movie Flops the losses are huge too. And both have a direct positive and negative impact. There are many online platforms that keep track of movies such as Rotten Tomatoes, Metacritic and Internet Movie Database (IMDb), which provide information about directors, budget, as well as user ratings and comments. Internet movie database (IMDB), the number one consumer site of movies, contains information about programs, films and television including financial information, biographies, user rating, cast, reviews, crew, actors, directors, summaries etc. It maintains a database of approx. 83 million registered users and 10.4 million personalities with 6.5 million movie and episodes titles [2] [12].

“Hollywood is the land of hunch and the wild guess” [3] [5]. Thousands of movies are released every year. In 2018, the global box office was worth $41.7 billion.[13] When including box office and home entertainment revenue, the global film industry was worth $136 billion in 2018 [11] [14] Hollywood is the world's oldest national film industry, and remains the largest in terms of box office gross revenue. Indian cinema is the largest national film industry in terms of the number of films produced, with 1,813 feature films produced annually as of 2018. There is a great deal of uncertainty that the movie will do business or not. A lot of research has been done on predicting the success of movies. Majority of the past research focused on user ratings for which the source of data were mostly the social media platforms such as YouTube, Twitter etc. [3] [15]. The movies attributes such as crew, Release dates, Production Houses, storyline etc. are a valuable source of information and can have a significant contribution on the prediction of the success of a movie. A lot of data is available on the web, from various sources such as IMDB, about various movie attributes that makes it a significant use-case for Data mining and machine learning as it is quite relevant that successful prediction of a movie is of great relevance to this multi-Billion Dollars Industry. This will help producers and directors to make movies which will be more appropriate with the audience’s preferences [3] [6].

II. LITERATURE OVERVIEW

In August 2016, Muhammad Hassan Latif and Hammad Afza [1] wrote a paper for IJCNS. The paper talks about predicting the rating of a movie using machine learning. In their work, they rated the budget of the movie in a scale from 1 to 9. The various classes of prediction in their approach were Terrible, Poor, Average and Excellent. The least accuracy for neural network was 79.07% and the highest accuracy was 84.34%.

Nikhil Apte, Mats Forssell and Anahita Sidhwa [2] used a couple of different machine learning techniques to predict the box office revenue. In their work they only considered movies that were released after 1st, January 1990 as the data before that was incomplete. The final dataset consisted of 2510 records. They used algorithms such as linear regression and weighted linear regression. They used hold-out cross validation for estimating testing errors.

In a more recent work Rijul Dhir and Anand Raj [16] tried to predict how successful a movie will be prior to its arrival at the box office instead of listening to critics and others on whether a movie will be successful or not. The proposed research provides a quite efficient approach to predict IMDB score on IMDB Movie Dataset. In their study they also tried to unveil the important factors influencing the score of IMDB Movie Data. Random forest yielded the best accuracy.

In another approach social media interactions were used for movie success predictions. The volume of tweets and interactions were the source of the dataset. Sitaram Asur and Bernardo A. Huberman [3] conducted a study for future prediction using social media platforms. Thus, proving that it is possible to use social media to predict the box office revenue for a movie. This study used Twitter as the source of the data. The data set with 2.89 million tweets from 1.2 million users were then used to create a linear regression model which obtained an accuracy of 98%.

Karl Persson [4] at the university of Skövde compared the predictive performance of Random Forest with support vector machine. He achieved a success rate of 84% when using random forest, and a success rate of 86% when using support vector machines. To validate his results, he used 10-fold cross validation. Similar work has been presented [5] where social media including twitter and YouTube’s comments are used for same purpose. Another approach [6] presents prediction of popularity of a movie by the articles on Wikipedia.

The research shows that these articles can be used to get some future outcomes. It also uses financial data of movies from box office mojo by using Pearson’s correlation coefficient and linear regression. A different approach predicts the opening weekend revenue. It takes the movie information like actors, director, genre and released date etc. from meta-critic and financial data like budget, opening week gross revenue from the numbers. Mean Absolute error, Pearson’s correlation coefficient and linear regression are employed [7-10].

TABLE 1: SUMMARY OF PREVIOUS WORK

Study\Method and Results Validation Prediction Success Rate
Predicting Movie Box Office Gross 20% withhold from data set Movie Revenue 65%
Prediction of Movies popularity Using Machine Learning Techniques 10-fold cross validation Movie Rating 80%
Predicting movie ratings: A comparative study on random forests and support vector machines 10-fold cross validation Movie Rating 83%
Predicting the Future with Social Media Cross validation Box office revenue 98%

III. METHODOLOGY

The Proposed methodology is illustrated in Figure below. It contains following steps:

  • Data Gathering
  • Data Analysis
  • Data Cleansing
  • Data Formatting
  • EDA
  • Feature Selection
  • Assumptions
  • Feature Engineering
  • Model Selection
  • Model Validation
Figure 1 - Methodology Followed
Figure 1 - Methodology Followed

IV. MATERIALS

We have used two types of datasets: labeled and unlabeled. The unlabeled dataset had over several unique entries. One of the main entries was customer reviews in IMDB movie reviews dataset. This dataset was used for pre-training the model to understand and recognize the English language using transfer learning.

The labeled dataset was collected manually and consisted of answers to a questionnaire pertaining to what constituents to a successful movie. We have gathered data from the well-known sites like Movie Lens, Rotten Tomato, and IMDB. Each of these datasets were freely available online.

  • Movie Lens: https://grouplens.org/datasets/movielens/
  • Rapid API: https://rapidapi.com/collection/movie-apis
  • Kaggle: https://www.kaggle.com/tmdb/tmdb-movie-metadata

For our models we settled with IMDB dataset as it was most suitable.

A. Data Cleansing and Formatting

Following are the steps taken to process the data:

We removed unused columns such as id, imdb id, vote count, production company, keywords, homepage etc. Removing the duplicate, the rows (if any). We handled the JSON in dataset. Some movies in the database have zero budget or zero revenue, i.e. their value has not been recorded so we will be discarding such entries. Changing release date column into date format. Replacing zero with NAN in the runtime column. Changing format of budget and revenue column.

B. Data Analysis, EDA and Feature Engineering

The data about a movie, contains a lot of textual information such as the cast (actors), crew (director), production houses. Our intention was to use these textual data as they have high significance in determining the movie success. Now, combining all this data, we need to build the target variable (where movie is either a success or a failure).

Following are the ways we derived the target variable from the textual data:

First, we created seven features from the categorical textual data (from a machine learning perspective). Encoding was done on the features by assigning weightage to a feature (director, Actor, Production House) for example:

Weightage for director = total movie success by the director / total movies directed.

We then derived the target variable considering 3 aspects of the movies data. We graded movie a success, only when the popularity rating is above 7 and the movie is a commercial success in terms of budget to gross income ratio. Finally, we dropped all the textual data that we have transformed as per above steps.

We did PCA to understand the feature importance and got the below results.

Figure 2 - Feature Importance Map
Figure 2 - Feature Importance Map

Based on our EDA we found Comedy to be the most successful genre.

Figure 3: Top Genres Map
Figure 3: Top Genres Map

Similarly, we found the most successful movies are where Samuel L Jackson and Morgan Freeman have worked.

Figure 4 -- Top Successful Actor
Figure 4 -- Top Successful Actor

C. Data Filtering

Following are the certain rules applied based on the analysis of data: Biases do not get introduced during the machine learning process. The minimum number of successful movies should be 5 for actors and directors. The minimum number of movies produced by any production house is 50. The cases where the above criteria are not met, a low weightage is assigned uniformly.

This approach helped eliminate the problem of director who directed only one movie and the movie becoming a huge success compared to a director who directed 10 movies out of which 6 were a huge success.

V. RESULTS AND DISCUSSION

A. Base Model

We started testing with various models, by keeping the already existing ones as our baseline models and then trying out neural network models to better the current outcomes as part of our project goal. Once the Data cleaning and feature engineering was done, it was time to compare the accuracy of various models with featured data to check whether the accuracy has improved or not. The table below shows the accuracy of each model at the beginning and the improvement achieved after feature engineering.

TABLE 3: DEPICTS THE MODEL ACCURACIES BEFORE AND AFTER TUNING.

Model Accuracy Before Accuracy After
KNN 82 82.04
Random Forest 86 88.57
Decision Tree 88.63 88.83
XGBoost NA 90.39
Gaussian Naive Bayes Model 80.46 80.91
Simple Neural Network Model 78.30 80.41

B. Natural Language Processing On the Movie Overview/Plot

Natural Language Processing can be used on the Movie Overview/Plot/Storyline of the Movie in order to perform classification on an unlabeled Movie Plot. After preprocessing and tokenizing the text, GloVe embedding are used to create the feature matrix. The embedding layer is fed the embedding matrix as weights. The models are compiled using Adam optimizer and the loss as binary cross entropy.

The first model is Text Classification performed using a Simple Neural Network. The Sequential model has an embedding layer and a Flatten layer followed by a sigmoid activation layer. The test accuracy was 76.69%

The second model is Text Classification with a Convolutional Neural Network which has an additional Conv1D layer with 128 neurons and stride 5 with activation RELU, followed by a GlobalMaxPooling1D layer after the first embedding layer in the previous model. This model gave an accuracy of 79.44%

The third model is with a type of Recurrent Neural Network (RNN), i.e. Long-Short-Term-Memory (LSTM). This Sequential Model contains an embedding layer followed by an LSTM layer of 128 neurons and the sigmoid activation layer. This model gave an accuracy of 79.45%

C. XGBoost Classifier

XGBoost which stands for eXtreme Gradient Boosting, which is a boosting algorithm based on gradient boosted decision trees algorithm. It applies better regularization from technique to reduce overfitting, and it is one of the differences from the gradient boosting. We observed that the XGBoost model outperforms all the other models. Also, the model performs best on the data after feature importance while on the other hand, when using the transformed PCA data, the accuracy drops to 80.86%.

The end results were 90.39% accuracy on the main dataset. After feature importance the accuracy obtained was 90.81% and 80.86% on the PCA dataset.

D. Neural Network Model Using TensorFlow and Tensor Board

In this project, a Deep Neural Network classification model was constructed using the TensorFlow tools. The model contains two dense layers, a dropout layer and the optimizer layer followed by the output layer. After testing with both SGD and Adam optimizer, we observed that adam showed better results overall.

Initially the model only showed an accuracy of 77.8%. But with the help of Individual feature importance and/or PCA we were able to get the accuracy up to 79.7%. But we could further proceed with hyper-parameter tuning with the help of TensorBoard in order to further increase accuracy of the model. TensorBoard is TensorFlow’s visualization toolkit in order to track and visualize metrics such as loss and accuracy. It is also used for visualizing the model graphs (ops and layer) which is what we used in order to perform hyper-parameter tuning.

In order to perform Hyper-Parameter Tuning, the number of neurons in the two layers and the dropout are evaluated for different values and observed in the TensorBoard graph. We finally concluded with the help of the TensorBoard model flow graph, that 64 neurons in the first layer, 4 neurons in the second followed by a dropout of 0.1 along with the adam optimizer yielded the highest accuracy of 81.56%

TABLE 4 – DEEP LEARNING MODEL RESULTS

Sentiment Analysis using Deep Neural Networks Accuracy
Simple Base Model 73.41
Reduced Model 74.26
Regularized Model 68.75
Dropout Model 73.20
Sentiment Analysis (Only Considering Overview column data)
Text Classification using Simple Neural Network 78.07
Text Classification using Convolutional Neural Network 79.44
Text Classification using Recurrent Neural Network (LSTM) 79.44

VI. CONCLUSION

The proposed research aims to predict the success of the movies. We have used machine learning approaches for our experimentation. Our research aims to improve previous researches. After performing classification, we have found out that our best results are achieved through XGBoost at around 90%. In our analysis we found out that Number of user reviews, Gross income and budget are the significant features. In addition to that following analogy can be derived:

Samuel Jackson, Robert De Niro, Morgan Freeman, and Bruce Willis have the best success ratio. Comedy, Action and Drama are the most watched and liked genres.

REFERENCES

  1. Muhammad Hassan Latif and Hammad Afzal. Prediction of movies popularity using machine learning techniques, 2016. http://paper.ijcsns.org/07_book/ 201608/20160820.pdf
  2. Nikhil Apte, Mats Forssell, and Anahita Sidhwa. Predicting movie revenue. CS229, Stanford University, 2011.
  3. Sitaram Asur and Bernardo A Huberman. Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 1, pages 492–499. IEEE, 2010.
  4. Karl Persson. Predicting movie ratings: A comparative study on random forests and support vector machines, 2015
  5. Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke, “Predicting IMDB Movie Ratings Using Social Media”, Advances in Information Retrieval , Volume 7224, 2012, pp 503-507
  6. Mestya´n M, Yasseri T, Kerte´sz J (2013): “Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data”. PLoS
  7. Mahesh Joshi Dipanjan Das Kevin Gimpel Noah A. Smith: “Movie Reviews and Revenues: An Experiment in Text Regression “, The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Pages 293-296
  8. Wenbin Zhang, Steven Skiena:” Improving Movie Gross Prediction Through News Analysis”, Department of computer science stony brook university, 2009 IEEE/WIC /ACM International Conference on Web Intelligence and Intelligent Agent Technology – Workshops, Pages 301-304.
  9. Nithin VR, Pranav M, Sarath Babu PB, Lijiya “A Predicting movie success based on IMDB data” International journal of data mining Algorithm, Vol. 3, Issue 2, 2014, pp. 34- 36, DOI: 10.20894/IJBI.105.003.002.004, ISSN: 2278-2397
  10. Khalid Ibnal Asad , Tanvir Ahmed , Md. Saiedur Rahman: “Movie Popularity Classification based on Inherent Movie Attributes using C4.5,PART and Correlation Coefficient”, IEEE/OSA/IAPR International Conference on Infonnatics, Electronics & Vision, Pages 747 – 752
  11. Darin Im and Minh Thao Nguyen: “PREDICTING BOXOFFICE SUCCESS OF MOVIES IN THE U.S. MARKET “, CS 229, Fall 2011
  12. https://en.wikipedia.org/wiki/IMDb#:~:text=As%20of%20January%202020%2C%20IMDb,as%2083%20million%20registered%20users.
  13. McNary, Dave (3 January 2019). "2018 Worldwide Box Office Hits Record as Disney Dominates". Variety. Retrieved 22 January 2019.
  14. Global Movie Production & Distribution Industry: Industry Market Research Report". IBISWorld. August 2018. Retrieved 22 January 2019.
  15. Movie Success Prediction using Historical and Current Data Mining September 2019 International Journal of Computer Applications 178(47):1-5 DOI: 10.5120/ijca2019919415.
  16. Movie Success Prediction using Machine Learning Algorithms and their Comparison May, 2019 DOI: 10.1109/ICSCCC.2018.8703320 https://ieeexplore.ieee.org/abstract/document/8703320
Explore More Research and Studies
Scroll to Top