Research: Stock Price Prediction Using Sentiment Analysis and Deep Learning for Indian Markets

NARAYANA DARAPANENI Director - AIML Great Learning/Northwestern University Illinois, USA
HIMANK SHARMA Student- AIML Great Learning Bangalore, India
MILIND MANJREKAR Student - AIML Great Learning Bangalore, India
NUTAN HINDLEKAR Student - AIML Great Learning Bangalore, India
PRANALI BHAGAT Student - AIML Great Learning Bangalore, India
USHA AIYER Student - AIML Great Learning Bangalore, India
YOGESH AGARWAL Student - AIML Great Learning Bangalore, India
Anwesh Reddy Paduri Research Assistant - AIML Great Learning Mumbai, India
Abstract

Stock market prediction has been an active area of research for a con siderable period. Arrival of computing, followed by Machine Learning has up graded the speed of research as well as opened new avenues. As part of this re search study, we aimed to predict the future stock movement of shares using the historical prices aided with availability of sentiment data. Two models were used as part of the exercise, LSTM was the first model with historical prices as the independent variable. Sentiment Analysis captured using Intensity Analyzer was used as the major parameter for Random Forest Model used for the second part, some macro parameters like Gold, Oil prices, USD exchange rate and Indian Govt. Securities yields were also added to the model for improved accuracy of the model. As the end product, prices of 4 stocks viz. Reliance, HDFC Bank, TCS and SBI were predicted using the aforementioned two models. The results were evaluated using RMSE metric.

I. Introduction

The objective of this exercise has been to predict future stock prices using Machine Learning and other Artificial Intelligence. The exercise started with a comprehensive review of available literature in this domain. Research papers as well as online sources tackling this problem were reviewed, a brief list of the same is included as part of references.

1.1 Literature Review

Early research on stock market prediction was primarily based on the Random Walk theory and the Efficient Market Hypothesis (EMH). Several studies, including those by Gallagher, Kavussanos, and Butler, demonstrated that stock market prices do not strictly follow a random walk and can be predicted to a certain extent.

Another hypothesis currently under investigation is whether early indicators extracted from online sources such as blogs and Twitter feeds can be used to predict changes in economic and commercial indicators. Similar analyses have been conducted in other domains. For example, Gruhl et al. showed a correlation between online chat activity and book sales, while blog sentiment analysis was used by Mishne and Glance to predict movie sales.

Schumaker et al. investigated the relationship between breaking financial news and stock price changes. One of the most influential studies in stock market prediction was conducted by Bollen and Mao (2011), where they examined the correlation between public mood and the Dow Jones Industrial Average (DJIA). Public moods such as happiness, calmness, and anxiety were derived from Twitter feeds.

Chen and Lazer derived investment strategies by observing and classifying Twitter feeds. Bing et al. analyzed Twitter data and concluded that stock predictability varies across industries. Zhang et al. identified a strong negative correlation between negative moods on social networks and the DJIA index.

Pagolu et al. demonstrated a strong relationship between fluctuations in company stock prices and public emotions expressed on Twitter. Instead of using standard word embedding models, their work focused on developing a sentiment analyzer that categorized tweets into positive, negative, and neutral classes.

Mittal et al. attempted to build a portfolio management tool using Twitter sentiment analysis. Their model, tested on DJIA data, employed a greedy strategy that incorporated sentiment feedback to predict buy or sell decisions one day in advance.

Chen et al. used an LSTM-based model to predict stock direction in the Chinese Stock Exchange and compared its performance with a random estimation model, confirming the superior accuracy of LSTM. Tekin et al. analyzed data from 25 leading companies and applied various forecasting models, identifying Random Forest as a highly relevant technique.

Malandri et al. employed LSTM, multilayer perceptron (MLP), and Random Forest classifiers in a portfolio allocation model. A study based on NYSE data suggested that LSTM achieved better experimental results.

Kilimci et al. proposed an efficient word embedding and deep learning–based framework for forecasting stock market direction using Twitter and financial news data for the Turkish Stock Exchange (BIST 100). Their approach combined word embedding techniques such as Word2Vec, FastText, and GloVe with deep learning models including CNN, RNN, and LSTM.

Their results showed that the combination of the Word2Vec embedding model with LSTM achieved the highest average accuracy across nine selected stocks when Twitter data was used as the primary information source.

1.2 Data Sourcing, Pre-processing, and EDA

The exercise began with stock-related information available in the public domain. Yahoo Finance was used as the primary source for stock market data. The dataset contained standard data points commonly used in stock analysis, including Open, High, Low, Close prices, Adjusted Close, and Trading Volume. Historical data from January 2007 onwards was utilized as part of the exploratory data analysis (EDA).

Historical Performance of Reliance, HDFC, TCS and SBI Stocks
Fig. 1. Historical Performance of Stocks

The process of model building led to domain exploration beyond the scope covered in the literature survey. Various macroeconomic, global economic, and fundamental parameters were studied across domains such as finance, economy, trade, and core industry indicators. The objective was to finalize a set of parameters that could significantly influence stock prices.

The final macroeconomic parameters selected for the study included gold prices, Brent crude oil prices, government securities yields, and the USD–INR exchange rate. Gold prices were considered due to their typically negative relationship with market returns. Brent crude oil prices were used as a proxy for fuel costs, which have a broad impact on economic indicators. Government bond yields were included as rising yields exert pressure on economic activity and market returns. Exchange rate fluctuations were incorporated due to their influence on multiple macroeconomic variables and their role in explaining stock price movements.

Figure image

Figure 2 illustrates the correlation between the selected macroeconomic parameters and stock prices.

Another key component of the study was the application of sentiment analysis. Initially, Twitter data was intended to serve as the primary input for sentiment analysis. However, changes in Twitter’s data access policies posed significant challenges in sourcing tweet data.

As an alternative, a manual data collection approach was adopted to source news headlines from publicly available platforms such as BSE, India Today, Reuters, News18, Hindustan Times, Mint, and Global Filings. News data spanning a two-year period, from 1 June 2019 to 28 June 2021, was compiled for the analysis.

The collected data was available on a daily basis and was organized into an Excel file, with each row corresponding to a single news headline. Since the data was sourced from news websites, standard text pre-processing steps were applied, including stop-word removal, elimination of special characters, and other common cleaning techniques to prepare the data for sentiment analysis.

The Sentiment Intensity Analyzer produced four sentiment scores: Positive, Negative, Neutral, and Compound. While stock prices were analyzed and predicted on a daily basis, multiple news items often corresponded to a single trading day.

To address this, all news items associated with a given date were concatenated into a single text input for the sentiment analyzer. The resulting daily sentiment scores were then combined with historical closing prices and the selected macro parameters to generate predictions using the Random Forest model.

Figure image
Figure image
II. Step-by-Step Walkthrough of the Solution

The data pre-processing and exploratory data analysis (EDA) phase was followed by the application of multiple machine learning algorithms to achieve acceptable error levels. Several models, including Linear Regression, K-Nearest Neighbors (KNN), Random Forest Regressor, Prophet, and ARIMA, were evaluated as part of the study. A brief description of each algorithm along with the parameters considered is presented below.

a) Linear Regression: Linear Regression is a fundamental approach for modeling the relationship between a scalar response variable and one or more explanatory variables. In this context, stock price and time period were used as system parameters, making the model broadly applicable for initial prediction experiments.

b) K-Nearest Neighbors (KNN): KNN is an instance-based learning algorithm used for pattern recognition. Since the method relies on distance-based classification, data normalization was essential to improve prediction accuracy. The Euclidean distance metric was employed, and neighbor values ranging from 2 to 9 were tested. Grid search was used to determine the optimal value of k, and five-fold cross-validation was performed for hyperparameter tuning.

c) Autoregressive Integrated Moving Average (ARIMA): Autoregressive models use lagged values of the dependent variable as regressors. The ARIMA model converts non-stationary time series data into stationary data prior to forecasting. The configuration included an autoregressive order of 0, differencing order of 1, and moving average order of 1. Seasonal components were defined with an AR order of 2, differencing of 1, moving average of 0, and a periodicity of 12. The ‘lbfgs’ optimization method was used to evaluate multiple parameter combinations.

d) Prophet: Prophet is a forecasting procedure based on an additive model that captures non-linear trends with yearly, weekly, and daily seasonality, along with holiday effects. The model is particularly effective for time series data exhibiting strong seasonal patterns and sufficient historical depth.

e) Prophet Data Structure: A linear growth curve was used to predict daily stock prices. The Prophet framework required a dataframe with two columns: “ds” to represent the datetime series and “y” to store the corresponding stock price values.

f) Random Forest Regressor: Random Forest Regressor is an ensemble learning method that fits multiple decision trees on various sub-samples of the dataset. The final prediction is obtained through averaging, which helps improve predictive accuracy and reduces the risk of overfitting.

g) Long Short-Term Memory (LSTM): LSTM, including Bidirectional LSTM variants, was applied to predict stock prices using historical closing price data. Extensive hyperparameter tuning was conducted to achieve optimal performance.

As indicated in the comparative results table, LSTM consistently outperformed the other models evaluated in the study. Consequently, LSTM was selected for predicting stock prices of additional companies, including HDFC Bank, SBI, and TCS.

Figure image

Sentiment analysis using news headlines was conducted as the next phase of the exercise. Polarity scores for daily news, including Positive, Negative, Neutral, and Compound values, were computed using a Sentiment Intensity Analyzer. Initial results showed higher RMSE values than expected, indicating limited predictive effectiveness.

Further model refinement was carried out by incorporating additional macroeconomic parameters such as gold prices, Brent crude prices, government securities yields, and the USD–INR exchange rate. The inclusion of these features led to a notable improvement in prediction accuracy, with RMSE values becoming comparable to those achieved by the LSTM model.

Based on these observations, the final solution for sentiment-based prediction employed the Random Forest Regressor augmented with selected macroeconomic parameters.

III. Model Evaluation
3.1 LSTM Model
Figure image

The purpose of this study has been to devise trading strategies based on stock price predictions, so regression analysis has been used to arrive at future stock price. LSTM has been the most successful in price prediction among the models we have tried. LSTM or Long Short-Term Memory Recurrent Neural Network belongs to the family of deep learning algorithms which works on the feedback connections in its architecture.

It has an advantage over traditional neural networks due to its capability to process entire sequence of data. Its architecture comprises the cell, input gate, output gate, and forget gate. Data pre-processing is an important step in LSTM. Scaling of data is a process which is advisable with most models, thus LSTM also requires processing in the form of scaling.

Since LSTM works on sequences using them as the base for prediction of a single value, a matrix needs to be created from the date-wise train dataset available. The train data fed into the LSTM consists of a multi-dimensional array consisting of various instances of the dependent variable and the corresponding linked independent variable, which in our case is an array consisting of historical close prices. This period is referred to as the sliding window.

Various ranges of 5 days to 250 days were tried for the sliding window to ascertain the best fit for the model under consideration. As part of model building, various variations of the model were tried including the addition of Dense and Dropout layers. Hyperparameter tuning was also carried out by comparing errors across different runs. Batch normalization was also attempted but did not yield any significant improvement in results.

Besides parameter tuning, a bidirectional variation of LSTM was also attempted to achieve better results. As a result of the entire model-building exercise, a sliding historical window of 60 days gave the best results among the range covered.

Two LSTM layers with 128 and 64 neurons respectively, followed by two dense layers of 25 and 1 neuron, constituted the final model that delivered the best performance among various model variations.

Since this is a regression model, standard features such as accuracy percentage could not be used. Therefore, RMSE was used as the quantifying parameter for evaluating the success of the models being tested.

3.2 Random Forest – Sentiment Analysis
Figure image

The aim of this study has been to use sentiment analysis for the prediction of stock prices. One of the challenges with LSTM is the usage of a single parameter for model building. Since LSTM could not be used for sentiment analysis, the exercise was divided into two major parts: daily sentiment collection and analysis, and model building for prediction.

As mentioned earlier, the first part consisted of manually sourcing data from various public domain websites. Preprocessing of data was carried out using standard libraries to improve data quality. Since there were multiple news items for a single day, all items were concatenated to arrive at combined daily news data.

The Sentiment Intensity Analyzer was used to generate sentiment polarity, which produces four values corresponding to the input text: Positive, Negative, Neutral, and Compound sentiment. These four parameters were considered independent features for the sentiment analysis component.

Several standard regression models were tested, and after multiple iterations, the Random Forest Regressor was identified as the most suitable model. Initial runs using close price and sentiment features alone produced poor results, with variations of 20–30% in predicted values.

Based on these outcomes, domain exploration was conducted to identify additional external features. Various permutations were evaluated, and four macroeconomic features—GSec yield, Brent crude price, gold prices, and USD exchange rates—were incorporated into the model.

These macroeconomic parameters proved to be significant in improving prediction accuracy. As this is also a regression model, accuracy percentage was not applicable, and RMSE was used as the evaluation metric.

Both models, namely LSTM and Random Forest, were used to predict future stock prices of four stocks for 28th June as part of the study. The predicted values from both models, along with the actual stock prices for the day, are presented in the table below.

Figure image
IV. Visualizations

LSTM was applied on the closing prices of four stocks, namely Reliance, HDFC Bank, TCS, and SBI. Model data including training, validation, and predicted values have been depicted in the graphs below. The blue line represents training data, the orange line represents validation data, and the green line represents the predicted closing price for each stock.

Out of a total of 3,478 data points, 3,305 data points were used for training, and the remaining 5% were used for validation, covering a span of approximately 15 years. The RMSE values for Reliance, HDFC Bank, TCS, and SBI were 38, 33, 59, and 7 respectively.

The LSTM model error was significantly lower than the error values obtained from earlier models such as Linear Regression, ARIMA, and k-nearest neighbor methods.

Figure image

Further analysis was conducted to study the impact of daily news sentiment and external macroeconomic factors such as gold prices, G-Sec yields, Brent crude prices, and the INR–USD exchange rate on stock movements using Random Forest regression. The resultant outputs of the model have been represented graphically below.

Figure image

While the LSTM model outperformed Random Forest regression overall, the inclusion of additional features improved the predictive performance of the Random Forest model. TCS was an exception, where the RMSE value of 139 was significantly higher than that of the LSTM model.

One possible reason for this discrepancy could be the insufficient availability of valid news data for sentiment analysis in the case of TCS.

V. Results / Implications

The broad purpose of this exercise was to arrive at trading strategies that could support real-world application of the developed models. However, the study could not reach that level due to several constraints and limitations, as described earlier.

Two regressors, namely LSTM and Random Forest, were used to predict the next-day stock price, with RMSE considered as the primary evaluation metric. Since this was a regression exercise, predictions with specific confidence levels could not be achieved. Therefore, an intuitive analysis of RMSE values was performed to assess the appropriateness of the predicted results.

Figure image

The results obtained from the models indicate that the Mean Absolute Percentage Error (MAPE) ranged from 1.36% to 1.81% for the LSTM model, while for sentiment analysis using the Random Forest model, the MAPE ranged from 1.25% to 3.76%.

HDFC Bank was the only stock for which sentiment analysis performed better than the LSTM model. SBI emerged as the best-performing stock in both models, with sentiment analysis and LSTM yielding similar levels of performance.

Based on these observations, a 95% confidence level can be considered an approximate fit to explain the overall working of the models. However, the exercise did not succeed in formulating a concrete trading strategy.

An attempt was made to use the models to forecast future price trends rather than predict a single-day price. The results of trend forecasting were not satisfactory, and significant changes would be required to achieve meaningful outcomes in future work.

VI. Limitations / Closing Reflections

This exercise represents a research effort into a new approach by the authors, and therefore certain limitations were encountered due to time constraints, technical challenges, and other factors. Some of these limitations were identified during the course of the study and are briefly discussed in the following paragraphs.

One of the primary data-related challenges concerned the sourcing of news data. News was collected from various websites using relevant filters, which at times resulted in the inclusion of news items that were not directly related to the company under consideration. This could lead to distortion, where strong sentiment associated with an unrelated entity influences the sentiment score for a given stock.

One possible way to address this limitation is through manual annotation of news data. Additionally, word vector–based techniques could help improve data quality in such cases by better capturing contextual relevance.

Another limitation, involving both data and methodology, relates to the computation of daily sentiment. As described earlier, the approach involved concatenating all news items related to a single day and calculating a combined sentiment score. This method may dilute the impact of a strong positive or negative sentiment from a single news item due to the presence of multiple neutral items, resulting in an overall neutral sentiment for the day.

Alternative approaches that calculate sentiment for individual news items and then apply a suitable aggregation technique to derive daily sentiment could potentially yield better results.

From a model implementation perspective, one limitation encountered was the use of Random Forest Regressor for sentiment analysis. While LSTM demonstrated superior performance compared to Random Forest when predicting based on closing prices, attempts to implement a multivariate LSTM incorporating both sentiment data and closing prices did not yield successful results during this exercise.

REFERENCES

  1. Bing, L., Chan, K. C. C., & Ou, C. Public sentiment analysis in Twitter data for prediction of a company’s stock price movements. 2014 IEEE 11th International Conference on E-Business Engineering. IEEE. (2014).
  2. Bollen, J., Mao, H., & Zeng, X. Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8. (2011).
  3. Butler, K. C., & Malaikah, S. J. Efficiency and inefficiency in thinly traded stock markets: Kuwait and Saudi Arabia. Journal of Banking & Finance, 16(1), 197–210. (1992).
  4. Chen, R., & Lazer, M. Sentiment analysis of Twitter feeds for the prediction of stock market movement. CS229, pp. 15. (2011).
  5. Dogan, E., & Kaya, B. Deep learning based sentiment analysis and text summarization in social networks. 2019 International Artificial Intelligence and Data Processing Symposium (IDAP). IEEE. (2019).
  6. Fama, E. F. The behavior of stock-market prices. The Journal of Business, 38(1), 34. (1965).
  7. Gallagher, L. A., & Taylor, M. P. Permanent and temporary components of stock prices: Evidence from assessing macroeconomic shocks. Southern Economic Journal, 69(2), 345. (2002).
  8. Gruhl, D., Guha, R., Kumar, R., Novak, J., & Tomkins, A. The predictive power of online chatter. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD ’05). ACM Press. (2005).
  9. Briggs, J. Sentiment analysis for stock price prediction. Towards Data Science. https://towardsdatascience.com/sentiment-analysis-for-stock-price-prediction-in-python-bed40c65d178 (Last accessed: 2020).
  10. Kavussanos, M. G., & Dockery, E. A multivariate test for stock market efficiency: The case of ASE. Applied Financial Economics, 11(5), 573–579. (2001).
  11. Kilimci, Z. H., & Akyokus, S. The analysis of text categorization represented with word embeddings using homogeneous classifiers. 2019 IEEE INISTA. IEEE. (2019).
  12. Li, X., Wu, P., & Wang, W. Incorporating stock prices and news sentiments for stock market prediction: A case of Hong Kong. Information Processing & Management, 57(5), 102212. (2020).
  13. Malandri, L., Xing, F. Z., Orsenigo, C., Vercellis, C., & Cambria, E. Public mood–driven asset allocation: The importance of financial sentiment in portfolio management. Cognitive Computation, 10(6), 1167–1176. (2018).
  14. Mikolov, T., Chen, K., Corrado, G., & Dean, J. Efficient estimation of word representations in vector space. http://arxiv.org/abs/1301.3781 (2013).
  15. Mishne, G., & Glance, N. Predicting movie sales from blogger sentiment. AAAI. (2006).
  16. Pagolu, V. S., Challa, K. N. R., Panda, G., & Majhi, B. Sentiment analysis of Twitter data for predicting stock market movements. http://arxiv.org/abs/1610.09225 (2016).
  17. Picasso, A., Merello, S., Ma, Y., Oneto, L., & Cambria, E. Technical analysis and sentiment embeddings for market trend prediction. Expert Systems with Applications, 135, 60–70. (2019).
  18. Schumaker, R. P., & Chen, H. Textual analysis of stock market prediction using breaking financial news: The AZFin text system. ACM Transactions on Information Systems, 27(2), 1–19. (2009).
  19. Tekin, S., & Canakoglu, E. Prediction of stock returns in Istanbul stock exchange using machine learning methods. 2018 IEEE SIU. IEEE. (2018).
  20. Zhang, L. Sentiment analysis on Twitter with stock price and significant keyword correlation, pp. 130. (2013).

Explore More Research and Studies

Scroll to Top