ML -- Reddits Analysis

EXECUTIVE SUMMARY

To have a better understanding of each post and its sentiment, a topic was defined for each post and it is evident that some topics are likely to be more positive, while some are not. Since these topics are overlapping on words like "elderly" and "crow", those unique words will explain more for the topics. The average Pos-Neg Ratio of all posts is 1.64, while the ratio of Topic 4 (more about accusation) is only 1.13, compared to 3.94 for Topic 3 (more about improvement). Therefore, the topic is indeed a important feature to determine the sentiment of a post, which is proved in the following analysis.

Summary Table of Topic (Average Pos-Neg Ratio: 1.64)
	Topic 0 Keywords: senator Pos-Neg Ratio: 2.05
	Topic 1 Keywords: short, das Pos-Neg Ratio: 1.61
	Topic 2 Keywords: integrate, conspiracy Pos-Neg Ratio: 1.46
	Topic 3 Keywords: improvement Pos-Neg Ratio: 3.94
	Topic 4 Keywords: officer, hillary, accusation Pos-Neg Ratio: 1.13

Apart from the topic, whether the post is racial equality and government related or not is also the determinants of posts' sentiment. Since political factor is always an essential feature of stock market, these subreddit posts may have an impact on the stock prices. According to the time series plot of Nasdaq Index, the predictions is pretty close to the real close prices. And the best model uses the last 60 days' historical data to make a prediction of the next day price.

Summary Table of Topic (Average Pos-Neg Ratio: 1.64)
	Topic 0 Keywords: senator Pos-Neg Ratio: 2.05
	Topic 1 Keywords: short, das Pos-Neg Ratio: 1.61
	Topic 2 Keywords: integrate, conspiracy Pos-Neg Ratio: 1.46
	Topic 3 Keywords: improvement Pos-Neg Ratio: 3.94
	Topic 4 Keywords: officer, hillary, accusation Pos-Neg Ratio: 1.13

Topic Modeling - LDA

Different Topics, Different Keywords, Different Sentiments

LDA, one of the most famous topic modeling methods was utilized here to define a topic for each post. It can be recognized that some specific topics tend to be more positive, while some are not. Some keywords appear in more than one topic, so those unique words will explain more for the topics. The average Pos-Neg Ratio of all posts is 1.64, while the ratio of Topic 4 (more about accusation) is only 1.13, compared to 3.94 for Topic 3 (more about improvement)

Normally Distributed on Each Dimension

After defining topics for all the posts, it is necessary to know about the distribution of our topic to see if the model works well. Therefore, t-SNE, a tool to visualize high-dimensional data, was applied onto the results of TF-IDF to reduce the data to 2 dimensions, and these points were colored based on their topics. In agreement with the plot, each topic is normally distributed on these two dimensions, elucidating the successful seperation of LDA model.

Sentiment Prediction
Based on the Features of Post

Complex Model Does Not Make a Big Difference

To find the relationship between the post sentiment and the post other features, three different types of model were constructed, and for each type of model, three different hyperparameter sets were applied.

Before constructing these models, the data were subset in the first half of 2020, since we want to find the influence of the pandemic, and only some important features were kept, including author_flair_text, score, topic, covid, etc. Also, training set, validation set and test set were sampled from the whole data set based on the following proportion (0.8, 0.18, 0.02).

According to the evaluation metrics, those complex models did not make a big difference, but they did require much more time in training, hence, the simpler model would be a better choice here.

Besides, in line with the AUC score, which is nearly close to 0.5, these models actually did not make a really good prediction on the test set, which is acceptable since the sentiment of a post is usually directly related with the text itself, instead of the topic or post score.

Hyperparameter Tuning
Random Forest
numTrees	maxDepth	Accuracy	F1-Score	Precision	Recall	AUC
50	5	0.6502	0.5567	0.6416	0.6502	0.5320
100	5	0.6503	0.5563	0.6425	0.6503	0.5318
100	10	0.6507	0.5564	0.6444	0.6507	0.5321

Gradient-Boosted Trees
maxIter	maxDepth	Accuracy	F1-Score	Precision	Recall	AUC
10	5	0.6505	0.5563	0.6434	0.6505	0.5319
20	5	0.6509	0.5611	0.6404	0.6509	0.5344
20	10	0.6511	0.5702	0.6341	0.6511	0.5391

Logistic Regression
maxIter	threshold	Accuracy	F1-Score	Precision	Recall	AUC
100	0.5	0.6504	0.5589	0.6400	0.6504	0.5331
200	0.5	0.6504	0.5589	0.6400	0.6504	0.5331
200	0.45	0.6496	0.5668	0.6308	0.6496	0.5366

Racial Equality, Goverment and Topic

In order to figure out the determinants of the sentiment for a post, the bar plots of feature importance were made for all these three simplest models, revealing that whether the post is racial equality and government related or not and its topic are the most important three determinants of a post's sentiment.

Hyperparameter Tuning
Moving Average (Baseline Model)
Lookback	Training R-Squared	Training RMSE	Test RMSE	Test MAPE
5	0.9849	216.0944	921.2071	0.0623
15	0.9636	335.6070	901.6648	0.0608
30	0.9262	480.6264	839.9877	0.0557

Support Vector Regression
Kernel	Training R-Squared	Training RMSE	Test RMSE	Test MAPE
poly	0.6860	985.9986	2969.8434	0.2078
linear	0.5547	1174.2439	2869.1457	0.2008
rbf	0.8528	675.0404	3389.8888	0.2218

Long Short Term Model
Lookback	Training R-Squared	Training RMSE	Test RMSE	Test MAPE
15	0.9792	253.6997	453.3420	0.0305
30	0.9804	247.7742	476.2668	0.0311
60	0.9881	176.6279	344.5321	0.0227
90	0.9485	308.5264	667.7971	0.0467
120	0.9606	230.3190	462.4384	0.0306

Stock Price Prediction
Based on the Daily Sentiment of Subreddit

LSTM Can Help to Defeat the Average

Stock Market fluctuates all the time because of tons of factors, making it more difficult to predict than other data. In this project, SVR and LSTM were chosen to construct models, trying to predict the Nasdaq Index, based on the daily sentiment of subreddit.

Feature engineering is always the first step to take when constructing predictive models, and daily number of posts, pos-neg ratio, post average score, negative post average score, positive post average score were calculated from the big reddit dataset. Besides, COVID daily new cases and new deaths were also included in the features. Since stock price is timeseries data, it cannot be random splitted into training and testing set. Consequently, the data were splitted by the date 04/01/2021.

In LSTM, a lookback period was predifined so that the model can not only learn about the information on one day, but also learn more patterns about the past trend. In accordance with the evaluation metrics, LSTM performs better than the Moving Average, especially with 60 days' lookback, while SVR actually has a rather poor performance on the test set.

LSTM with 60 Days' Lag

LSTM now can be proved to be able to defeat the moving average, and among all these choices of lookback days, 60 days' lag can provide enough information for the model to learn about the patterns and trends. When it is moving forward and getting farther away from the last training date, the predicted price will also get farther away from the true price

Summary

Some specific topics tend to be more positive, while some are not. The average Pos-Neg Ratio of all posts is 1.64, while the ratio of Topic 3 (more about bankrupt and persecution) is only 1.13, compared to 3.94 for Topic 2 (more about integrate and conspiracy).
The sentiment of a post is usually directly related with the text itself, instead of the topic or post score.
Whether the post is racial equality and government related or not and its topic are the most important three determinants of a post's sentiment.
LSTM performs can defeat the Moving Average, especially with 60 days' lookback.

Resources

Jupyter Notebook

What's next?

Next section is Conclusions, we will summarize all the findings we have so far.