Reddit Project ML Script

Reading the dataset

1. Latent Dirichlet allocation

1.1. Calculate TF-IDF

1.2. Train a LDA Model

1.3. Use TSNE to plot the Distribution of Topics

Figure 1. The Distribution of Post Topic Using LDA, Sample Size = 1400 (0.0002)

1.4. Wordcloud of Each Topic

2. The Relation Between Sentiment and Author Flair Type and Other Dummy Variables

2.1. Data Subset

2.2. Split the Data

2.3. Random Forest

2.3.1. Build a Pipeline and Train a Random Forest Model

2.3.2. Model Evaluation and Comparison

2.3.3. Feature Importance

2.3.4. Save Model

2.4. Gradient-Boosted Trees

2.4.1. Build a Pipeline and Train a GBT Model

2.4.2. Model Evaluation and Comparison

2.4.3. Feature Importance

2.4.4. Save Model

2.5. Logistic Regression

2.5.1. Build a Pipeline and Train a Logistic Regression Model

2.5.2. Model Evaluation and Comparison

2.5.3. Feature Importance

2.4.4. Save Model

3. The Relation Between Nasdaq Index, Reddit Sentiments and COVID Cases

3.1. Data Preparation

3.1.1. Join External Data

3.1.2. Missing Data Imputation

3.1.3. Reddit Data Aggregation

3.1.4. Create New Features

3.2. Packages Installation and Import

3.3. Moving Average Estimation

3.3.1. Define a Function of MA

3.3.2. Evaluate on Training Set

3.3.3. Evaluate on Test Set

3.3.4. Evaluation Metrics

3.4. SVR

3.4.1. Define a Function of SVR

3.4.2. Compare Different Model - Training Set

3.4.3. Compare Different Model - Test Set

3.4.4. Evaluation Metrics

3.5. LSTM

3.5.1. Define a Function of LSTM with Lookback

3.5.2. Compare Different Lookback Period - Train Set

3.5.3. Compare Different Lookback Period - Train Set

3.5.4. Evaluation Metrics