Reddit Project NLP Script

Reading the dataset

1. Text Checks / Analysis

Answer:

Most of the posts are short posts, and those posts shorter than 20 words are more than 50 percent. To get the most common and important words, we will do the text cleaning first.

2. Clean the Text

Answer:

To clean the data, a pipeline was constructed, including cleaning up non-word characters, removing stop words, stemming and lemmatizing. When running a Stemmer, a special pattern was recognized that the character e at the end of a word was removed, therefore, a Lemmatizer was utilized instead of a Stemmer, to extract the most important part of a word. And since there are some urls included in the posts, so a regex was firstly applied to remove those urls.

3. Identify Most Common Words

Answer:

From the results of CountVectorizer and TF-IDF, the most common and important words are people, government, auth/lib, right/left. Therefore, dummy variables including government, eco_left / eco_right / eco_centralist, and auth / lib are created.

3.1 CountVectorizer

3.2 TF-IDF

3.3 Create Dummy Variables

4. Sentiment Model

Answer:

Reddit posts are similar to Twitter tweets in some way, accordingly, a pretrained deep learning model sentimentdl_use_twitter was used here to build a pipeline for conducting sentiment analysis on the posts.

5. Visualizations

  1. Contour Plot: It is evident that a higher positive sentiment rate will tend to be associated with a lower Nasdaq Close Price.
  2. Radar Plot: Posts from a centralist author usually have a higher positive sentiment rate.
  3. Boxenplot: A positive post tent to have a higher post score

5.1 Contour Plot

5.2 Radar Plot

5.3 Boxenplot

6. Summary Tables

  1. After the explosion of COVID, the proportion of negative posts also started to increase.
  2. COVID and Racial Equality related posts tend to have a larger proportion of negative sentiment, while Economics and Gender Equality related posts are likely to be more positive.

7. Save the Data

Answer:

The data containing clean text, sentiment and new dummy variables are saved to S3 bucket.