Reddit Project EDA Script

Yongrui Chen

yc910@georgetown.edu

Reading the entire dataset

Make sure your SparkSession is active:

1. Basic Info about the Data of #PoliticalCompassMemes

Report on the basic info about your dataset. What are the interesting columns? What is the schema? How many rows do you have? etc. etc.

Answer:

The date has 12904319 rows and 51 columns in total, and the most interesting colunmns are author, author_created_utc, author_premium, author_flair_text, author_fullname, body, collapsed, collapsed_reason, controversiality, created_utc, distinguished, gilded, no_follow, quarantined, removal_reason, score, send_replies, top_awarded_type, total_awards_received.

Most of the columns are either string type or boolean type, and the string columns are always descriptive variables, while the boolean columns are some features about the authors and the posts.

2. Basic Data Quality Checks

Answer:

In the dataset, 8 columns have NaN values,5 of which have nearly or more than 50% of NaN values, therefore, these 5 columns are removed. Besides, those rows that contain NaN values are also dropped.

To check the length of the comments, each post body was splitted into a list of words and the average length of the comments is 22. Then those posts whose length is larger than 5 and less than 100 were filtered. So far we have 7753637 row left in the data.

3. Transformations -- Create New Variables and Convert Data Types

Answer:

In Q2, the length of post was already been created into a new variable len_body. In this part, date, month, score_group were also created and all the boolean type variables were converted into int type.

4. Exploratory Data Analysis

Answer:

To conduct EDA, A small dataset (0.1%) was firstly sampled from the original dataset, and five different plots including The Distribution of the Length of the Post Body, Boxplot of "Post Length" for Premium or Umpremium Authors, Lineplot of the Trend of Number of Posts Every Month, KDE Plot Between the Post Length and Score, Radar Plot of the Average Score of Different Authors' Flairs were maded.

Findings:

  1. Most of the posts (over 60%) are less than 20 words.
  2. There is not a significant difference of post length between the premium authors and the non-premium authors.
  3. After the COVID (2020-01), the number of posts every month started to increase fast.
  4. The length of posts and the score of posts are not likely to have a significan relationship.
  5. The posts from right related author_flair type tend to have a higher score.

Figure 1. The Distribution of the Length of the Post Body (Sample Size = 7881 (0.1%))

Figure 2. Boxplot of "Post Length" for Premium or Umpremium Authors

Figure 3. Lineplot of the Trend of Number of Posts Every Month

Figure 4. KDE Plot Between the Post Length and Score

Figure 5. Radar Plot of the Average Score of Different Authors' Flairs

5. Summary Table

Answer:

  1. Summary table of average score, number of posts, and average post length by month.
  2. Summary table of number of authors, number of premium users, number of posts, average post length, and average score by author flair type.
  3. Summary table of number of posts, average post length, average score, average controversiality, number of send_replies, number of no_follow, and average gilded by score group.

6. Create Dummy Variables

Answer:

Dummy variables including covid, election, economics, finance, gender, and race were created using regex to eveluate whether the post contains the dummy variables, and summary tables were maded based on these dummy variables.

7. Join External Data

Answer:

Here, COVID data and stock price daily data were collected and joined to the Reddit data. To find out the relationship between these variables, a correlation plot was maded.

According to the plot, it can be recognized that number of posts and stock price, new COVID cases and number of posts are related.

COVID Data

Stock Price Data

8. Summary

  1. Most of the posts (over 60%) are less than 20 words.
  2. There is not a significant difference of post length between the premium authors and the non-premium authors.
  3. After the COVID (2020-01), the number of posts every month started to increase fast.
  4. The length of posts and the score of posts are not likely to have a significan relationship.
  5. The posts from right related author_flair type tend to have a higher score.
  6. According to the plot, it can be recognized that number of posts and stock price, new COVID cases and number of posts are related.