Basic Info and Quality Check

The data has 12904319 rows and 51 columns in total, and the most interesting colunmns are author, author_created_utc, author_premium, author_flair_text, author_fullname, body, collapsed, collapsed_reason, controversiality, created_utc, distinguished, gilded, no_follow, quarantined, removal_reason, score, send_replies, top_awarded_type, total_awards_received.

Most of the columns are either string type or boolean type, and the string columns are always descriptive variables, while the boolean columns are some features about the authors and the posts.

In agreement with the table in the right, 8 columns have NaN values, 5 of which have nearly or more than 50% of NaN values, therefore, these 5 columns are removed. Besides, those rows that contain NaN values are also dropped.

To check the length of the comments, each post body was split into a list of words and the average length of the comments is 22. Then those posts whose length is larger than 5 and less than 100 were filtered. So far we have 7753637 row left in the data.

Column NaN_count Column NaN_count
author 0 distinguished 9645420(75%)
author_created_utc 5312781(41%) gilded 0
author_premium 1406212(11%) no_follow 0
author_flair_text 1732898(13%) quarantined 0
author_fullname 1408996(11%) removal_reason 12904313(100%)
body 0 score 0
collapsed 0 send_replies 0
collapsed_reason 12662774(98%) top_awarded_type 12904319(100%)
controversiality 0 total_awards_received 0
created_utc 0

Column NaN_count Column NaN_count
author 0 created_utc 0
author_premium 0 gilded 0
author_flair_text 0 no_follow 0
author_fullname 0 quarantined 0
body 0 score 0
collapsed 0 send_replies 0
controversiality 0 total_awards_received 0

Post Length
Before After
Count 11171161 7753637
Mean 21.9518 22.1503
Std-Dev 43.9908 18.3004
Min 1 6
Max 5798 100

Exploratory Data Analysis



Mainly Short Posts

The subject of the subreddit is #PoliticalCompassMemes, which is about memes and pictures, so most of the posts (over 60%) are really short, with less than 20 words.

Premium Author Does Not Make a Difference

The post length between the premium authors and the non-premium authors does not a significant difference.

Drastic Increase of Number of Posts in Pandemic

After the breakout of the COVID (2020-01), the number of posts every month started to increase drastically. However, it can be recognized that this trend started to descend after the end of 2020.

Post Score Concentrate at Original Point

The length of posts and the score of posts are not likely to have a significant relationship. Nevertheless, most of the posts have a length of less than 20 and a score close to 0.

Right Gains a Higher Post Score

The posts from right related author_flair type tend to have a higher score, no matter whether the posts are from authors of Authoritarian or Libertarian.

Summary Tables

Summary table by month
Month 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06
Average Score 12.7695 12.3930 12.9884 12.2727 13.0255 14.4096 14.6238 14.1243 15.1630 14.8942 14.8320 16.2224
Number of Posts 9071 13063 16527 34677 51367 92830 153545 166806 256694 378431 396712 463962
Average Post Length 21.2943 20.7286 20.0411 20.2321 19.9685 19.1897 19.6620 20.6618 20.6983 21.6775 21.6781 21.8864
Month 2020-07 2020-08 2020-09 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06
Average Post Score 16.3415 15.7214 16.1279 15.5739 15.6497 15.2423 14.8004 14.2504 13.8542 14.2599 14.6110 14.4265
Number of Posts 518974 541562 475011 518073 541137 512667 562761 405018 432279 403575 431636 377259
Average Post Length 22.0940 22.2788 22.3779 22.4891 22.2870 22.2646 22.9125 22.4867 22.6493 22.8247 22.7662 22.7624

Summary table by author flair type
Author Fair Type Number of Author Number of Premium Author Number of Post Average Post Length Average Post Length
auth-authcenter 11119 361 741710 23.1689 13.4810
authright-authright 13370 431 624875 20.5316 17.0034
right-right 14897 627 621542 21.8659 16.4658
left-left 20502 1091 641302 23.9589 11.6959
authleft-authleft 9715 411 415993 21.8329 13.1836
centrist-centrist 23233 1162 858226 22.4528 15.3651
lib-libcenter 28843 1384 924146 22.2499 15.3306
libright2-libright 5172 258 288570 20.0923 17.2759
libright-libright 28530 1460 1311434 21.6714 17.8269
centg-centrist 6902 284 207879 20.7911 14.5259
libleft-libleft 38824 1864 1110088 22.6661 12.9708

Summary table by score group
Score Group Number of Post Average Post Length Average Score Average Controversiality Number of Send_replies Number of No_follow Average Gilded
high_positive 414067 20.4552 186.9529 0.0001 413081 41321 0.0011
low_negative 349852 25.8145 -6.0160 0.2036 347914 314203 0.0000
low_positive 6985752 22.0660 6.0239 0.0206 6966040 4015353 0.0000
high_negative 3966 24.3911 -81.1092 0.0030 3931 3563 0.0010

Create Dummy Variables

Regex was utilized here to figure out whether a post contains some key words or not, including "COVID", "Pandemic", "Election", etc. and these 6 dummy variables were created based on the matching results.

Count
Covid 1 34384
0 7719253
Election 1 44198
0 7709439
Economics 1 120068
0 7633569
Finance 1 29025
0 7724612
Gender Equality 1 59921
0 7693716
Racial Equality 1 271779
0 7481858

External Data




COVID cases data and Nasdaq index data were collected from CDC, then a regression plot was made for all these important variables. According to the plot, the number of posts and Nasdaq close price, the average post length and Nasdaq close price, and new COVID cases and the number of posts are likely to have an apparent correlation.

Summary

  1. Most of the posts (over 60%) are less than 20 words.
  2. There is not a significant difference of post length between the premium authors and the non-premium authors.
  3. After the COVID (2020-01), the number of posts every month started to increase fast.
  4. The length of posts and the score of posts are not likely to have a significan relationship.
  5. The posts from right related author flair type tend to have a higher score.
  6. According to the Figure 6, it can be recognized that number of posts and stock price, new COVID cases and number of posts are related.

Resources

What's next?

Next section is Natural Language Processing, we will start to dive deep into the posts and process the data.