Can We Detect Political Bias in Tweets?

Given the recent political context, various organizations are exploring the use of machine learning to fight political bias in text content, such as news articles and social media posts. While its easy to point to bias in the news sources themselves, it is becoming more common to detect bias in the content itself through the use of advanced tools such as Natural Language Processing (NLP), Machine Learning and Deep Learning. The website Bipartisan Press has been using machine learning models to classify their articles according to direction.

I wanted to conduct my own experiment to see how easily a machine learning or deep learning model could distinguish between left and right wing tweets. If the model performs with significant accuracy, it could imply that there is a certain rhetoric used depending on the speakers own political biases.

So…Can we detect political bias in tweets?

Before we get into the modeling aspect, here’s a few caveats/limitations to this experiment:

  • Labeling: There are no “true” labels for tweets as right or left leaning. We have yet to develop a scale or methodology to objectively detect right or left leaning bias in text, so we can only try to use our best proxy. My labeling process was pretty straightforward, albeit naïve: If the tweet came from a right leaning politician, it was labeled as the “R” class, and vice versa with the “L” class. Now its important to note, not every tweet coming from a right leaning politician is right biased and not every tweet coming from a left leaning politician is left biased. There’s significant variation in political beliefs and rhetoric especially when it comes to tweets, so this labeling technique is obviously not perfect and results should be taken with a grain of salt.
  • Distribution of data: Because tweets were scraped from congress over the last five years, those politicians that tweeted frequently and had over a certain number of likes were oversampled in the data. For example, Bernie Sanders had over 9000 tweets in the dataset while other, less popular members of congress only had 50. This can skew the model results, especially if a certain politician that makes up a large portion of the data tends to use a certain rhetoric.

Now, let’s get into the data.

The Data

The dataset was scraped from Twitter using the Twint scraper tool (A workaround to the Twitter API). Over 400,000 tweets were scraped from over 500 + members of congress and politicians over a period of five years (January 2016 — February 2021). The distribution of the classes was split 75/25 with the majority of the tweets being from democrats (class D). While the class imbalance isn’t extreme, it is something we’ll have to deal with while running the models. We’ll put the tweets and the labels (party) into a Pandas Dataframe so it’s easy to work with.

Tweets Dataframe

Exploring the Data

Just from creating a few visualizations and summary statistics, we can learn a lot about our data before the modeling even begins. While both right and left leaning parties used some of the same terms frequently (“president”, “today”, “work”, “American”), the proportions in which they used them show some distinct differences. In the left wing tweet word cloud, some of the words we can see are “healthcare” gun violence”, seemingly more topic oriented.

Word Cloud Left Wing Tweets

On the right side, there is significant use of temporal words (“today”, “time”).

Word Cloud Right Wing Tweets

Both left and right sides mentioned the other side frequently as well, with one of the most common phrases on the left being “President Trump” and one of the most common on the right being “democrat”. When we look at some of the most relevant topics being discussed for each side (Using LDA topic clustering), we can see that on the left side some topic clusters could be healthcare, work and taxes and elections.

LDA Left Wing Tweets

Topic clusters developed from right wing tweets included national security, international relations, the economy and elections.

LDA Right Wing Tweets

Processing the Data using Natural Language Processing Techniques

So how do we go from text data to an input for machine learning models? There are various NLP techniques that can be used on text data to convert it into a numerical format. Most involve the process of converting words or text into a vector representation for each word or document. Some of the more traditional text processing techniques include count vectorization and Term Frequency -Inverse Document Frequency (TF-IDF). Count vectorization involves creating vectors for each document the length of the total vocabulary. Each word in the document is represented by how many times it occurs. TF-IDF also represents documents as vectors the length of the total vocabulary, but assigns the words a numerical value based on the inverse of its frequency of occurrence. TF-IDF depends on the idea that words that occur less frequently in text have higher informational value.

In terms of more advanced techniques, there are word embeddings. Word embeddings are a learned numerical representation for words in the form of a vector. Words that are “close” to other words in the vector space are considered more similar, which allows the model to detect and interpret semantic relationships. It’s possible to train your own word embeddings based on your specific texts, or use pretrained word embeddings. We will apply both traditional methods as well as pretrained word embeddings to our data.


Before we jump into the modeling, let’s define some metrics. The metrics we will use are accuracy (% correctly classified tweets) as well as F1 score. This is because F1 score is a balance between precision and recall, which aim to minimize false positives and false negatives, respectively. Since both false positives and false negatives carry equal weight in this context, we will try to strike a balance between precision and recall, hence the maximization of the F1 score.

In terms of models, we will run both deep learning models and machine learning models (because sometimes less is more). The models we will run are:

  • Logistic Regression
  • Random Forest
  • Naïve Bayes
  • Recurrent Neural Network (GRU and LSTM)

We will split our data 70/30 into training and testing so we can validate the model on unseen data.

Logistic Regression

The Logistic Regression model with count vectorization performed fairly well on the dataset, giving an initial accuracy of around 80%. However, the recall for the republican class was fairly low, implying that the model was struggling to identify the actual “R” tweets as “R”. This results in an average F1 score of 0.68. This could be due to class imbalance, which we will handle by SMOTING the data (Synthetic Majority Oversampling Technique). We will also us the TF-IDF vectorizer, as it tends to perform better. The results are as follows:

Logistic Regression

The average F1 score increased slightly, and the accuracy is around 75%.

Random Forest

Random Forest model had initially lower accuracy and F1 score (67% and 0.62 respectively). After tuning the model, and using the TF-IDF vectorizer, the model achieved an accuracy and F1 score of 68% and 0.63:

Random Forest Classifier with a Max Depth of 20

We were also able to extract feature importances from the Random Forest model, showing us the words that were most relevant in determining a class:

Random Forest Feature Importances

Some of the most relevant words for the model included “Trump”, “democrat”, “must”, “health” and “need”, similar to the words that appeared in the word clouds. While the model is interpretable and gives us telling information about the differences in each party’s speech, it performs fairly poor in terms of accuracy compared to the other models.

Naïve Bayes

Naïve Bayes achieved an accuracy of 68% and a F1 score of 0.6 initially. After implementing the TF-IDF vectorizer, the model’s accuracy increased to 71% and had an F1 score of 0.65:

Naïve Bayes Classifier

The model performed similarly to Random Forest, but still trails behind the logistic regression model.

Recurrent Neural Network

Our Recurrent Neural Network (RNN) will use pretrained word embeddings from the GloVe Project of the Standford NLP Group. RNNs can pick up on sequential data, making it a natural approach when working with text data which depends on order of words. The two architectures we will use are Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), as these architectures help the model with long term dependencies (“memory” of words that are farther apart in a sentence and give semantic meaning to each other).

RNN Model Construction (GRU)

We will feed an embedding matrix into the embedding layer of the model. This represents the pretrained word embeddings that were mapped to the words in the document in a matrix format. We will also set trainable to false, so the model uses the pretrained “weights” of the word embeddings instead of learning the embeddings from the input text itself.

GRU Model Results

The original GRU model performed similarly to Logistic Regression, with an accuracy of 74% and an F1 score of 0.7. Let’s see if we can try out the LSTM model and tune some parameters:

RNN Model Construction (LSTM)
LSTM Model Results

The model appears to have improved with the new parameters. The accuracy increased to 80% and the F1 score to 0.75.

The Winner: Recurrent Neural Network with LSTM

Conclusions and Recommendations

While the results appear to show a detectable difference between the tweets from right and left leaning politicians (80% accuracy), the model is far from perfect and could use further tuning. Aggregating the word embeddings into document embeddings could be a further step to improve the model’s accuracy. In a case like this where the data labeling is very subjective and many texts could go either way, it would be advisable to implement more classes. For example, having a third class that’s neutral, or more extreme classes (far right or far left) could help with misclassification. When deploying the model for real world use, tweets in the more extreme classes could be labeled as “Potentially Biased” so that the reader is aware that the content may not be totally objective.

While there are already some tools to help us identify partisanship in our news sources, it’s worth further exploration to help bring awareness to the implicit biases in the content around us.




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

What About a 6-Week Machine Learning Project?

How to predict the outcome of a video game?

Food Classification with Monk

Review: RefineNet — Multi-path Refinement Network (Semantic Segmentation)

Review — DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Discriminative…

Binary cross-entropy loss — Special case of Categorical cross-entropy loss

Review: AdaConv — Video Frame Interpolation via Adaptive Convolution (Video Frame Interpolation)

Game of Predictive Modeling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarah Wright

Sarah Wright

More from Medium

Jer Thorp and Data Ethics

Arranged marriage vs Natural selection

You may feel as if you’re stuck in a rut, creatively. Here’s how to get yourself back on track.