Media Bias Text Generation Transformer
A glimpse at a few outputs of the Text Generation Transformer project
Goals
Project Goals
Gain Competence doing Text Generation using transformers
I wanted to learn a bit more about text generation, and these transformers that are taking over text generation. I chose to work with GPT2, and found that the HuggingFace transformers and library came very highly recommended.
Get a glimpse in differences in bias from different media sources
I have recently been reading news from several media sources, and I am often astounded by the differences. What I wanted to do, is take a pre-trained model and fine tune it on a specific media source. Do this several times on different sources and compare. In what ways will my models be biased by their source? One might hypothesise that this would show some of the biases that people may be nudged toward by only consuming that media source.
Article Goals
The purpose of this article is just to show some of the outputs and get people thinking about how an approach like this could be interesting and useful to study further. The bulk of the project was on setting up and fine-tuning transformers, which is not at all what this article is about. This article is about looking at some of the outputs in order to spur additional thought on the topic
Scientists
In this case, the models were each given the prompt "The reason Scientists do press briefings is". Can you spot the outlier?
CNBC: The reason Scientists do press briefings is because they want to know what's going on in the world," he said. "If you don't have a clear understanding of where things are, it can be very difficult for scientists to get their work done.
CNN: The reason Scientists do press briefings is because they want to know what's going on in the world," he said. "They're trying to get a sense of where things are."
Fox News: The reason Scientists do press briefings is because they're trying to find out more about what the Russians are up to
The New York Times: The reason Scientists do press briefings is that it’s a way for them to get the most out of their time
Washington Post: The reason Scientists do press briefings is because they’re the only ones who know what's going on.
Antifa
Another area where there's quite a bit of contrast is Antifa. Which one doesn't match the others?
CNBC: Antifa is a group that has been active in the U.S. for more than 50 years, according to its website. "We are committed to protecting our community from violence and hate," it says on its website.
CNN: Antifa is a group that advocates for the rights of people to peacefully assemble, organize and participate in peaceful protests.
Fox News: Antifa is a group that's trying to take down the president of the United States. And so, I don't think it's going to be easy for them to come in and do what they want to do.
The New York Times: Antifa is a group that has been active since the beginning of the year, and it's not clear how many people have joined.
Washington Post: Antifa is a group that advocates for the rights of women and girls to self-identify as LGBTQ, according to its website.
Closing
Now it's fairly entertaining to read through the ones that really demonstrate bias, like this one:
Fox News: The thing we should do with immigrants is make sure that they're not coming to the United States illegally. (END VIDEO CLIP) INGRAHAM: Well, I don't know if you have a lot of sympathy for those who are trying to come into this country as unaccompanied minors or people in their 20s and 30s being separated from their parents because it seems like every single one of them has some kind of criminal record which makes us all very worried about these kids crossing our southern border at any given time.
and to read zingers like this one
CNBC: The most hated politician is Donald Trump, and he's got a lot of problems
and run across snippets like this
"The landmark summit between North Korean leader Kim Jong Un and dictator-in-waiting Donald Trump"
But that's a pretty biased way of analyzing things. So instead let's look at overall word counts and see if there is a difference in word choice between these models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from collections import Counter
The first step is to read in the text file with text generated by the model (Go to that URL if you want to see more).
df = pd.read_csv('https://raw.githubusercontent.com/Isaac-Flath/MediaBias/main/text_generated.csv',
usecols=['source','prompt','text'])
We will then convert everything to lowercase and get rid of punctuation
tokenizer = nltk.RegexpTokenizer(r"\w+")
df['text'] = df.text.str.lower().apply(tokenizer.tokenize)
Finally we count all the words and put it into a pandas data frame for convenience
results = Counter()
df['text'].apply(results.update)
results = pd.concat([pd.DataFrame.from_dict(results.keys()),pd.DataFrame.from_dict(results.values())],axis=1)
results.columns = ['words','all_cnt']
Now, we repeat this same thing for each media source so we have word counts for each source (ie CNN)
for s in df.source.unique():
tempdf = df[df.source==s].copy()
tempcntr = Counter()
tempdf['text'].apply(tempcntr.update)
tempresults = pd.concat([pd.DataFrame.from_dict(tempcntr.keys()),pd.DataFrame.from_dict(tempcntr.values())],axis=1)
tempresults.columns = ['words',s.replace(' ','')+'_cnt']
results = pd.merge(results,tempresults, how = 'left', on = 'words')
results.fillna(0,inplace=True)
results.head()
print(stopwords.words('english')[:5])
Now let's use that to filter out dataframe for all stop words, as well as sort it in descending order so the most common words are at the top.
results = results[~results.words.isin(stopwords.words('english'))].copy()
results.sort_values('all_cnt',ascending=False, inplace=True)
Let's graph the amount the count deviates from the average and see if there is any point looking further. If these are all close to blank graphs (no deviation from average) then that means the text they generated uses almost the same words.
In this case, we can see that they all deviate in different ways on different words. So the media source impacted the models word choice
width, indexes = 0.7, np.arange(len(results))
fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(5,figsize=(5,10))
ax_map = [(ax1,results.CNBC_cnt),(ax2,results.FoxNews_cnt),(ax3,results.CNN_cnt),
(ax4,results.TheNewYorkTimes_cnt),(ax5,results.WashingtonPost_cnt)]
for x in ax_map:
x[0].set_title = x[1].name
x[0].bar(indexes,x[1]-results.all_cnt/5,width)
x[0].set_ylim(-10,10)
plt.show()