Using R to Compare Word Frequencies in Two of Shakespeare’s Comedies – Text Mining in History and the Humanities

R is a “free software environment for statistical computing and graphics” that can be used for text mining. For this blog post, I have used R to create tables of word frequencies in two of Shakespeare’s comedic plays: The Comedy of Errors and The Tempest.

The first page of Shakespeare’s The Comedy of Errors, printed in the First Folio of 1623 (Wikimedia Commons / Folger Shakespeare Library Digital Image Collection)

Below is a table showing the ten most frequent words occurring in Shakespeare’s The Comedy of Errors. Not surprising, some of the most common words are prepositions (of, to), articles (the, a), and a conjunctions (and). The first person pronoun “I” occurs about 1.5 times more frequently than the second person pronoun “you.” This correlates with the book’s story of the unwitting encounters between the lost twin sons (both named Antipholus) and their twin servants (both named Dromio). The most common noun, “Syracuse,” indicates a place in the story.

WORDS	FREQUENCY
“of”	612
“and”	465
“I”	461
“the”	448
“to”	335
“you”	302
“my”	265
“me”	262
“a”	244
“Syracuse”	234

Title page of *The Tempest* from the 1623 First Folio (Wikimedia Commons / The Internet Shakespeare Editions)

The most common words in The Tempest are not that different from The Comedy of Errors. Again, we mostly see conjunctions, articles, and prepositions. The first person pronoun “I” occurs 2.5 times as often as the second person “you” as Shakespeare tells the story from the point of view of the magician Prospero, a former duke of Milan exiled on an island, where is accompanied by his daughter, Miranda, the spirit Ariel, and the monster Caliban.

WORDS	FREQUENCY
“and”	525
“the”	457
“I”	453
“to”	324
“of”	304
“a”	301
“my”	287
“you”	209
“that”	193
“this”	186

Now, let’s see how the two texts differ. The table below shows ten common words in The Tempest that are not in the fifty most frequently occurring words of The Comedy of Errors.

RANK	WORDS
1	“Prospero”
2	“do”
3	“Ariel”
4	“all”
5	“Sebastian”
6	“Stephano”
7	“o”
8	“now”
9	“they”
10	“which”

Finally, this table shows the opposite: the ten most common words in The Comedy of Errors that are not in the fifty most frequently occurring words of The Tempest. As one would expect, the differences include character and place names.

RANK	WORDS
1	“Syracuse”
2	“Aromio”
3	“Antipholus”
4	“Ephesus”
5	“sir”
6	“Adriana”
7	“at”
8	“her”
9	“from”
10	“or”

These examples mostly provide a starting point for the possibilities of text mining. More detailed analyses could provide insight into mood shifts or even gender biases within texts.

# Code for creating a word frequency table of The Comedy of Errors

library(“stringr”) # Loads the stringr package into the library

COMEDY.lines.scan <- scan(“C://Users//…COMEDY_CLEAN.txt”, what=”character”, sep=”\n”) # Scans “A Comedy of Errors” separated by lines from a txt file in a desktop folder *note, I saved the text from Project Gutenberg and cleaned up the document so it would contain only the lines of the play

COMEDY.lines.df <- data.frame(COMEDY.lines.scan, stringsAsFactors = FALSE) # creates a data frame so it’s easier to handle

COMEDY.string <- paste(COMEDY.lines.df, collapse=” “) # Creates a new vector that “collapses all the lines together, inserting white space where the lines are “collapsed” together

COMEDY.words <-str_split(string=COMEDY.string, pattern = ” “) # Splits the string in COMEDY.string based on white space

COMEDY.words <- unlist(COMEDY.words)

COMEDY.freq.df <- data.frame(table(COMEDY.words)) # Creates a table of the new object

COMEDY.words <- COMEDY.words[which(COMEDY.words!=””)] #Creates a variable that removes the blanks

COMEDY.words.df <- data.frame(COMEDY.words) # Creates a data frame so it’s easier to see elements side by side

COMEDY.words.df$lower <- tolower(COMEDY.words.df[,1]) # Changes text of all rows in the first column to lower case

colnames(COMEDY.words.df)[1] <- “words” # Simplifies the title of column one to “words”

COMEDY.words.df$clean_text <- str_replace_all(COMEDY.words.df$words, “[:punct:]”,””) # Creates a new column that removes the punctuation and replaces it with nothing

COMEDY.words.df$cleaned <- str_replace_all(COMEDY.words.df$lower, “[:punct:]”,””) # Removes punctuation from the lower case version of the text

COMEDY.cleaned.tbl.df <- data.frame(table(COMEDY.words.df$cleaned)) # Creates a data frame with a frequency table of the cleaned text

COMEDY.cleaned.tbl.ord.df <- COMEDY.cleaned.tbl.df[order(-COMEDY.cleaned.tbl.df$Freq),] # Reorders the rows so that most frequent words are at the top

colnames(COMEDY.cleaned.tbl.ord.df) <- c(“Words”,”Freq”)

write.table(COMEDY.cleaned.tbl.ord.df, “C://Users//…comedy_table.txt”, sep=”\t”) # Saves the table

# The same codes were used for the The Tempest (using a different txt document and saving with different file names)

# Code comparing differences in The Comedy of Errors and The Tempest

setdiff(COMEDY.cleaned.tbl.ord.df$Words[1:50],TEMPEST.cleaned.tbl.ord.df$Words[1:50]) # To see the top 50 words in “the Comedy of Errors” that are not in “The Tempest”
different_comedy.words.df <- data.frame(setdiff(COMEDY.cleaned.tbl.ord.df$Words[1:50],TEMPEST.cleaned.tbl.ord.df$Words[1:50])) # Creates a data frame
write.table(different_comedy.words.df, “C://Users//…different_comedy_table.txt”, sep=”\t”) # Saves the data frame

Leave a Reply Cancel reply