R is a “free software environment for statistical computing and graphics” that can be used for text mining. For this blog post, I have used R to create tables of word frequencies in two of Shakespeare’s comedic plays: The Comedy of Errors and The Tempest.
Below is a table showing the ten most frequent words occurring in Shakespeare’s The Comedy of Errors. Not surprising, some of the most common words are prepositions (of, to), articles (the, a), and a conjunctions (and). The first person pronoun “I” occurs about 1.5 times more frequently than the second person pronoun “you.” This correlates with the book’s story of the unwitting encounters between the lost twin sons (both named Antipholus) and their twin servants (both named Dromio). The most common noun, “Syracuse,” indicates a place in the story.
WORDS | FREQUENCY |
“of” | 612 |
“and” | 465 |
“I” | 461 |
“the” | 448 |
“to” | 335 |
“you” | 302 |
“my” | 265 |
“me” | 262 |
“a” | 244 |
“Syracuse” | 234 |
The most common words in The Tempest are not that different from The Comedy of Errors. Again, we mostly see conjunctions, articles, and prepositions. The first person pronoun “I” occurs 2.5 times as often as the second person “you” as Shakespeare tells the story from the point of view of the magician Prospero, a former duke of Milan exiled on an island, where is accompanied by his daughter, Miranda, the spirit Ariel, and the monster Caliban.
WORDS | FREQUENCY |
“and” | 525 |
“the” | 457 |
“I” | 453 |
“to” | 324 |
“of” | 304 |
“a” | 301 |
“my” | 287 |
“you” | 209 |
“that” | 193 |
“this” | 186 |
Now, let’s see how the two texts differ. The table below shows ten common words in The Tempest that are not in the fifty most frequently occurring words of The Comedy of Errors.
RANK | WORDS |
1 | “Prospero” |
2 | “do” |
3 | “Ariel” |
4 | “all” |
5 | “Sebastian” |
6 | “Stephano” |
7 | “o” |
8 | “now” |
9 | “they” |
10 | “which” |
Finally, this table shows the opposite: the ten most common words in The Comedy of Errors that are not in the fifty most frequently occurring words of The Tempest. As one would expect, the differences include character and place names.
RANK | WORDS |
1 | “Syracuse” |
2 | “Aromio” |
3 | “Antipholus” |
4 | “Ephesus” |
5 | “sir” |
6 | “Adriana” |
7 | “at” |
8 | “her” |
9 | “from” |
10 | “or” |
These examples mostly provide a starting point for the possibilities of text mining. More detailed analyses could provide insight into mood shifts or even gender biases within texts.
# Code for creating a word frequency table of The Comedy of Errors
library(“stringr”) # Loads the stringr package into the library
COMEDY.lines.scan <- scan(“C://Users//…COMEDY_CLEAN.txt”, what=”character”, sep=”\n”) # Scans “A Comedy of Errors” separated by lines from a txt file in a desktop folder *note, I saved the text from Project Gutenberg and cleaned up the document so it would contain only the lines of the play
COMEDY.lines.df <- data.frame(COMEDY.lines.scan, stringsAsFactors = FALSE) # creates a data frame so it’s easier to handle
COMEDY.string <- paste(COMEDY.lines.df, collapse=” “) # Creates a new vector that “collapses all the lines together, inserting white space where the lines are “collapsed” together
COMEDY.words <-str_split(string=COMEDY.string, pattern = ” “) # Splits the string in COMEDY.string based on white space
COMEDY.words <- unlist(COMEDY.words)
COMEDY.freq.df <- data.frame(table(COMEDY.words)) # Creates a table of the new object
COMEDY.words <- COMEDY.words[which(COMEDY.words!=””)] #Creates a variable that removes the blanks
COMEDY.words.df <- data.frame(COMEDY.words) # Creates a data frame so it’s easier to see elements side by side
COMEDY.words.df$lower <- tolower(COMEDY.words.df[,1]) # Changes text of all rows in the first column to lower case
colnames(COMEDY.words.df)[1] <- “words” # Simplifies the title of column one to “words”
COMEDY.words.df$clean_text <- str_replace_all(COMEDY.words.df$words, “[:punct:]”,””) # Creates a new column that removes the punctuation and replaces it with nothing
COMEDY.words.df$cleaned <- str_replace_all(COMEDY.words.df$lower, “[:punct:]”,””) # Removes punctuation from the lower case version of the text
COMEDY.cleaned.tbl.df <- data.frame(table(COMEDY.words.df$cleaned)) # Creates a data frame with a frequency table of the cleaned text
COMEDY.cleaned.tbl.ord.df <- COMEDY.cleaned.tbl.df[order(-COMEDY.cleaned.tbl.df$Freq),] # Reorders the rows so that most frequent words are at the top
colnames(COMEDY.cleaned.tbl.ord.df) <- c(“Words”,”Freq”)
write.table(COMEDY.cleaned.tbl.ord.df, “C://Users//…comedy_table.txt”, sep=”\t”) # Saves the table
# The same codes were used for the The Tempest (using a different txt document and saving with different file names)
# Code comparing differences in The Comedy of Errors and The Tempest
setdiff(COMEDY.cleaned.tbl.ord.df$Words[1:50],TEMPEST.cleaned.tbl.ord.df$Words[1:50]) # To see the top 50 words in “the Comedy of Errors” that are not in “The Tempest”
different_comedy.words.df <- data.frame(setdiff(COMEDY.cleaned.tbl.ord.df$Words[1:50],TEMPEST.cleaned.tbl.ord.df$Words[1:50])) # Creates a data frame
write.table(different_comedy.words.df, “C://Users//…different_comedy_table.txt”, sep=”\t”) # Saves the data frame