Shakespeare – Text Mining in History and the Humanities

When William Shakespeare dedicated his narrative poem Venus and Adonis to his benefactor in 1593 he made a solemn promise. “I… vow to take advantage of all idle hours, till I have honoured you with some graver labour.” A year later he produced the Rape of Lucrece, a poem considered by many to be one of “the Bard’s” more serious works. Using the text mining tools in R we can see that Shakespeare appears to have fulfilled his vow. While there are numerous similar words that point to an unsurprising similarity in style(after all both were written in narrative form and back to back), the more distinctive words in each seem to illustrate a marked gap in the tone of these poems. The Rape of Lucrece mentions words like “honour,” “sad,” and “sin,” more then Venus and Adonis. Comparatively, the latter makes use of more positive words like, “kiss,” “boar,” and “cheek.” Yet, context is all, and those of us who have read Venus and Adonis know that a “kiss” may not be enjoyed by all and the hunted may become the hunter. Thus, in a forthcoming post, we will delve deeper into these two works using R’s sentiment analysis tools and call Shakespeare to account for the vow he made 423 years ago.

The Comparison Table

Common	Lucrece Distinctive	Venus Distinctive	Lucrece “More” Distinctive*	Venus “More” Distinctive*
the	which	love	honour	kiss
and	when	now	sad	boar
to	then	shall	sin	boy
in	have	more	while	cheek
of	such	being	live	hard
his	did	heart	thing	best

			*These categories exclude proper nouns

The code that makes it work

#First download Venus and Adonis and the Rape of Lucrece in .txt form, from PorjectGutenberg. You will also need the stringr and stringi packages.
##Part 1- Cleaning up “The Rape of Lucrece”
Lucrece.lines.scan<scan(“c:\\yourname\\location\\TheRapeofLucrece.txt”,what=”character”, sep=”\n”)
Lucrece.lines Lucrece.lines Lucrece.string Lucrece.words Lucrece.words Lucrece.words Lucrece.words.df Lucrece.words.df$lower colnames(Lucrece.words.df)[1]<- “words”
Lucrece.words.df$clean_text Lucrece.words.df$cleaned Lucrece.clean.tbl.df Lucrece.cleaned.tbl.ord.df colnames(Lucrece.cleaned.tbl.ord.df)[1] <- “Words”
#Cleaning up “Venus and Adonis
VenusAdonis.line.scan VenusAdonis.lines VenusAdonis.lines VenusAdonis.string VenusAdonis.words VenusAdonis.words VenusAdonis.words VenusAdonis.words.df VenusAdonis.words.df$lower colnames(VenusAdonis.words.df)[1]<- “words”
VenusAdonis.words.df$clean_text VenusAdonis.words.df$cleaned VenusAdonis.clean.tbl.df VenusAdonis.cleaned.tbl.ord.df colnames(VenusAdonis.cleaned.tbl.ord.df)[1] <- “Words”
#Part 2- Comparison
##Which words are common in both “the Rape of Lucrece” and “Venus and Adonis”?
table<-intersect(Lucrece.cleaned.tbl.ord.df$Words[1:10],VenusAdonis.cleaned.tbl.ord.df$Words[1:10])
write.table(table, “C:\\your.location\\VenusAdonis-Lucrece.csv”,sep=”,”, col.names=NA)
##Which words are “somewhat”distinctive?
setdiff(Lucrece.cleaned.tbl.ord.df$Words[1:50],VenusAdonis.cleaned.tbl.ord.df$Words[1:50])
setdiff(VenusAdonis.cleaned.tbl.ord.df$Words[1:50],Lucrece.cleaned.tbl.ord.df$Words[1:50])
##Which words are “more”distinctive?
VenusAdonis.cleaned.tbl.ord.df[which(!VenusAdonis.cleaned.tbl.ord.df$Words[1:500]%in% Lucrece.cleaned.tbl.ord.df$Words[1:500]),]
Lucrece.cleaned.tbl.ord.df[which(!Lucrece.cleaned.tbl.ord.df$Words[1:500]%in% VenusAdonis.cleaned.tbl.ord.df$Words[1:500]),]

Pages: 12

I began by copying the Latin text of Vergil’s Aeneid from Project Gutenberg and pasting it to a word document. I had R pull the raw text from that document, and then I converted it into a dataframe. The Aeneid is divided into twelve “books” (closer in length to our “chapters”), so I had to scroll through the dataframe and note the line numbers of the beginning and ending of each book, so that I could direct R to leave out the subheadings between them. The rest proceeded much as it did with Shakespeare’s Sonnets last week, and fortunately punctuation in Latin is a non-ancient convention, so it was easy for me to eliminate that from the text without it affecting the words.

The only major problem that I ran into (and still need to figure out how to resolve) is the fact that Latin is a highly inflected language. Depending on how words are being used in a sentence, their endings change, and this complicates any attempts at counting word frequencies (e.g. haec is one of the most common words in the poem, but its other forms, hic, hoc, hanc, huius, huic, etc, were counted separately). That being said, the commonest words that showed up in my (imperfect) dataframe were conjunctions, prepositions, and particles (et, meaning “and”, is by far the most common word in the Aeneid), which generally do not change form. Finally, the fact that I used a Latin text (which is what I would do when using R for my actual research) meant that it was more difficult to compare and contrast the Aeneid’s word frequencies with those of Shakespeare, but they did have a few words in common: “in”, “me”, and “o”, which have roughly the same uses in both Latin and English (and English of course gets those words from Latin).

Code

Aeneid.lines.scan <- scan(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid Raw Text.txt”, what=”character”, sep=”\n”) # Scan Aeneid Raw Text

Aeneid.lines.df <- data.frame(Aeneid.lines.scan, stringsAsFactors = FALSE) # Put into a dataframe

Aeneid.lines <- Aeneid.lines.df[c(27:782, 786:1589, 1593:2310, 2314:3018, 3022:3892, 3896:4796, 4800:5616, 5620:6350, 6354:7171, 7175:8082, 8086:9000, 9004:9955),] # Eliminate non-text lines

Aeneid.string <- paste(Aeneid.lines, collapse=” “)

Aeneid.words <- str_split(string=Aeneid.string, pattern = ” “)

Aeneid.words <- unlist(Aeneid.words)

Aeneid.freq.df <- data.frame(table(Aeneid.words))

Aeneid.words <- Aeneid.words[which(Aeneid.words!=””)] # Remove white space

Aeneid.words.df <- data.frame(Aeneid.words)

Aeneid.words.df$lower <- tolower(Aeneid.words.df[,1])

colnames(Aeneid.words.df)[1] <- “words”
Aeneid.words.df$clean_text <- str_replace_all(Aeneid.words.df$words, “[:punct:]”,””) # remove punctuation

Aeneid.words.df$cleaned <- str_replace_all(Aeneid.words.df$lower, “[:punct:]”,””) # remove punctuation

Aeneid.clean.tbl.df <- data.frame(table(Aeneid.words.df$cleaned))

Aeneid.cleaned.tbl.ord.df <- Aeneid.clean.tbl.df[order(-Aeneid.clean.tbl.df$Freq),]

colnames(Aeneid.cleaned.tbl.ord.df)[1] <- “Words”

write.table(Aeneid.cleaned.tbl.ord.df, “~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid.tbl.ord.df.txt”,
sep=”\t”) #Save Cleaned tabled ordered Aeneid

SONNETS.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/SONNETS.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Sonnets

HAMLET.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/HAMLET.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Hamlet

intersect(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(HAMLET.cleaned.tbl.ord.df$Words[1:50], Aeneid.cleaned.tbl.ord.df$Words[1:50])

Aeneid.cleaned.tbl.ord.df[which(Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Aeneid.cleaned.tbl.ord.df[which(!Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Tag: Shakespeare

Comparing Word Usage in Shakespeare’s the Rape of Lucrece and Venus and Adonis

The Comparison Table

The code that makes it work

Text Mining I – The Latin Text of Vergil’s Aeneid