Text Mining I – The Latin Text of Vergil’s Aeneid

I began by copying the Latin text of Vergil’s Aeneid from Project Gutenberg and pasting it to a word document. I had R pull the raw text from that document, and then I converted it into a dataframe. The Aeneid is divided into twelve “books” (closer in length to our “chapters”), so I had to scroll through the dataframe and note the line numbers of the beginning and ending of each book, so that I could direct R to leave out the subheadings between them. The rest proceeded much as it did with Shakespeare’s Sonnets last week, and fortunately punctuation in Latin is a non-ancient convention, so it was easy for me to eliminate that from the text without it affecting the words.

The only major problem that I ran into (and still need to figure out how to resolve) is the fact that Latin is a highly inflected language. Depending on how words are being used in a sentence, their endings change, and this complicates any attempts at counting word frequencies (e.g. haec is one of the most common words in the poem, but its other forms, hic, hoc, hanc, huius, huic, etc, were counted separately). That being said, the commonest words that showed up in my (imperfect) dataframe were conjunctions, prepositions, and particles (et, meaning “and”, is by far the most common word in the Aeneid), which generally do not change form. Finally, the fact that I used a Latin text (which is what I would do when using R for my actual research) meant that it was more difficult to compare and contrast the Aeneid’s word frequencies with those of Shakespeare, but they did have a few words in common: “in”, “me”, and “o”, which have roughly the same uses in both Latin and English (and English of course gets those words from Latin).


Aeneid.lines.scan <- scan(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid Raw Text.txt”, what=”character”, sep=”\n”) # Scan Aeneid Raw Text

Aeneid.lines.df <- data.frame(Aeneid.lines.scan, stringsAsFactors = FALSE) # Put into a dataframe

Aeneid.lines <- Aeneid.lines.df[c(27:782, 786:1589, 1593:2310, 2314:3018, 3022:3892, 3896:4796, 4800:5616, 5620:6350, 6354:7171, 7175:8082, 8086:9000, 9004:9955),] # Eliminate non-text lines

Aeneid.string <- paste(Aeneid.lines, collapse=” “)

Aeneid.words <- str_split(string=Aeneid.string, pattern = ” “)

Aeneid.words <- unlist(Aeneid.words)

Aeneid.freq.df <- data.frame(table(Aeneid.words))

Aeneid.words <- Aeneid.words[which(Aeneid.words!=””)] # Remove white space

Aeneid.words.df <- data.frame(Aeneid.words)

Aeneid.words.df$lower <- tolower(Aeneid.words.df[,1])

colnames(Aeneid.words.df)[1] <- “words”
Aeneid.words.df$clean_text <- str_replace_all(Aeneid.words.df$words, “[:punct:]”,””) # remove punctuation

Aeneid.words.df$cleaned <- str_replace_all(Aeneid.words.df$lower, “[:punct:]”,””) # remove punctuation

Aeneid.clean.tbl.df <- data.frame(table(Aeneid.words.df$cleaned))

Aeneid.cleaned.tbl.ord.df <- Aeneid.clean.tbl.df[order(-Aeneid.clean.tbl.df$Freq),]

colnames(Aeneid.cleaned.tbl.ord.df)[1] <- “Words”

write.table(Aeneid.cleaned.tbl.ord.df, “~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid.tbl.ord.df.txt”,
sep=”\t”) #Save Cleaned tabled ordered Aeneid

SONNETS.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/SONNETS.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Sonnets

HAMLET.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/HAMLET.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Hamlet

intersect(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(HAMLET.cleaned.tbl.ord.df$Words[1:50], Aeneid.cleaned.tbl.ord.df$Words[1:50])

%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]