R – Text Mining in History and the Humanities

IDEA

My idea is to choose some texts that I have read and compare it to something that I have never read, so I can raise interesting questions based on my previous knowledge. My choices are Pride and Prejudice, A Tale of Two Cities, two novels that I read about 6 years ago, Sense and Sensibility, Mansfield Park, Persuasion, Emma, Great Expectations, and Oliver Twist, which I have never read.

R CODE

I made some improvements to the code from class.

If I take out punctuation, I will create empty strings (“”). There are words with only punctuation. Thus, I took out the punctuation before eliminating empty strings.
When I create my data frame, I found that my “Word” column automatically turns to factor. I converted them to character.

#Change the name of the file to import
PRIDE.scan <- scan(“C:/Users/klijia/Desktop/HIST582A/W2/Raw Text/PRIDE.txt”,what=”character”,sep = “\n”)
PRIDE.df <- data.frame(PRIDE.scan, stringsAsFactors = FALSE)

#Select appropriate text
PRIDE.t <- PRIDE.df[c(16:10734),]
PRIDE.string <- paste(PRIDE.t, collapse= ” “)

PRIDE.words <- str_split(string = PRIDE.string, pattern = ” “)
PRIDE.words.good <- unlist(PRIDE.words)

# Take out punctuation before take out empty string “”
# Since there are words consist only punctuations
PRIDE.words.good1 <- str_replace_all(PRIDE.words.good,”[:punct:]”,””)
PRIDE.words.good2 <- PRIDE.words.good1[which(PRIDE.words.good1 != “”)]
PRIDE.words.goodF <- tolower(PRIDE.words.good2)

PRIDE.df <- data.frame(table(PRIDE.words.goodF))
PRIDE.ord.df <- PRIDE.df [order(-PRIDE.df$Freq),]
colnames(PRIDE.ord.df)[1] <- “Word”

# For some reason, the first column of the df is factor. Next line tries to
# convert it into character.
PRIDE.ord.df$Word <- as.character(PRIDE.ord.df$Word)

#Change the name to export file
write.table(PRIDE.ord.df,”C:/Users/klijia/Desktop/HIST582A/W2/Freq/A Tale_Freq.txt”,sep = “\t”)

I used same code for eight novels every time, changing only the import, text selection and output line. Creating a function should make this even more convenient.

Questions

Epistolary Legacy

One thing I remember from my reading of Pride and Prejudice is that Jane Austen likes to use letter in her novels. Early novels are in epistolary style; Austen’s early works are in epistolary form. It is not surprising that Austen preserves some epistolary legacy in her later works. The method that I used to confirm Austen’s preference for letters is to simply calculate the word frequency of “letter” and “letters”. The method is rudimentary and I could not claim that mere use of the words “letter” and “letters” substantiates more usage of letter quote in novels, but the following graphs reveal interesting patterns. graph1

graph-2

From the graph, I found that Austen uses the words “letter” and “letters” four times as Dickens does. In Pride and Prejudice, every 10.8 in 10,000 words are “letter” or “letters”. Austen’s works retain an epistolary legacy compared to Dickens’ works. This is also correct chronologically, since Dickens comes after Austen.

The Comparison between Pride and Prejudice and Sense and Sensibility

I also did the comparison between Austen’s two novels. Since novels use many proper nouns, I compared the differences in top 300 words. Following are code. I imported the tables created previously before running the codes.

setdiff(Pride_Freq$Word[1:300],Sense_Freq$Word[1:300])

[1] “elizabeth” “darcy” “bennet” “jane” “bingley”
[6] “wickham” “collins” “lydia” “father” “catherine”
[11] “lizzy” “longbourn” “gardiner” “take” “anything”
[16] “aunt” “daughter” “let” “ladies” “netherfield”
[21] “evening” “added” “kitty” “charlotte” “marriage”
[26] “went” “lucas” “answer” “character” “gone”
[31] “passed” “received” “coming” “conversation” “part”
[36] “seeing” “began” “either” “those” “uncle”
[41] “whose” “daughters” “meryton” “means” “party”
[46] “possible” “able” “bingleys” “london” “pemberley”

setdiff(Sense_Freq$Word[1:300],Pride_Freq$Word[1:300])

[1] “elinor” “marianne” “dashwood” “edward” “jennings”
[6] “thing” “willoughby” “lucy” “john” “heart”
[11] “brandon” “ferrars” “barton” “middleton” “mariannes”
[16] “spirits” “person” “against” “feel” “hardly”
[21] “poor” “engagement” “palmer” “acquaintance” “elinors”
[26] “comfort” “cottage” “visit” “within” “brought”
[31] “dashwoods” “short” “continued” “eyes” “general”
[36] “half” “side” “situation” “suppose” “wished”
[41] “end” “norland” “people” “reason” “rest”
[46] “returned” “longer” “park” “took” “under”

Proper nouns are not interesting, so I ignored them. Some of the words that are in Pride and Prejudice, but not in Sense and Sensibility are “father”, “aunt”, “daughter”, “uncle”. Sense and Sensibilities have no frequent words about family member or relatives in the list, so this suggests that Pride and Prejudice concerns more with family relationships. Sense and Sensibilities has more words with negative connotation “poor”, “against”, “hardly”, “cottage” (compare to mansions in Pride and Prejudice). This suggests that Sense and Sensibility tells a sad story, compared to Pride and Prejudice. Of course, through close reading, I can figure out exactly whether Sense and Sensibility deals with family relation or not and whether it is a comedy or tragedy, but the text mining helps me to get a general idea within a few seconds.

Pages: 12

I began by copying the Latin text of Vergil’s Aeneid from Project Gutenberg and pasting it to a word document. I had R pull the raw text from that document, and then I converted it into a dataframe. The Aeneid is divided into twelve “books” (closer in length to our “chapters”), so I had to scroll through the dataframe and note the line numbers of the beginning and ending of each book, so that I could direct R to leave out the subheadings between them. The rest proceeded much as it did with Shakespeare’s Sonnets last week, and fortunately punctuation in Latin is a non-ancient convention, so it was easy for me to eliminate that from the text without it affecting the words.

The only major problem that I ran into (and still need to figure out how to resolve) is the fact that Latin is a highly inflected language. Depending on how words are being used in a sentence, their endings change, and this complicates any attempts at counting word frequencies (e.g. haec is one of the most common words in the poem, but its other forms, hic, hoc, hanc, huius, huic, etc, were counted separately). That being said, the commonest words that showed up in my (imperfect) dataframe were conjunctions, prepositions, and particles (et, meaning “and”, is by far the most common word in the Aeneid), which generally do not change form. Finally, the fact that I used a Latin text (which is what I would do when using R for my actual research) meant that it was more difficult to compare and contrast the Aeneid’s word frequencies with those of Shakespeare, but they did have a few words in common: “in”, “me”, and “o”, which have roughly the same uses in both Latin and English (and English of course gets those words from Latin).

Code

Aeneid.lines.scan <- scan(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid Raw Text.txt”, what=”character”, sep=”\n”) # Scan Aeneid Raw Text

Aeneid.lines.df <- data.frame(Aeneid.lines.scan, stringsAsFactors = FALSE) # Put into a dataframe

Aeneid.lines <- Aeneid.lines.df[c(27:782, 786:1589, 1593:2310, 2314:3018, 3022:3892, 3896:4796, 4800:5616, 5620:6350, 6354:7171, 7175:8082, 8086:9000, 9004:9955),] # Eliminate non-text lines

Aeneid.string <- paste(Aeneid.lines, collapse=” “)

Aeneid.words <- str_split(string=Aeneid.string, pattern = ” “)

Aeneid.words <- unlist(Aeneid.words)

Aeneid.freq.df <- data.frame(table(Aeneid.words))

Aeneid.words <- Aeneid.words[which(Aeneid.words!=””)] # Remove white space

Aeneid.words.df <- data.frame(Aeneid.words)

Aeneid.words.df$lower <- tolower(Aeneid.words.df[,1])

colnames(Aeneid.words.df)[1] <- “words”
Aeneid.words.df$clean_text <- str_replace_all(Aeneid.words.df$words, “[:punct:]”,””) # remove punctuation

Aeneid.words.df$cleaned <- str_replace_all(Aeneid.words.df$lower, “[:punct:]”,””) # remove punctuation

Aeneid.clean.tbl.df <- data.frame(table(Aeneid.words.df$cleaned))

Aeneid.cleaned.tbl.ord.df <- Aeneid.clean.tbl.df[order(-Aeneid.clean.tbl.df$Freq),]

colnames(Aeneid.cleaned.tbl.ord.df)[1] <- “Words”

write.table(Aeneid.cleaned.tbl.ord.df, “~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid.tbl.ord.df.txt”,
sep=”\t”) #Save Cleaned tabled ordered Aeneid

SONNETS.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/SONNETS.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Sonnets

HAMLET.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/HAMLET.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Hamlet

intersect(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(HAMLET.cleaned.tbl.ord.df$Words[1:50], Aeneid.cleaned.tbl.ord.df$Words[1:50])

Aeneid.cleaned.tbl.ord.df[which(Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Aeneid.cleaned.tbl.ord.df[which(!Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Tag: R

Use of Letters of Austen and Dickens and Comparison between Two Austen’s Novels

IDEA

R CODE

Questions

Epistolary Legacy

The Comparison between Pride and Prejudice and Sense and Sensibility

Text Mining I – The Latin Text of Vergil’s Aeneid