Text Mining in History and the Humanities – Page 3 – Just another Emory WordPress Sites site

Use of Letters of Austen and Dickens and Comparison between Two Austen’s Novels

IDEA

My idea is to choose some texts that I have read and compare it to something that I have never read, so I can raise interesting questions based on my previous knowledge. My choices are Pride and Prejudice, A Tale of Two Cities, two novels that I read about 6 years ago, Sense and Sensibility, Mansfield Park, Persuasion, Emma, Great Expectations, and Oliver Twist, which I have never read.

R CODE

I made some improvements to the code from class.

If I take out punctuation, I will create empty strings (“”). There are words with only punctuation. Thus, I took out the punctuation before eliminating empty strings.
When I create my data frame, I found that my “Word” column automatically turns to factor. I converted them to character.

#Change the name of the file to import
PRIDE.scan <- scan(“C:/Users/klijia/Desktop/HIST582A/W2/Raw Text/PRIDE.txt”,what=”character”,sep = “\n”)
PRIDE.df <- data.frame(PRIDE.scan, stringsAsFactors = FALSE)

#Select appropriate text
PRIDE.t <- PRIDE.df[c(16:10734),]
PRIDE.string <- paste(PRIDE.t, collapse= ” “)

PRIDE.words <- str_split(string = PRIDE.string, pattern = ” “)
PRIDE.words.good <- unlist(PRIDE.words)

# Take out punctuation before take out empty string “”
# Since there are words consist only punctuations
PRIDE.words.good1 <- str_replace_all(PRIDE.words.good,”[:punct:]”,””)
PRIDE.words.good2 <- PRIDE.words.good1[which(PRIDE.words.good1 != “”)]
PRIDE.words.goodF <- tolower(PRIDE.words.good2)

PRIDE.df <- data.frame(table(PRIDE.words.goodF))
PRIDE.ord.df <- PRIDE.df [order(-PRIDE.df$Freq),]
colnames(PRIDE.ord.df)[1] <- “Word”

# For some reason, the first column of the df is factor. Next line tries to
# convert it into character.
PRIDE.ord.df$Word <- as.character(PRIDE.ord.df$Word)

#Change the name to export file
write.table(PRIDE.ord.df,”C:/Users/klijia/Desktop/HIST582A/W2/Freq/A Tale_Freq.txt”,sep = “\t”)

I used same code for eight novels every time, changing only the import, text selection and output line. Creating a function should make this even more convenient.

Questions

Epistolary Legacy

One thing I remember from my reading of Pride and Prejudice is that Jane Austen likes to use letter in her novels. Early novels are in epistolary style; Austen’s early works are in epistolary form. It is not surprising that Austen preserves some epistolary legacy in her later works. The method that I used to confirm Austen’s preference for letters is to simply calculate the word frequency of “letter” and “letters”. The method is rudimentary and I could not claim that mere use of the words “letter” and “letters” substantiates more usage of letter quote in novels, but the following graphs reveal interesting patterns. graph1

graph-2

From the graph, I found that Austen uses the words “letter” and “letters” four times as Dickens does. In Pride and Prejudice, every 10.8 in 10,000 words are “letter” or “letters”. Austen’s works retain an epistolary legacy compared to Dickens’ works. This is also correct chronologically, since Dickens comes after Austen.

The Comparison between Pride and Prejudice and Sense and Sensibility

I also did the comparison between Austen’s two novels. Since novels use many proper nouns, I compared the differences in top 300 words. Following are code. I imported the tables created previously before running the codes.

setdiff(Pride_Freq$Word[1:300],Sense_Freq$Word[1:300])

[1] “elizabeth” “darcy” “bennet” “jane” “bingley”
[6] “wickham” “collins” “lydia” “father” “catherine”
[11] “lizzy” “longbourn” “gardiner” “take” “anything”
[16] “aunt” “daughter” “let” “ladies” “netherfield”
[21] “evening” “added” “kitty” “charlotte” “marriage”
[26] “went” “lucas” “answer” “character” “gone”
[31] “passed” “received” “coming” “conversation” “part”
[36] “seeing” “began” “either” “those” “uncle”
[41] “whose” “daughters” “meryton” “means” “party”
[46] “possible” “able” “bingleys” “london” “pemberley”

setdiff(Sense_Freq$Word[1:300],Pride_Freq$Word[1:300])

[1] “elinor” “marianne” “dashwood” “edward” “jennings”
[6] “thing” “willoughby” “lucy” “john” “heart”
[11] “brandon” “ferrars” “barton” “middleton” “mariannes”
[16] “spirits” “person” “against” “feel” “hardly”
[21] “poor” “engagement” “palmer” “acquaintance” “elinors”
[26] “comfort” “cottage” “visit” “within” “brought”
[31] “dashwoods” “short” “continued” “eyes” “general”
[36] “half” “side” “situation” “suppose” “wished”
[41] “end” “norland” “people” “reason” “rest”
[46] “returned” “longer” “park” “took” “under”

Proper nouns are not interesting, so I ignored them. Some of the words that are in Pride and Prejudice, but not in Sense and Sensibility are “father”, “aunt”, “daughter”, “uncle”. Sense and Sensibilities have no frequent words about family member or relatives in the list, so this suggests that Pride and Prejudice concerns more with family relationships. Sense and Sensibilities has more words with negative connotation “poor”, “against”, “hardly”, “cottage” (compare to mansions in Pride and Prejudice). This suggests that Sense and Sensibility tells a sad story, compared to Pride and Prejudice. Of course, through close reading, I can figure out exactly whether Sense and Sensibility deals with family relation or not and whether it is a comedy or tragedy, but the text mining helps me to get a general idea within a few seconds.

Pages: 12

How is the difference between the Old and standard Manchu language?

The Origin of the Manchu Language

Before I address the origin of the Manchu language, I must introduce a historiographic approach, which is the New Qing History. The New Qing History emphasizes the importance and element of Manchu in the Qing Empire (1616-1912) through reading non-Han Chinese sources, mainly in the Manchu language, to use the lens of the global history context. In other words, the New Qing historians regard the Qing Empire as an empire in the early modern period, just like the Ottoman Empire, British Empire, and so on. Apparently, in order to conduct study by using the methodology and idea of this historiography, it is necessary to understand the importance and evolution of the Manchu language.

Figure 1. The image of a document in the standard Manchu language (provided by Cheng-Heng Lu)

Zheng, Koxinga

Manchu language was the official language for the Qing Empire. This language was the most common languages for the Qing Empire. As a universal empire, the Qing Empire adapted to use different institutions to efficiently reign different areas, so using local language in the official documents in different areas expresses the essence of the Qing Empire. However, no matter what language was dominant, the Manchu language was written in parallel. For example, the empire used Chinese and Manchu language in China, used Mongolian language and the Manchu language in Mongolia, and used Tibetan and the Manchu language in Tibet. Generally speaking, in order to efficiently govern various regions, the Qing Empire endeavored to translate classics in different languages, such as Confucian classics from China, Buddhist classics from Tibet, and so on, in the Manchu language. Therefore, Manchu language was undoubtedly the most important and universal language in East Asia, just like Chinese, during the Qing period.

Literally, this language was created in 1599 based on the rule and characters of Mongolian language. During this period, this language was called the Old Manchu language because it was not mature and standardized. For example, the Old Manchu language could not spell Han Chinese name because the Old Manchu language had not established a complicated system to spell the sounds, which were not used colloquially. Additionally, the characters of “h,” “k,” and “g” were written exactly the same in the Old Manchu language. Moreover, the grammar was slightly not standardized. Since this regime was gradually growing, this immature language had to be revised. The significant turning point was in 1632 because Dahai, a literal doctor, was ordered to revise the Old Manchu language as the New Manchu language, as known as the Manchu language. After Dahai successfully revised the Old Manchu language as the standard Manchu language, which was widely used later, was mature enough to became the official language for the Court.

The Texts in the Old Manchu language and the Text in the Manchu language

The Old Manchu language was only used from 1599 to 1632, so there were few sources in the Old Manchu language, except Man Wen Lao Dang. Man Wen Lao Dang (滿文老檔, The Old Archive of the Manchu Language) was recorded daily events, including political, ethnic, economic, military, and social, before 1644 when the Manchu troops occupied Beijing to establish the Qing Dynasty, as one of the orthodox Chinese dynasties. Because Man Wen Lao Dang was the most primary source in the Old Manchu language, it becomes the most significant source to recognize the usage of the Old Manchu language.

When the Manchu army occupied Beijing and established the Qing Dynasty in China, increasing number of archives were written in Manchu language, as mentioned above. To be sure, Chinese was still the most important language, but, as mentioned, Manchu language was undoubtedly the official language. As a result, Manchu language was the only choice when the Court proposed to edit certain texts or books.

Ping Ding Hai Kou Fang Lue (平定海寇方略, The Book about Defeating Piracy), another text in this analysis, was edited around 1686. This book recorded how the Qing Empire suppressed and occupied Taiwan, where was reigned by the Zheng Regime (1661-1683). Since this was edited for claiming the victory and sovereignty of the Qing Empire, this was reasonable to compiling in the Manchu language so as to delivery to every corner among the empire. In this sense, the Manchu language version might be appropriate because the Manchu language was the shared language for different ethnicities and races. Meanwhile, the standard Manchu language had been used over half century while Ping Ding Hai Kou Fang Lue was edited. As a result, this book might be proper to analyze the linguistic usage of the Manchu language, comparing with Man Wen Lao Dang.

Overall, most importantly, this two texts are merely digital version in Manchu languages, the old and standard. Therefore, although the comparison of two texts might explore less important idea, this might be the first time that a study uses the methodology of digital humanity to conduct study regarding Manchu language sources.

The Analysis and Comparison of two texts

By using comparative and statistic method, these two texts surprisingly offered considerable interesting details. I individually analyze each text here.

Table 1. The frequency of words in Man Wen Lao Dang

Order	Words in Manchu language	Frequency	Meaning in English
1	i	19374	of
2	de	17149	at
3	be	17115	is
4	emu	5049	one
5	han	4960	khan
6	niyalma	4692	people
7	seme	4568	(expletive)
8	juwe	3328	two
9	cooha	2803	military/army
10	juwan	2260	ten
11	weile	2095	affair
12	tere	2029	that/he/she
13	ilan	1883	three
14	gurun	1877	state/country
15	orin	1739	twenty
16	tanggū	1723	hundred
17	morin	1685	horse/the seventh character of Earth Branch
18	ni	1667	of (the previous word ending with “n”)
19	nikan	1576	Chinese
20	ere	1568	this

As can be seen in Table 1, except numbers, such as emu, juwe, and ilan, and auxiliary word, such as i, de, and be, the most frequent word is han. As said, han refers to khan. This was the official title before Manchu army invaded into China. In other words, during this period, the Qing Empire was slightly like a khanate instead of an empire. In fact, this makes sense because the Qing Empire became an “empire” after the second khan, Hong Taiji, defeated Mongolian army in 1635. In order to deeply understand the essence of the Qing Regime at this time, let’s focus on another term, hūwangdi, which refers to emperor, appears in this text only 38 times, and all of them referred to the emperor of the Ming Dynasty. As a result, the title of the leader of this regime in this period addressed that this regime was a khanate in lieu of an empire. Additionally, although scholars try to interpret that this regime was ruled by a tribal council which was organized by khan and other seven feudatories, belie, the frequency of belie in this text was 1516. Accordingly, for this regime, khan might play a much more important role than these feudatories.

Additionally, in Table 1, the frequencies of niyalma and nikan are hard to ignore. After a closed reading, niyalma is a general term to describe all people under the reign of this regime. However, nikan is particularly identifying Chinese. Why was not there a term about Manchu? In fact, Manchu, which was written as manju in this text, only appeared 131 times. To be sure, Manchu was created for uniting all ethnicities in Manchuria after Hong Taiji controlled Mongolia and came to the throne as an emperor after 1635. However, the frequency of nikan also indicates an important factor: Chinese were still the majority in this region. This might also explain why the Qing Empire had to establish the Hanjun Eight Banners System to assimilate Chinese into its ruling class.

Table 2. The frequency of words in Ping Ding Hai Kou Fang Lue

Order	Words in Manchu language	Frequency	Meaning in English
1	be	499	is
2	i	249	of
3	de	238	at
4	cooha	164	Military/army
5	jeng	111	Zheng (surname)
6	cuwan	83	(referring to Quanzhou, name of a city)
7	seme	80	(expletive)
8	hūlha	76	Bandit/pirate
9	wan	75	Wan(surname)
10	mederi	74	Ocean/sea/marine
11	tidu	64	Commander
12	fu	59	City
13	fugiyan	57	Fujian (name of a province)
14	dzungdu	54	governor
15	ni	53	of (the previous word ending with “n”)
16	wang	50	king
17	men	49	Door (referring to certain name of place with this term, where usually means port)
18	jeo	48	prefecture
19	dahame	46	because
20	sehe	45	(completed tense)

As can be seen in Table 2, as mentioned above, except auxiliary word, such as i, de, and be, this text in fact really overturns present knowledge. Why could I make this argument? The most significant reason is because of the frequency of Wan. Wan is a surname, and this surname only referred to one general during this war: Wan Zhengse. In the past, scholars all acknowledged that Shi Lang was the most important person to defeat the Zheng Regime. However, in this text, Wan Zhengse was much more frequent mentioned because he was actually the general to organize and plan how to defeat the Zheng Regime although all credit was obtained by Shi Lang later.

Since this text concentrated on the war, it does make sense to mention considerable name of place. Among the top twenty frequent mentioned words, at least five words related to name of place. To be specific, Jeo referred to two places, Zhangzhou or Quanzhou. Fu also referred to Zhangzhou and Quanzhou. Unquestionably, cuwan only referred to Quanzhou. In other words, Quanzhou seems the most important place during this period. This is not surprised due to several reasons. First of all, Quanzhou was the most important city in southern Fujian. Second, Quanzhou was garrisoned by Fujian navy marshal, which was tidu. Third, Quanzhou was undoubtedly not only a city but the name of entire region. As a result, it can be concluded that Quanzhou was the most important area/city during this period.

Comparison of two texts

Admittedly, the scale of two texts are extremely different. The digital Man Wen Lao Dang is over 1,500 pages in a word file, but the digital Ping Ding Han Kou Fang Lue is just around 30 pages in a word file. However, according to statistic methodology, the frequency is still significantly remarkable.

As mentioned, these two texts were written in two “languages.” However, according to the statistics, the Old and standard Manchu language were actually similar because the auxiliary words were widely used in both. To be sure, two languages were not very different. Nevertheless, comparing two texts, it is easily to recognize the tense in two texts. In the Ping Ding Hai Kou Fang Lue, sehe, which is completed tense, frequently appeared because this text was edited after Taiwan had been already colonized by the Qing Empire. In contrast, Man Wen Lao Dang was recorded current dialogues or events reported by official immediately. As a result, the completed tense rarely appeared in Man Wen Lao Dang.

Comparing two texts, in fact, the Old Manchu language was not probably immature. In fact, the grammar in both texts were similar. For example, regarding verb, both texts contained past tense (-ha, -he), imperative mood (-kini), final form (-fi), conditional form (-ci), appositive form (-ra, -re, -ro), and perfective form (-habi). These verb forms were all appeared in Ping Ding Hai Kou Fang Lue as well. Therefore, the difference between the Old and standard Manchu language is probably not in grammar.

Conclusion

What could we learn from comparison of two texts? First, the grammar is still the same in either the Old Manchu language or Manchu language. In other words, the Manchu language, either the old or the standard, in fact has been a systematic and logical language. This could fully explain why this language could be widely utilized within the vast territory of the Qing Empire for over three hundred years.

Second, both texts focus on military because cooha was frequently emerging. To be sure, both texts discuss military events, especially Ping Ding Hai Kou Fang Lue. However, even though Man Wen Lao Dang recorded considerable military activities, this book should also describe something regarding administration or bureaucracy. However, it seems that military was still the most significant affair for this regime at that period.

Finally, because of the different purpose and content of two texts, they emphasized different terms. In Man Wen Lao Dang, numbers were everywhere because these numbers were used to record dates, years, and months. Instead, in Ping Ding Hai Kou Fang Lue, name of places was widely recorded because the geography was the essential point for this text.

Admittedly, comparing these two texts is not appropriate, in effect. However, this is due to the reality. Few sources in the Manchu language had been translated or Romanized into digital forms although some institutes have been conducted such works, such as Manchu Studies at Harvard University. Fortunately, these two texts were digitalized, and each of them represented different periods. Consequently, due to the manic tendency of the New Qing History in recent decade, the Manchu language is significantly emphasized. In order to use Manchus’ language to study Manchus’ history, it is necessary to widely use Manchu language as the primary source for studying Qing History. Once the amount of digital Manchu language sources appeared, it could help scholars to conduct Qing History through using digital methodology to offer more meaningful research.

Text Mining I – The Latin Text of Vergil’s Aeneid

I began by copying the Latin text of Vergil’s Aeneid from Project Gutenberg and pasting it to a word document. I had R pull the raw text from that document, and then I converted it into a dataframe. The Aeneid is divided into twelve “books” (closer in length to our “chapters”), so I had to scroll through the dataframe and note the line numbers of the beginning and ending of each book, so that I could direct R to leave out the subheadings between them. The rest proceeded much as it did with Shakespeare’s Sonnets last week, and fortunately punctuation in Latin is a non-ancient convention, so it was easy for me to eliminate that from the text without it affecting the words.

The only major problem that I ran into (and still need to figure out how to resolve) is the fact that Latin is a highly inflected language. Depending on how words are being used in a sentence, their endings change, and this complicates any attempts at counting word frequencies (e.g. haec is one of the most common words in the poem, but its other forms, hic, hoc, hanc, huius, huic, etc, were counted separately). That being said, the commonest words that showed up in my (imperfect) dataframe were conjunctions, prepositions, and particles (et, meaning “and”, is by far the most common word in the Aeneid), which generally do not change form. Finally, the fact that I used a Latin text (which is what I would do when using R for my actual research) meant that it was more difficult to compare and contrast the Aeneid’s word frequencies with those of Shakespeare, but they did have a few words in common: “in”, “me”, and “o”, which have roughly the same uses in both Latin and English (and English of course gets those words from Latin).

Code

Aeneid.lines.scan <- scan(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid Raw Text.txt”, what=”character”, sep=”\n”) # Scan Aeneid Raw Text

Aeneid.lines.df <- data.frame(Aeneid.lines.scan, stringsAsFactors = FALSE) # Put into a dataframe

Aeneid.lines <- Aeneid.lines.df[c(27:782, 786:1589, 1593:2310, 2314:3018, 3022:3892, 3896:4796, 4800:5616, 5620:6350, 6354:7171, 7175:8082, 8086:9000, 9004:9955),] # Eliminate non-text lines

Aeneid.string <- paste(Aeneid.lines, collapse=” “)

Aeneid.words <- str_split(string=Aeneid.string, pattern = ” “)

Aeneid.words <- unlist(Aeneid.words)

Aeneid.freq.df <- data.frame(table(Aeneid.words))

Aeneid.words <- Aeneid.words[which(Aeneid.words!=””)] # Remove white space

Aeneid.words.df <- data.frame(Aeneid.words)

Aeneid.words.df$lower <- tolower(Aeneid.words.df[,1])

colnames(Aeneid.words.df)[1] <- “words”
Aeneid.words.df$clean_text <- str_replace_all(Aeneid.words.df$words, “[:punct:]”,””) # remove punctuation

Aeneid.words.df$cleaned <- str_replace_all(Aeneid.words.df$lower, “[:punct:]”,””) # remove punctuation

Aeneid.clean.tbl.df <- data.frame(table(Aeneid.words.df$cleaned))

Aeneid.cleaned.tbl.ord.df <- Aeneid.clean.tbl.df[order(-Aeneid.clean.tbl.df$Freq),]

colnames(Aeneid.cleaned.tbl.ord.df)[1] <- “Words”

write.table(Aeneid.cleaned.tbl.ord.df, “~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/Aeneid.tbl.ord.df.txt”,
sep=”\t”) #Save Cleaned tabled ordered Aeneid

SONNETS.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/SONNETS.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Sonnets

HAMLET.cleaned.tbl.ord.df <- read.table(“~/Education/Emory/Coursework/Digital Humanities Methods/RStudios Practice/HAMLET.tbl.ord.df.txt”,
sep=”\t”, stringsAsFactors = FALSE) #Load Cleaned tabled ordered Hamlet

intersect(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(Aeneid.cleaned.tbl.ord.df$Words[1:50], HAMLET.cleaned.tbl.ord.df$Words[1:50])

setdiff(HAMLET.cleaned.tbl.ord.df$Words[1:50], Aeneid.cleaned.tbl.ord.df$Words[1:50])

Aeneid.cleaned.tbl.ord.df[which(Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Aeneid.cleaned.tbl.ord.df[which(!Aeneid.cleaned.tbl.ord.df$Words[1:20]
%in% HAMLET.cleaned.tbl.ord.df$Words[1:20]),]

Welcome to the 582A Text Mining Blog

We’ll be using this blog to:

Learning about blogging
Post our preliminary text mining results