Wednesday May 31, 2017
Breakfast (8:45 – 9:30 AM): Jones Room
Session Five (9:30 – 10:45 AM): Woodruff 312
From paper to image to digital text: an overview of Japanese-specific challenges (PC and OS X — Long, Des Jardin and Ravina)
A survey of the practical challenges of turning paper texts into digitized texts. Key issues will include
-
- Image size and Optical Character Recognition (OCR)
- Streamlining workflow
- Repairing character misreads
In preparation for the workshop try to install these trial versions of OCR software
- ABBYY FineReader
- ReadIris
- Main page
- 10-day free trial (counts days of use)
- ETypist
As advanced topics, the session will also touch on the complexities of digitizing Japanese texts such as
- Simplified vs. traditional characters
- Hentaigana
- Rubi
Session Six (11:00 AM – 11:30 PM): Woodruff 312
Hands on work on OCR
Lunch Break (11:30 AM -12:30 PM): Jones Room
Session Seven (12:30 – 2:30 PM): Woodruff 312
Basics of Text Mining with R and RStudio (Ravina) — guide
This session will introduce text mining in the R programming language. To make the most of these sessions please do the following BEFORE the workshop:
- Install R from one of the following mirror sites (https://cran.r-project.org/mirrors.html) and RStudio (https://www.rstudio.com/products/rstudio/download/) on your laptop. Choose the free version. This is widely used software, so your local support staff should be able and willing to help.
- Before May 10, tell us your gmail user name. NOT your password, or anything interesting like that, just your user name. We will have a web-based version of RStudio running in case there are problems on your desktop
- Try the free “Introduction to R” or “R for the Intimidated” class at Datacamp.com. Or otherwise familiarize yourself with the basics of R. We will review everything in Atlanta, but we’ll be able to go much faster (and you’ll learn much more) if the basics are familiar
- Try to install MeCab on your desktop, with an emphasis on “try.” Local installation of MeCab can be difficult, and we’ll be posting detailed instructions shortly. You may need local support to give you “admin privileges” to change normally hidden files on your machine. If they won’t do this, or the install just doesn’t work, do not despair. Through the support of Emory and the Japan Foundation, the web-based version will be available for three years.
Topics for this first session will include:
-
- Importing texts
- Basic text manipulation
- Implementation of techniques such as
- Word frequencies
- Colocation
- Document term matrices
Session Eight (2:45 – 4:00 PM): Woodruff 312
Hands on work in RStudio