Day Two

Wednesday May 31, 2017

Breakfast (8:45 – 9:30 AM): Jones Room

Session Five (9:30 – 10:45 AM): Woodruff 312

From paper to image to digital text: an overview of Japanese-specific challenges (PC and OS X — Long, Des Jardin and Ravina)

A survey of the practical challenges of turning paper texts into digitized texts. Key issues will include

    • Image size and Optical Character Recognition (OCR)
    • Streamlining workflow
    • Repairing character misreads

In preparation for the workshop try to install these trial versions of OCR software

As advanced topics, the session will also touch on the complexities of digitizing Japanese texts such as

  • Simplified vs. traditional characters
  • Hentaigana
  • Rubi

 Session Six (11:00 AM – 11:30 PM): Woodruff 312

Hands on work on OCR

Lunch Break (11:30 AM -12:30 PM): Jones Room

Session Seven (12:30 – 2:30 PM): Woodruff 312

Basics of Text Mining with R and RStudio (Ravina) — guide

This session will introduce text mining in the R programming language. To make the most of these sessions please do the following BEFORE the workshop:

  • Install R from one of the following mirror sites (https://cran.r-project.org/mirrors.html) and RStudio (https://www.rstudio.com/products/rstudio/download/) on your laptop. Choose the free version. This is widely used software, so your local support staff should be able and willing to help.
  • Before May 10, tell us your gmail user name. NOT your password, or anything interesting like that, just your user name. We will have a web-based version of RStudio running in case there are problems on your desktop
  • Try the free “Introduction to R” or “R for the Intimidated” class at Datacamp.com. Or otherwise familiarize yourself with the basics of R. We will review everything in Atlanta, but we’ll be able to go much faster (and you’ll learn much more) if the basics are familiar
  • Try to install MeCab on your desktop, with an emphasis on “try.” Local installation of MeCab can be difficult, and we’ll be posting detailed instructions shortly. You may need local support to give you “admin privileges” to change normally hidden files on your machine. If they won’t do this, or the install just doesn’t work, do not despair. Through the support of Emory and the Japan Foundation, the web-based version will be available for three years.

Topics for this first session will include:

    • Importing texts
    • Basic text manipulation
    • Implementation of techniques such as
      • Word frequencies
      • Colocation
      • Document term matrices

Session Eight (2:45 – 4:00 PM): Woodruff 312

Hands on work in RStudio