John Wang and I recently gave a presentation about the Connections group and the Civil War Pilot, at the CNI Fall Meeting. We had a number of questions and comments from the audience at the end. We had mentioned our struggles to identify best tools to use and to actually use those open source tools. Several IT people in the audience had suggestions for what to use, but among then was, that if we didn’t need to get deeply into ontologies and inference, back end systems specifically designed as triplestores might be unnecessary, and maybe we could accomplish our specific purpose with a LAMP stack, SOLR, or even avoid RDF altogether and use text mining and indexing.
Of course, one of the purposes of our pilot was to explore linked data and semantic web technology specifically, but these suggestions got me thinking about how important inferencing might be to us. This is partly because some of our troubles – e.g. sudden loss of data “contexts” (the named graph part) due to a power outage which wasn’t supposed to happen – probably came from the OWLIM part of our Sesame/OWLIM installation, and maybe that part, which provides advanced inferencing capability, is overkill for us right now.
Yes, we haven’t gotten heavily into incorporating ontologies and their classes and rules into our thinking – yet. But, sitting on the broad overview perch, it looks like to me that the potential use of inference to generate new knowledge is the most powerful promise of linked data, really beyond “things not strings” and “global web of knowledge”. This suspicion was recently reinforced when I watched/ listened to the video of Chris Welty from IBM’s great presentation at ESWC 2012 on the making of Watson, IBM’s question-answering Jeopardy-winning creation. Inferencing based on linked data is a key component of parts of Watson. If you don’t have a good laundry afternoon to watch the whole thing, Welty begins talking linked data around 40:36, and a good earlier contextual place to dive in is 33:13 “Knowledge is not the destination”.
Our Connections pilot is not “Watson” scale, but inferencing could be useful for us in practical terms, based on some familiar problems our current metadata doesn’t handle well. We want to be able to group and present content based on characteristics of the creators, for example, “African-American pamphlets”. Until recently, we didn’t have a place in our authority data structure for an author to record his/her ethnicity, which is really where the property of African-Americanness should be applied (not to the pamphlet). But even if we have such a field (see MARBI discussion 2012-DP05 ), we don’t know this characteristic about many of our authors or other creators, and if the creator is named as a corporate body, it’s not likely that such a characteristic would be applied. However, from other sources we might be able to tease it out – for example, if someone was named in an anthology of African-American writers, or was listed in an African-American Who’s Who, we (our algorithms) might make a reasonable inference that their ethnicity’s African-American, and if all persons associated with a corporate body could be determined to be African-Americans, maybe we could make some inferences there too.
Similarly, if the “pamphletness” of a pamphlet isn’t actually contained anywhere in the metadata, maybe we could infer it well enough from the dimensions and/or number of pages, or from other sources.
It’s possible to script this kind of data winnowing, but time consuming. Linked data just seems to be designed for making these kinds of “good guesses” easier. Maybe we will have some test cases in our next pilot (better ones)!
Leave a Reply