Emory University Libraries
Connections Linked Data Pilot Project I: Civil War
Report Summary
Dec. 18, 2012
Pilot Team
Laura Akerman, leader
Tim Bryson
Kim Durante
Kyle Fenton
Bernardo Gomez
Bethany Nash
Elizabeth Russey Roke
John Wang
Sponsors
John Ellinger
Lars Meyer
Overview:
The Connections Linked Data group conducted planning sessions for our first pilot project in May and June 2012 and presented a project proposal to the sponsors for approval. The pilot focused on a subset of our resources that bore some relation to the Civil War. After some clarification, the pilot was approved; work began formally in July and concluded formally in September.
Goals, Experience, and Learnings
Goal 1: Gather tools.
Pilot experience:
Tools used:
- Sesame/Owlim linked data triplestore and management tools
- Oxygen for XML transformations.
Preliminary evaluations:
- OpenLink Virtuoso triplestore/management tools- required too many resources for this pilot.
- Callimachus, web framework/triplestore. Interesting but “beta”, and also required a lot of web programming knowledge.
- Drupal 7, CMS/web framework with linked data plugin – we were unable to create an integration with our Sesame instance in the timeframe we had to work with.
- Ontowiki, wiki/web framwork with linked data capabilities – required a Virtuoso back end.
- Pubby and Djubby – linked data “publishing” tools – publishing was not in scope for this pilot, but we tried them out; we were able to use with DBPedia but not with our Sesame instance.
- Simile Welkin – visualization tool – we were able to get web version working with a subset of our data, but even our limited dataset was to large to load the entire thing.
- Protege – primarily, an ontology management tool – not practical for use to develop instance data.
A search for RDF triple creation tools did not find anything simple enough to be feasible for this project, so we created N3 files by hand in Notepad.
We did cursory investigations of many other tools that did we found not useful for the timeframe and scope of this project.
Learnings:
- There are many open source tools, but most are time consuming to install and configure, and no one tool achieves all the purposes we are aiming for.
- Having a test data set and some experience with a triplestore puts us in a better position to understand and compare tool features. Wider tool evaluation might lead to solutions that would produce better results, and would be essential before we could provide a production level service.
- Increasing the inclusion of technical staff involvement, both in terms of time and variety of skills, could help us to fully test and configure data management and web tools and to identify areas where we might need to build out functionality ourselves.
Goal 2: Transform our metadata into linked data.
Pilot experience: Starting with a modification of transformation stylesheets from ArchivesHub (for EAD) and Library of Congress (for MARCXML), we transformed sample data (selected finding aids and the Regimental Histories sub-collection of digitized books). Extensive modification of the complex ArchivesHub transform was needed to get it to work with our data.
Learnings:
- We experimented with modifying the data architecture for our purposes, but realize that many more data modeling decisions need to be made before we are ready for “production level.”
- There is no one “standard” way of modeling cultural heritage data in RDF. Models other institutions have published varied greatly, and we will need to do our own analysis.
- In transforming data from different source schemas, we began to see possibilities for use of common vocabularies, and if it is possible to use some common structures across different types of repositories, that would make our data easier to query and work with, particularly if/when we publish it in RDF.
- Creating our own URLs for concepts and parts or aspects of our content makes sense, both to reference our unique data and to make assertions of equivalent or “close equivalent” relations to concepts defined by others. We need to develop a strategy for creating and maintaining such URLs before ramping up to production level.
- Working with the transformations led to some observations about our metadata practices; if we can achieve more consistency and better field definition and use of identifiers in our “native” metadata, it will benefit us down the road.
Goal 3: Create new linked data.
Pilot experience:
- Excel is unsatisfactory as a tool for creating triples.
- We used a simple text file process for creating N3 triples. After creating a graph for a resource, we used the online RDF Validator to prepare the RDF for upload through Sesame’s RDF Workbench.
- Simple Dublin Core metadata for this first effort
- incorporated links to LC vocabularies and DBPedia
- A little experimenting with RDA vocabulary.
In the time we had, we only got a small taste of the many possibilities for vocabularies, relationships and expanded links that creating metadata as RDF could enable us to use.
Learnings:
- With the right tools, it is possible to model any metadata we would need to create as linked data. Manually creating triples gives us a better picture of what a tool to create metadata as native RDF could do, and what capabilities we would need to implement a production system for metadata creation in RDF.
- Ideally, a tool would make URL capture simple, make lists of properties that we are using available for selection, and execute searches and queries against remote triplestores to retrieve URLs for linkage.
- Link creators need to see labels instead of (or in addition to) URLs in order to better visualize what statements are being made (RDF is even less “eye friendly” than XML).
- With RDF we can choose to make any kind of statement, using any vocabulary. But we need to decide how to model our data to provide consistency. We should be open to adding different relationships, and to generating statements using different vocabularies if needed for a new purpose, by employing reasoning and ontologies. We need tools that give us the flexibility to add new types of statements without having to program them in.
Goal 4: Assess and harvest external data.
Pilot experience:
- Scripted retrieval of id.loc.gov identifiers for our controlled “LC” names and subject strings, linking that data to our converted EAD and MARCXML.
- Investigated simple matching of names/subjects to DBPedia, but did not pursue
- Reviewed VIAF as a target, but did not find enough additional information useful to this pilot.
- As part of creating metadata in RDF, created links to id.loc.gov and to DBPedia for names and subjects.
- Manually searched and created links to DBPedia based on names/subjects in some finding aids.
- Reviewed additional Civil-War-related data sources, including Civil War Data 150 sources, but did not find any available in RDF and did not have time for this pilot to develop conversions.
Learnings:
- Not all LC subject strings were found by the id.loc.gov script because not all combinations with “free-floating subdivisions” are established there. We realized we need an approach to deal with this problem, perhaps establishing our own URl and making it a “subclass” of the broader vocabulary so we can still make use of id.loc.gov identifiers.
- With DBPedia, scripted query did not hit many matches because of different terminology, and there was concern about lack of disambiguation.
- Human searching “by hand” for links, particularly in DBPedia, was too time consuming to be a good option for a production setting.
- “I can find a little broader concept, or not quite the same thing as an LC heading but potentially useful, in DBPedia – should we use it?” This question came up often.
- We could see the potential value of linking to more information, which itself has additional links, in a production setting.
- For DBPedia (and other sources), a “semi-automated” process may be the best solution. A machine search would do some of the work, but have a human review component.
- We see concept mapping across knowledge domains as a crucial area, one that begs for cooperative work across cultural heritage institutions. Having the results of such work available and incorporated into our tools could simplify creation of new data in RDF.
Goal 5: Web display and visualization. Configure and/or build a web interface to display the data we have, link to content, allow navigation relationships across described information resources and associated people, organizations, topics, concepts, and if time permits, create visualizations such as maps and/or timelines.
Pilot experience:
- Sesame’s OpenRDF Workbench end-user navigation or visualization, but it did allow us to see the triples, run SPARQL queries on the data and try to find patterns, but didn’t provide end-user navigation features.
- Utilizing or developing web-based tools that could “talk to” our Sesame repository ended up requiring additional data publishing steps that were not envisioned for the pilot and that we couldn’t accomplish within its timeframe.
- We were able to get a taste of visualizations by using Simile Welkin on a subset of our data, showing “where the connections are” across domains, but this tool is limited.
- We did a cursory investigation of a number of visualization tools; some were no longer functional or had requirements we didn’t want to deal with in this pilot (e.g. requiring knowledge of Ruby); a few showed promise and could be explored later (Graphity, Sig.ma, lod:live).
Learnings:
- Other linked data projects have found that the functionality of available end-user applications is not mature. Our experience thus far supports that observation.
- Staff working to provide digital services could benefit by learning more skills in understanding and setting up simple web applications. We gained respect for the amount of time and knowledge required.
- We envisioned specific kinds of visualization (maps and timelines) but our sample data didn’t support such applications in a meaningful way — it wasn’t granular enough to map out a timeline of battles mentioned in archival collections, for example.
- We also realized that publishing our linked data on the web is necessary for many web applications; we did not attempt to address issues around stable URLs, data rights or attribution in this pilot but they will need to be addressed if we move beyond pilot stage.
Other learnings:
- The selection of a “sample” for the pilot that would “show good connections” across silos turned out to be a surprisingly difficult and time consuming task, reflecting the limitations of our current metadata search and reporting capabilities. We decided it would be easier to convert the data to triples and find connections using SPARQL queries. We envision that access to linked data queries could be empowering for both staff and users.
- Observing how useful dates and places would have been at an “item level” for maps and timelines led to some engaged discussions. How do we provide this valuable information with current staffing? While we don’t have definitive answers, there is some interest in crowdsourcing/getting researcher involvement. Linked data’s additive/lightweight properties could make it a good fit for user-contributed metadata or annotations.
Recommended Next Steps
With the knowledge and skills gained from the first pilot, we recommend our institution continue its investment of staff time in the area of semantic technologies. The group proposes deep learning in one area of linked data vis-à-vis the entire life cycle management of linked data.
1. Continuing education
It is vital that we continue to raise consciousness about linked open data, given its impacts on our profession. The group suggests restarting workshops to educate staff about this new method of diffusing and connecting knowledge. An increased understanding of this new technology by a wider base of staff will foster conversations and stimulate creative ideas for its adoption in our everyday work.
We recommend the same level of staff time allocation (2-5 hours per week) for learning. We particularly recommend that sponsors designate key technical services and DPS staff for linked data training.
2. Coordinated efforts
The group recommends coordinated learning that creates synergy and efficiency across teams and divisions.
- In the past, learning and experimentations of linked open data took a style of spontaneity by a team or individual (e.g. Networking the Belfast Group) .
- In the next phase, a systematic approach that includes MARBL, Services, Research Commons, DPS, Content, Cabinet subcommittees, and possibly collaboration with other libraries, even other University or external organizations, will result in greater use of staff resources and better learning results.
3. Focused learning and experimentation
The first pilot helped us acquire knowledge on linked data operation. It also surfaced areas that need more attention and exploration. The possible tracks that we should pursue are listed below. They are not ranked in order of importance, but rather are a list of recommendations that can help sponsors set priorities:
- Continue to work on developing the sandbox and test data sets as a learning tool, adding more data and functionalities as we work toward specific goals. For example, the test data set could become useful for training on SPARQL or other aspects of linked data. It could also be a test bed for trying out Encoded Archival Context in RDF or other newer metadata approaches.
- Develop common linked data models (classes, properties) to which metadata describing common aspects of different types of content (finding aids, books, articles, web resources, etc.) can be mapped.
- Develop a strategy for publishing linked data, including consideration of data rights and attribution, domains and base URLs, pattern usage and web infrastructure.
- Investigate additional external data with which to interweave Emory unique content; extend the function of tools for automated or semi-automated mapping of entities.
- Experiment with RDFa / schema.org publishing.
- Continue to survey open source and proprietary software tool landscape, but also develop more formal criteria to apply when evaluating other promising tools (i.e. don’t be limited to Sesame just because we tried it first).
- Develop tools that simplify linked data creation and mapping, leveraging grant opportunities and collaboration with other institutions if possible.
- Experiment with using the APIs of our core discovery tools (e.g. Primo) to bring in additional context for our resources using linked data, or to connect users to linked data visualizations or datasets.
We can pick one or several from the list depending on our resources allocation. Note that creating a real “production ready” user application would definitely require additional resources (see the “Future directions” section).
4. Organizing the effort
We acknowledge the contribution of everyone on this first pilot. Because there were multiple goals and they were ambitious (more ambitious that we realized!), we were limited in what we could accomplish on any one of them in the established time frame. This was a good approach for a first start. Going forward, we suggest:
- more targeted projects with smaller goals,
- multiple project leads, or individual efforts,
- drawing on pilot group knowledge and experience, but also including others who can learn and contribute.
Continuing to have a central interest group and some coordination of education, communication, pilots and projects could help us maximize impact of our efforts going forward.
5. Future Directions
The pilot team finds that more work is needed to research and assess new technologies before we can recommend specific implementations of linked data for the libraries, including the technical components needed.
However, we want our experimentation to have a goal in sight beyond hands-on learning. Below are directions we think could be worthwhile to aim for, that would have some value in the near term and would better position the libraries to understand and further implement linked data:
- Use Core Services APIs + Linked data to bring in or link out to, more context for our users – (this would eventually include whatever system we end up with for digital asset management/preservation/delivery). The Primo and the Semantic Web presentation.from IGELU 2012 (http://igelu.org/wp-content/uploads/2012/09/DataEnrichment_Dominique_Ritze.ppt) gives some ideas of ways this could be done.
- Put together suites of tools for library staff or researchers to collect, convert, access, display, and navigate linked data, including tools to harvest/convert our own metadata from its various sources.
- Find or start collaborations with other institutions both locally and nationally. Some ideas: locally, include them in data mashups e.g. the Civil War as an RDF-enabled web application. At the national/international level, projects could include vocabulary linking e.g. DBPedia=>LC vocabularies; data modeling, e.g. shared metadata maps/ontologies; or shared sandboxes for learning and development.
- Enhance web development skills – We appreciated having our sandbox. While we were able to do some things, when confronted with a need for HTML5, CSS, Javascript, using APIs, etc. most of the group didn’t feel comfortable tackling this within a short timeframe. We think the library should support including more staff with web development skills in projects such as this. We appreciate the professional competence of systems staff, but believe the library should encourage learning in this area by the many staff members who are working to support web services in different roles. The web has become the primary tool by which we provide discovery and other services; more understanding of this “core technology” by everyone is needed.
Conclusion
The pilot project team appreciates the opportunity to learn, to test out our ideas and gain new skills, and to share this experience with the library. We are even more interested in and convinced of the value of linked data now than we were when we began. We look forward to continued learning and to involving more of the library staff in the conversation about what linked data means for us and where it can lead us.