ALA 2012: Linked Data & Next Generation Catalogs Session — Part 1

“Linked Data & Next Generation Catalogs”
ALA Annual Conference 2012
8am on Saturday, June 23
The speakers, in order, were Corey Harper, Phil Schreur, Ted Fons, Yvette Diven and Jennifer Bowen.
Presentation slides at: ALA Connect: Next Generation Catalog Interest Group

Part I: Corey Harper and Phil Schreur

“Linked Library Data: Publish, Enrich, Relate and Un-Silo”
Corey Harper (On Twitter as @chrpr)

Harper began his talk with a short overview of Linked Data. Linked Data is all about publishing data for re-use. Harper referenced a TED Talk by Tim Berners-Lee filmed in 2009 on his vision of the Semantic Web: “Tim Berners-Lee on the next Web”. [Incidentally, there is a follow-up TED Talk, “The year open data went worldwide,” by Tim Berners-Lee filmed in 2010 where he looks at some examples of how linked data can be used.]

Harper summarized Berners-Lee’s famous “Design Note” that defines Linked Data:
use URIs as names
use HTTP URIs
provide information at the URI
includes links to other URIs

Next, Harper discussed metadata as a graph. Things and relationships are named with URIs allowing ease of integration across data sources. The graphs can be merged to create new explicit relationships between previously unconnected pieces of data.

Library data, however, is primarily string-based so it cannot be easily connected to other data sources. Slides 5-7 show the difference between a basic text-string description of a “thing” and the way that using URIs can allow that “thing” to link to other related “things.”

Finally, Harper quickly went over some of the terminology used in describing Linked Data.
A resource is a thing
A class is an abstraction of a type of thing
An individual is an instance of a class
A property is an attribute of individual
A statement or triple consists of the combination of resource, property, and value.
A graph is a visual representation of statements
An ontology is a domain-specific collection of classes and properties
Nodes are the subjects and objects in a graph
Arcs are the predicates in a graph
Domains and ranges provide constraints on nodes
The domain defines what things can be subjects
The range defines what things (or strings) can be objects
Literals are values that are strings rather than URIs
Named graphs are graphs where the URIs are treated as nodes [this is not clear to me]
These are still under development
[This is not on a slide, did I make it up?: Provenance allows you to know where the data comes from.]

Harper discussed the growth of the Linked Open Data cloud. Slides 11 and 12 show the growth from May 2007 to September 2011. Harper noted that Freebase has become almost as interlinked as DBpedia. Though Freebase is now owned by Google, anyone can add anything to it. For example, VIAF and id.loc.gov have added their data into it, which means that the data has Freebase URIs [in addition to the URIs generated by the respective projects themselves].

Projects that are using Linked Data.
Google Refine allows the input of unstructured data and reconciles it against Freebase to create new links (slides 14 and 15).

RelFinder (slide 16) finds linkages between between data sets. [The RelFinder website states that “it extracts and visualizes relationships between given objects.”]

Next, Harper discussed Linked Open Data – Libraries, Archives & Museums (LOD-LAM). Two examples of these types of projects are Europeana and LOCAH: Linked Open Copac & Archives Hub.

Europeana describes 15 million European cultural items and includes data from the British Library, the Rijksmuseum and the Louvre. The data model builds on OAI-ORE.

LOCAH combines bibliographic and archival data. Using finding aids from over 200 institutions in the UK, they modeled EAD as RDF. [LOCAH has become Linking Lives.]

The Social Networks and Archival Context Project (SNAC)

Linking Lives

Viewshare

The Linked Ancient World Data Institute (LAWDI) is actively modeling ancient place names as linked data and is very excited to get librarians involved.

The Library Linked Data Incubator Group Final Report includes a related document, Library Linked Data Incubator Group: Use Cases that details over 50 use cases spread across a range of different categories, including bibliographic data, authority data, archives and heterogeneous metadata, citations, etc. (slides 31-34).

The Linked Open Data Special Interest Working Group of the International Group of Ex Libris Users (IGeLU) is working on incorporating Linked Open Data into its architecture. A public draft on their work is due in two weeks.

Linked Data is a distributed information ecosystem – it focuses on identification rather than description. [This is an important point to keep in mind. I think it is another paradigm shift like those discussed by Phil Shreur below.] All data retains its context and enriches the user experience. Library bibliographic data, on the other hand, has been removed from its context. A bibliographic record cannot, for example, tell a user anything detailed about the author’s life. Linked Data allows links to be made to outside sources that do provide such data.

Finally, Harper discussed facets and issues of Linked Data that are actively being worked on: provenance; licensing; best practices, modeling and infrastructure; and DCMI and W3C work.

Harper noted that the FRBR model is finding its way into other domains. [This seems to me that it is a validation of the model.]

————————————————————————————-

“Shatter the Catalog, Free the Data, Link the Pieces”

Phil Schreur (LinkedIn profile)

Phil Schreur spoke about the stressors on our current catalogs and the Linked Data solution.

Stressors.
Schreur began his discussion of stressors with Google. Google has taught users to expect an all-inclusive, more-is-better approach to searching. As libraries try to adopt/adapt to this type of searching, the catalog starts to lose all local character. Carefully curated items become lost in the midst of giant dumps of e-book [and e-serials, I would guess as well, since those are all package deals] bibliographic data. More bibliographic data of questionable quality is ingested from a variety of sources. [I have a note about supplementary data as well, but not what his point about it is.]

A second stressor on current library catalogs are the bibliographic records themselves. All are subject to “local practice” guidelines that mean catalogers are re-cataloging the same item over and over again at different institutions. Records can be missing elements and include other mistakes as well. [I think this is the point, though I did not include it in my notes: These problems all get fixed at each institution that finds them, with no way to propagate those changes out to other places.]

Next, Schreur discussed the fact that most library data is stored in relational databases in closed systems. The catalog needs to have MARC records for the discovery process to work but the cost of cataloging an entire MARC record can be prohibitively expensive. This results in a backlog of items that have no record in the database.

Finally, Schreur noted that not only is the data siloed within the individual institutions but the institutions themselves do not have consistent access to their own materials. In an academic environment, for example, data about related resources such as course descriptions and reading lists are not linked to the bibliographic data. They are too expensive to catalog because the only way related materials can be included would be in cataloger-created MARC records.

Schreur stated that Linked Data is the answer to this problem. The Linked Data in an academic environment like Stanford could interact with the Linked Open Data on the web. This takes the data to where people are searching for it. It provides better discoverability and opportunities for innovation. It allows for continuous improvement of data without having to exchange MARC records. MARC data that machines cannot understand becomes machine-actionable data that is directly accessible. It breaks down the silos and allows for unanticipated opportunities.

Moving to Linked Data involves several paradigm shifts. First, MARC bibliographic records are often considered a commodity and can have many restrictions on them. Linked Data, on the other hand, is focused on the free and open exchange of data. Second, from entire MARC bibliographic records, we shift to simple RDF statements. Third, data [i.e. metadata?] will be captured at the point of creation. The RDF triples will result as part of the creation process and they will be heterogeneous. Finally, instead of the problem of limited data that we have now, we will have a problem of an overwhelming amount of data. The triple stores will have to be managed.

Finally, Schreur provided four examples of projects implementing Linked Data:
Mashpoint
Bibliotheque nationale de France
LinkSailor
Google Knowledge Graph

Mashpoint
Allows the user to take one data set and apply it to another data set. [It sounds a lot like RelFinder and Google Refine.]

Bibliotheque nationale de France
This project provides good documentation for its data [data provenance!]. The search results for Edgar Allan Poe include not only bibliographic resources but also related materials that can be found in the Archives & Manuscripts department and links to resources that are outside of the collection entirely, like those in the Europeana project.

LinkSailor
This is a project started by Talis. LinkSailor allows the user to follow the links themselves from one place to another: from a map for Heathrow Airport to other airports.

Google Knowledge Graph
Collects links to a resource rather than searching text strings for things related to a resource. This is in use now by Google, it appears next the more traditional text-string search results.

Questions/Discussion for Harper and Schreur

Q: About having authoritative data: For example, scholars are worried about the ESTC being “ruined” by crowd sourcing the data
PS: It is important to have authoritative data still, there will be push-pull between the crowd sourcing and the idea of controlled authoritative data.
CH: The crowd is important, sometimes it is that random person on the Internet who has the necessary knowledge and expertise in the topic. The real value is in how the data gets used on the Web.

Q: If we open the data, where does the value-added cataloging go?
[This is the idea, that if everyone thinks they can get the data from somewhere else, they are not going to pay to have their own quality data created. But it seems to me that the answer is the same that it is now. If you want quality/authoritative data, you have to pay someone to create it. If you are just going to copy catalog, then you take someone else’s data. This is a problem now. The technology of Linked Data is not going to fix what is at its core a cultural problem.]
This query postponed to the panel discussion at end of the session.