“Linked Data & Next Generation Catalogs”
8am on Saturday, June 23, 2012
The speakers, in order, were Corey Harper, Phil Schreur, Ted Fons, Yvette Diven and Jennifer Bowen.
Presentation slides at: ALA Connect: Next Generation Catalog Interest Group
Part II: Ted Fons, Yvette Diven and Jennifer Bowen
“A View of OCLCĂs Strategy: Linked Data”
Ted Fons
Ted Fons discussed how OCLC is implementing Linked Data technologies.
OCLC is implementing schema.org markup on their webpages and has developed an extension beyond the basic bibliographic vocabulary provided by schema.org. OCLC has begun this to improve SEO for Worldcat and libraries, to strengthen the WorldShare Platform with a tangible new offering, to gain a position of authority in modeling post-MARC data, and to promote internal efficiency (slide 5). The ultimate objective is to position OCLC as a leader in the library community, at the forefront of Linked Data technology (slide 6).
How We Got to Linked Data
Fons gave an overview of the timeline that lead to OCLC’s implementation of schema.org markup and of schema.org (slide 8).
Using the schema.org markup allows OCLC to assign URIs to “library things” in a way that facilitates re-use of the data by other entities outside the world of libraries, such as the search engines.
Future of Cataloging
Fons characterized traditional cataloging as all about the local catalog. It includes description and limited use of authority files and is about “locating the resource in the local catalog context” (slide 16). The future of cataloging takes that and allows expanded use of authority files, not just library-specific ones, and locates the resource “within a network of useful links” (slide 17). Cataloging will still be core to the library, but becomes an even more important source of authoritative data within the context of the web.
Fons concluded his presentation by noting that OCLC is positioning itself on the forefront of Linked Data research, that integrating with the rest of the Linked Data world will allow libraries to become major hubs and that the use of schema.org represents a major step for OCLC members.
———————————–
“Stories and Lessons from the Road to Linked Data”
Yvette Diven
Yvette Diven discussed Serials Solutions’ foray into Linked Data.
Diven began with an overview of Serials Solutions (SS). SS was founded specifically to provide authoritative metadata for libraries. At its core is a “centrally-provisioned Knowledgebase” (slide 2, notes). The knowledgebase started as an industry-standard flat file and has evolved from there.
The metadata imported into the repository comes in various formats including MARC, DC, XML, ONIX, text, etc. It is cleaned up and integrated into the services provided by SS.
Linked Data
Starting with this knowledgebase, they have chosen FRBR to use as the conceptual framework and RDA as the schema. The data is loaded into a separate relational database that is not a MARC database. MARC data can come in and go out, but it is not internally stored as MARC data. This provides more flexibility for importing and exporting is multiple formats.
Serial Solutions’ new service Intota will be based on this new data framework and should be able to support the entire life cycle of the library’s collection as well as providing Linked Data benefits to service users.
Diven stated that one lesson they learned about moving towards Linked Data was to start simply, like with controlled vocabularies.
Future pilot projects include publishing knowledgebase data as RDF/RDFa triples and including open access journal data and data from Ulrich’s.
————————————-
“‘Linked-Data-Ready’ Software for Libraries: The eXtensible Catalog (XC)”
Jennifer Bowen
Jennifer Bowen discussed how XC can facilitate the move to Linked Data.
Bowen noted that there are many questions surrounding Linked Data:
- Why should we do it,
- Who should do it,
- How can we get started
- and What are the outcomes?
Bowen asked if Linked Data could help us provide what our users need and if there are new roles for libraries. As part of the process of building XC, they performed user studies that might help answer these questions. The results of that research are available in the book Scholarly Practice, Participatory Design and the eXtensible Catalog.
The user studies showed that, first, scholars want:
- to read everything on the topic that they are researching
- to be in the middle of everything they need, with it all organized so it is findable and useable
- their research to be findable and usable by others
- to connect with people whose work is interesting and useful to them
Finally, the user studies found that scholars don’t care about the technology as long as it works.
Second, the studies showed a shift in how people seek and use information. Library-based systems (website, catalogs, etc.) are being bypassed, not only in favor of Google, et al., but also in favor of “tailored desktop, mobile, and web applications” (slide 10). Furthermore, even if they use library-provided tools to identify resources, scholars go outside the library domain to analyze their information.
The solution to this is to make library resources discoverable where users are looking for them: search engines, mobile apps and social media. Bowen noted that libraries could build their own tools and applications, but if they simply concentrate on making the data usable, someone else will happily build these types of tools.
Who should create Linked Data? Why create Linked Data?
Bowen stated that any/all libraries should be working on Linked Data. Libraries need to change to a new data paradigm and they need hands-on experience with Linked Data both to understand its potential and to develop best practices. Linked Data provides an opportunity to showcase unique local collections and serve local interests. Libraries also need to get started with Linked Data so that they can push their vendors to start thinking about Linked Data now and not at some amorphous point in the future. Finally, Linked Data will allow libraries to create or take advantage of new opportunities and explore new roles.
How can we get started?
To get to Linked Data, we need tools to convert legacy data into Linked Data. Bowen discussed how XC might be one such tool. XC is open source and provides both a discovery system and a set of tools to transform and manage metadata. This provides a platform for metadata transformation experimentation (and potentially for Linked Data) that is risk-free. It allows bulk conversion of existing library metadata and can synchronize data conversion to existing systems.
While not built with the idea of Linked Data in mind, XC could potentially be used to make Linked Data available to developers. Bowen envisions the User Interface (Drupal Toolkit) and the Metadata Services (MST toolkit) as being the main components for creating the RDF. The user interface could generate RDFa (which is built into Drupal 7). The bulk metadata conversion processes could output RDF/XML or a SPARQL endpoint.
The underlying schema for XC is based on elements drawn from registered element sets. The elements themselves already have URIs assigned to them. Some elements come from RDA, some from DC and some are XC-created (and registered) elements.
As an interim step in the data transformation process, XC converts the data (MARCXML) into FRBR entities: Work, Expression and Manifestation. This may actually produce more meaningful Linked Data in the end. The user research showed that users want to see the relationships between the resources, between the resources and people, between people and other people and between a search term and the resources actually retrieved. Relationships are what FRBR and FRAD are all about.
Finally, using XC as a template, Bowen looked at a couple of specific ways that Linked Data could address user needs.
Scholars want to read and access everything on a topic. In XC, a custom interface can be easily set up for a particular group of users or a particular need without doing any custom programming.
Scholars want to connect with others whose work interests them. This a place where libraries have the opportunity to develop new technologies and take on new roles. Libraries could create tools that allow scholars to make Linked Data statements as part of the scholarly process. This could involve things like: creating/managing vocabularies, augmenting metadata about a resource, making their own work more discoverable or understandable or documenting the relationships between resources/people/etc.
To sum up, Bowen noted that they are currently looking for funding and partners for Linked Data development in XC.
——————–
Panel Discussion / Q&A
Query postponed from earlier in session: Who is going to create this data?
PS: The model will not change rapidly. Our Cataloging dept. has been renamed the Metadata dept. with no decrease in work.
TF: The tools will change, but the work is still the same and still has to be done
Query: How do we ensure that we have the SEO rankings that we want?
TF: Ranking algorithms are all about linkages, more collaboration, more aggregation. More linkages means more clicking and that pushes us up where we need to be [in the search results].
??:If you re-imagine the authority file with the context of the person’s life, that is what the graphs are looking for.
CH: The value of cataloging is in providing authorities and context. Users will not rely on libraries for the easy-to-access, mass-market materials but for local special collections, rare items, archives, specialized data that is not available from commercial enterprises.
PS: Google likes library data and it will show up on the first page [of search results] now when it is available.
Query: What about the contract terms that state that OCLC owns the information that libraries contribute?
TF: The “agreement is to distribute data under certain guidelines.” There is a new license that members like, new guidelines. Libraries can open data as long as it is attributed to OCLC.
CH: The new license is good. It was the consortium that wrote the orignal license and basically effed it up.
Query: What about all these different schemas? Is that a problem?
TF: It is still early days and things are still in development. It is not something to really worry about. A bigger problem might be durable URIs. Things will evolve.
YD: Right now it is all really modeling and not hard publishing[?not sure I noted this down correctly?]. Controlled vocabularies are a bigger challenge.
JB: In the case of XC, it has its own internal schema that adheres to RDF and can output triples. [Implying that other schemas are translated into the system’s internal schema?] So the schema itself is not so important, interoperability probably possible.
CH: If you create linkages at the level of the vocabularies, they all become interoperable.
TF: Within a community of practice, there are opportunities to agree on URIs to be used: how to represent works, expressions, etc. Once these have been agreed upon, it will be more efficient and there will be less duplication of effort.
Query: What about the hard mappings? Author and title are easy.
PS: It is not an easy or perfect exercise. Implicit connections are made by humans in the MARC records and there is going to be some loss of understanding when moving over to a new infrastructure. However, the data should be self-improving as users re-connect the data points explicitly.
CH: We don’t lose data, just the [implicit] connections between them.
Query: How do we know users will contribute?
Examples: tagasauris and mechanical turk
JB: XC platform allows for playing with data to see what works.
Query: What about making corrections to records? For example, Proquest sends records and the library makes corrections, then Proquest resends the records and the corrections are overwritten. The corrections never go back to Proquest.
PS: Nothing about a LD statement makes it inherently true. There will still be errors.
[Query or further discussion of previous query?]: With machine-actionable data, tools can be used to find the errors instead of humans. Freebase has confidence ratings, some data gets sent to humans for intervention.