Cataloging – Cataloging is not Sexy

August 6, 2015August 6, 2015

Digital Music Organization and Archiving

When I first considered writing this up, my first thought was that it wasn’t related to cataloging. But really, like many of my projects, it actually is. I am all about the information organization and having good metadata no matter what I am doing. So here it is.

Recently, I ran out of space on my ancient iMac’s 500 GB hard drive. I have thinking for quite awhile now about shifting my media off the internal hard drive and onto an external drive array of some kind; specifically so I could add more media. Fortunately, we had a Drobo sitting around begging to be used. Now I have 2 TBs of space that is expandable to 8 TBs! All that space! Let the ripping, er, archiving commence!

First question: What format should I use?

Being firmly planted in the Apple camp, of course I use iTunes as my media player, so I need to have my music in a format that it can use. I have ripped music to mp3s previously and I know that there is a newer format that iTunes uses as the default for purchased music. But I also vaguely know that these are not “perfect” versions, that they are compressed to save space, etc. So what about an “archival” version? Is is possible to rip an exact duplicate of the CD version of the music?

I discovered that there are basically three types of music files: uncompressed; lossless, compressed; and lossy, compressed. The uncompressed files take up huge amounts of space but are exact replicas of the original CD. I guess this would be the ultimate “preservation” file, assuming I had the disk space for it. A lossless, compressed file also exactly replicates the original CD though, so I think this is probably a better choice for me. Someone else might have some sort of specific reason to choose uncompressed. The lossy, compressed file seems to be a good choice for the “access” version of the music.

My research revealed two main types of uncompressed formats: WAVE and AIFF. WAVE (Waveform Audio File Format) is used primarily by Windows computers and AIFF (Audio Interchange File Format) by Macs. I didn’t investigate these formats any further as my focus was on the lossless, compressed file types.

The Wikipedia article “Audio file format” discusses several lossless, compressed formats and further searches narrowed my interest down to two: FLAC and ALAC. FLAC stands for “Free Lossless Audio Codec” and ALAC stands for “Apple Lossless Audio Codec”. I’ll cover these in more detail below.

Finally, there are many lossy, compressed formats, but again I quickly narrowed my interest down to two: mp3 and AAC. These also I did not pursue very far in my research. Despite being the defacto standard for digital music, it turns out that the mp3 format is actually encumbered with a jumble of licensing issues. However, the adoption rate for this file format is high enough that, not being a developer, the messy licensing would probably never be an issue for me. AAC is the format that iTunes uses for music downloaded through its store. Additionally, I learned that iTunes saves AAC DRM’d files as m4p and AAC unprotected files as m4a; and as an iTunes Match subscriber, I can get my (handful) of m4p songs updated to m4a.

As an aside to this discussion about “which format?”, I learned that file format (i.e. container) and codec are not necessarily the same thing. Often a codec uses a particular file format but that is not always true. For example, FLAC can be encoded in its native container (with a file extension of .flac) or in the Ogg container (with file extension .oga).

FLAC is open-source and supported by a range of software and hardware. Everyone on the internet, (by which I mean the posters in the audiophile forums that my search results turned up[1]), seems to have generally positive things to say about FLAC. Alas, it is not supported by iTunes. But that is okay; at this point in my research, my thought was that I could use this as my “preservation” format and transcode the files into a lossy “access” format (for the smaller file size) that iTunes can use.

The objections around ALAC seem to center mainly on it being proprietary (and/or because it’s Apple) and the fact that it won’t transcode 24 bit files back out of ALAC again[2]. It turns out that while the codec was originally proprietary, Apple released it as open source in 2011. So that is now a non-issue. As for the 24 bit transcoding problem; according to Wikipedia, audio CDs are 16 bit, while audio DVDs can be up to 24 bit. I am only ripping CDs at this point, so the 24 bit transcoding problem seems like a non-issue for me. Obviously, if I should want to rip audio DVDs in the future, I would need to revisit this issue and do some more research (DRM is also an issue for audio DVDs). For 16 bit files, the audiophile forums discussing the topic assert that there is no difference in the quality of files produced by ALAC versus FLAC.

In sum, I found that FLAC is supported in more places than ALAC, but ALAC is supported by Apple. Using ALAC would also allow me to rip the music into just the one lossless format, instead of needing separate preservation and access formats.

Question: Would using ALAC lock me into iTunes as my only alternative?

More research revealed that there are several tools for converting between ALAC and FLAC. ffmpeg was the most commonly mentioned (also, free!), but others I saw frequently mentioned were dBpowerAMP and XLD.

Question: What about the metadata?

This took a bit more careful searching (and some reading between the lines) to discover the answer. The metadata was not really explicitly discussed very often. I think that “everyone knows how it works” so it is not really discussed, or people just jump right in and try it out, so they don’t ever need to ask (or they don’t care?). In any case, the sense that I got from the discussion boards was that, yes, the metadata is transcoded along with the rest.

However, there seems to be an issue with album covers. This isn’t really an issue for me as I have never made any effort to include them, but it is something to be aware of. It is not really clear to me, but I think that if the art is embedded in the metadata (requiring some effort on the part of the user), it will be transferred. But, if the art is only linked to the metadata, then it will not be transferred. I believe that iTunes’ default is to link rather than embed. I have not researched how or if it is possible to change linked art to embedded art within iTunes.

Conclusion

So the answer is fairly simple for me: use iTunes to rip my CDs into ALAC. I’ll have a lossless, compressed file that works just fine with iTunes and can be transcoded fairly easily to another format at a later date without any loss of fidelity. The next step is the hard one – actually doing it!

A note about my research process (because I care about that sort of thing)

I did all of my research on the internet. From a general internet search, I started with Wikipedia articles. These gave me the basic keywords and understanding that I needed to ask more specific questions. The answers to those questions, I found on blogs and message forums, and in the documentation for the software/codecs/file types. This was a fairly straight-forward question that I basically spent an afternoon researching. Writing this blog post, on the other hand, has taken at least four times as long. Organizing a coherent explanation that cites all of the appropriate sources apparently takes me much longer than just making a decision based on those sources.

Notes

[1] In researching differences between FLAC and ALAC, I found several forums that discuss it.
Head-Fi Forums
Hydrogenaudio Forums
Linn Forums

[2] This discussion on the Stereophile Forums specifically addresses the 16 versus 24 bit issue.

December 10, 2012

ALA 2012: FRBR Presentation Four

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 4 of 4

“FRBR and XC: Participatory Design”
Jennifer Bowen

No slides available for this presentation.

Bowen began her presentation with a brief introduction to the eXtensible Catalog (XC).

She then noted that user studies were built into the development of XC. The participatory design included observations of users working, surveys, and interviews. They asked users what they wanted.

The findings of that research were not FRBR-specific but what users wanted basically matched the FRBR model:

Users have preferred material and format types
Users want to know why items are on a result list
Users want to choose between versions of resource and see the relationships between resources

For ever-changing future needs, XC has a customizable user interface. Browsing of a collection of resources can be customized based on some common attribute or relationship within the collection.

Finally, Bowen concluded that their research showed that aspects of FRBR do address what users need to do.

Related links:
The results of the research are available in the book Scholarly Practice, Participatory Design and the eXtensible Catalog

December 3, 2012December 3, 2012

ALA 2012: FRBR Presentation Three

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

Presentation 3 of 4

“Research, Development and Evaluation of a FRBR-based Catalog Prototype”
Yin Zhang & Athena Salaba

Presentation Slides

Presentation outline:

background of the project
research and development of the project
user evaluation of the project
conclusion/next steps

Background

Zhang began the presentation by discussing the background of the project. While FRBR has the potential for libraries to develop better and more effective catalogs and discovery tools, there is not much in the way of guidance for how to implement FRBR. User studies are still few and far between. KSU received IMLS funding to develop and research FRBR-based systems. As part of that project, KSU conducted a series of user studies.

Methodology

1. Run user evaluation studies on FRBR-based catalogs already in existence
2. Put together a FRBR-ized data set
3. Develop an initial set of displays
4. User feedback on the developed prototypes

Step One

The first step was to evaluate existing FRBR-based catalogs. They evaluated three existing FRBR-based catalogs for user experiences and support for the FRBR tasks: OCLC WorldCat.org, FictionFinder, and Libraries Australias. The results of the evaluation served as the basis for their own FRBR prototype catalog.

Step Two

The next step was to extract Library of Congress bib records and authority records from WorldCat. They used OCLC’s Workset algorithm to identify works, but applied their own algorithm to identify expressions and manifestations. The results of this were used to develop FRBR-based displays.

Step Three

In the third step, they developed the layouts for the FRBR-based displays based of:

works from an author search
works from a subject search
works from a title search
expressions from a language/form search
manifestation (slide 7)

Step Four

Finally, they sought user feedback on the interface design. The study participants were interviewed using printed display layouts as prompts and asked about data elements and functions. The feedback was incorporated into the final prototype catalog programming.

Here I have appended a screenshot of the prototype catalog search results taken from presentation slide 10 and a screenshot taken of LC’s current catalog search results where I tried to run approximately the same search.

FRBR prototype catalog

Traditional catalog

Instead of the gazillion search results all strung out over many pages as seen in the traditional catalog (is this another record for the same thing that I already looked at three pages ago?), in the prototype, the records are gathered together under the author/title work sets and then by form and language. The resulting display seems cleaner and more compact, while still presenting plenty of information. It seems so obvious to me that catalogs should have always worked this way.

Study Design

Next Salaba discussed the study design for having users actually evaluate the FRBR prototype. They used a comparative approach: with the same set of records, they had users search using both the traditional catalog and the FRBR prototype catalog. The study group contained 34 participants and data was collected via observations, interviews, audio recordings and screen captures.

The participants were given two kinds of search strategies to pursue. The first set of searches were predefined and users were asked to evaluated the resulting displays. In the second set, participants were given criteria and allowed to use their own search strategies.

Findings

Overall, most users (85%) preferred the FRBR prototype for all of the searches they did. The table on slide 14 breaks down the findings into the categories of language or type of materials, author, title, title and publication information, entertainment, research, and a general topic. The biggest difference in searching the two catalogs was that the FRBR prototype allowed users to find expressions. Since the current catalog only provides access at the manifestation level and does not group by language or format, this cannot really be a surprise.

Features that the participants found “helpful”:
Grouping of results by work and expression (65%)
Refining results (24%)
Alphabetical order of results display (15%)
Interface appearance (24%)

Features that participants thought needed improvement:
More detail before manifestation level display (15%)
Prefer individual manifestation level results (9%)
Listing a resource under each language of a multi-language resource (3%)

88% of participants thought that clustering the resources by work/expression/manifestation made it easier to find things. 91% thought that the navigation made sense and was helpful in performing searches. One participant found the FRBR prototype less helpful for searching for a specific title, but helpful when searching for a specific topic.

Conclusions

Salaba noted the importance of user input into the design and implementation of FRBR-based catalogs. The study showed that users can successfully complete searching tasks using the FRBR-based catalog and that users do understand and can navigate the FRBR-based displays.

Finally, Salaba stated that more research is needed into other FRBR implementations, with more studies comparing those implementations. She noted that other issues include:

FRBRization algorithms
Existing MARC records
Attributes and relationships
FRBR-based catalogs the support user tasks
Displays

Additionally, it is unknown at this point how RDA and Linked Data will work into the whole equation.

Article (2007): From a Conceptual Model to Application and System Development

Poster (2007): User Research and Testing of FRBR Prototype Systems

Article (2009): User Interface for FRBR User Tasks in Online Catalogs

Article (2009): What is Next for Functional Requirements for Bibliographic Records? A Delphi Study

Book (2009): Implementing FRBR in Libraries: Key Issues and Future Directions

Presentation for the ALA 2010 Annual Conference: FRBRizing MARC Records Based on FRBR User Tasks

Presentation for ASIST 2010 Annual Conference: FRBR User Research and a User Study on Evaluating FRBR Based Catalogs

An abstract for a presentation at a panel discussion at ASIST 2010: FRBR Implementation and User Research

An abstract for a presentation at a panel discussion of FRBR at ASIST 2011: Developing FRBR-Based Library Catalogs for Users

October 19, 2012

ALA 2012: FRBR Presentation Two

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the second of four presentations given at this session.

“FRBR at OCLC”
Thomas Hickey

No slides found online for this presentation.

Hickey spoke about the use of FRBR at OCLC.

OCLC manages 275 million bibliographic records at the work, expression and manifestation levels. The bib records are already clustered by work level. OCLC has now started the process of clustering “content” which is roughly equivalent to expression and manifestation.

Clustering is done by creating normalized keys from a combination of the author and title. The advantage is that the process is straight-forward and efficient. The disadvantage is that the algorithm misses cataloging variations.

OCLC is now working with the GLIMIR (Global Library Manifestation Identifier) project. This project will cluster records at the manifestation level and assign an identifier. The algorithms for this project go beyond the author title keys used for workset creation into even the note fields. [NOTE: Examples given in the code4lib article include: publishing information and pagination.]

Using the GLIMIR algorithms they have discovered the same manifestation hiding in different worksets. They tried pushing changes back up to the workset level but it didn’t work very well. [NOTE: the code4lib article gives several examples of ways the GLIMIR has improved their de-duplication efforts. Was it a computational/technical problem?] They are moving to Hadoop and HBase now [?to improve their ability to handle copious amounts of data?].

The goal is to pull together all of the keys, group them and then separate them into coherent work[?] clusters. One problem is the friend of a friend issue. This is used to cluster similar items, but if A links to B and B links to C, are A and C the same thing?

In sum:

the new algorithms are much more forgiving of variations in the records
the iterations can be controlled
the records are much easier to re-cluster (the processing takes hours rather than months)
the work cluster assignments can happen in real time

Worldcat contains:
1.8 billion holdings
275 million worksets
20% non-singltons
80% holdings [have 42 per workset?]
Top 30 worksets — 3-10 thousand records
30-100 holdings
largest group 3.3 million
2.7 million keys
GLIMIR content set — 483 records

Music is problematic for clustering.

VIAF and FRBR
VIAF contains 1.5 million uniform title records
links to and from expressions
link to author from author/title

OCLC can also do clustering in multiple alphabets, using many, many cross-references.

October 19, 2012October 19, 2012

ALA 2012: FRBR Presentation One

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the first of four presentations given at this session.

“FRBRizing Mark Twain”
Erik Mitchell & Carolyn McCallum

The presentation slides are available on Slideshare or view them in the embedded slideshow below.

Current Research on and Use of FRBR in Libraries from cjmccallum

Erik Mitchell and Carolyn McCallum discussed their project to apply the FRBR model to a group of records relating to Mark Twain. McCallum organized the data manually while Mitchell created a program to do it in an automated fashion. They then compared the results. This presentation covered:

Metadata issues that arose from applying FRBR
Issues in migration
Comparison of the automated technique to an expert’s manual analysis

Carolyn McCallum spoke first about the manual processing portion of the project.

For this project, they focused on the Group 1 entities (work, expression, manifestation and item). They extracted 848 records from the Z. Smith Reynolds Library catalog at Wake Forest University for publications that were either by Mark Twain or about him. Using Mark Twain ensured that the data set had enough complexity to reveal any problems. The expert cataloger then grouped the metadata into worksets using titles and the OCLC FRBR key.

In the cataloger’s assessment, there were 410 records that grouped into 147 total worksets (each one having 2 or more expressions). The other 420 records sorted out into worksets with only one expression each. The largest worksets were for Huckleberry Finn (26 records) and Tom Sawyer (14 records). The most useful metadata was title, author, and a combination of title and author.

A couple of problems that were identified in the process were that whole to part and expression to manifestation were not expressed consistently across the records and that determining boundaries between entities was difficult. The line where one work changes enough to become another expression or even a completely different work can be open to interpretation. McCallum suggested that the entity classification should be guided by the needs of the local collection.

Mitchell then spoke about the automated version of the processing.

Comparison keys comprised of the OCLC FRBR keys (author & title) were again used to cluster records into worksets. The results were not as good as the manual expert process but were acceptable and comparable to OCLC’s results. To improve the results using the automated process, they built a Python script to extract normalized FRBR keys out of the MARC data and compared those keys. This did improve the results.

In conclusion, Mitchell noted that the metadata quality is not so much a problem as the intellectual content. The complex relationships between the various works/expressions/manifestations are simply not described by the metadata. Both methods, manual and automated are time and resource consuming. Finally, new data models, like Linked Data, “are changing our view of MARC metadata” (slide 21).

Question from the audience about problems [with the modeling process?]
Answer: Process could not deal well with multiple authors.

Other related links:
McCallum’s summary of their presentation (about halfway through the post).
A poster from the ASIS&T Annual Meeting in 2011

July 24, 2012July 24, 2012

ALA 2012: Linked Data & Next Generation Catalogs Session – Part 2

“Linked Data & Next Generation Catalogs”
8am on Saturday, June 23, 2012
The speakers, in order, were Corey Harper, Phil Schreur, Ted Fons, Yvette Diven and Jennifer Bowen.
Presentation slides at: ALA Connect: Next Generation Catalog Interest Group

Part II: Ted Fons, Yvette Diven and Jennifer Bowen

“A View of OCLCís Strategy: Linked Data”
Ted Fons

Ted Fons discussed how OCLC is implementing Linked Data technologies.

OCLC is implementing schema.org markup on their webpages and has developed an extension beyond the basic bibliographic vocabulary provided by schema.org. OCLC has begun this to improve SEO for Worldcat and libraries, to strengthen the WorldShare Platform with a tangible new offering, to gain a position of authority in modeling post-MARC data, and to promote internal efficiency (slide 5). The ultimate objective is to position OCLC as a leader in the library community, at the forefront of Linked Data technology (slide 6).

How We Got to Linked Data
Fons gave an overview of the timeline that lead to OCLC’s implementation of schema.org markup and of schema.org (slide 8).

Using the schema.org markup allows OCLC to assign URIs to “library things” in a way that facilitates re-use of the data by other entities outside the world of libraries, such as the search engines.

Future of Cataloging
Fons characterized traditional cataloging as all about the local catalog. It includes description and limited use of authority files and is about “locating the resource in the local catalog context” (slide 16). The future of cataloging takes that and allows expanded use of authority files, not just library-specific ones, and locates the resource “within a network of useful links” (slide 17). Cataloging will still be core to the library, but becomes an even more important source of authoritative data within the context of the web.

Fons concluded his presentation by noting that OCLC is positioning itself on the forefront of Linked Data research, that integrating with the rest of the Linked Data world will allow libraries to become major hubs and that the use of schema.org represents a major step for OCLC members.

———————————–

“Stories and Lessons from the Road to Linked Data”
Yvette Diven

Yvette Diven discussed Serials Solutions’ foray into Linked Data.

Diven began with an overview of Serials Solutions (SS). SS was founded specifically to provide authoritative metadata for libraries. At its core is a “centrally-provisioned Knowledgebase” (slide 2, notes). The knowledgebase started as an industry-standard flat file and has evolved from there.

The metadata imported into the repository comes in various formats including MARC, DC, XML, ONIX, text, etc. It is cleaned up and integrated into the services provided by SS.

Linked Data
Starting with this knowledgebase, they have chosen FRBR to use as the conceptual framework and RDA as the schema. The data is loaded into a separate relational database that is not a MARC database. MARC data can come in and go out, but it is not internally stored as MARC data. This provides more flexibility for importing and exporting is multiple formats.

Serial Solutions’ new service Intota will be based on this new data framework and should be able to support the entire life cycle of the library’s collection as well as providing Linked Data benefits to service users.

Diven stated that one lesson they learned about moving towards Linked Data was to start simply, like with controlled vocabularies.

Future pilot projects include publishing knowledgebase data as RDF/RDFa triples and including open access journal data and data from Ulrich’s.

————————————-

“‘Linked-Data-Ready’ Software for Libraries: The eXtensible Catalog (XC)”
Jennifer Bowen

Jennifer Bowen discussed how XC can facilitate the move to Linked Data.

Bowen noted that there are many questions surrounding Linked Data:

Why should we do it,
Who should do it,
How can we get started
and What are the outcomes?

Bowen asked if Linked Data could help us provide what our users need and if there are new roles for libraries. As part of the process of building XC, they performed user studies that might help answer these questions. The results of that research are available in the book Scholarly Practice, Participatory Design and the eXtensible Catalog.

The user studies showed that, first, scholars want:

to read everything on the topic that they are researching
to be in the middle of everything they need, with it all organized so it is findable and useable
their research to be findable and usable by others
to connect with people whose work is interesting and useful to them

Finally, the user studies found that scholars don’t care about the technology as long as it works.

Second, the studies showed a shift in how people seek and use information. Library-based systems (website, catalogs, etc.) are being bypassed, not only in favor of Google, et al., but also in favor of “tailored desktop, mobile, and web applications” (slide 10). Furthermore, even if they use library-provided tools to identify resources, scholars go outside the library domain to analyze their information.

The solution to this is to make library resources discoverable where users are looking for them: search engines, mobile apps and social media. Bowen noted that libraries could build their own tools and applications, but if they simply concentrate on making the data usable, someone else will happily build these types of tools.

Who should create Linked Data? Why create Linked Data?
Bowen stated that any/all libraries should be working on Linked Data. Libraries need to change to a new data paradigm and they need hands-on experience with Linked Data both to understand its potential and to develop best practices. Linked Data provides an opportunity to showcase unique local collections and serve local interests. Libraries also need to get started with Linked Data so that they can push their vendors to start thinking about Linked Data now and not at some amorphous point in the future. Finally, Linked Data will allow libraries to create or take advantage of new opportunities and explore new roles.

How can we get started?
To get to Linked Data, we need tools to convert legacy data into Linked Data. Bowen discussed how XC might be one such tool. XC is open source and provides both a discovery system and a set of tools to transform and manage metadata. This provides a platform for metadata transformation experimentation (and potentially for Linked Data) that is risk-free. It allows bulk conversion of existing library metadata and can synchronize data conversion to existing systems.

While not built with the idea of Linked Data in mind, XC could potentially be used to make Linked Data available to developers. Bowen envisions the User Interface (Drupal Toolkit) and the Metadata Services (MST toolkit) as being the main components for creating the RDF. The user interface could generate RDFa (which is built into Drupal 7). The bulk metadata conversion processes could output RDF/XML or a SPARQL endpoint.

The underlying schema for XC is based on elements drawn from registered element sets. The elements themselves already have URIs assigned to them. Some elements come from RDA, some from DC and some are XC-created (and registered) elements.

As an interim step in the data transformation process, XC converts the data (MARCXML) into FRBR entities: Work, Expression and Manifestation. This may actually produce more meaningful Linked Data in the end. The user research showed that users want to see the relationships between the resources, between the resources and people, between people and other people and between a search term and the resources actually retrieved. Relationships are what FRBR and FRAD are all about.

Finally, using XC as a template, Bowen looked at a couple of specific ways that Linked Data could address user needs.
Scholars want to read and access everything on a topic. In XC, a custom interface can be easily set up for a particular group of users or a particular need without doing any custom programming.

Scholars want to connect with others whose work interests them. This a place where libraries have the opportunity to develop new technologies and take on new roles. Libraries could create tools that allow scholars to make Linked Data statements as part of the scholarly process. This could involve things like: creating/managing vocabularies, augmenting metadata about a resource, making their own work more discoverable or understandable or documenting the relationships between resources/people/etc.

To sum up, Bowen noted that they are currently looking for funding and partners for Linked Data development in XC.

——————–

Panel Discussion / Q&A

Query postponed from earlier in session: Who is going to create this data?
PS: The model will not change rapidly. Our Cataloging dept. has been renamed the Metadata dept. with no decrease in work.
TF: The tools will change, but the work is still the same and still has to be done

Query: How do we ensure that we have the SEO rankings that we want?
TF: Ranking algorithms are all about linkages, more collaboration, more aggregation. More linkages means more clicking and that pushes us up where we need to be [in the search results].
??:If you re-imagine the authority file with the context of the person’s life, that is what the graphs are looking for.
CH: The value of cataloging is in providing authorities and context. Users will not rely on libraries for the easy-to-access, mass-market materials but for local special collections, rare items, archives, specialized data that is not available from commercial enterprises.
PS: Google likes library data and it will show up on the first page [of search results] now when it is available.

Query: What about the contract terms that state that OCLC owns the information that libraries contribute?
TF: The “agreement is to distribute data under certain guidelines.” There is a new license that members like, new guidelines. Libraries can open data as long as it is attributed to OCLC.
CH: The new license is good. It was the consortium that wrote the orignal license and basically effed it up.

Query: What about all these different schemas? Is that a problem?
TF: It is still early days and things are still in development. It is not something to really worry about. A bigger problem might be durable URIs. Things will evolve.
YD: Right now it is all really modeling and not hard publishing[?not sure I noted this down correctly?]. Controlled vocabularies are a bigger challenge.
JB: In the case of XC, it has its own internal schema that adheres to RDF and can output triples. [Implying that other schemas are translated into the system’s internal schema?] So the schema itself is not so important, interoperability probably possible.
CH: If you create linkages at the level of the vocabularies, they all become interoperable.
TF: Within a community of practice, there are opportunities to agree on URIs to be used: how to represent works, expressions, etc. Once these have been agreed upon, it will be more efficient and there will be less duplication of effort.

Query: What about the hard mappings? Author and title are easy.
PS: It is not an easy or perfect exercise. Implicit connections are made by humans in the MARC records and there is going to be some loss of understanding when moving over to a new infrastructure. However, the data should be self-improving as users re-connect the data points explicitly.
CH: We don’t lose data, just the [implicit] connections between them.

Query: How do we know users will contribute?
Examples: tagasauris and mechanical turk
JB: XC platform allows for playing with data to see what works.

Query: What about making corrections to records? For example, Proquest sends records and the library makes corrections, then Proquest resends the records and the corrections are overwritten. The corrections never go back to Proquest.
PS: Nothing about a LD statement makes it inherently true. There will still be errors.

[Query or further discussion of previous query?]: With machine-actionable data, tools can be used to find the errors instead of humans. Freebase has confidence ratings, some data gets sent to humans for intervention.

July 6, 2012

ALA 2012: Linked Data & Next Generation Catalogs Session — Part 1

“Linked Data & Next Generation Catalogs”
ALA Annual Conference 2012
8am on Saturday, June 23
The speakers, in order, were Corey Harper, Phil Schreur, Ted Fons, Yvette Diven and Jennifer Bowen.
Presentation slides at: ALA Connect: Next Generation Catalog Interest Group

Part I: Corey Harper and Phil Schreur

“Linked Library Data: Publish, Enrich, Relate and Un-Silo”
Corey Harper (On Twitter as @chrpr)

Harper began his talk with a short overview of Linked Data. Linked Data is all about publishing data for re-use. Harper referenced a TED Talk by Tim Berners-Lee filmed in 2009 on his vision of the Semantic Web: “Tim Berners-Lee on the next Web”. [Incidentally, there is a follow-up TED Talk, “The year open data went worldwide,” by Tim Berners-Lee filmed in 2010 where he looks at some examples of how linked data can be used.]

Harper summarized Berners-Lee’s famous “Design Note” that defines Linked Data:
use URIs as names
use HTTP URIs
provide information at the URI
includes links to other URIs

Next, Harper discussed metadata as a graph. Things and relationships are named with URIs allowing ease of integration across data sources. The graphs can be merged to create new explicit relationships between previously unconnected pieces of data.

Library data, however, is primarily string-based so it cannot be easily connected to other data sources. Slides 5-7 show the difference between a basic text-string description of a “thing” and the way that using URIs can allow that “thing” to link to other related “things.”

Finally, Harper quickly went over some of the terminology used in describing Linked Data.
A resource is a thing
A class is an abstraction of a type of thing
An individual is an instance of a class
A property is an attribute of individual
A statement or triple consists of the combination of resource, property, and value.
A graph is a visual representation of statements
An ontology is a domain-specific collection of classes and properties
Nodes are the subjects and objects in a graph
Arcs are the predicates in a graph
Domains and ranges provide constraints on nodes
The domain defines what things can be subjects
The range defines what things (or strings) can be objects
Literals are values that are strings rather than URIs
Named graphs are graphs where the URIs are treated as nodes [this is not clear to me]
These are still under development
[This is not on a slide, did I make it up?: Provenance allows you to know where the data comes from.]

Harper discussed the growth of the Linked Open Data cloud. Slides 11 and 12 show the growth from May 2007 to September 2011. Harper noted that Freebase has become almost as interlinked as DBpedia. Though Freebase is now owned by Google, anyone can add anything to it. For example, VIAF and id.loc.gov have added their data into it, which means that the data has Freebase URIs [in addition to the URIs generated by the respective projects themselves].

Projects that are using Linked Data.
Google Refine allows the input of unstructured data and reconciles it against Freebase to create new links (slides 14 and 15).

RelFinder (slide 16) finds linkages between between data sets. [The RelFinder website states that “it extracts and visualizes relationships between given objects.”]

Next, Harper discussed Linked Open Data – Libraries, Archives & Museums (LOD-LAM). Two examples of these types of projects are Europeana and LOCAH: Linked Open Copac & Archives Hub.

Europeana describes 15 million European cultural items and includes data from the British Library, the Rijksmuseum and the Louvre . The data model builds on OAI-ORE .

LOCAH combines bibliographic and archival data. Using finding aids from over 200 institutions in the UK, they modeled EAD as RDF. [LOCAH has become Linking Lives.]

The Social Networks and Archival Context Project (SNAC)

Linking Lives

Viewshare

The Linked Ancient World Data Institute (LAWDI) is actively modeling ancient place names as linked data and is very excited to get librarians involved.

The Library Linked Data Incubator Group Final Report includes a related document, Library Linked Data Incubator Group: Use Cases that details over 50 use cases spread across a range of different categories, including bibliographic data, authority data, archives and heterogeneous metadata, citations, etc. (slides 31-34).

The Linked Open Data Special Interest Working Group of the International Group of Ex Libris Users (IGeLU) is working on incorporating Linked Open Data into its architecture. A public draft on their work is due in two weeks.

Linked Data is a distributed information ecosystem – it focuses on identification rather than description. [This is an important point to keep in mind. I think it is another paradigm shift like those discussed by Phil Shreur below.] All data retains its context and enriches the user experience. Library bibliographic data, on the other hand, has been removed from its context. A bibliographic record cannot, for example, tell a user anything detailed about the author’s life. Linked Data allows links to be made to outside sources that do provide such data.

Finally, Harper discussed facets and issues of Linked Data that are actively being worked on: provenance; licensing; best practices, modeling and infrastructure; and DCMI and W3C work.

Harper noted that the FRBR model is finding its way into other domains. [This seems to me that it is a validation of the model.]

————————————————————————————-

“Shatter the Catalog, Free the Data, Link the Pieces”

Phil Schreur (LinkedIn profile)

Phil Schreur spoke about the stressors on our current catalogs and the Linked Data solution.

Stressors.
Schreur began his discussion of stressors with Google. Google has taught users to expect an all-inclusive, more-is-better approach to searching. As libraries try to adopt/adapt to this type of searching, the catalog starts to lose all local character. Carefully curated items become lost in the midst of giant dumps of e-book [and e-serials, I would guess as well, since those are all package deals] bibliographic data. More bibliographic data of questionable quality is ingested from a variety of sources. [I have a note about supplementary data as well, but not what his point about it is.]

A second stressor on current library catalogs are the bibliographic records themselves. All are subject to “local practice” guidelines that mean catalogers are re-cataloging the same item over and over again at different institutions. Records can be missing elements and include other mistakes as well. [I think this is the point, though I did not include it in my notes: These problems all get fixed at each institution that finds them, with no way to propagate those changes out to other places.]

Next, Schreur discussed the fact that most library data is stored in relational databases in closed systems. The catalog needs to have MARC records for the discovery process to work but the cost of cataloging an entire MARC record can be prohibitively expensive. This results in a backlog of items that have no record in the database.

Finally, Schreur noted that not only is the data siloed within the individual institutions but the institutions themselves do not have consistent access to their own materials. In an academic environment, for example, data about related resources such as course descriptions and reading lists are not linked to the bibliographic data. They are too expensive to catalog because the only way related materials can be included would be in cataloger-created MARC records.

Schreur stated that Linked Data is the answer to this problem. The Linked Data in an academic environment like Stanford could interact with the Linked Open Data on the web. This takes the data to where people are searching for it. It provides better discoverability and opportunities for innovation. It allows for continuous improvement of data without having to exchange MARC records. MARC data that machines cannot understand becomes machine-actionable data that is directly accessible. It breaks down the silos and allows for unanticipated opportunities.

Moving to Linked Data involves several paradigm shifts. First, MARC bibliographic records are often considered a commodity and can have many restrictions on them. Linked Data, on the other hand, is focused on the free and open exchange of data. Second, from entire MARC bibliographic records, we shift to simple RDF statements. Third, data [i.e. metadata?] will be captured at the point of creation. The RDF triples will result as part of the creation process and they will be heterogeneous. Finally, instead of the problem of limited data that we have now, we will have a problem of an overwhelming amount of data. The triple stores will have to be managed.

Finally, Schreur provided four examples of projects implementing Linked Data:
Mashpoint
Bibliotheque nationale de France
LinkSailor
Google Knowledge Graph

Mashpoint
Allows the user to take one data set and apply it to another data set. [It sounds a lot like RelFinder and Google Refine.]

Bibliotheque nationale de France
This project provides good documentation for its data [data provenance!]. The search results for Edgar Allan Poe include not only bibliographic resources but also related materials that can be found in the Archives & Manuscripts department and links to resources that are outside of the collection entirely, like those in the Europeana project.

LinkSailor
This is a project started by Talis. LinkSailor allows the user to follow the links themselves from one place to another: from a map for Heathrow Airport to other airports.

Google Knowledge Graph
Collects links to a resource rather than searching text strings for things related to a resource. This is in use now by Google, it appears next the more traditional text-string search results.

Questions/Discussion for Harper and Schreur

Q: About having authoritative data: For example, scholars are worried about the ESTC being “ruined” by crowd sourcing the data
PS: It is important to have authoritative data still, there will be push-pull between the crowd sourcing and the idea of controlled authoritative data.
CH: The crowd is important, sometimes it is that random person on the Internet who has the necessary knowledge and expertise in the topic. The real value is in how the data gets used on the Web.

Q: If we open the data, where does the value-added cataloging go?
[This is the idea, that if everyone thinks they can get the data from somewhere else, they are not going to pay to have their own quality data created. But it seems to me that the answer is the same that it is now. If you want quality/authoritative data, you have to pay someone to create it. If you are just going to copy catalog, then you take someone else’s data. This is a problem now. The technology of Linked Data is not going to fix what is at its core a cultural problem.]
This query postponed to the panel discussion at end of the session.