October 2012 – Cataloging is not Sexy

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the second of four presentations given at this session.

“FRBR at OCLC”
Thomas Hickey

No slides found online for this presentation.

Hickey spoke about the use of FRBR at OCLC.

OCLC manages 275 million bibliographic records at the work, expression and manifestation levels. The bib records are already clustered by work level. OCLC has now started the process of clustering “content” which is roughly equivalent to expression and manifestation.

Clustering is done by creating normalized keys from a combination of the author and title. The advantage is that the process is straight-forward and efficient. The disadvantage is that the algorithm misses cataloging variations.

OCLC is now working with the GLIMIR (Global Library Manifestation Identifier) project. This project will cluster records at the manifestation level and assign an identifier. The algorithms for this project go beyond the author title keys used for workset creation into even the note fields. [NOTE: Examples given in the code4lib article include: publishing information and pagination.]

Using the GLIMIR algorithms they have discovered the same manifestation hiding in different worksets. They tried pushing changes back up to the workset level but it didn’t work very well. [NOTE: the code4lib article gives several examples of ways the GLIMIR has improved their de-duplication efforts. Was it a computational/technical problem?] They are moving to Hadoop and HBase now [?to improve their ability to handle copious amounts of data?].

The goal is to pull together all of the keys, group them and then separate them into coherent work[?] clusters. One problem is the friend of a friend issue. This is used to cluster similar items, but if A links to B and B links to C, are A and C the same thing?

In sum:

the new algorithms are much more forgiving of variations in the records
the iterations can be controlled
the records are much easier to re-cluster (the processing takes hours rather than months)
the work cluster assignments can happen in real time

Worldcat contains:
1.8 billion holdings
275 million worksets
20% non-singltons
80% holdings [have 42 per workset?]
Top 30 worksets — 3-10 thousand records
30-100 holdings
largest group 3.3 million
2.7 million keys
GLIMIR content set — 483 records

Music is problematic for clustering.

VIAF and FRBR
VIAF contains 1.5 million uniform title records
links to and from expressions
link to author from author/title

OCLC can also do clustering in multiple alphabets, using many, many cross-references.

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the first of four presentations given at this session.

“FRBRizing Mark Twain”
Erik Mitchell & Carolyn McCallum

The presentation slides are available on Slideshare or view them in the embedded slideshow below.

Current Research on and Use of FRBR in Libraries from cjmccallum

Erik Mitchell and Carolyn McCallum discussed their project to apply the FRBR model to a group of records relating to Mark Twain. McCallum organized the data manually while Mitchell created a program to do it in an automated fashion. They then compared the results. This presentation covered:

Metadata issues that arose from applying FRBR
Issues in migration
Comparison of the automated technique to an expert’s manual analysis

Carolyn McCallum spoke first about the manual processing portion of the project.

For this project, they focused on the Group 1 entities (work, expression, manifestation and item). They extracted 848 records from the Z. Smith Reynolds Library catalog at Wake Forest University for publications that were either by Mark Twain or about him. Using Mark Twain ensured that the data set had enough complexity to reveal any problems. The expert cataloger then grouped the metadata into worksets using titles and the OCLC FRBR key.

In the cataloger’s assessment, there were 410 records that grouped into 147 total worksets (each one having 2 or more expressions). The other 420 records sorted out into worksets with only one expression each. The largest worksets were for Huckleberry Finn (26 records) and Tom Sawyer (14 records). The most useful metadata was title, author, and a combination of title and author.

A couple of problems that were identified in the process were that whole to part and expression to manifestation were not expressed consistently across the records and that determining boundaries between entities was difficult. The line where one work changes enough to become another expression or even a completely different work can be open to interpretation. McCallum suggested that the entity classification should be guided by the needs of the local collection.

Mitchell then spoke about the automated version of the processing.

Comparison keys comprised of the OCLC FRBR keys (author & title) were again used to cluster records into worksets. The results were not as good as the manual expert process but were acceptable and comparable to OCLC’s results. To improve the results using the automated process, they built a Python script to extract normalized FRBR keys out of the MARC data and compared those keys. This did improve the results.

In conclusion, Mitchell noted that the metadata quality is not so much a problem as the intellectual content. The complex relationships between the various works/expressions/manifestations are simply not described by the metadata. Both methods, manual and automated are time and resource consuming. Finally, new data models, like Linked Data, “are changing our view of MARC metadata” (slide 21).

Question from the audience about problems [with the modeling process?]
Answer: Process could not deal well with multiple authors.

Other related links:
McCallum’s summary of their presentation (about halfway through the post).
A poster from the ASIS&T Annual Meeting in 2011

Month: October 2012

ALA 2012: FRBR Presentation Two

ALA 2012: FRBR Presentation One