ALA 2012: FRBR Presentation Two – Cataloging is not Sexy

“Current Research on and Use of FRBR in Libraries”
8am on Sunday, June 24, 2012
Speakers: Erik Mitchell & Carolyn McCallum, Thomas Hickey, Yin Zhang & Athena Salaba, Jennifer Bowen

This is the second of four presentations given at this session.

“FRBR at OCLC”
Thomas Hickey

No slides found online for this presentation.

Hickey spoke about the use of FRBR at OCLC.

OCLC manages 275 million bibliographic records at the work, expression and manifestation levels. The bib records are already clustered by work level. OCLC has now started the process of clustering “content” which is roughly equivalent to expression and manifestation.

Clustering is done by creating normalized keys from a combination of the author and title. The advantage is that the process is straight-forward and efficient. The disadvantage is that the algorithm misses cataloging variations.

OCLC is now working with the GLIMIR (Global Library Manifestation Identifier) project. This project will cluster records at the manifestation level and assign an identifier. The algorithms for this project go beyond the author title keys used for workset creation into even the note fields. [NOTE: Examples given in the code4lib article include: publishing information and pagination.]

Using the GLIMIR algorithms they have discovered the same manifestation hiding in different worksets. They tried pushing changes back up to the workset level but it didn’t work very well. [NOTE: the code4lib article gives several examples of ways the GLIMIR has improved their de-duplication efforts. Was it a computational/technical problem?] They are moving to Hadoop and HBase now [?to improve their ability to handle copious amounts of data?].

The goal is to pull together all of the keys, group them and then separate them into coherent work[?] clusters. One problem is the friend of a friend issue. This is used to cluster similar items, but if A links to B and B links to C, are A and C the same thing?

In sum:

the new algorithms are much more forgiving of variations in the records
the iterations can be controlled
the records are much easier to re-cluster (the processing takes hours rather than months)
the work cluster assignments can happen in real time

Worldcat contains:
1.8 billion holdings
275 million worksets
20% non-singltons
80% holdings [have 42 per workset?]
Top 30 worksets — 3-10 thousand records
30-100 holdings
largest group 3.3 million
2.7 million keys
GLIMIR content set — 483 records

Music is problematic for clustering.

VIAF and FRBR
VIAF contains 1.5 million uniform title records
links to and from expressions
link to author from author/title

OCLC can also do clustering in multiple alphabets, using many, many cross-references.